--- language: - en license: apache-2.0 library_name: pytorch tags: - text-classification - fiction-detection - byte-level - cnn datasets: - HuggingFaceTB/cosmopedia - BEE-spoke-data/gutenberg-en-v1-clean - common-pile/arxiv_abstracts - ccdv/cnn_dailymail metrics: - accuracy - f1 - roc_auc model-index: - name: TinyByteCNN-Fiction-Classifier results: - task: type: text-classification name: Fiction vs Non-Fiction Classification dataset: name: Custom Fiction/Non-Fiction Dataset (85k samples) type: custom split: validation metrics: - type: accuracy value: 99.91 name: Validation Accuracy - type: f1 value: 99.91 name: F1 Score - type: roc_auc value: 99.99 name: ROC AUC - task: type: text-classification name: Curated Test Samples dataset: name: 18 Diverse Fiction/Non-Fiction Samples type: curated split: test metrics: - type: accuracy value: 100.0 name: Test Accuracy - type: confidence_avg value: 96.3 name: Average Confidence --- # TinyByteCNN Fiction vs Non-Fiction Detector A lightweight, byte-level CNN model for detecting fiction vs non-fiction text with 99.91% validation accuracy. ## Model Description TinyByteCNN is a highly efficient byte-level convolutional neural network designed for binary classification of fiction vs non-fiction text. The model operates directly on UTF-8 byte sequences, eliminating the need for tokenization and making it robust to various text formats and languages. ### Architecture Highlights - **Model Size**: 942,313 parameters (~3.6MB) - **Input**: Raw UTF-8 bytes (max 4096 bytes ≈ 512 words) - **Architecture**: Depthwise-separable 1D CNN with Squeeze-Excitation - **Receptive Field**: ~2.8KB covering multi-paragraph context - **Key Features**: - 4 stages with progressive downsampling (32x reduction) - Dilated convolutions for larger receptive field - SE attention modules for channel recalibration - Global average + max pooling head ## Intended Uses & Limitations ### Intended Uses - Automated content categorization for libraries and archives - Fiction/non-fiction filtering for content platforms - Educational content classification - Writing style analysis - Content recommendation systems ### Limitations - **Personal narratives**: May misclassify personal journal entries and memoirs as fiction (observed ~97% fiction confidence on journal entries) - **Mixed content**: Struggles with creative non-fiction and narrative journalism - **Length**: Optimized for 512-4096 byte inputs; longer texts should be chunked - **Language**: Primarily trained on English text ## Training Data The model was trained on a diverse dataset of 85,000 samples (60k train, 15k validation, 10k test) drawn from: ### Fiction Sources (50%) 1. **Cosmopedia Stories** (HuggingFaceTB/cosmopedia) - Synthetic fiction stories - License: Apache 2.0 2. **Project Gutenberg** (BEE-spoke-data/gutenberg-en-v1-clean) - Classic literature - License: Public Domain 3. **Reddit WritingPrompts** - Community-generated creative writing - Via synthetic alternatives ### Non-Fiction Sources (50%) 1. **Cosmopedia Educational** (HuggingFaceTB/cosmopedia) - Textbooks, WikiHow, educational blogs - License: Apache 2.0 2. **Scientific Papers** (common-pile/arxiv_abstracts) - Academic abstracts and introductions - License: Various (permissive) 3. **News Articles** (ccdv/cnn_dailymail) - CNN and Daily Mail articles - License: Apache 2.0 ## Training Procedure ### Preprocessing - Unicode NFC normalization - Whitespace normalization (max 2 consecutive spaces) - UTF-8 byte encoding - Padding/truncation to 4096 bytes ### Training Hyperparameters - **Optimizer**: AdamW (lr=3e-3, betas=(0.9, 0.98), weight_decay=0.01) - **Schedule**: Cosine decay with 5% warmup - **Batch Size**: 32 - **Epochs**: 10 - **Label Smoothing**: 0.05 - **Gradient Clipping**: 1.0 - **Device**: Apple M-series (MPS) ## Evaluation Results ### Validation Set (15,000 samples) | Metric | Value | |--------|-------| | Accuracy | 99.91% | | F1 Score | 0.9991 | | ROC AUC | 0.9999 | | Loss | 0.1194 | ### Detailed Test Results on 18 Curated Samples The model achieved **100% accuracy** across all categories, but shows interesting confidence patterns: | Category | Sample Title/Type | True Label | Predicted | Confidence | Analysis | |----------|------------------|------------|-----------|------------|----------| | **FICTION - General** | | | | | | | Literary | Lighthouse Keeper Storm | Fiction | Fiction | **79.8%** | ⚠️ **Lowest confidence** - realistic setting | | Sci-Fi | Time Travel Bedroom | Fiction | Fiction | 97.2% | ✅ Clear fantastical elements | | Mystery | Detective Rose Case | Fiction | Fiction | 97.3% | ✅ Strong narrative structure | | **FICTION - Children's** | | | | | | | Animal Tale | Benny's Carrot Problem | Fiction | Fiction | 97.1% | ✅ Clear storytelling markers | | Fantasy | Princess Luna's Paintings | Fiction | Fiction | 97.3% | ✅ Magical elements detected | | Magical | Tommy's Dream Sprites | Fiction | Fiction | **96.0%** | ⚠️ Lower confidence - whimsical tone | | **FICTION - Fantasy** | | | | | | | Epic Fantasy | Shadowgate & Void Lords | Fiction | Fiction | 97.4% | ✅ High fantasy vocabulary | | Magic System | Moonlight Weaver Elara | Fiction | Fiction | 96.8% | ✅ Complex world-building | | Urban Fantasy | Dragon Memory Markets | Fiction | Fiction | 97.3% | ✅ Supernatural commerce | | **NON-FICTION - Academic** | | | | | | | Biology | Photosynthesis Process | Non-Fiction | Non-Fiction | 97.8% | ✅ Technical terminology | | Mathematics | Calculus Theorem | Non-Fiction | Non-Fiction | 97.8% | ✅ Mathematical concepts | | Economics | Market Equilibrium | Non-Fiction | Non-Fiction | 97.9% | ✅ Economic theory | | **NON-FICTION - News** | | | | | | | Financial | Federal Reserve Decision | Non-Fiction | Non-Fiction | 97.8% | ✅ Factual reporting style | | Local Gov | Homeless Crisis Plan | Non-Fiction | Non-Fiction | 97.9% | ✅ Policy announcement format | | Science | Exoplanet Discovery | Non-Fiction | Non-Fiction | 97.9% | ✅ Research reporting | | **NON-FICTION - Journals** | | | | | | | Financial | Wall Street Journal Market | Non-Fiction | Non-Fiction | 97.7% | ✅ Professional journalism | | Scientific | Nature Research Report | Non-Fiction | Non-Fiction | 97.7% | ✅ Academic publication style | | Personal | Kyoto Travel Log | Non-Fiction | Non-Fiction | **97.5%** | ⚠️ Slightly lower - personal narrative | ### Key Insights: - **Weakest Performance**: Realistic literary fiction (79.8% confidence) - the lighthouse story lacks obvious fantastical elements - **Strongest Performance**: Academic/news content (97.8-97.9% confidence) - clear technical/factual language - **Edge Cases**: Personal narratives and whimsical children's stories show slightly lower confidence - **Perfect Accuracy**: 18/18 samples correctly classified despite confidence variations ### Detailed Test Results #### ✅ All 12 Samples Correctly Classified **Fiction Samples (3/3):** 1. Lighthouse keeper narrative → Fiction (79.8% conf) 2. Time travel story → Fiction (97.2% conf) 3. Detective mystery → Fiction (97.3% conf) **Textbook Samples (3/3):** 1. Photosynthesis (Biology) → Non-Fiction (97.8% conf) 2. Fundamental theorem (Calculus) → Non-Fiction (97.8% conf) 3. Market equilibrium (Economics) → Non-Fiction (97.9% conf) **News Articles (3/3):** 1. Federal Reserve decision → Non-Fiction (97.8% conf) 2. City homeless initiative → Non-Fiction (97.9% conf) 3. Exoplanet discovery → Non-Fiction (97.9% conf) **Journal Articles (3/3):** 1. Wall Street Journal (Financial) → Non-Fiction (97.7% conf) 2. Nature Scientific Reports → Non-Fiction (97.7% conf) 3. Personal Travel Journal → Non-Fiction (97.5% conf) ## How to Use ### PyTorch ```python import torch import numpy as np from model import TinyByteCNN, preprocess_text # Load model model = TinyByteCNN.from_pretrained("username/tinybytecnn-fiction-detector") model.eval() # Prepare text text = "Your text here..." input_bytes = preprocess_text(text) # Returns tensor of shape [1, 4096] # Predict with torch.no_grad(): logits = model(input_bytes) probability = torch.sigmoid(logits).item() if probability > 0.5: print(f"Non-Fiction (confidence: {probability:.1%})") else: print(f"Fiction (confidence: {1-probability:.1%})") ``` ### Batch Processing ```python def classify_texts(texts, model, batch_size=32): results = [] for i in range(0, len(texts), batch_size): batch = texts[i:i+batch_size] inputs = torch.stack([preprocess_text(t) for t in batch]) with torch.no_grad(): logits = model(inputs) probs = torch.sigmoid(logits) for text, prob in zip(batch, probs): results.append({ 'text': text[:100] + '...', 'class': 'Non-Fiction' if prob > 0.5 else 'Fiction', 'confidence': prob.item() if prob > 0.5 else 1-prob.item() }) return results ``` ## Training Infrastructure - **Hardware**: Apple M-series with 8GB MPS memory limit - **Training Time**: ~20 minutes - **Framework**: PyTorch 2.0+ ## Environmental Impact - **Hardware Type**: Apple Silicon M-series - **Hours used**: 0.33 - **Carbon Emitted**: Minimal (ARM-based efficiency, ~10W average) ## Citation ```bibtex @model{tinybytecnn-fiction-2024, title={TinyByteCNN Fiction vs Non-Fiction Detector}, author={Mitchell Currie}, year={2024}, publisher={HuggingFace}, url={https://huggingface.co/username/tinybytecnn-fiction-detector} } ``` ## Acknowledgments This model uses data from: - HuggingFace Team (Cosmopedia dataset) - Project Gutenberg - Common Pile contributors - CNN/Daily Mail dataset creators ## License Apache 2.0 ## Contact For questions or issues, please open an issue on the [model repository](https://huggingface.co/username/tinybytecnn-fiction-detector).