language:
  - en
license: apache-2.0
library_name: pytorch
tags:
  - text-classification
  - fiction-detection
  - byte-level
  - cnn
datasets:
  - HuggingFaceTB/cosmopedia
  - BEE-spoke-data/gutenberg-en-v1-clean
  - common-pile/arxiv_abstracts
  - ccdv/cnn_dailymail
metrics:
  - accuracy
  - f1
  - roc_auc
model-index:
  - name: TinyByteCNN-Fiction-Classifier
    results:
      - task:
          type: text-classification
          name: Fiction vs Non-Fiction Classification
        dataset:
          name: Custom Fiction/Non-Fiction Dataset (85k samples)
          type: custom
          split: validation
        metrics:
          - type: accuracy
            value: 99.91
            name: Validation Accuracy
          - type: f1
            value: 99.91
            name: F1 Score
          - type: roc_auc
            value: 99.99
            name: ROC AUC
      - task:
          type: text-classification
          name: Curated Test Samples
        dataset:
          name: 18 Diverse Fiction/Non-Fiction Samples
          type: curated
          split: test
        metrics:
          - type: accuracy
            value: 100
            name: Test Accuracy
          - type: confidence_avg
            value: 96.3
            name: Average Confidence
TinyByteCNN Fiction vs Non-Fiction Detector
A lightweight, byte-level CNN model for detecting fiction vs non-fiction text with 99.91% validation accuracy.
Model Description
TinyByteCNN is a highly efficient byte-level convolutional neural network designed for binary classification of fiction vs non-fiction text. The model operates directly on UTF-8 byte sequences, eliminating the need for tokenization and making it robust to various text formats and languages.
Architecture Highlights
- Model Size: 942,313 parameters (~3.6MB)
- Input: Raw UTF-8 bytes (max 4096 bytes β 512 words)
- Architecture: Depthwise-separable 1D CNN with Squeeze-Excitation
- Receptive Field: ~2.8KB covering multi-paragraph context
- Key Features:- 4 stages with progressive downsampling (32x reduction)
- Dilated convolutions for larger receptive field
- SE attention modules for channel recalibration
- Global average + max pooling head
 
Intended Uses & Limitations
Intended Uses
- Automated content categorization for libraries and archives
- Fiction/non-fiction filtering for content platforms
- Educational content classification
- Writing style analysis
- Content recommendation systems
Limitations
- Personal narratives: May misclassify personal journal entries and memoirs as fiction (observed ~97% fiction confidence on journal entries)
- Mixed content: Struggles with creative non-fiction and narrative journalism
- Length: Optimized for 512-4096 byte inputs; longer texts should be chunked
- Language: Primarily trained on English text
Training Data
The model was trained on a diverse dataset of 85,000 samples (60k train, 15k validation, 10k test) drawn from:
Fiction Sources (50%)
- Cosmopedia Stories (HuggingFaceTB/cosmopedia) - Synthetic fiction stories
- License: Apache 2.0
 
- Project Gutenberg (BEE-spoke-data/gutenberg-en-v1-clean) - Classic literature
- License: Public Domain
 
- Reddit WritingPrompts - Community-generated creative writing
- Via synthetic alternatives
 
Non-Fiction Sources (50%)
- Cosmopedia Educational (HuggingFaceTB/cosmopedia) - Textbooks, WikiHow, educational blogs
- License: Apache 2.0
 
- Scientific Papers (common-pile/arxiv_abstracts) - Academic abstracts and introductions
- License: Various (permissive)
 
- News Articles (ccdv/cnn_dailymail) - CNN and Daily Mail articles
- License: Apache 2.0
 
Training Procedure
Preprocessing
- Unicode NFC normalization
- Whitespace normalization (max 2 consecutive spaces)
- UTF-8 byte encoding
- Padding/truncation to 4096 bytes
Training Hyperparameters
- Optimizer: AdamW (lr=3e-3, betas=(0.9, 0.98), weight_decay=0.01)
- Schedule: Cosine decay with 5% warmup
- Batch Size: 32
- Epochs: 10
- Label Smoothing: 0.05
- Gradient Clipping: 1.0
- Device: Apple M-series (MPS)
Evaluation Results
Validation Set (15,000 samples)
| Metric | Value | 
|---|---|
| Accuracy | 99.91% | 
| F1 Score | 0.9991 | 
| ROC AUC | 0.9999 | 
| Loss | 0.1194 | 
Detailed Test Results on 18 Curated Samples
The model achieved 100% accuracy across all categories, but shows interesting confidence patterns:
| Category | Sample Title/Type | True Label | Predicted | Confidence | Analysis | 
|---|---|---|---|---|---|
| FICTION - General | |||||
| Literary | Lighthouse Keeper Storm | Fiction | Fiction | 79.8% | β οΈ Lowest confidence - realistic setting | 
| Sci-Fi | Time Travel Bedroom | Fiction | Fiction | 97.2% | β Clear fantastical elements | 
| Mystery | Detective Rose Case | Fiction | Fiction | 97.3% | β Strong narrative structure | 
| FICTION - Children's | |||||
| Animal Tale | Benny's Carrot Problem | Fiction | Fiction | 97.1% | β Clear storytelling markers | 
| Fantasy | Princess Luna's Paintings | Fiction | Fiction | 97.3% | β Magical elements detected | 
| Magical | Tommy's Dream Sprites | Fiction | Fiction | 96.0% | β οΈ Lower confidence - whimsical tone | 
| FICTION - Fantasy | |||||
| Epic Fantasy | Shadowgate & Void Lords | Fiction | Fiction | 97.4% | β High fantasy vocabulary | 
| Magic System | Moonlight Weaver Elara | Fiction | Fiction | 96.8% | β Complex world-building | 
| Urban Fantasy | Dragon Memory Markets | Fiction | Fiction | 97.3% | β Supernatural commerce | 
| NON-FICTION - Academic | |||||
| Biology | Photosynthesis Process | Non-Fiction | Non-Fiction | 97.8% | β Technical terminology | 
| Mathematics | Calculus Theorem | Non-Fiction | Non-Fiction | 97.8% | β Mathematical concepts | 
| Economics | Market Equilibrium | Non-Fiction | Non-Fiction | 97.9% | β Economic theory | 
| NON-FICTION - News | |||||
| Financial | Federal Reserve Decision | Non-Fiction | Non-Fiction | 97.8% | β Factual reporting style | 
| Local Gov | Homeless Crisis Plan | Non-Fiction | Non-Fiction | 97.9% | β Policy announcement format | 
| Science | Exoplanet Discovery | Non-Fiction | Non-Fiction | 97.9% | β Research reporting | 
| NON-FICTION - Journals | |||||
| Financial | Wall Street Journal Market | Non-Fiction | Non-Fiction | 97.7% | β Professional journalism | 
| Scientific | Nature Research Report | Non-Fiction | Non-Fiction | 97.7% | β Academic publication style | 
| Personal | Kyoto Travel Log | Non-Fiction | Non-Fiction | 97.5% | β οΈ Slightly lower - personal narrative | 
Key Insights:
- Weakest Performance: Realistic literary fiction (79.8% confidence) - the lighthouse story lacks obvious fantastical elements
- Strongest Performance: Academic/news content (97.8-97.9% confidence) - clear technical/factual language
- Edge Cases: Personal narratives and whimsical children's stories show slightly lower confidence
- Perfect Accuracy: 18/18 samples correctly classified despite confidence variations
Detailed Test Results
β All 12 Samples Correctly Classified
Fiction Samples (3/3):
- Lighthouse keeper narrative β Fiction (79.8% conf)
- Time travel story β Fiction (97.2% conf)
- Detective mystery β Fiction (97.3% conf)
Textbook Samples (3/3):
- Photosynthesis (Biology) β Non-Fiction (97.8% conf)
- Fundamental theorem (Calculus) β Non-Fiction (97.8% conf)
- Market equilibrium (Economics) β Non-Fiction (97.9% conf)
News Articles (3/3):
- Federal Reserve decision β Non-Fiction (97.8% conf)
- City homeless initiative β Non-Fiction (97.9% conf)
- Exoplanet discovery β Non-Fiction (97.9% conf)
Journal Articles (3/3):
- Wall Street Journal (Financial) β Non-Fiction (97.7% conf)
- Nature Scientific Reports β Non-Fiction (97.7% conf)
- Personal Travel Journal β Non-Fiction (97.5% conf)
How to Use
PyTorch
import torch
import numpy as np
from model import TinyByteCNN, preprocess_text
# Load model
model = TinyByteCNN.from_pretrained("username/tinybytecnn-fiction-detector")
model.eval()
# Prepare text
text = "Your text here..."
input_bytes = preprocess_text(text)  # Returns tensor of shape [1, 4096]
# Predict
with torch.no_grad():
    logits = model(input_bytes)
    probability = torch.sigmoid(logits).item()
    
    if probability > 0.5:
        print(f"Non-Fiction (confidence: {probability:.1%})")
    else:
        print(f"Fiction (confidence: {1-probability:.1%})")
Batch Processing
def classify_texts(texts, model, batch_size=32):
    results = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        inputs = torch.stack([preprocess_text(t) for t in batch])
        
        with torch.no_grad():
            logits = model(inputs)
            probs = torch.sigmoid(logits)
            
        for text, prob in zip(batch, probs):
            results.append({
                'text': text[:100] + '...',
                'class': 'Non-Fiction' if prob > 0.5 else 'Fiction',
                'confidence': prob.item() if prob > 0.5 else 1-prob.item()
            })
    
    return results
Training Infrastructure
- Hardware: Apple M-series with 8GB MPS memory limit
- Training Time: ~20 minutes
- Framework: PyTorch 2.0+
Environmental Impact
- Hardware Type: Apple Silicon M-series
- Hours used: 0.33
- Carbon Emitted: Minimal (ARM-based efficiency, ~10W average)
Citation
@model{tinybytecnn-fiction-2024,
  title={TinyByteCNN Fiction vs Non-Fiction Detector},
  author={Mitchell Currie},
  year={2024},
  publisher={HuggingFace},
  url={https://huggingface.co/username/tinybytecnn-fiction-detector}
}
Acknowledgments
This model uses data from:
- HuggingFace Team (Cosmopedia dataset)
- Project Gutenberg
- Common Pile contributors
- CNN/Daily Mail dataset creators
License
Apache 2.0
Contact
For questions or issues, please open an issue on the model repository.
