Anime Style Classifier · EfficientNet-B0 v2

This checkpoint classifies anime artwork into six coarse styles – flat, grim, modern, moe, painterly, and retro. It replaces the original v1 release (older taxonomy, weaker validation) and reflects the "current" EfficientNet-B0 model used by the internal web classifier. The model was trained and benchmarked exclusively on synthetic anime imagery, so accuracy on hand-drawn or photo-based references depends on how closely they resemble the synthetic generation regime.

Highlights

Architecture: EfficientNet-B0 with a 6-way linear head (input 224×224 RGB).
Training mix: ~1.9k curated training images, with validation/holdout splits of 402/408 samples respectively. Augmentations included a blend of standard crops and tiled views, but inference works best with a single aspect-preserving center crop.
Metrics:
- Validation accuracy: 96.27%
- Holdout accuracy: 96.81%
- Real-world spot check (12 human-labelled photos, two per style): 12/12 correct with aspect-fill inference. Sliding/multi-crop modes lagged (details below).

Files

pytorch_model.bin – PyTorch state dict (use with torchvision.models.efficientnet_b0 + custom classifier).
config.json – Metadata, label mappings, preprocessing constants.
requirements.txt – Minimal deps for the sample script.
inference.py – Reference CLI/SDK helper.
real_world_eval.json – Aspect vs sliding/multi-crop comparison on the 12-sample real-world set.

Usage

import json, torch
from PIL import Image
from torchvision import models, transforms

cfg = json.load(open('config.json'))
model = models.efficientnet_b0(weights=None)
model.classifier[1] = torch.nn.Linear(model.classifier[1].in_features, cfg['num_labels'])
state = torch.load('pytorch_model.bin', map_location='cpu')
model.load_state_dict(state)
model.eval()

preprocess = transforms.Compose([
    transforms.Resize((cfg['image_size'], cfg['image_size'])),
    transforms.CenterCrop(cfg['image_size']),
    transforms.ToTensor(),
    transforms.Normalize(cfg['mean'], cfg['std'])
])

img = Image.open('your_image.jpg').convert('RGB')
with torch.no_grad():
    logits = model(preprocess(img).unsqueeze(0))
    probs = torch.softmax(logits, dim=1)[0]

for label, prob in zip(cfg['id2label'].values(), probs):
    print(label, float(prob))

The accompanying inference.py exposes the same logic with a CLI (python inference.py path/to/image.png).

Inference modes (and why aspect-fill is preferred)

The internal web server exposes four preprocessing paths:

Mode	Description	Real-world (12 imgs)
`resize`	Aspect-preserving resize + single center crop (a.k.a. aspect-fill).	12/12
`slide-avg`	Sliding windows (~⅓ min side) with logits averaged.	8/12
`slide-mode`	Sliding windows with majority vote over argmax labels.	7/12
`multicrop`	Center crop + 2×2 grid of full-res tiles, logits averaged.	10/12

Despite being trained with a mix of global crops and tiles, the model is most stable on human-shot references when run with the simple aspect-fill path. Sliding windows tend to overweight local lighting cues (e.g., grim scenes drifting toward modern), while averaging tiled crops can wash out global color palettes. Multi-crop is still available for comparison, but the resize mode should be the default for production use.

Per-image breakdown (real-world set)

File	Ground truth	resize	slide-avg	slide-mode	multicrop
flat.jpg	flat	flat (98.9%) ✅	flat (89.1%) ✅	flat (100%) ✅	flat (97.9%) ✅
flat-2.jpeg	flat	flat (90.5%) ✅	flat (55.6%) ✅	flat (100%) ✅	flat (84.7%) ✅
grim.jpg	grim	grim (50.7%) ✅	modern (82.7%) ❌	modern (100%) ❌	retro (49.5%) ❌
grim-2.jpg	grim	grim (95.9%) ✅	grim (66.7%) ✅	grim (100%) ✅	grim (72.5%) ✅
modern.webp	modern	modern (91.1%) ✅	modern (92.7%) ✅	modern (100%) ✅	modern (82.2%) ✅
modern-2.jpeg	modern	modern (75.0%) ✅	modern (82.3%) ✅	modern (100%) ✅	modern (93.5%) ✅
moe.webp	moe	moe (86.5%) ✅	moe (71.6%) ✅	moe (100%) ✅	moe (58.9%) ✅
moe-2.jpeg	moe	moe (99.4%) ✅	moe (94.5%) ✅	moe (100%) ✅	moe (65.0%) ✅
painterly.webp	painterly	painterly (60.6%) ✅	painterly (77.1%) ✅	painterly (100%) ✅	modern (61.0%) ❌
painterly-2.jpg	painterly	painterly (81.9%) ✅	painterly (77.7%) ✅	painterly (100%) ✅	painterly (93.6%) ✅
retro.png	retro	retro (98.1%) ✅	retro (72.4%) ✅	retro (100%) ✅	retro (97.4%) ✅
retro-2.jpg	retro	retro (44.9%) ✅	retro (79.2%) ✅	retro (100%) ✅	retro (92.6%) ✅

Raw JSON for this benchmark is stored in real_world_eval.json (same label order as the config).

Intended use & limitations

Designed for anime-style classification tasks (dataset curation, filtering, analytics). Not a general-purpose art classifier.
Labels can overlap conceptually – e.g., modern vs painterly – so treat probabilities as soft cues rather than strict taxonomy.
Training data is synthetic. While the model performs well on a curated set of real-world references, distributions that diverge from the synthetic renders (e.g., photography, realistic illustration) may degrade accuracy.
Please review your downstream dataset licenses when sharing outputs.

Citation / attribution

If you use this checkpoint, cite it as Mitchins – Anime Style Classifier EfficientNet-B0 v2 and link back to the Hugging Face repo (e.g., hf.co/Mitchins/anime-style-classifier-efficientnet-b0-v2).

Downloads last month: 30

Model tree for Mitchins/anime-style-classifier-efficientnet-b0-v2

Base model

google/efficientnet-b0

Finetuned

(39)

this model

Evaluation results

accuracy on internal-holdout
self-reported

0.968
accuracy on anime-style-validation
self-reported

1.000

View on Papers With Code