Anime Style Classifier · EfficientNet-B0 v2
This checkpoint classifies anime artwork into six coarse styles – flat, grim, modern, moe, painterly, and retro. It replaces the original v1 release (older taxonomy, weaker validation) and reflects the "current" EfficientNet-B0 model used by the internal web classifier. The model was trained and benchmarked exclusively on synthetic anime imagery, so accuracy on hand-drawn or photo-based references depends on how closely they resemble the synthetic generation regime.
Highlights
- Architecture: EfficientNet-B0 with a 6-way linear head (input 224×224 RGB).
- Training mix: ~1.9k curated training images, with validation/holdout splits of 402/408 samples respectively. Augmentations included a blend of standard crops and tiled views, but inference works best with a single aspect-preserving center crop.
- Metrics:
- Validation accuracy: 96.27%
- Holdout accuracy: 96.81%
- Real-world spot check (12 human-labelled photos, two per style): 12/12 correct with aspect-fill inference. Sliding/multi-crop modes lagged (details below).
Files
pytorch_model.bin– PyTorch state dict (use withtorchvision.models.efficientnet_b0+ custom classifier).config.json– Metadata, label mappings, preprocessing constants.requirements.txt– Minimal deps for the sample script.inference.py– Reference CLI/SDK helper.real_world_eval.json– Aspect vs sliding/multi-crop comparison on the 12-sample real-world set.
Usage
import json, torch
from PIL import Image
from torchvision import models, transforms
cfg = json.load(open('config.json'))
model = models.efficientnet_b0(weights=None)
model.classifier[1] = torch.nn.Linear(model.classifier[1].in_features, cfg['num_labels'])
state = torch.load('pytorch_model.bin', map_location='cpu')
model.load_state_dict(state)
model.eval()
preprocess = transforms.Compose([
transforms.Resize((cfg['image_size'], cfg['image_size'])),
transforms.CenterCrop(cfg['image_size']),
transforms.ToTensor(),
transforms.Normalize(cfg['mean'], cfg['std'])
])
img = Image.open('your_image.jpg').convert('RGB')
with torch.no_grad():
logits = model(preprocess(img).unsqueeze(0))
probs = torch.softmax(logits, dim=1)[0]
for label, prob in zip(cfg['id2label'].values(), probs):
print(label, float(prob))
The accompanying inference.py exposes the same logic with a CLI (python inference.py path/to/image.png).
Inference modes (and why aspect-fill is preferred)
The internal web server exposes four preprocessing paths:
| Mode | Description | Real-world (12 imgs) |
|---|---|---|
resize |
Aspect-preserving resize + single center crop (a.k.a. aspect-fill). | 12/12 |
slide-avg |
Sliding windows (~⅓ min side) with logits averaged. | 8/12 |
slide-mode |
Sliding windows with majority vote over argmax labels. | 7/12 |
multicrop |
Center crop + 2×2 grid of full-res tiles, logits averaged. | 10/12 |
Despite being trained with a mix of global crops and tiles, the model is most stable on human-shot references when run with the simple aspect-fill path. Sliding windows tend to overweight local lighting cues (e.g., grim scenes drifting toward modern), while averaging tiled crops can wash out global color palettes. Multi-crop is still available for comparison, but the resize mode should be the default for production use.
Per-image breakdown (real-world set)
| File | Ground truth | resize | slide-avg | slide-mode | multicrop |
|---|---|---|---|---|---|
| flat.jpg | flat | flat (98.9%) ✅ | flat (89.1%) ✅ | flat (100%) ✅ | flat (97.9%) ✅ |
| flat-2.jpeg | flat | flat (90.5%) ✅ | flat (55.6%) ✅ | flat (100%) ✅ | flat (84.7%) ✅ |
| grim.jpg | grim | grim (50.7%) ✅ | modern (82.7%) ❌ | modern (100%) ❌ | retro (49.5%) ❌ |
| grim-2.jpg | grim | grim (95.9%) ✅ | grim (66.7%) ✅ | grim (100%) ✅ | grim (72.5%) ✅ |
| modern.webp | modern | modern (91.1%) ✅ | modern (92.7%) ✅ | modern (100%) ✅ | modern (82.2%) ✅ |
| modern-2.jpeg | modern | modern (75.0%) ✅ | modern (82.3%) ✅ | modern (100%) ✅ | modern (93.5%) ✅ |
| moe.webp | moe | moe (86.5%) ✅ | moe (71.6%) ✅ | moe (100%) ✅ | moe (58.9%) ✅ |
| moe-2.jpeg | moe | moe (99.4%) ✅ | moe (94.5%) ✅ | moe (100%) ✅ | moe (65.0%) ✅ |
| painterly.webp | painterly | painterly (60.6%) ✅ | painterly (77.1%) ✅ | painterly (100%) ✅ | modern (61.0%) ❌ |
| painterly-2.jpg | painterly | painterly (81.9%) ✅ | painterly (77.7%) ✅ | painterly (100%) ✅ | painterly (93.6%) ✅ |
| retro.png | retro | retro (98.1%) ✅ | retro (72.4%) ✅ | retro (100%) ✅ | retro (97.4%) ✅ |
| retro-2.jpg | retro | retro (44.9%) ✅ | retro (79.2%) ✅ | retro (100%) ✅ | retro (92.6%) ✅ |
Raw JSON for this benchmark is stored in real_world_eval.json (same label order as the config).
Intended use & limitations
- Designed for anime-style classification tasks (dataset curation, filtering, analytics). Not a general-purpose art classifier.
- Labels can overlap conceptually – e.g.,
modernvspainterly– so treat probabilities as soft cues rather than strict taxonomy. - Training data is synthetic. While the model performs well on a curated set of real-world references, distributions that diverge from the synthetic renders (e.g., photography, realistic illustration) may degrade accuracy.
- Please review your downstream dataset licenses when sharing outputs.
Citation / attribution
If you use this checkpoint, cite it as Mitchins – Anime Style Classifier EfficientNet-B0 v2 and link back to the Hugging Face repo (e.g., hf.co/Mitchins/anime-style-classifier-efficientnet-b0-v2).
- Downloads last month
- 30
Model tree for Mitchins/anime-style-classifier-efficientnet-b0-v2
Base model
google/efficientnet-b0Evaluation results
- accuracy on internal-holdoutself-reported0.968
- accuracy on anime-style-validationself-reported1.000