Anime Style Classifier · EfficientNet-B0 v2

This checkpoint classifies anime artwork into six coarse styles – flat, grim, modern, moe, painterly, and retro. It replaces the original v1 release (older taxonomy, weaker validation) and reflects the "current" EfficientNet-B0 model used by the internal web classifier. The model was trained and benchmarked exclusively on synthetic anime imagery, so accuracy on hand-drawn or photo-based references depends on how closely they resemble the synthetic generation regime.

Highlights

  • Architecture: EfficientNet-B0 with a 6-way linear head (input 224×224 RGB).
  • Training mix: ~1.9k curated training images, with validation/holdout splits of 402/408 samples respectively. Augmentations included a blend of standard crops and tiled views, but inference works best with a single aspect-preserving center crop.
  • Metrics:
    • Validation accuracy: 96.27%
    • Holdout accuracy: 96.81%
    • Real-world spot check (12 human-labelled photos, two per style): 12/12 correct with aspect-fill inference. Sliding/multi-crop modes lagged (details below).

Files

  • pytorch_model.bin – PyTorch state dict (use with torchvision.models.efficientnet_b0 + custom classifier).
  • config.json – Metadata, label mappings, preprocessing constants.
  • requirements.txt – Minimal deps for the sample script.
  • inference.py – Reference CLI/SDK helper.
  • real_world_eval.json – Aspect vs sliding/multi-crop comparison on the 12-sample real-world set.

Usage

import json, torch
from PIL import Image
from torchvision import models, transforms

cfg = json.load(open('config.json'))
model = models.efficientnet_b0(weights=None)
model.classifier[1] = torch.nn.Linear(model.classifier[1].in_features, cfg['num_labels'])
state = torch.load('pytorch_model.bin', map_location='cpu')
model.load_state_dict(state)
model.eval()

preprocess = transforms.Compose([
    transforms.Resize((cfg['image_size'], cfg['image_size'])),
    transforms.CenterCrop(cfg['image_size']),
    transforms.ToTensor(),
    transforms.Normalize(cfg['mean'], cfg['std'])
])

img = Image.open('your_image.jpg').convert('RGB')
with torch.no_grad():
    logits = model(preprocess(img).unsqueeze(0))
    probs = torch.softmax(logits, dim=1)[0]

for label, prob in zip(cfg['id2label'].values(), probs):
    print(label, float(prob))

The accompanying inference.py exposes the same logic with a CLI (python inference.py path/to/image.png).

Inference modes (and why aspect-fill is preferred)

The internal web server exposes four preprocessing paths:

Mode Description Real-world (12 imgs)
resize Aspect-preserving resize + single center crop (a.k.a. aspect-fill). 12/12
slide-avg Sliding windows (~⅓ min side) with logits averaged. 8/12
slide-mode Sliding windows with majority vote over argmax labels. 7/12
multicrop Center crop + 2×2 grid of full-res tiles, logits averaged. 10/12

Despite being trained with a mix of global crops and tiles, the model is most stable on human-shot references when run with the simple aspect-fill path. Sliding windows tend to overweight local lighting cues (e.g., grim scenes drifting toward modern), while averaging tiled crops can wash out global color palettes. Multi-crop is still available for comparison, but the resize mode should be the default for production use.

Per-image breakdown (real-world set)

File Ground truth resize slide-avg slide-mode multicrop
flat.jpg flat flat (98.9%) ✅ flat (89.1%) ✅ flat (100%) ✅ flat (97.9%) ✅
flat-2.jpeg flat flat (90.5%) ✅ flat (55.6%) ✅ flat (100%) ✅ flat (84.7%) ✅
grim.jpg grim grim (50.7%) ✅ modern (82.7%) ❌ modern (100%) ❌ retro (49.5%) ❌
grim-2.jpg grim grim (95.9%) ✅ grim (66.7%) ✅ grim (100%) ✅ grim (72.5%) ✅
modern.webp modern modern (91.1%) ✅ modern (92.7%) ✅ modern (100%) ✅ modern (82.2%) ✅
modern-2.jpeg modern modern (75.0%) ✅ modern (82.3%) ✅ modern (100%) ✅ modern (93.5%) ✅
moe.webp moe moe (86.5%) ✅ moe (71.6%) ✅ moe (100%) ✅ moe (58.9%) ✅
moe-2.jpeg moe moe (99.4%) ✅ moe (94.5%) ✅ moe (100%) ✅ moe (65.0%) ✅
painterly.webp painterly painterly (60.6%) ✅ painterly (77.1%) ✅ painterly (100%) ✅ modern (61.0%) ❌
painterly-2.jpg painterly painterly (81.9%) ✅ painterly (77.7%) ✅ painterly (100%) ✅ painterly (93.6%) ✅
retro.png retro retro (98.1%) ✅ retro (72.4%) ✅ retro (100%) ✅ retro (97.4%) ✅
retro-2.jpg retro retro (44.9%) ✅ retro (79.2%) ✅ retro (100%) ✅ retro (92.6%) ✅

Raw JSON for this benchmark is stored in real_world_eval.json (same label order as the config).

Intended use & limitations

  • Designed for anime-style classification tasks (dataset curation, filtering, analytics). Not a general-purpose art classifier.
  • Labels can overlap conceptually – e.g., modern vs painterly – so treat probabilities as soft cues rather than strict taxonomy.
  • Training data is synthetic. While the model performs well on a curated set of real-world references, distributions that diverge from the synthetic renders (e.g., photography, realistic illustration) may degrade accuracy.
  • Please review your downstream dataset licenses when sharing outputs.

Citation / attribution

If you use this checkpoint, cite it as Mitchins – Anime Style Classifier EfficientNet-B0 v2 and link back to the Hugging Face repo (e.g., hf.co/Mitchins/anime-style-classifier-efficientnet-b0-v2).

Downloads last month
30
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Mitchins/anime-style-classifier-efficientnet-b0-v2

Finetuned
(39)
this model

Evaluation results