---
license: mit
datasets:
- mozilla-foundation/common_voice_17_0
language:
- en
- es
- ar
- fr
- de
- it
- pt
- ru
- zh
- ja
metrics:
- accuracy
base_model:
- hubertsiuzdak/snac_24khz
pipeline_tag: audio-classification
tags:
- audio
- language
- classification
---
# Audio Language Classifier (SNAC backbone, Common Voice 17.0)
First iteration of a lightweight (7M parameter) model for detecting language from a speech audio. Code is available at [GitHub](https://github.com/surus-lat/audio-language-classification) 

In short:
- Identification of spoken language in audio (10 languages)
- Backbone: SNAC (hubertsiuzdak/snac_24khz) with attention pooling
- Dataset used: Mozilla Common Voice 17.0 (streaming)
- Sample rate: 24 kHz; Max audio length: 10 s (pad/trim)
- Mixed precision: FP16
- Best validation accuracy: 0.57

Supported languages (labels):
- en, es, fr, de, it, pt, ru, zh-CN, ja, ar

Intended use:
- Classify the language of short speech segments (≤10 s).
- Not for ASR or dialect/variant classification.

Out-of-scope:
- Very long audio, code-switching, overlapping speakers, noisy or music-heavy inputs.

Data:
- Source: Mozilla Common Voice 17.0 (streaming; per-language subset).
- License: CC-0 (check dataset card for details).
- Splits: Official validation/test splits used (use_official_splits: true). Parquet branch to handle the large sizes
- Percent slice per split used during training: 50%.

Model architecture:
- Backbone: SNAC encoder (pretrained).
- Pooling: Attention pooling over time.
- Head:
  - Linear(feature_dim → 512), ReLU, Dropout(0.1)
  - Linear(512 → 256), ReLU, Dropout(0.1)
  - Linear(256 → 10)
- Selective tuning:
  - Start frozen (backbone_tune_strategy: "frozen")
  - Unfreeze strategy at epoch 2: "last_n_blocks" with last_n_blocks: 1
  - Gradient checkpointing enabled for backbone.

Training setup:
- Batch size: 48
- Epochs: up to 100 (early stopping patience: 15)
- Streaming steps per epoch: 2000
- Optimizer: AdamW (betas: 0.9, 0.999; eps: 1e-8)
- Learning rate: head 1e-4; backbone 2e-5 (after unfreeze)
- Scheduler: cosine with warmup (num_warmup_steps: 2000)
- Label smoothing: 0.1
- Max grad norm: 1.0
- Seed: 42
- Hardware: 1x RTX3090; FP16 enabled

Preprocessing:
- Mono waveform at 24 kHz; pad/trim to 10 s.
- Normalization handled by torchaudio/Tensor transforms in pipeline.

Evaluation results:
- Validation:
  - Best accuracy: 0.5016
- Test:
  - accuracy: 0.3830
  - f1_micro: 0.3830
  - f1_macro: 0.3624
  - f1_weighted: 0.3666
  - loss: 2.2467

Files and checkpoints:
- Checkpoints dir: ./training
  - best_model.pt
  - language_mapping.txt (idx: language)
  - final_results.txt

How to use (inference):
```python
import torch
from models import LanguageClassifier

device = "cuda" if torch.cuda.is_available() else "cpu"

# Build and load from a directory containing best_model.pt and language_mapping.txt
model = LanguageClassifier.from_pretrained("training", device=device)

# Single-file prediction (auto resample to 24k, pad/trim to 10s)
label, prob = model.predict("example.wav", max_length_seconds=10.0, top_k=1)
print(label, prob)

# Top-3
top3 = model.predict("example.wav", top_k=3)
print(top3)  # [('en', 0.62), ('de', 0.21), ('fr', 0.08)]

# If you already have a waveform tensor:
#   wav: torch.Tensor [T] at 24kHz (or provide sample_rate to auto-resample)
#   model.predict handles [T] or [B,T]
# label, prob = model.predict(wav, sample_rate=orig_sr, top_k=1)
```

Limitations and risks:
- Accuracy varies across speakers, accents, microphones, and noise conditions.
- May misclassify short utterances or code-switched speech.
- Not suitable for sensitive decision making without human review.

Reproducibility:
- Default config: ./config.yaml
- Training script: ./train.py
- To visualize internals: CHECKPOINT_DIR=training/ python viualization.py

Citation and acknowledgements:
- [SNAC: hubertsiuzdak/snac_24khz](https://github.com/hubertsiuzdak/snac/)
- [Dataset: Mozilla Common Voice 17.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0)