marianbasti
/

audio-language-classification

+---
+license: mit
+datasets:
+- mozilla-foundation/common_voice_17_0
+language:
+- en
+- es
+- ar
+- fr
+- de
+- it
+- pt
+- ru
+- zh
+- ja
+metrics:
+- accuracy
+base_model:
+- hubertsiuzdak/snac_24khz
+pipeline_tag: audio-classification
+tags:
+- audio
+- language
+- classification
+---
+# Audio Language Classifier (SNAC backbone, Common Voice 17.0)
+Summary:
+- Task: Spoken language identification (10 languages)
+- Backbone: SNAC (hubertsiuzdak/snac_24khz) with attention pooling
+- Dataset: Mozilla Common Voice 17.0 (streaming)
+- Sample rate: 24 kHz; Max audio length: 10 s (pad/trim)
+- Mixed precision: FP16
+- Best validation accuracy: 0.5016
+- Test accuracy: 0.3830
+Supported languages (labels):
+- en, es, fr, de, it, pt, ru, zh-CN, ja, ar
+Intended use:
+- Classify the language of short speech segments (≤10 s).
+- Not for ASR or dialect/variant classification.
+Out-of-scope:
+- Very long audio, code-switching, overlapping speakers, noisy or music-heavy inputs.
+Data:
+- Source: Mozilla Common Voice 17.0 (streaming; per-language subset).
+- License: CC-0 (check dataset card for details).
+- Splits: Official validation/test splits used (use_official_splits: true).
+- Optional percent slice per split used during training: 25%.
+Model architecture:
+- Backbone: SNAC encoder (pretrained).
+- Pooling: Attention pooling over time.
+- Head:
+  - Linear(feature_dim → 512), ReLU, Dropout(0.1)
+  - Linear(512 → 256), ReLU, Dropout(0.1)
+  - Linear(256 → 10)
+- Selective tuning:
+  - Start frozen (backbone_tune_strategy: "frozen")
+  - Unfreeze strategy at epoch 5: "last_n_blocks" with last_n_blocks: 1
+  - Gradient checkpointing enabled for backbone.
+Training setup:
+- Batch size: 48
+- Epochs: up to 100 (early stopping patience: 15)
+- Streaming steps per epoch: 500
+- Optimizer: AdamW (betas: 0.9, 0.999; eps: 1e-8)
+- Learning rate: head 1e-4; backbone 2e-5 (after unfreeze)
+- Scheduler: cosine with warmup (num_warmup_steps: 2000)
+- Label smoothing: 0.1
+- Max grad norm: 1.0
+- Seed: 42
+- Hardware: CUDA if available; FP16 enabled
+Preprocessing:
+- Mono waveform at 24 kHz; pad/trim to 10 s.
+- Normalization handled by torchaudio/Tensor transforms in pipeline.
+Evaluation results:
+- Validation:
+  - Best accuracy: 0.5016
+- Test:
+  - accuracy: 0.3830
+  - f1_micro: 0.3830
+  - f1_macro: 0.3624
+  - f1_weighted: 0.3666
+  - loss: 2.2467
+Files and checkpoints:
+- Checkpoints dir: ./training
+  - best_model.pt
+  - language_mapping.txt (idx: language)
+  - final_results.txt
+How to use (inference):
+```python
+import torch
+from models import LanguageClassifier
+device = "cuda" if torch.cuda.is_available() else "cpu"
+# Build and load from a directory containing best_model.pt and language_mapping.txt
+model = LanguageClassifier.from_pretrained("training", device=device)
+# Single-file prediction (auto resample to 24k, pad/trim to 10s)
+label, prob = model.predict("example.wav", max_length_seconds=10.0, top_k=1)
+print(label, prob)
+# Top-3
+top3 = model.predict("example.wav", top_k=3)
+print(top3)  # [('en', 0.62), ('de', 0.21), ('fr', 0.08)]
+# If you already have a waveform tensor:
+#   wav: torch.Tensor [T] at 24kHz (or provide sample_rate to auto-resample)
+#   model.predict handles [T] or [B,T]
+# label, prob = model.predict(wav, sample_rate=orig_sr, top_k=1)
+```
+Limitations and risks:
+- Accuracy varies across speakers, accents, microphones, and noise conditions.
+- May misclassify short utterances or code-switched speech.
+- Not suitable for sensitive decision making without human review.
+Reproducibility:
+- Default config: ./config.yaml
+- Training script: ./train.py
+- To visualize internals: CHECKPOINT_DIR=training/ python viualization.py
+Citation and acknowledgements:
+- [SNAC: hubertsiuzdak/snac_24khz](https://github.com/hubertsiuzdak/snac/)
+- [Dataset: Mozilla Common Voice 17.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0)