--- license: mit datasets: - mozilla-foundation/common_voice_17_0 language: - en - es - ar - fr - de - it - pt - ru - zh - ja metrics: - accuracy base_model: - hubertsiuzdak/snac_24khz pipeline_tag: audio-classification tags: - audio - language - classification --- # Audio Language Classifier (SNAC backbone, Common Voice 17.0) First iteration of a lightweight (7M parameter) model for detecting language from a speech audio. Code is available at [GitHub](https://github.com/surus-lat/audio-language-classification) In short: - Identification of spoken language in audio (10 languages) - Backbone: SNAC (hubertsiuzdak/snac_24khz) with attention pooling - Dataset used: Mozilla Common Voice 17.0 (streaming) - Sample rate: 24 kHz; Max audio length: 10 s (pad/trim) - Mixed precision: FP16 - Best validation accuracy: 0.57 Supported languages (labels): - en, es, fr, de, it, pt, ru, zh-CN, ja, ar Intended use: - Classify the language of short speech segments (≤10 s). - Not for ASR or dialect/variant classification. Out-of-scope: - Very long audio, code-switching, overlapping speakers, noisy or music-heavy inputs. Data: - Source: Mozilla Common Voice 17.0 (streaming; per-language subset). - License: CC-0 (check dataset card for details). - Splits: Official validation/test splits used (use_official_splits: true). Parquet branch to handle the large sizes - Percent slice per split used during training: 50%. Model architecture: - Backbone: SNAC encoder (pretrained). - Pooling: Attention pooling over time. - Head: - Linear(feature_dim → 512), ReLU, Dropout(0.1) - Linear(512 → 256), ReLU, Dropout(0.1) - Linear(256 → 10) - Selective tuning: - Start frozen (backbone_tune_strategy: "frozen") - Unfreeze strategy at epoch 2: "last_n_blocks" with last_n_blocks: 1 - Gradient checkpointing enabled for backbone. Training setup: - Batch size: 48 - Epochs: up to 100 (early stopping patience: 15) - Streaming steps per epoch: 2000 - Optimizer: AdamW (betas: 0.9, 0.999; eps: 1e-8) - Learning rate: head 1e-4; backbone 2e-5 (after unfreeze) - Scheduler: cosine with warmup (num_warmup_steps: 2000) - Label smoothing: 0.1 - Max grad norm: 1.0 - Seed: 42 - Hardware: 1x RTX3090; FP16 enabled Preprocessing: - Mono waveform at 24 kHz; pad/trim to 10 s. - Normalization handled by torchaudio/Tensor transforms in pipeline. Evaluation results: - Validation: - Best accuracy: 0.5016 - Test: - accuracy: 0.3830 - f1_micro: 0.3830 - f1_macro: 0.3624 - f1_weighted: 0.3666 - loss: 2.2467 Files and checkpoints: - Checkpoints dir: ./training - best_model.pt - language_mapping.txt (idx: language) - final_results.txt How to use (inference): ```python import torch from models import LanguageClassifier device = "cuda" if torch.cuda.is_available() else "cpu" # Build and load from a directory containing best_model.pt and language_mapping.txt model = LanguageClassifier.from_pretrained("training", device=device) # Single-file prediction (auto resample to 24k, pad/trim to 10s) label, prob = model.predict("example.wav", max_length_seconds=10.0, top_k=1) print(label, prob) # Top-3 top3 = model.predict("example.wav", top_k=3) print(top3) # [('en', 0.62), ('de', 0.21), ('fr', 0.08)] # If you already have a waveform tensor: # wav: torch.Tensor [T] at 24kHz (or provide sample_rate to auto-resample) # model.predict handles [T] or [B,T] # label, prob = model.predict(wav, sample_rate=orig_sr, top_k=1) ``` Limitations and risks: - Accuracy varies across speakers, accents, microphones, and noise conditions. - May misclassify short utterances or code-switched speech. - Not suitable for sensitive decision making without human review. Reproducibility: - Default config: ./config.yaml - Training script: ./train.py - To visualize internals: CHECKPOINT_DIR=training/ python viualization.py Citation and acknowledgements: - [SNAC: hubertsiuzdak/snac_24khz](https://github.com/hubertsiuzdak/snac/) - [Dataset: Mozilla Common Voice 17.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0)