marianbasti commited on
Commit
97a9fba
·
verified ·
1 Parent(s): 3ebcf63

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +133 -3
README.md CHANGED
@@ -1,3 +1,133 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ datasets:
4
+ - mozilla-foundation/common_voice_17_0
5
+ language:
6
+ - en
7
+ - es
8
+ - ar
9
+ - fr
10
+ - de
11
+ - it
12
+ - pt
13
+ - ru
14
+ - zh
15
+ - ja
16
+ metrics:
17
+ - accuracy
18
+ base_model:
19
+ - hubertsiuzdak/snac_24khz
20
+ pipeline_tag: audio-classification
21
+ tags:
22
+ - audio
23
+ - language
24
+ - classification
25
+ ---
26
+ # Audio Language Classifier (SNAC backbone, Common Voice 17.0)
27
+
28
+ Summary:
29
+ - Task: Spoken language identification (10 languages)
30
+ - Backbone: SNAC (hubertsiuzdak/snac_24khz) with attention pooling
31
+ - Dataset: Mozilla Common Voice 17.0 (streaming)
32
+ - Sample rate: 24 kHz; Max audio length: 10 s (pad/trim)
33
+ - Mixed precision: FP16
34
+ - Best validation accuracy: 0.5016
35
+ - Test accuracy: 0.3830
36
+
37
+ Supported languages (labels):
38
+ - en, es, fr, de, it, pt, ru, zh-CN, ja, ar
39
+
40
+ Intended use:
41
+ - Classify the language of short speech segments (≤10 s).
42
+ - Not for ASR or dialect/variant classification.
43
+
44
+ Out-of-scope:
45
+ - Very long audio, code-switching, overlapping speakers, noisy or music-heavy inputs.
46
+
47
+ Data:
48
+ - Source: Mozilla Common Voice 17.0 (streaming; per-language subset).
49
+ - License: CC-0 (check dataset card for details).
50
+ - Splits: Official validation/test splits used (use_official_splits: true).
51
+ - Optional percent slice per split used during training: 25%.
52
+
53
+ Model architecture:
54
+ - Backbone: SNAC encoder (pretrained).
55
+ - Pooling: Attention pooling over time.
56
+ - Head:
57
+ - Linear(feature_dim → 512), ReLU, Dropout(0.1)
58
+ - Linear(512 → 256), ReLU, Dropout(0.1)
59
+ - Linear(256 → 10)
60
+ - Selective tuning:
61
+ - Start frozen (backbone_tune_strategy: "frozen")
62
+ - Unfreeze strategy at epoch 5: "last_n_blocks" with last_n_blocks: 1
63
+ - Gradient checkpointing enabled for backbone.
64
+
65
+ Training setup:
66
+ - Batch size: 48
67
+ - Epochs: up to 100 (early stopping patience: 15)
68
+ - Streaming steps per epoch: 500
69
+ - Optimizer: AdamW (betas: 0.9, 0.999; eps: 1e-8)
70
+ - Learning rate: head 1e-4; backbone 2e-5 (after unfreeze)
71
+ - Scheduler: cosine with warmup (num_warmup_steps: 2000)
72
+ - Label smoothing: 0.1
73
+ - Max grad norm: 1.0
74
+ - Seed: 42
75
+ - Hardware: CUDA if available; FP16 enabled
76
+
77
+ Preprocessing:
78
+ - Mono waveform at 24 kHz; pad/trim to 10 s.
79
+ - Normalization handled by torchaudio/Tensor transforms in pipeline.
80
+
81
+ Evaluation results:
82
+ - Validation:
83
+ - Best accuracy: 0.5016
84
+ - Test:
85
+ - accuracy: 0.3830
86
+ - f1_micro: 0.3830
87
+ - f1_macro: 0.3624
88
+ - f1_weighted: 0.3666
89
+ - loss: 2.2467
90
+
91
+ Files and checkpoints:
92
+ - Checkpoints dir: ./training
93
+ - best_model.pt
94
+ - language_mapping.txt (idx: language)
95
+ - final_results.txt
96
+
97
+ How to use (inference):
98
+ ```python
99
+ import torch
100
+ from models import LanguageClassifier
101
+
102
+ device = "cuda" if torch.cuda.is_available() else "cpu"
103
+
104
+ # Build and load from a directory containing best_model.pt and language_mapping.txt
105
+ model = LanguageClassifier.from_pretrained("training", device=device)
106
+
107
+ # Single-file prediction (auto resample to 24k, pad/trim to 10s)
108
+ label, prob = model.predict("example.wav", max_length_seconds=10.0, top_k=1)
109
+ print(label, prob)
110
+
111
+ # Top-3
112
+ top3 = model.predict("example.wav", top_k=3)
113
+ print(top3) # [('en', 0.62), ('de', 0.21), ('fr', 0.08)]
114
+
115
+ # If you already have a waveform tensor:
116
+ # wav: torch.Tensor [T] at 24kHz (or provide sample_rate to auto-resample)
117
+ # model.predict handles [T] or [B,T]
118
+ # label, prob = model.predict(wav, sample_rate=orig_sr, top_k=1)
119
+ ```
120
+
121
+ Limitations and risks:
122
+ - Accuracy varies across speakers, accents, microphones, and noise conditions.
123
+ - May misclassify short utterances or code-switched speech.
124
+ - Not suitable for sensitive decision making without human review.
125
+
126
+ Reproducibility:
127
+ - Default config: ./config.yaml
128
+ - Training script: ./train.py
129
+ - To visualize internals: CHECKPOINT_DIR=training/ python viualization.py
130
+
131
+ Citation and acknowledgements:
132
+ - [SNAC: hubertsiuzdak/snac_24khz](https://github.com/hubertsiuzdak/snac/)
133
+ - [Dataset: Mozilla Common Voice 17.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0)