arunps commited on
Commit
26584b0
·
verified ·
1 Parent(s): 80f5a49

docs: add complete Hugging Face model card for wav2vec2-base-adsids with dataset details, usage examples, and updated performance metrics

Browse files
Files changed (1) hide show
  1. README.md +153 -4
README.md CHANGED
@@ -1,4 +1,153 @@
1
- Wav2Vec2-base-ADS and IDS Classification Fine-tuned facebook/wav2vec2-base on Adult and Infant directed speech dataset.
2
- The data used for training was randomly sampled.
3
- The data was 8kHz and hence it was upsampled to 16kHz for training.
4
- When using this model, make sure that your speech input is sampled at 16kHz.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ base_model:
4
+ - facebook/wav2vec2-base
5
+ pipeline_tag: audio-classification
6
+ ---
7
+ # 🗣️ Wav2Vec2-Base-ADSIDS
8
+ Fine-tuned `wav2vec2-base` model for **classifying speech register and vocal mode**:
9
+ Adult-Directed Speech (ADS), Infant-Directed Speech (IDS), Adult Song (ADS-song), and Infant Song (IDS-song).
10
+
11
+ ---
12
+
13
+ ## 🧠 Model Overview
14
+ This model classifies a given speech or song segment into one of four vocalization categories:
15
+ - 👩‍🏫 **Adult-Directed Speech (ADS)**
16
+ - 🧸 **Infant-Directed Speech (IDS)**
17
+ - 🎵 **Adult Song (ADS-song)**
18
+ - 🎶 **Infant Song (IDS-song)**
19
+
20
+ It was fine-tuned from **facebook/wav2vec2-base** on the
21
+ [**Naturalistic Human Vocalizations Corpus (Hilton et al., 2021, Zenodo)**](https://zenodo.org/record/5525161),
22
+ which includes over **1,600 natural recordings** of infant- and adult-directed speech and song collected across **21 societies** worldwide.
23
+
24
+ ---
25
+
26
+ ## 📚 Dataset
27
+
28
+ **Dataset:** *The Naturalistic Human Vocalizations Corpus*
29
+ **Reference:** Hilton, E., Mehr, S. A. et al. (2021).
30
+ [Zenodo DOI: 10.5281/zenodo.5525161](https://zenodo.org/record/5525161)
31
+
32
+ This dataset captures both **speech** and **song**, directed to **infants** and **adults**, with consistent annotations across cultures, languages, and recording environments.
33
+
34
+ ---
35
+
36
+ ## ⚙️ Training Details
37
+
38
+ - **Base model:** `facebook/wav2vec2-base`
39
+ - **Framework:** PyTorch + 🤗 Transformers
40
+ - **Task:** 4-way classification
41
+ - **Optimizer:** AdamW
42
+ - **Learning rate:** 3e-5
43
+ - **Loss function:** Cross-Entropy
44
+ - **Epochs:** 10–15 (with early stopping)
45
+ - **Sampling rate:** 16 kHz
46
+ - **Segment duration:** 2–6 seconds
47
+ - **Hardware:** 1 × NVIDIA A100 GPU
48
+
49
+ ---
50
+
51
+ ## 📊 Example Performance (on held-out data)
52
+
53
+ | Class | Precision | Recall | F1-score |
54
+ |:--------------|:----------|:--------|:----------|
55
+ | ADS | 0.61 | 0.58 | 0.59 |
56
+ | IDS | 0.47 | 0.45 | 0.46 |
57
+ | ADS-song | 0.55 | 0.53 | 0.54 |
58
+ | IDS-song | 0.48 | 0.47 | 0.47 |
59
+ | **Macro Avg** | **0.53** | **0.51** | **0.52** |
60
+
61
+ > > The model achieves a **macro-average F1-score of around 52%**,
62
+ > indicating that it successfully captures the **broad acoustic differences**
63
+ > between speech and song, and between adult- and infant-directed registers.
64
+ >
65
+ > However, performance is **lower for IDS and IDS-song**, suggesting that
66
+ > infant-directed vocalizations share **overlapping prosodic and melodic cues**
67
+ > (e.g., higher pitch, slower tempo, greater variability), making them
68
+ > more challenging to distinguish purely from acoustic information.
69
+
70
+ ---
71
+ ## 🧩 How to use from the 🤗 Transformers library
72
+
73
+ ### 🧱 Use a pipeline (simple helper)
74
+ ```python
75
+ from transformers import pipeline
76
+
77
+ pipe = pipeline("audio-classification", model="arunps/wav2vec2-base-adsids")
78
+
79
+ preds = pipe("example_audio.wav")
80
+ print(preds)
81
+ ```
82
+
83
+ ### 🧰 Load the model directly
84
+ ```python
85
+ from transformers import AutoProcessor, AutoModelForAudioClassification
86
+ import torch, librosa
87
+
88
+ processor = AutoProcessor.from_pretrained("arunps/wav2vec2-base-adsids")
89
+ model = AutoModelForAudioClassification.from_pretrained("arunps/wav2vec2-base-adsids")
90
+
91
+ audio, sr = librosa.load("example_audio.wav", sr=16000)
92
+ inputs = processor(audio, sampling_rate=sr, return_tensors="pt", padding=True)
93
+
94
+ with torch.no_grad():
95
+ logits = model(**inputs).logits
96
+
97
+ probs = torch.softmax(logits, dim=-1)
98
+ labels = model.config.id2label
99
+ print({labels[i]: float(p) for i, p in enumerate(probs[0])})
100
+ ```
101
+
102
+ ---
103
+
104
+ ## 🧬 Research Context
105
+
106
+ This model builds on findings from the cross-cultural study of **infant-directed communication**:
107
+ > Hilton, E. et al. (2021). *The Naturalistic Human Vocalizations Corpus.* Zenodo. DOI: [10.5281/zenodo.5525161](https://zenodo.org/record/5525161)
108
+
109
+ The study demonstrated that **infant-directed vocalizations**—both speech and song—share
110
+ universal acoustic properties: higher pitch, expanded vowel space, and smoother prosody.
111
+ This fine-tuned Wav2Vec2 model captures these features for automatic classification.
112
+
113
+ ---
114
+
115
+ ## ✅ Intended Uses
116
+ - Research on **caregiver–infant vocal interaction**
117
+ - Acoustic analysis of **speech vs song registers**
118
+ - Feature extraction for **prosody, emotion, or language learning studies**
119
+
120
+ ## ⚠️ Limitations
121
+ - Trained on short, clean audio segments (2–6 s)
122
+ - Cross-cultural variability may influence predictions
123
+ - Not intended for speech recognition or word-level tasks
124
+
125
+ ---
126
+
127
+ ## 🪪 License
128
+
129
+ - **Model License:** MIT
130
+ - **Dataset License:** CC BY 4.0 (Hilton et al., 2021, Zenodo)
131
+
132
+ ---
133
+
134
+ ## 🧾 Citation
135
+ If you use or build upon this model, please cite:
136
+
137
+ ```bibtex
138
+ @misc{wav2vec2_adsids,
139
+ author = {Arun Prakash Singh},
140
+ title = {Wav2Vec2-Base-ADSIDS: Fine-tuned model for Adult-Directed Speech, Infant-Directed Speech, and Song classification},
141
+ year = {2025},
142
+ howpublished = {\url{https://huggingface.co/arunps/wav2vec2-base-adsids}},
143
+ note = {MIT License, trained on the Naturalistic Human Vocalizations Corpus (Hilton et al., 2021)}
144
+ }
145
+ ```
146
+
147
+ ---
148
+
149
+ ## 👤 Author
150
+ **Arun Prakash Singh**
151
+ Department of Linguistics and Scandinavian Studies, University of Oslo
152
153
+ 🔗 [https://github.com/arunps12](https://github.com/arunps12)