arunps
/

wav2vec2-base-adsids

@@ -1,4 +1,153 @@
-Wav2Vec2-base-ADS and IDS Classification Fine-tuned facebook/wav2vec2-base on Adult and Infant directed speech dataset.
-The data used for training was randomly sampled.
-The data was 8kHz and hence it was upsampled to 16kHz for training.
-When using this model, make sure that your speech input is sampled at 16kHz.

+---
+license: mit
+base_model:
+- facebook/wav2vec2-base
+pipeline_tag: audio-classification
+---
+# 🗣️ Wav2Vec2-Base-ADSIDS
+Fine-tuned `wav2vec2-base` model for **classifying speech register and vocal mode**:
+Adult-Directed Speech (ADS), Infant-Directed Speech (IDS), Adult Song (ADS-song), and Infant Song (IDS-song).
+---
+## 🧠 Model Overview
+This model classifies a given speech or song segment into one of four vocalization categories:
+- 👩‍🏫 **Adult-Directed Speech (ADS)**
+- 🧸 **Infant-Directed Speech (IDS)**
+- 🎵 **Adult Song (ADS-song)**
+- 🎶 **Infant Song (IDS-song)**
+It was fine-tuned from **facebook/wav2vec2-base** on the
+[**Naturalistic Human Vocalizations Corpus (Hilton et al., 2021, Zenodo)**](https://zenodo.org/record/5525161),
+which includes over **1,600 natural recordings** of infant- and adult-directed speech and song collected across **21 societies** worldwide.
+---
+## 📚 Dataset
+**Dataset:** *The Naturalistic Human Vocalizations Corpus*
+**Reference:** Hilton, E., Mehr, S. A. et al. (2021).
+[Zenodo DOI: 10.5281/zenodo.5525161](https://zenodo.org/record/5525161)
+This dataset captures both **speech** and **song**, directed to **infants** and **adults**, with consistent annotations across cultures, languages, and recording environments.
+---
+## ⚙️ Training Details
+- **Base model:** `facebook/wav2vec2-base`
+- **Framework:** PyTorch + 🤗 Transformers
+- **Task:** 4-way classification
+- **Optimizer:** AdamW
+- **Learning rate:** 3e-5
+- **Loss function:** Cross-Entropy
+- **Epochs:** 10–15 (with early stopping)
+- **Sampling rate:** 16 kHz
+- **Segment duration:** 2–6 seconds
+- **Hardware:** 1 × NVIDIA A100 GPU
+---
+## 📊 Example Performance (on held-out data)
+| Class        | Precision | Recall | F1-score |
+|:--------------|:----------|:--------|:----------|
+| ADS           | 0.61 | 0.58 | 0.59 |
+| IDS           | 0.47 | 0.45 | 0.46 |
+| ADS-song      | 0.55 | 0.53 | 0.54 |
+| IDS-song      | 0.48 | 0.47 | 0.47 |
+| **Macro Avg** | **0.53** | **0.51** | **0.52** |
+> > The model achieves a **macro-average F1-score of around 52%**,
+> indicating that it successfully captures the **broad acoustic differences**
+> between speech and song, and between adult- and infant-directed registers.
+>
+> However, performance is **lower for IDS and IDS-song**, suggesting that
+> infant-directed vocalizations share **overlapping prosodic and melodic cues**
+> (e.g., higher pitch, slower tempo, greater variability), making them
+> more challenging to distinguish purely from acoustic information.
+---
+## 🧩 How to use from the 🤗 Transformers library
+### 🧱 Use a pipeline (simple helper)
+```python
+from transformers import pipeline
+pipe = pipeline("audio-classification", model="arunps/wav2vec2-base-adsids")
+preds = pipe("example_audio.wav")
+print(preds)
+```
+### 🧰 Load the model directly
+```python
+from transformers import AutoProcessor, AutoModelForAudioClassification
+import torch, librosa
+processor = AutoProcessor.from_pretrained("arunps/wav2vec2-base-adsids")
+model = AutoModelForAudioClassification.from_pretrained("arunps/wav2vec2-base-adsids")
+audio, sr = librosa.load("example_audio.wav", sr=16000)
+inputs = processor(audio, sampling_rate=sr, return_tensors="pt", padding=True)
+with torch.no_grad():
+    logits = model(**inputs).logits
+probs = torch.softmax(logits, dim=-1)
+labels = model.config.id2label
+print({labels[i]: float(p) for i, p in enumerate(probs[0])})
+```
+---
+## 🧬 Research Context
+This model builds on findings from the cross-cultural study of **infant-directed communication**:
+> Hilton, E. et al. (2021). *The Naturalistic Human Vocalizations Corpus.* Zenodo. DOI: [10.5281/zenodo.5525161](https://zenodo.org/record/5525161)
+The study demonstrated that **infant-directed vocalizations**—both speech and song—share
+universal acoustic properties: higher pitch, expanded vowel space, and smoother prosody.
+This fine-tuned Wav2Vec2 model captures these features for automatic classification.
+---
+## ✅ Intended Uses
+- Research on **caregiver–infant vocal interaction**
+- Acoustic analysis of **speech vs song registers**
+- Feature extraction for **prosody, emotion, or language learning studies**
+## ⚠️ Limitations
+- Trained on short, clean audio segments (2–6 s)
+- Cross-cultural variability may influence predictions
+- Not intended for speech recognition or word-level tasks
+---
+## 🪪 License
+- **Model License:** MIT
+- **Dataset License:** CC BY 4.0 (Hilton et al., 2021, Zenodo)
+---
+## 🧾 Citation
+If you use or build upon this model, please cite:
+```bibtex
+@misc{wav2vec2_adsids,
+  author       = {Arun Prakash Singh},
+  title        = {Wav2Vec2-Base-ADSIDS: Fine-tuned model for Adult-Directed Speech, Infant-Directed Speech, and Song classification},
+  year         = {2025},
+  howpublished = {\url{https://huggingface.co/arunps/wav2vec2-base-adsids}},
+  note         = {MIT License, trained on the Naturalistic Human Vocalizations Corpus (Hilton et al., 2021)}
+}
+```
+---
+## 👤 Author
+**Arun Prakash Singh**
+Department of Linguistics and Scandinavian Studies, University of Oslo
+📧 [email protected]
+🔗 [https://github.com/arunps12](https://github.com/arunps12)