---
license: mit
base_model:
- facebook/wav2vec2-base
pipeline_tag: audio-classification
---
# 🗣️ Wav2Vec2-Base-ADSIDS
Fine-tuned `wav2vec2-base` model for **classifying speech register and vocal mode**:  
Adult-Directed Speech (ADS), Infant-Directed Speech (IDS), Adult Song (ADS-song), and Infant Song (IDS-song).

---

## 🧠 Model Overview
This model classifies a given speech or song segment into one of four vocalization categories:  
- 👩‍🏫 **Adult-Directed Speech (ADS)**  
- 🧸 **Infant-Directed Speech (IDS)**  
- 🎵 **Adult Song (ADS-song)**  
- 🎶 **Infant Song (IDS-song)**  

It was fine-tuned from **facebook/wav2vec2-base** on the  
[**Naturalistic Human Vocalizations Corpus (Hilton et al., 2021, Zenodo)**](https://zenodo.org/record/5525161),  
which includes over **1,600 natural recordings** of infant- and adult-directed speech and song collected across **21 societies** worldwide.

---

## 📚 Dataset

**Dataset:** *The Naturalistic Human Vocalizations Corpus*  
**Reference:** Hilton, E., Mehr, S. A. et al. (2021).  
[Zenodo DOI: 10.5281/zenodo.5525161](https://zenodo.org/record/5525161)

This dataset captures both **speech** and **song**, directed to **infants** and **adults**, with consistent annotations across cultures, languages, and recording environments.

---

## ⚙️ Training Details

- **Base model:** `facebook/wav2vec2-base`  
- **Framework:** PyTorch + 🤗 Transformers  
- **Task:** 4-way classification  
- **Optimizer:** AdamW  
- **Learning rate:** 3e-5  
- **Loss function:** Cross-Entropy  
- **Epochs:** 10–15 (with early stopping)  
- **Sampling rate:** 16 kHz  
- **Segment duration:** 2–6 seconds  
- **Hardware:** 1 × NVIDIA A100 GPU

---

## 📊 Example Performance (on held-out data)

| Class        | Precision | Recall | F1-score |
|:--------------|:----------|:--------|:----------|
| ADS           | 0.61 | 0.58 | 0.59 |
| IDS           | 0.47 | 0.45 | 0.46 |
| ADS-song      | 0.55 | 0.53 | 0.54 |
| IDS-song      | 0.48 | 0.47 | 0.47 |
| **Macro Avg** | **0.53** | **0.51** | **0.52** |

> > The model achieves a **macro-average F1-score of around 52%**,  
> indicating that it successfully captures the **broad acoustic differences**  
> between speech and song, and between adult- and infant-directed registers.  
>
> However, performance is **lower for IDS and IDS-song**, suggesting that  
> infant-directed vocalizations share **overlapping prosodic and melodic cues**  
> (e.g., higher pitch, slower tempo, greater variability), making them  
> more challenging to distinguish purely from acoustic information.

---
## 🧩 How to use from the 🤗 Transformers library

### 🧱 Use a pipeline (simple helper)
```python
from transformers import pipeline

pipe = pipeline("audio-classification", model="arunps/wav2vec2-base-adsids")

preds = pipe("example_audio.wav")
print(preds)
```

### 🧰 Load the model directly
```python
from transformers import AutoProcessor, AutoModelForAudioClassification
import torch, librosa

processor = AutoProcessor.from_pretrained("arunps/wav2vec2-base-adsids")
model = AutoModelForAudioClassification.from_pretrained("arunps/wav2vec2-base-adsids")

audio, sr = librosa.load("example_audio.wav", sr=16000)
inputs = processor(audio, sampling_rate=sr, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(**inputs).logits

probs = torch.softmax(logits, dim=-1)
labels = model.config.id2label
print({labels[i]: float(p) for i, p in enumerate(probs[0])})
```

---

## 🧬 Research Context

This model builds on findings from the cross-cultural study of **infant-directed communication**:  
> Hilton, E. et al. (2021). *The Naturalistic Human Vocalizations Corpus.* Zenodo. DOI: [10.5281/zenodo.5525161](https://zenodo.org/record/5525161)

The study demonstrated that **infant-directed vocalizations**—both speech and song—share  
universal acoustic properties: higher pitch, expanded vowel space, and smoother prosody.  
This fine-tuned Wav2Vec2 model captures these features for automatic classification.

---

## ✅ Intended Uses
- Research on **caregiver–infant vocal interaction**  
- Acoustic analysis of **speech vs song registers**  
- Feature extraction for **prosody, emotion, or language learning studies**

## ⚠️ Limitations
- Trained on short, clean audio segments (2–6 s)  
- Cross-cultural variability may influence predictions  
- Not intended for speech recognition or word-level tasks  

---

## 🪪 License

- **Model License:** MIT  
- **Dataset License:** CC BY 4.0 (Hilton et al., 2021, Zenodo)

---

## 🧾 Citation
If you use or build upon this model, please cite:

```bibtex
@misc{wav2vec2_adsids,
  author       = {Arun Prakash Singh},
  title        = {Wav2Vec2-Base-ADSIDS: Fine-tuned model for Adult-Directed Speech, Infant-Directed Speech, and Song classification},
  year         = {2025},
  howpublished = {\url{https://huggingface.co/arunps/wav2vec2-base-adsids}},
  note         = {MIT License, trained on the Naturalistic Human Vocalizations Corpus (Hilton et al., 2021)}
}
```

---

## 👤 Author
**Arun Prakash Singh**  
Department of Linguistics and Scandinavian Studies, University of Oslo  
📧 arunps@uio.no  
🔗 [https://github.com/arunps12](https://github.com/arunps12)