|
|
--- |
|
|
license: mit |
|
|
base_model: |
|
|
- facebook/wav2vec2-base |
|
|
pipeline_tag: audio-classification |
|
|
--- |
|
|
# 🗣️ Wav2Vec2-Base-ADSIDS |
|
|
Fine-tuned `wav2vec2-base` model for **classifying speech register and vocal mode**: |
|
|
Adult-Directed Speech (ADS), Infant-Directed Speech (IDS), Adult Song (ADS-song), and Infant Song (IDS-song). |
|
|
|
|
|
--- |
|
|
|
|
|
## 🧠 Model Overview |
|
|
This model classifies a given speech or song segment into one of four vocalization categories: |
|
|
- 👩🏫 **Adult-Directed Speech (ADS)** |
|
|
- 🧸 **Infant-Directed Speech (IDS)** |
|
|
- 🎵 **Adult Song (ADS-song)** |
|
|
- 🎶 **Infant Song (IDS-song)** |
|
|
|
|
|
It was fine-tuned from **facebook/wav2vec2-base** on the |
|
|
[**Naturalistic Human Vocalizations Corpus (Hilton et al., 2021, Zenodo)**](https://zenodo.org/record/5525161), |
|
|
which includes over **1,600 natural recordings** of infant- and adult-directed speech and song collected across **21 societies** worldwide. |
|
|
|
|
|
--- |
|
|
|
|
|
## 📚 Dataset |
|
|
|
|
|
**Dataset:** *The Naturalistic Human Vocalizations Corpus* |
|
|
**Reference:** Hilton, E., Mehr, S. A. et al. (2021). |
|
|
[Zenodo DOI: 10.5281/zenodo.5525161](https://zenodo.org/record/5525161) |
|
|
|
|
|
This dataset captures both **speech** and **song**, directed to **infants** and **adults**, with consistent annotations across cultures, languages, and recording environments. |
|
|
|
|
|
--- |
|
|
|
|
|
## ⚙️ Training Details |
|
|
|
|
|
- **Base model:** `facebook/wav2vec2-base` |
|
|
- **Framework:** PyTorch + 🤗 Transformers |
|
|
- **Task:** 4-way classification |
|
|
- **Optimizer:** AdamW |
|
|
- **Learning rate:** 3e-5 |
|
|
- **Loss function:** Cross-Entropy |
|
|
- **Epochs:** 10–15 (with early stopping) |
|
|
- **Sampling rate:** 16 kHz |
|
|
- **Segment duration:** 2–6 seconds |
|
|
- **Hardware:** 1 × NVIDIA A100 GPU |
|
|
|
|
|
--- |
|
|
|
|
|
## 📊 Example Performance (on held-out data) |
|
|
|
|
|
| Class | Precision | Recall | F1-score | |
|
|
|:--------------|:----------|:--------|:----------| |
|
|
| ADS | 0.61 | 0.58 | 0.59 | |
|
|
| IDS | 0.47 | 0.45 | 0.46 | |
|
|
| ADS-song | 0.55 | 0.53 | 0.54 | |
|
|
| IDS-song | 0.48 | 0.47 | 0.47 | |
|
|
| **Macro Avg** | **0.53** | **0.51** | **0.52** | |
|
|
|
|
|
> > The model achieves a **macro-average F1-score of around 52%**, |
|
|
> indicating that it successfully captures the **broad acoustic differences** |
|
|
> between speech and song, and between adult- and infant-directed registers. |
|
|
> |
|
|
> However, performance is **lower for IDS and IDS-song**, suggesting that |
|
|
> infant-directed vocalizations share **overlapping prosodic and melodic cues** |
|
|
> (e.g., higher pitch, slower tempo, greater variability), making them |
|
|
> more challenging to distinguish purely from acoustic information. |
|
|
|
|
|
--- |
|
|
## 🧩 How to use from the 🤗 Transformers library |
|
|
|
|
|
### 🧱 Use a pipeline (simple helper) |
|
|
```python |
|
|
from transformers import pipeline |
|
|
|
|
|
pipe = pipeline("audio-classification", model="arunps/wav2vec2-base-adsids") |
|
|
|
|
|
preds = pipe("example_audio.wav") |
|
|
print(preds) |
|
|
``` |
|
|
|
|
|
### 🧰 Load the model directly |
|
|
```python |
|
|
from transformers import AutoProcessor, AutoModelForAudioClassification |
|
|
import torch, librosa |
|
|
|
|
|
processor = AutoProcessor.from_pretrained("arunps/wav2vec2-base-adsids") |
|
|
model = AutoModelForAudioClassification.from_pretrained("arunps/wav2vec2-base-adsids") |
|
|
|
|
|
audio, sr = librosa.load("example_audio.wav", sr=16000) |
|
|
inputs = processor(audio, sampling_rate=sr, return_tensors="pt", padding=True) |
|
|
|
|
|
with torch.no_grad(): |
|
|
logits = model(**inputs).logits |
|
|
|
|
|
probs = torch.softmax(logits, dim=-1) |
|
|
labels = model.config.id2label |
|
|
print({labels[i]: float(p) for i, p in enumerate(probs[0])}) |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## 🧬 Research Context |
|
|
|
|
|
This model builds on findings from the cross-cultural study of **infant-directed communication**: |
|
|
> Hilton, E. et al. (2021). *The Naturalistic Human Vocalizations Corpus.* Zenodo. DOI: [10.5281/zenodo.5525161](https://zenodo.org/record/5525161) |
|
|
|
|
|
The study demonstrated that **infant-directed vocalizations**—both speech and song—share |
|
|
universal acoustic properties: higher pitch, expanded vowel space, and smoother prosody. |
|
|
This fine-tuned Wav2Vec2 model captures these features for automatic classification. |
|
|
|
|
|
--- |
|
|
|
|
|
## ✅ Intended Uses |
|
|
- Research on **caregiver–infant vocal interaction** |
|
|
- Acoustic analysis of **speech vs song registers** |
|
|
- Feature extraction for **prosody, emotion, or language learning studies** |
|
|
|
|
|
## ⚠️ Limitations |
|
|
- Trained on short, clean audio segments (2–6 s) |
|
|
- Cross-cultural variability may influence predictions |
|
|
- Not intended for speech recognition or word-level tasks |
|
|
|
|
|
--- |
|
|
|
|
|
## 🪪 License |
|
|
|
|
|
- **Model License:** MIT |
|
|
- **Dataset License:** CC BY 4.0 (Hilton et al., 2021, Zenodo) |
|
|
|
|
|
--- |
|
|
|
|
|
## 🧾 Citation |
|
|
If you use or build upon this model, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{wav2vec2_adsids, |
|
|
author = {Arun Prakash Singh}, |
|
|
title = {Wav2Vec2-Base-ADSIDS: Fine-tuned model for Adult-Directed Speech, Infant-Directed Speech, and Song classification}, |
|
|
year = {2025}, |
|
|
howpublished = {\url{https://huggingface.co/arunps/wav2vec2-base-adsids}}, |
|
|
note = {MIT License, trained on the Naturalistic Human Vocalizations Corpus (Hilton et al., 2021)} |
|
|
} |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## 👤 Author |
|
|
**Arun Prakash Singh** |
|
|
Department of Linguistics and Scandinavian Studies, University of Oslo |
|
|
📧 [email protected] |
|
|
🔗 [https://github.com/arunps12](https://github.com/arunps12) |
|
|
|