docs: add complete Hugging Face model card for wav2vec2-base-adsids with dataset details, usage examples, and updated performance metrics

26584b0 verified 19 days ago

preview code

raw

history blame contribute delete

5.28 kB

	---
	license: mit
	base_model:
	- facebook/wav2vec2-base
	pipeline_tag: audio-classification
	---
	# 🗣️ Wav2Vec2-Base-ADSIDS
	Fine-tuned `wav2vec2-base` model for classifying speech register and vocal mode:
	Adult-Directed Speech (ADS), Infant-Directed Speech (IDS), Adult Song (ADS-song), and Infant Song (IDS-song).

	---

	## 🧠 Model Overview
	This model classifies a given speech or song segment into one of four vocalization categories:
	- 👩‍🏫 Adult-Directed Speech (ADS)
	- 🧸 Infant-Directed Speech (IDS)
	- 🎵 Adult Song (ADS-song)
	- 🎶 Infant Song (IDS-song)

	It was fine-tuned from facebook/wav2vec2-base on the
	[Naturalistic Human Vocalizations Corpus (Hilton et al., 2021, Zenodo)](https://zenodo.org/record/5525161),
	which includes over 1,600 natural recordings of infant- and adult-directed speech and song collected across 21 societies worldwide.

	---

	## 📚 Dataset

	Dataset: The Naturalistic Human Vocalizations Corpus
	Reference: Hilton, E., Mehr, S. A. et al. (2021).
	[Zenodo DOI: 10.5281/zenodo.5525161](https://zenodo.org/record/5525161)

	This dataset captures both speech and song, directed to infants and adults, with consistent annotations across cultures, languages, and recording environments.

	---

	## ⚙️ Training Details

	- Base model: `facebook/wav2vec2-base`
	- Framework: PyTorch + 🤗 Transformers
	- Task: 4-way classification
	- Optimizer: AdamW
	- Learning rate: 3e-5
	- Loss function: Cross-Entropy
	- Epochs: 10–15 (with early stopping)
	- Sampling rate: 16 kHz
	- Segment duration: 2–6 seconds
	- Hardware: 1 × NVIDIA A100 GPU

	---

	## 📊 Example Performance (on held-out data)

	\| Class \| Precision \| Recall \| F1-score \|
	\|:--------------\|:----------\|:--------\|:----------\|
	\| ADS \| 0.61 \| 0.58 \| 0.59 \|
	\| IDS \| 0.47 \| 0.45 \| 0.46 \|
	\| ADS-song \| 0.55 \| 0.53 \| 0.54 \|
	\| IDS-song \| 0.48 \| 0.47 \| 0.47 \|
	\| Macro Avg \| 0.53 \| 0.51 \| 0.52 \|

	> > The model achieves a macro-average F1-score of around 52%,
	> indicating that it successfully captures the broad acoustic differences
	> between speech and song, and between adult- and infant-directed registers.
	>
	> However, performance is lower for IDS and IDS-song, suggesting that
	> infant-directed vocalizations share overlapping prosodic and melodic cues
	> (e.g., higher pitch, slower tempo, greater variability), making them
	> more challenging to distinguish purely from acoustic information.

	---
	## 🧩 How to use from the 🤗 Transformers library

	### 🧱 Use a pipeline (simple helper)
	```python
	from transformers import pipeline

	pipe = pipeline("audio-classification", model="arunps/wav2vec2-base-adsids")

	preds = pipe("example_audio.wav")
	print(preds)
	```

	### 🧰 Load the model directly
	```python
	from transformers import AutoProcessor, AutoModelForAudioClassification
	import torch, librosa

	processor = AutoProcessor.from_pretrained("arunps/wav2vec2-base-adsids")
	model = AutoModelForAudioClassification.from_pretrained("arunps/wav2vec2-base-adsids")

	audio, sr = librosa.load("example_audio.wav", sr=16000)
	inputs = processor(audio, sampling_rate=sr, return_tensors="pt", padding=True)

	with torch.no_grad():
	logits = model(**inputs).logits

	probs = torch.softmax(logits, dim=-1)
	labels = model.config.id2label
	print({labels[i]: float(p) for i, p in enumerate(probs[0])})
	```

	---

	## 🧬 Research Context

	This model builds on findings from the cross-cultural study of infant-directed communication:
	> Hilton, E. et al. (2021). The Naturalistic Human Vocalizations Corpus. Zenodo. DOI: [10.5281/zenodo.5525161](https://zenodo.org/record/5525161)

	The study demonstrated that infant-directed vocalizations—both speech and song—share
	universal acoustic properties: higher pitch, expanded vowel space, and smoother prosody.
	This fine-tuned Wav2Vec2 model captures these features for automatic classification.

	---

	## ✅ Intended Uses
	- Research on caregiver–infant vocal interaction
	- Acoustic analysis of speech vs song registers
	- Feature extraction for prosody, emotion, or language learning studies

	## ⚠️ Limitations
	- Trained on short, clean audio segments (2–6 s)
	- Cross-cultural variability may influence predictions
	- Not intended for speech recognition or word-level tasks

	---

	## 🪪 License

	- Model License: MIT
	- Dataset License: CC BY 4.0 (Hilton et al., 2021, Zenodo)

	---

	## 🧾 Citation
	If you use or build upon this model, please cite:

	```bibtex
	@misc{wav2vec2_adsids,
	author = {Arun Prakash Singh},
	title = {Wav2Vec2-Base-ADSIDS: Fine-tuned model for Adult-Directed Speech, Infant-Directed Speech, and Song classification},
	year = {2025},
	howpublished = {\url{https://huggingface.co/arunps/wav2vec2-base-adsids}},
	note = {MIT License, trained on the Naturalistic Human Vocalizations Corpus (Hilton et al., 2021)}
	}
	```

	---

	## 👤 Author
	Arun Prakash Singh
	Department of Linguistics and Scandinavian Studies, University of Oslo
	📧 [email protected]
	🔗 [https://github.com/arunps12](https://github.com/arunps12)