π£οΈ Wav2Vec2-Base-ADSIDS
Fine-tuned wav2vec2-base model for classifying speech register and vocal mode:
Adult-Directed Speech (ADS), Infant-Directed Speech (IDS), Adult Song (ADS-song), and Infant Song (IDS-song).
π§ Model Overview
This model classifies a given speech or song segment into one of four vocalization categories:
- π©βπ« Adult-Directed Speech (ADS)
- π§Έ Infant-Directed Speech (IDS)
- π΅ Adult Song (ADS-song)
- πΆ Infant Song (IDS-song)
It was fine-tuned from facebook/wav2vec2-base on the
Naturalistic Human Vocalizations Corpus (Hilton et al., 2021, Zenodo),
which includes over 1,600 natural recordings of infant- and adult-directed speech and song collected across 21 societies worldwide.
π Dataset
Dataset: The Naturalistic Human Vocalizations Corpus
Reference: Hilton, E., Mehr, S. A. et al. (2021).
Zenodo DOI: 10.5281/zenodo.5525161
This dataset captures both speech and song, directed to infants and adults, with consistent annotations across cultures, languages, and recording environments.
βοΈ Training Details
- Base model:
facebook/wav2vec2-base - Framework: PyTorch + π€ Transformers
- Task: 4-way classification
- Optimizer: AdamW
- Learning rate: 3e-5
- Loss function: Cross-Entropy
- Epochs: 10β15 (with early stopping)
- Sampling rate: 16 kHz
- Segment duration: 2β6 seconds
- Hardware: 1 Γ NVIDIA A100 GPU
π Example Performance (on held-out data)
| Class | Precision | Recall | F1-score |
|---|---|---|---|
| ADS | 0.61 | 0.58 | 0.59 |
| IDS | 0.47 | 0.45 | 0.46 |
| ADS-song | 0.55 | 0.53 | 0.54 |
| IDS-song | 0.48 | 0.47 | 0.47 |
| Macro Avg | 0.53 | 0.51 | 0.52 |
The model achieves a macro-average F1-score of around 52%,
indicating that it successfully captures the broad acoustic differences
between speech and song, and between adult- and infant-directed registers.However, performance is lower for IDS and IDS-song, suggesting that
infant-directed vocalizations share overlapping prosodic and melodic cues
(e.g., higher pitch, slower tempo, greater variability), making them
more challenging to distinguish purely from acoustic information.
π§© How to use from the π€ Transformers library
π§± Use a pipeline (simple helper)
from transformers import pipeline
pipe = pipeline("audio-classification", model="arunps/wav2vec2-base-adsids")
preds = pipe("example_audio.wav")
print(preds)
π§° Load the model directly
from transformers import AutoProcessor, AutoModelForAudioClassification
import torch, librosa
processor = AutoProcessor.from_pretrained("arunps/wav2vec2-base-adsids")
model = AutoModelForAudioClassification.from_pretrained("arunps/wav2vec2-base-adsids")
audio, sr = librosa.load("example_audio.wav", sr=16000)
inputs = processor(audio, sampling_rate=sr, return_tensors="pt", padding=True)
with torch.no_grad():
logits = model(**inputs).logits
probs = torch.softmax(logits, dim=-1)
labels = model.config.id2label
print({labels[i]: float(p) for i, p in enumerate(probs[0])})
𧬠Research Context
This model builds on findings from the cross-cultural study of infant-directed communication:
Hilton, E. et al. (2021). The Naturalistic Human Vocalizations Corpus. Zenodo. DOI: 10.5281/zenodo.5525161
The study demonstrated that infant-directed vocalizationsβboth speech and songβshare
universal acoustic properties: higher pitch, expanded vowel space, and smoother prosody.
This fine-tuned Wav2Vec2 model captures these features for automatic classification.
β Intended Uses
- Research on caregiverβinfant vocal interaction
- Acoustic analysis of speech vs song registers
- Feature extraction for prosody, emotion, or language learning studies
β οΈ Limitations
- Trained on short, clean audio segments (2β6 s)
- Cross-cultural variability may influence predictions
- Not intended for speech recognition or word-level tasks
πͺͺ License
- Model License: MIT
- Dataset License: CC BY 4.0 (Hilton et al., 2021, Zenodo)
π§Ύ Citation
If you use or build upon this model, please cite:
@misc{wav2vec2_adsids,
author = {Arun Prakash Singh},
title = {Wav2Vec2-Base-ADSIDS: Fine-tuned model for Adult-Directed Speech, Infant-Directed Speech, and Song classification},
year = {2025},
howpublished = {\url{https://huggingface.co/arunps/wav2vec2-base-adsids}},
note = {MIT License, trained on the Naturalistic Human Vocalizations Corpus (Hilton et al., 2021)}
}
π€ Author
Arun Prakash Singh
Department of Linguistics and Scandinavian Studies, University of Oslo
π§ [email protected]
π https://github.com/arunps12
- Downloads last month
- 8
Model tree for arunps/wav2vec2-base-adsids
Base model
facebook/wav2vec2-base