πŸ—£οΈ Wav2Vec2-Base-ADSIDS

Fine-tuned wav2vec2-base model for classifying speech register and vocal mode:
Adult-Directed Speech (ADS), Infant-Directed Speech (IDS), Adult Song (ADS-song), and Infant Song (IDS-song).


🧠 Model Overview

This model classifies a given speech or song segment into one of four vocalization categories:

  • πŸ‘©β€πŸ« Adult-Directed Speech (ADS)
  • 🧸 Infant-Directed Speech (IDS)
  • 🎡 Adult Song (ADS-song)
  • 🎢 Infant Song (IDS-song)

It was fine-tuned from facebook/wav2vec2-base on the
Naturalistic Human Vocalizations Corpus (Hilton et al., 2021, Zenodo),
which includes over 1,600 natural recordings of infant- and adult-directed speech and song collected across 21 societies worldwide.


πŸ“š Dataset

Dataset: The Naturalistic Human Vocalizations Corpus
Reference: Hilton, E., Mehr, S. A. et al. (2021).
Zenodo DOI: 10.5281/zenodo.5525161

This dataset captures both speech and song, directed to infants and adults, with consistent annotations across cultures, languages, and recording environments.


βš™οΈ Training Details

  • Base model: facebook/wav2vec2-base
  • Framework: PyTorch + πŸ€— Transformers
  • Task: 4-way classification
  • Optimizer: AdamW
  • Learning rate: 3e-5
  • Loss function: Cross-Entropy
  • Epochs: 10–15 (with early stopping)
  • Sampling rate: 16 kHz
  • Segment duration: 2–6 seconds
  • Hardware: 1 Γ— NVIDIA A100 GPU

πŸ“Š Example Performance (on held-out data)

Class Precision Recall F1-score
ADS 0.61 0.58 0.59
IDS 0.47 0.45 0.46
ADS-song 0.55 0.53 0.54
IDS-song 0.48 0.47 0.47
Macro Avg 0.53 0.51 0.52

The model achieves a macro-average F1-score of around 52%,
indicating that it successfully captures the broad acoustic differences
between speech and song, and between adult- and infant-directed registers.

However, performance is lower for IDS and IDS-song, suggesting that
infant-directed vocalizations share overlapping prosodic and melodic cues
(e.g., higher pitch, slower tempo, greater variability), making them
more challenging to distinguish purely from acoustic information.


🧩 How to use from the πŸ€— Transformers library

🧱 Use a pipeline (simple helper)

from transformers import pipeline

pipe = pipeline("audio-classification", model="arunps/wav2vec2-base-adsids")

preds = pipe("example_audio.wav")
print(preds)

🧰 Load the model directly

from transformers import AutoProcessor, AutoModelForAudioClassification
import torch, librosa

processor = AutoProcessor.from_pretrained("arunps/wav2vec2-base-adsids")
model = AutoModelForAudioClassification.from_pretrained("arunps/wav2vec2-base-adsids")

audio, sr = librosa.load("example_audio.wav", sr=16000)
inputs = processor(audio, sampling_rate=sr, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(**inputs).logits

probs = torch.softmax(logits, dim=-1)
labels = model.config.id2label
print({labels[i]: float(p) for i, p in enumerate(probs[0])})

🧬 Research Context

This model builds on findings from the cross-cultural study of infant-directed communication:

Hilton, E. et al. (2021). The Naturalistic Human Vocalizations Corpus. Zenodo. DOI: 10.5281/zenodo.5525161

The study demonstrated that infant-directed vocalizationsβ€”both speech and songβ€”share
universal acoustic properties: higher pitch, expanded vowel space, and smoother prosody.
This fine-tuned Wav2Vec2 model captures these features for automatic classification.


βœ… Intended Uses

  • Research on caregiver–infant vocal interaction
  • Acoustic analysis of speech vs song registers
  • Feature extraction for prosody, emotion, or language learning studies

⚠️ Limitations

  • Trained on short, clean audio segments (2–6 s)
  • Cross-cultural variability may influence predictions
  • Not intended for speech recognition or word-level tasks

πŸͺͺ License

  • Model License: MIT
  • Dataset License: CC BY 4.0 (Hilton et al., 2021, Zenodo)

🧾 Citation

If you use or build upon this model, please cite:

@misc{wav2vec2_adsids,
  author       = {Arun Prakash Singh},
  title        = {Wav2Vec2-Base-ADSIDS: Fine-tuned model for Adult-Directed Speech, Infant-Directed Speech, and Song classification},
  year         = {2025},
  howpublished = {\url{https://huggingface.co/arunps/wav2vec2-base-adsids}},
  note         = {MIT License, trained on the Naturalistic Human Vocalizations Corpus (Hilton et al., 2021)}
}

πŸ‘€ Author

Arun Prakash Singh
Department of Linguistics and Scandinavian Studies, University of Oslo
πŸ“§ [email protected]
πŸ”— https://github.com/arunps12

Downloads last month
8
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for arunps/wav2vec2-base-adsids

Finetuned
(871)
this model