arunps's picture
docs: add complete Hugging Face model card for wav2vec2-base-adsids with dataset details, usage examples, and updated performance metrics
26584b0 verified
metadata
license: mit
base_model:
  - facebook/wav2vec2-base
pipeline_tag: audio-classification

🗣️ Wav2Vec2-Base-ADSIDS

Fine-tuned wav2vec2-base model for classifying speech register and vocal mode:
Adult-Directed Speech (ADS), Infant-Directed Speech (IDS), Adult Song (ADS-song), and Infant Song (IDS-song).


🧠 Model Overview

This model classifies a given speech or song segment into one of four vocalization categories:

  • 👩‍🏫 Adult-Directed Speech (ADS)
  • 🧸 Infant-Directed Speech (IDS)
  • 🎵 Adult Song (ADS-song)
  • 🎶 Infant Song (IDS-song)

It was fine-tuned from facebook/wav2vec2-base on the
Naturalistic Human Vocalizations Corpus (Hilton et al., 2021, Zenodo),
which includes over 1,600 natural recordings of infant- and adult-directed speech and song collected across 21 societies worldwide.


📚 Dataset

Dataset: The Naturalistic Human Vocalizations Corpus
Reference: Hilton, E., Mehr, S. A. et al. (2021).
Zenodo DOI: 10.5281/zenodo.5525161

This dataset captures both speech and song, directed to infants and adults, with consistent annotations across cultures, languages, and recording environments.


⚙️ Training Details

  • Base model: facebook/wav2vec2-base
  • Framework: PyTorch + 🤗 Transformers
  • Task: 4-way classification
  • Optimizer: AdamW
  • Learning rate: 3e-5
  • Loss function: Cross-Entropy
  • Epochs: 10–15 (with early stopping)
  • Sampling rate: 16 kHz
  • Segment duration: 2–6 seconds
  • Hardware: 1 × NVIDIA A100 GPU

📊 Example Performance (on held-out data)

Class Precision Recall F1-score
ADS 0.61 0.58 0.59
IDS 0.47 0.45 0.46
ADS-song 0.55 0.53 0.54
IDS-song 0.48 0.47 0.47
Macro Avg 0.53 0.51 0.52

The model achieves a macro-average F1-score of around 52%,
indicating that it successfully captures the broad acoustic differences
between speech and song, and between adult- and infant-directed registers.

However, performance is lower for IDS and IDS-song, suggesting that
infant-directed vocalizations share overlapping prosodic and melodic cues
(e.g., higher pitch, slower tempo, greater variability), making them
more challenging to distinguish purely from acoustic information.


🧩 How to use from the 🤗 Transformers library

🧱 Use a pipeline (simple helper)

from transformers import pipeline

pipe = pipeline("audio-classification", model="arunps/wav2vec2-base-adsids")

preds = pipe("example_audio.wav")
print(preds)

🧰 Load the model directly

from transformers import AutoProcessor, AutoModelForAudioClassification
import torch, librosa

processor = AutoProcessor.from_pretrained("arunps/wav2vec2-base-adsids")
model = AutoModelForAudioClassification.from_pretrained("arunps/wav2vec2-base-adsids")

audio, sr = librosa.load("example_audio.wav", sr=16000)
inputs = processor(audio, sampling_rate=sr, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(**inputs).logits

probs = torch.softmax(logits, dim=-1)
labels = model.config.id2label
print({labels[i]: float(p) for i, p in enumerate(probs[0])})

🧬 Research Context

This model builds on findings from the cross-cultural study of infant-directed communication:

Hilton, E. et al. (2021). The Naturalistic Human Vocalizations Corpus. Zenodo. DOI: 10.5281/zenodo.5525161

The study demonstrated that infant-directed vocalizations—both speech and song—share
universal acoustic properties: higher pitch, expanded vowel space, and smoother prosody.
This fine-tuned Wav2Vec2 model captures these features for automatic classification.


✅ Intended Uses

  • Research on caregiver–infant vocal interaction
  • Acoustic analysis of speech vs song registers
  • Feature extraction for prosody, emotion, or language learning studies

⚠️ Limitations

  • Trained on short, clean audio segments (2–6 s)
  • Cross-cultural variability may influence predictions
  • Not intended for speech recognition or word-level tasks

🪪 License

  • Model License: MIT
  • Dataset License: CC BY 4.0 (Hilton et al., 2021, Zenodo)

🧾 Citation

If you use or build upon this model, please cite:

@misc{wav2vec2_adsids,
  author       = {Arun Prakash Singh},
  title        = {Wav2Vec2-Base-ADSIDS: Fine-tuned model for Adult-Directed Speech, Infant-Directed Speech, and Song classification},
  year         = {2025},
  howpublished = {\url{https://huggingface.co/arunps/wav2vec2-base-adsids}},
  note         = {MIT License, trained on the Naturalistic Human Vocalizations Corpus (Hilton et al., 2021)}
}

👤 Author

Arun Prakash Singh
Department of Linguistics and Scandinavian Studies, University of Oslo
📧 arunps@uio.no
🔗 https://github.com/arunps12