docs: add complete Hugging Face model card for wav2vec2-base-adsids with dataset details, usage examples, and updated performance metrics

26584b0 verified 16 days ago

preview code

raw

history blame contribute delete

5.28 kB

metadata

license: mit
base_model:
  - facebook/wav2vec2-base
pipeline_tag: audio-classification

🗣️ Wav2Vec2-Base-ADSIDS

Fine-tuned wav2vec2-base model for classifying speech register and vocal mode:
Adult-Directed Speech (ADS), Infant-Directed Speech (IDS), Adult Song (ADS-song), and Infant Song (IDS-song).

🧠 Model Overview

This model classifies a given speech or song segment into one of four vocalization categories:

👩‍🏫 Adult-Directed Speech (ADS)
🧸 Infant-Directed Speech (IDS)
🎵 Adult Song (ADS-song)
🎶 Infant Song (IDS-song)

It was fine-tuned from facebook/wav2vec2-base on the
Naturalistic Human Vocalizations Corpus (Hilton et al., 2021, Zenodo),
which includes over 1,600 natural recordings of infant- and adult-directed speech and song collected across 21 societies worldwide.

📚 Dataset

Dataset: The Naturalistic Human Vocalizations Corpus
Reference: Hilton, E., Mehr, S. A. et al. (2021).
Zenodo DOI: 10.5281/zenodo.5525161

This dataset captures both speech and song, directed to infants and adults, with consistent annotations across cultures, languages, and recording environments.

⚙️ Training Details

Base model: facebook/wav2vec2-base
Framework: PyTorch + 🤗 Transformers
Task: 4-way classification
Optimizer: AdamW
Learning rate: 3e-5
Loss function: Cross-Entropy
Epochs: 10–15 (with early stopping)
Sampling rate: 16 kHz
Segment duration: 2–6 seconds
Hardware: 1 × NVIDIA A100 GPU

📊 Example Performance (on held-out data)

Class	Precision	Recall	F1-score
ADS	0.61	0.58	0.59
IDS	0.47	0.45	0.46
ADS-song	0.55	0.53	0.54
IDS-song	0.48	0.47	0.47
Macro Avg	0.53	0.51	0.52

The model achieves a macro-average F1-score of around 52%,
indicating that it successfully captures the broad acoustic differences
between speech and song, and between adult- and infant-directed registers.

However, performance is lower for IDS and IDS-song, suggesting that
infant-directed vocalizations share overlapping prosodic and melodic cues
(e.g., higher pitch, slower tempo, greater variability), making them
more challenging to distinguish purely from acoustic information.

🧩 How to use from the 🤗 Transformers library

🧱 Use a pipeline (simple helper)

from transformers import pipeline

pipe = pipeline("audio-classification", model="arunps/wav2vec2-base-adsids")

preds = pipe("example_audio.wav")
print(preds)

🧰 Load the model directly

from transformers import AutoProcessor, AutoModelForAudioClassification
import torch, librosa

processor = AutoProcessor.from_pretrained("arunps/wav2vec2-base-adsids")
model = AutoModelForAudioClassification.from_pretrained("arunps/wav2vec2-base-adsids")

audio, sr = librosa.load("example_audio.wav", sr=16000)
inputs = processor(audio, sampling_rate=sr, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(**inputs).logits

probs = torch.softmax(logits, dim=-1)
labels = model.config.id2label
print({labels[i]: float(p) for i, p in enumerate(probs[0])})

🧬 Research Context

This model builds on findings from the cross-cultural study of infant-directed communication:

Hilton, E. et al. (2021). The Naturalistic Human Vocalizations Corpus. Zenodo. DOI: 10.5281/zenodo.5525161

The study demonstrated that infant-directed vocalizations—both speech and song—share
universal acoustic properties: higher pitch, expanded vowel space, and smoother prosody.
This fine-tuned Wav2Vec2 model captures these features for automatic classification.

✅ Intended Uses

Research on caregiver–infant vocal interaction
Acoustic analysis of speech vs song registers
Feature extraction for prosody, emotion, or language learning studies

⚠️ Limitations

Trained on short, clean audio segments (2–6 s)
Cross-cultural variability may influence predictions
Not intended for speech recognition or word-level tasks

🪪 License

Model License: MIT
Dataset License: CC BY 4.0 (Hilton et al., 2021, Zenodo)

🧾 Citation

If you use or build upon this model, please cite:

@misc{wav2vec2_adsids,
  author       = {Arun Prakash Singh},
  title        = {Wav2Vec2-Base-ADSIDS: Fine-tuned model for Adult-Directed Speech, Infant-Directed Speech, and Song classification},
  year         = {2025},
  howpublished = {\url{https://huggingface.co/arunps/wav2vec2-base-adsids}},
  note         = {MIT License, trained on the Naturalistic Human Vocalizations Corpus (Hilton et al., 2021)}
}

👤 Author

Arun Prakash Singh
Department of Linguistics and Scandinavian Studies, University of Oslo
📧 arunps@uio.no
🔗 https://github.com/arunps12