--- license: mit base_model: - facebook/wav2vec2-base pipeline_tag: audio-classification --- # 🗣️ Wav2Vec2-Base-ADSIDS Fine-tuned `wav2vec2-base` model for **classifying speech register and vocal mode**: Adult-Directed Speech (ADS), Infant-Directed Speech (IDS), Adult Song (ADS-song), and Infant Song (IDS-song). --- ## 🧠 Model Overview This model classifies a given speech or song segment into one of four vocalization categories: - 👩‍🏫 **Adult-Directed Speech (ADS)** - 🧸 **Infant-Directed Speech (IDS)** - 🎵 **Adult Song (ADS-song)** - 🎶 **Infant Song (IDS-song)** It was fine-tuned from **facebook/wav2vec2-base** on the [**Naturalistic Human Vocalizations Corpus (Hilton et al., 2021, Zenodo)**](https://zenodo.org/record/5525161), which includes over **1,600 natural recordings** of infant- and adult-directed speech and song collected across **21 societies** worldwide. --- ## 📚 Dataset **Dataset:** *The Naturalistic Human Vocalizations Corpus* **Reference:** Hilton, E., Mehr, S. A. et al. (2021). [Zenodo DOI: 10.5281/zenodo.5525161](https://zenodo.org/record/5525161) This dataset captures both **speech** and **song**, directed to **infants** and **adults**, with consistent annotations across cultures, languages, and recording environments. --- ## ⚙️ Training Details - **Base model:** `facebook/wav2vec2-base` - **Framework:** PyTorch + 🤗 Transformers - **Task:** 4-way classification - **Optimizer:** AdamW - **Learning rate:** 3e-5 - **Loss function:** Cross-Entropy - **Epochs:** 10–15 (with early stopping) - **Sampling rate:** 16 kHz - **Segment duration:** 2–6 seconds - **Hardware:** 1 × NVIDIA A100 GPU --- ## 📊 Example Performance (on held-out data) | Class | Precision | Recall | F1-score | |:--------------|:----------|:--------|:----------| | ADS | 0.61 | 0.58 | 0.59 | | IDS | 0.47 | 0.45 | 0.46 | | ADS-song | 0.55 | 0.53 | 0.54 | | IDS-song | 0.48 | 0.47 | 0.47 | | **Macro Avg** | **0.53** | **0.51** | **0.52** | > > The model achieves a **macro-average F1-score of around 52%**, > indicating that it successfully captures the **broad acoustic differences** > between speech and song, and between adult- and infant-directed registers. > > However, performance is **lower for IDS and IDS-song**, suggesting that > infant-directed vocalizations share **overlapping prosodic and melodic cues** > (e.g., higher pitch, slower tempo, greater variability), making them > more challenging to distinguish purely from acoustic information. --- ## 🧩 How to use from the 🤗 Transformers library ### 🧱 Use a pipeline (simple helper) ```python from transformers import pipeline pipe = pipeline("audio-classification", model="arunps/wav2vec2-base-adsids") preds = pipe("example_audio.wav") print(preds) ``` ### 🧰 Load the model directly ```python from transformers import AutoProcessor, AutoModelForAudioClassification import torch, librosa processor = AutoProcessor.from_pretrained("arunps/wav2vec2-base-adsids") model = AutoModelForAudioClassification.from_pretrained("arunps/wav2vec2-base-adsids") audio, sr = librosa.load("example_audio.wav", sr=16000) inputs = processor(audio, sampling_rate=sr, return_tensors="pt", padding=True) with torch.no_grad(): logits = model(**inputs).logits probs = torch.softmax(logits, dim=-1) labels = model.config.id2label print({labels[i]: float(p) for i, p in enumerate(probs[0])}) ``` --- ## 🧬 Research Context This model builds on findings from the cross-cultural study of **infant-directed communication**: > Hilton, E. et al. (2021). *The Naturalistic Human Vocalizations Corpus.* Zenodo. DOI: [10.5281/zenodo.5525161](https://zenodo.org/record/5525161) The study demonstrated that **infant-directed vocalizations**—both speech and song—share universal acoustic properties: higher pitch, expanded vowel space, and smoother prosody. This fine-tuned Wav2Vec2 model captures these features for automatic classification. --- ## ✅ Intended Uses - Research on **caregiver–infant vocal interaction** - Acoustic analysis of **speech vs song registers** - Feature extraction for **prosody, emotion, or language learning studies** ## ⚠️ Limitations - Trained on short, clean audio segments (2–6 s) - Cross-cultural variability may influence predictions - Not intended for speech recognition or word-level tasks --- ## 🪪 License - **Model License:** MIT - **Dataset License:** CC BY 4.0 (Hilton et al., 2021, Zenodo) --- ## 🧾 Citation If you use or build upon this model, please cite: ```bibtex @misc{wav2vec2_adsids, author = {Arun Prakash Singh}, title = {Wav2Vec2-Base-ADSIDS: Fine-tuned model for Adult-Directed Speech, Infant-Directed Speech, and Song classification}, year = {2025}, howpublished = {\url{https://huggingface.co/arunps/wav2vec2-base-adsids}}, note = {MIT License, trained on the Naturalistic Human Vocalizations Corpus (Hilton et al., 2021)} } ``` --- ## 👤 Author **Arun Prakash Singh** Department of Linguistics and Scandinavian Studies, University of Oslo 📧 arunps@uio.no 🔗 [https://github.com/arunps12](https://github.com/arunps12)