fixie-ai
/

ultraVAD

@@ -3,44 +3,31 @@ library_name: transformers
 tags: []
 ---
-# Model Card for Model ID
-Ultravad (UltraVAD) is a context-aware, audio-native neural endpointing model. It estimates the probability that a speaker has finished their turn in real time by fusing dialog context (text) with the user speech (audio). UltraVAD frames endpointing as next-token prediction of an explicit end-of-turn token <eot>, and uses the Ultravox audio projector to become audio-native.
 ## Model Details
-### Model Description
-UltraVAD consumes both audio and text: recent conversation turns (text) plus the current user’s audio. It predicts a calibrated probability for <eot>—i.e.,
-𝑃
-(
-<eot>
-∣
-context
-,
-user audio
-)
-P(<eot>∣context,user audio)—and returns an end-of-turn decision using a configurable threshold. This design captures semantic intent (from text) and paralinguistic cues (from audio) such as pauses, intonation, and pitch.
-- **Developed by:** Ultravox.ai
-- **Model type:** Context-aware audio-text fusion endpointing model
-- **Language(s):** ar, bg, zh, cs, da, nl, en, fi, fr, de, el, hi, hu, it, ja, pl, pt, ro, ru, sk, es, sv, ta, tr, uk, vi
-- **Finetuned from model:** Llama-8b
-### Model Sources [optional]
-<!-- Provide the basic links for the model. -->
-- **Repository:** https://ultravox.ai
-- **Demo [optional]:** https://demo.ultravox.ai/
 - **Benchmark:** https://huggingface.co/datasets/fixie-ai/turntaking-contextual-tts
 ## Usage
-Think of UltraVAD as a turn-taking oracle for voice agents. It is best used when continuously ran a pause during user speech and, when its <eot> probability crosses a threshold, you trigger your agent’s response. It also helps with short replies, numerical answers, and self-repairs, where semantics and prosody determine whether the user is actually done.
-To use the model, try the following:
 ```python
 # pip install transformers peft librosa
@@ -50,84 +37,11 @@ import librosa
 pipe = transformers.pipeline(model='fixie-ai/ultraVAD', trust_remote_code=True)
-path = "<path-to-input-audio>"  # TODO: pass the audio here
 audio, sr = librosa.load(path, sr=16000)
 turns = [
-  {
-    "role": "assistant",
-    "content": "Hi how are you?"
-  },
 ]
-pipe({'audio': audio, 'turns': turns, 'sampling_rate': sr}, max_new_tokens=30)
-```
-### Training Details
-UltraVAD is trained in two stages to be context-aware and audio-native.
-Training Data
-Text stage: Synthetic conversational corpora with inserted <eot> tokens after valid stopping points, covering multiple languages via translation augmentation.
-Audio stage: The Ultravox audio projector initialized from Ultravox training (robust to real-world noise, device/mic variance, overlapping speech). Additional single-turn samples emphasize prosodic cues (intonation, pitch, drawn-out syllables).
-Languages: 26 supported; easy to extend by translating the textual corpus and continuing training.
-## Training Procedure
-LLM next-token post-training (text-only):
-Post-train a backbone LLM to predict <eot> in dialogue. This yields a distribution over likely stop points rather than brittle true/false labels.
-Audio-native fusion (Ultravox projector):
-Attach and finetune the Ultravox audio projector so the model conditions on both audio embeddings and textual context, aligning audio cues with the <eot> objective.
-Calibration:
-Choose a decision threshold for <eot> probability to balance precision/recall for your application. Thresholds can be tuned per language/domain. We recommend starting with 0.1.
-## Speeds, Sizes, Times
-Forward pass latency: ~100–200 ms for a typical UltraVAD inference.
-Deployment pattern: Often paired with a lightweight streaming VAD (e.g. silero VAD); UltraVAD is invoked when short silences are detected, so its latency is hidden under TTS TTFT.
-## Evaluation
-UltraVAD is evaluated on both context-dependent and single-turn datasets.
-Testing Data, Factors & Metrics
-Contextual benchmark: 400 held-out samples requiring dialog history (fixie-ai/turntaking-contextual-tts).
-Single-turn sets: Smart-Turn V2’s Orpheus synthetic datasets (aggregate).
-Metrics: Accuracy, Precision, Recall, F1, AUC for end-of-turn classification at recommended thresholds.
-Results
-# Context-dependent turn-taking (400 held-out samples)
-| **Metric** | **UltraVAD** | **Smart-Turn V2** |
-|---|---:|---:|
-| **Accuracy** | 77.5% | 63.0% |
-| **Precision** | 69.6% | 59.8% |
-| **Recall** | 97.5% | 79.0% |
-| **F1-Score** | 81.3% | 68.1% |
-| **AUC** | 89.6% | 70.0% |
-# Single-turn datasets (Orpheus aggregate)
-| **Dataset** | **UltraVAD** | **Smart-Turn V2** |
-|---|---:|---:|
-| **orpheus-aggregate-train** | 93.7% | N/A |
-| **orpheus-aggregate-test** | N/A | 94.3% |
-Notes:
-Smart-Turn V2 test scores are reported from their paper; UltraVAD uses their train splits for comparison due to test set unavailability. The aggregate numbers are within ~1 percentile, suggesting comparability.
-Thresholds can be re-calibrated per deployment to trade precision vs. recall.

 tags: []
 ---
+# Model Card: UltraVAD
+UltraVAD is a context-aware, audio-native endpointing model. It estimates the probability that a speaker has finished their turn in real time by fusing recent dialog text with the user’s audio. UltraVAD frames endpointing as next-token prediction of an explicit end-of-turn token `<eot>`, using the Ultravox audio projector to become audio-native.
 ## Model Details
+- **Developer:** Ultravox.ai
+- **Type:** Context-aware audio–text fusion endpointing
+- **Backbone:** Llama-8B (post-trained)
+- **Languages (26):** ar, bg, zh, cs, da, nl, en, fi, fr, de, el, hi, hu, it, ja, pl, pt, ro, ru, sk, es, sv, ta, tr, uk, vi
+**What it predicts.** UltraVAD computes a calibrated probability
+`P(<eot> | context, user_audio)`
+and emits an end-of-turn decision using a configurable threshold. This captures both **semantics** (from text) and **paralinguistic cues** (from audio: pauses, intonation, pitch).
+## Sources
+- **Website/Repo:** https://ultravox.ai
+- **Demo:** https://demo.ultravox.ai/
 - **Benchmark:** https://huggingface.co/datasets/fixie-ai/turntaking-contextual-tts
 ## Usage
+Use UltraVAD as a turn-taking oracle in voice agents. Run it alongside a lightweight streaming VAD; when short silences are detected, call UltraVAD and trigger your agent’s response once the `<eot>` probability crosses your threshold (see **Calibration**).
 ```python
 # pip install transformers peft librosa
 pipe = transformers.pipeline(model='fixie-ai/ultraVAD', trust_remote_code=True)
+path = "<path-to-input-audio>"
 audio, sr = librosa.load(path, sr=16000)
 turns = [
+  {"role": "assistant", "content": "Hi, how are you?"},
 ]
+pipe({'audio': audio, 'turns': turns, 'sampling_rate': sr}, max_new_tokens=30)