fixie-ai
/

ultraVAD

 ]
 pipe({'audio': audio, 'turns': turns, 'sampling_rate': sr}, max_new_tokens=30)
+```
+## Training
+Text-only post-training (LLM):
+Post-train the backbone to predict <eot> in dialog, yielding a probability over likely stop points rather than brittle binary labels.
+Data: Synthetic conversational corpora with inserted <eot> tokens, translation-augmented across 26 languages.
+Audio-native fusion (Ultravox projector):
+Attach and fine-tune the Ultravox audio projector so the model conditions jointly on audio embeddings and text, aligning prosodic cues with the <eot> objective.
+Data: Robust to real-world noise, device/mic variance, overlapping speech; extra single-turn samples emphasize intonation, pitch, drawn-out syllables.
+Calibration:
+Choose a decision threshold to balance precision vs. recall per language/domain. Recommended starting threshold: 0.1.
+## Performance & Deployment
+Latency (forward pass): ~100–200 ms (typical UltraVAD inference)
+Common pattern: Pair with a streaming VAD (e.g., Silero). Invoke UltraVAD on short silences; its latency is often hidden under TTS time-to-first-token.
+## Evaluation
+UltraVAD is evaluated on both context-dependent and single-turn datasets.
+Testing Data, Factors & Metrics
+Contextual benchmark: 400 held-out samples requiring dialog history (fixie-ai/turntaking-contextual-tts).
+Single-turn sets: Smart-Turn V2’s Orpheus synthetic datasets (aggregate).
+Metrics: Accuracy, Precision, Recall, F1, AUC for end-of-turn classification at recommended thresholds.
+Results
+# Context-dependent turn-taking (400 held-out samples)
+| **Metric** | **UltraVAD** | **Smart-Turn V2** |
+|---|---:|---:|
+| **Accuracy** | 77.5% | 63.0% |
+| **Precision** | 69.6% | 59.8% |
+| **Recall** | 97.5% | 79.0% |
+| **F1-Score** | 81.3% | 68.1% |
+| **AUC** | 89.6% | 70.0% |
+# Single-turn datasets (Orpheus aggregate)
+| **Dataset** | **UltraVAD** | **Smart-Turn V2** |
+|---|---:|---:|
+| **orpheus-aggregate-train** | 93.7% | N/A |
+| **orpheus-aggregate-test** | N/A | 94.3% |
+Notes:
+Smart-Turn V2 test scores are reported from their paper; UltraVAD uses their train splits for comparison due to test set unavailability. The aggregate numbers are within ~1 percentile, suggesting comparability.
+Thresholds can be re-calibrated per deployment to trade precision vs. recall.