fixie-ai
/

ultraVAD

@@ -85,7 +85,7 @@ Data: Synthetic conversational corpora with inserted <eot> tokens, translation-a
 Audio-native fusion (Ultravox projector):
 Attach and fine-tune the Ultravox audio projector so the model conditions jointly on audio embeddings and text, aligning prosodic cues with the <eot> objective.
-Data: Robust to real-world noise, device/mic variance, overlapping speech; extra single-turn samples emphasize intonation, pitch, drawn-out syllables.
 Calibration:
 Choose a decision threshold to balance precision vs. recall per language/domain. Recommended starting threshold: 0.1.
@@ -93,7 +93,7 @@ Raise the threshold if you find the model interrupting too eagerly, and lower th
 ## Performance & Deployment
-Latency (forward pass): ~30-100 ms on an H100.
 Common pattern: Pair with a streaming VAD (e.g., Silero). Invoke UltraVAD on short silences; its latency is often hidden under TTS time-to-first-token.

 Audio-native fusion (Ultravox projector):
 Attach and fine-tune the Ultravox audio projector so the model conditions jointly on audio embeddings and text, aligning prosodic cues with the <eot> objective.
+Data: Robust to real-world noise, device/mic variance, overlapping speech.
 Calibration:
 Choose a decision threshold to balance precision vs. recall per language/domain. Recommended starting threshold: 0.1.
 ## Performance & Deployment
+Latency (forward pass): ~65-110 ms on an A6000.
 Common pattern: Pair with a streaming VAD (e.g., Silero). Invoke UltraVAD on short silences; its latency is often hidden under TTS time-to-first-token.