File size: 4,598 Bytes
3cc71d4 48ba913 a14178e 3cc71d4 ca1966e 3cc71d4 1b1ebcb 3cc71d4 ca1966e 3cc71d4 518a303 3cc71d4 ca1966e 3cc71d4 ca1966e db2b8a6 3cc71d4 db2b8a6 3cc71d4 cbb431f 3cc71d4 db2b8a6 0ef9fa1 db2b8a6 0ef9fa1 3cc71d4 0ef9fa1 3cc71d4 0ef9fa1 de2d4fc 0ef9fa1 3cc71d4 db2b8a6 ca1966e db2b8a6 3cc71d4 0ef9fa1 604d09a 0ef9fa1 48820ed 6d540ab 48820ed 518a303 48820ed 6d540ab 48820ed 0b56417 48820ed 0b56417 48820ed 66b40f0 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 |
---
library_name: transformers
tags:
- VAD
- audio
- transformer
- endpointing
---
# Model Card: UltraVAD
UltraVAD is a context-aware, audio-native endpointing model. It estimates the probability that a speaker has finished their turn in real time by fusing recent dialog text with the user’s audio. UltraVAD consumes the dialogue history and the last user audio turn, then produces a probability for the end-of-turn token `<|eot_id|>`.
## Model Details
- **Developer:** Ultravox.ai
- **Type:** Context-aware audio–text fusion endpointing
- **Backbone:** Llama-8B (post-trained)
- **Languages (26):** ar, bg, zh, cs, da, nl, en, fi, fr, de, el, hi, hu, it, ja, pl, pt, ro, ru, sk, es, sv, ta, tr, uk, vi
**What it predicts.** UltraVAD computes the probability `P(<eot_id> | context, user_audio)`
## Sources
- **Website/Repo:** https://ultravox.ai
- **Demo:** https://demo.ultravox.ai/
- **Benchmark:** https://huggingface.co/datasets/fixie-ai/turntaking-contextual-tts
## Usage
Use UltraVAD as a turn-taking oracle in voice agents. Run it alongside a lightweight streaming VAD; when short silences are detected, call UltraVAD and trigger your agent’s response once the `<eot>` probability crosses your threshold.
```python
import transformers
import torch
import librosa
import os
pipe = transformers.pipeline(model='fixie-ai/ultraVAD', trust_remote_code=True, device="cpu")
sr = 16000
wav_path = os.path.join(os.path.dirname(__file__), "sample.wav")
audio, sr = librosa.load(wav_path, sr=sr)
turns = [
{"role": "assistant", "content": "Hi, how are you?"},
]
# Build model inputs via pipeline preprocess
inputs = {"audio": audio, "turns": turns, "sampling_rate": sr}
model_inputs = pipe.preprocess(inputs)
# Move tensors to model device
device = next(pipe.model.parameters()).device
model_inputs = {k: (v.to(device) if hasattr(v, "to") else v) for k, v in model_inputs.items()}
# Forward pass (no generation)
with torch.inference_mode():
output = pipe.model.forward(**model_inputs, return_dict=True)
# Compute last-audio token position
logits = output.logits # (1, seq_len, vocab)
audio_pos = int(
model_inputs["audio_token_start_idx"].item() +
model_inputs["audio_token_len"].item() - 1
)
# Resolve <|eot_id|> token id and compute probability at last-audio index
token_id = pipe.tokenizer.convert_tokens_to_ids("<|eot_id|>")
if token_id is None or token_id == pipe.tokenizer.unk_token_id:
raise RuntimeError("<|eot_id|> not found in tokenizer.")
audio_logits = logits[0, audio_pos, :]
audio_probs = torch.softmax(audio_logits.float(), dim=-1)
eot_prob_audio = audio_probs[token_id].item()
print(f"P(<|eot_id|>) = {eot_prob_audio:.6f}")
threshold = 0.1
if eot_prob_audio > threshold:
print("Is End of Turn")
else:
print("Is Not End of Turn")
```
## Training
Text-only post-training (LLM):
Post-train the backbone to predict <eot> in dialog, yielding a probability over likely stop points rather than brittle binary labels.
Data: Synthetic conversational corpora with inserted <eot> tokens, translation-augmented across 26 languages.
Audio-native fusion (Ultravox projector):
Attach and fine-tune the Ultravox audio projector so the model conditions jointly on audio embeddings and text, aligning prosodic cues with the <eot> objective.
Data: Robust to real-world noise, device/mic variance, overlapping speech.
Calibration:
Choose a decision threshold to balance precision vs. recall per language/domain. Recommended starting threshold: 0.1.
Raise the threshold if you find the model interrupting too eagerly, and lower the threshold if you find the model not responding when its supposed to.
## Performance & Deployment
Latency (forward pass): ~65-110 ms on an A6000.
Common pattern: Pair with a streaming VAD (e.g., Silero). Invoke UltraVAD on short silences; its latency is often hidden under TTS time-to-first-token.
## Evaluation
UltraVAD is evaluated on both context-dependent and single-turn datasets.
Contextual benchmark: 400 held-out samples requiring dialog history (fixie-ai/turntaking-contextual-tts).
Single-turn sets: Smart-Turn V2’s Orpheus synthetic datasets (aggregate).
Results
Context-dependent turn-taking (400 held-out samples)
| **Metric** | **UltraVAD** | **Smart-Turn V2** |
|---|---:|---:|
| **Accuracy** | 77.5% | 63.0% |
| **Precision** | 69.6% | 59.8% |
| **Recall** | 97.5% | 79.0% |
| **F1-Score** | 81.3% | 68.1% |
| **AUC** | 89.6% | 70.0% |
Single-turn datasets (Orpheus aggregate)
| **Dataset** | **UltraVAD** | **Smart-Turn V2** |
|---|---:|---:|
| **orpheus-aggregate-test** | 93.7% | 94.3% |
|