File size: 4,598 Bytes
3cc71d4
 
48ba913
 
 
 
a14178e
3cc71d4
 
ca1966e
3cc71d4
1b1ebcb
3cc71d4
 
 
ca1966e
 
 
 
3cc71d4
518a303
3cc71d4
ca1966e
3cc71d4
ca1966e
 
db2b8a6
3cc71d4
db2b8a6
3cc71d4
cbb431f
3cc71d4
db2b8a6
 
0ef9fa1
db2b8a6
0ef9fa1
3cc71d4
0ef9fa1
3cc71d4
0ef9fa1
de2d4fc
0ef9fa1
3cc71d4
db2b8a6
ca1966e
db2b8a6
3cc71d4
0ef9fa1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
604d09a
0ef9fa1
 
 
 
 
48820ed
 
 
 
 
 
 
 
 
 
6d540ab
48820ed
 
518a303
 
48820ed
 
 
6d540ab
48820ed
 
 
 
 
 
 
 
 
 
 
 
 
0b56417
48820ed
 
 
 
 
 
 
 
 
0b56417
48820ed
 
66b40f0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
---
library_name: transformers
tags:
- VAD
- audio
- transformer
- endpointing
---

# Model Card: UltraVAD

UltraVAD is a context-aware, audio-native endpointing model. It estimates the probability that a speaker has finished their turn in real time by fusing recent dialog text with the user’s audio. UltraVAD consumes the dialogue history and the last user audio turn, then produces a probability for the end-of-turn token `<|eot_id|>`.

## Model Details

- **Developer:** Ultravox.ai  
- **Type:** Context-aware audio–text fusion endpointing  
- **Backbone:** Llama-8B (post-trained)  
- **Languages (26):** ar, bg, zh, cs, da, nl, en, fi, fr, de, el, hi, hu, it, ja, pl, pt, ro, ru, sk, es, sv, ta, tr, uk, vi

**What it predicts.** UltraVAD computes the probability `P(<eot_id> | context, user_audio)`

## Sources

- **Website/Repo:** https://ultravox.ai  
- **Demo:** https://demo.ultravox.ai/  
- **Benchmark:** https://huggingface.co/datasets/fixie-ai/turntaking-contextual-tts

## Usage

Use UltraVAD as a turn-taking oracle in voice agents. Run it alongside a lightweight streaming VAD; when short silences are detected, call UltraVAD and trigger your agent’s response once the `<eot>` probability crosses your threshold.

```python
import transformers
import torch
import librosa
import os

pipe = transformers.pipeline(model='fixie-ai/ultraVAD', trust_remote_code=True, device="cpu")

sr = 16000
wav_path = os.path.join(os.path.dirname(__file__), "sample.wav")
audio, sr = librosa.load(wav_path, sr=sr)

turns = [
  {"role": "assistant", "content": "Hi, how are you?"},
]

# Build model inputs via pipeline preprocess
inputs = {"audio": audio, "turns": turns, "sampling_rate": sr}
model_inputs = pipe.preprocess(inputs)

# Move tensors to model device
device = next(pipe.model.parameters()).device
model_inputs = {k: (v.to(device) if hasattr(v, "to") else v) for k, v in model_inputs.items()}

# Forward pass (no generation)
with torch.inference_mode():
  output = pipe.model.forward(**model_inputs, return_dict=True)

# Compute last-audio token position
logits = output.logits  # (1, seq_len, vocab)
audio_pos = int(
  model_inputs["audio_token_start_idx"].item() +
  model_inputs["audio_token_len"].item() - 1
)

# Resolve <|eot_id|> token id and compute probability at last-audio index
token_id = pipe.tokenizer.convert_tokens_to_ids("<|eot_id|>")
if token_id is None or token_id == pipe.tokenizer.unk_token_id:
  raise RuntimeError("<|eot_id|> not found in tokenizer.")

audio_logits = logits[0, audio_pos, :]
audio_probs = torch.softmax(audio_logits.float(), dim=-1)
eot_prob_audio = audio_probs[token_id].item()
print(f"P(<|eot_id|>) = {eot_prob_audio:.6f}")
threshold = 0.1
if eot_prob_audio > threshold:
  print("Is End of Turn")
else:
  print("Is Not End of Turn")
```

## Training

Text-only post-training (LLM):
Post-train the backbone to predict <eot> in dialog, yielding a probability over likely stop points rather than brittle binary labels.
Data: Synthetic conversational corpora with inserted <eot> tokens, translation-augmented across 26 languages.

Audio-native fusion (Ultravox projector):
Attach and fine-tune the Ultravox audio projector so the model conditions jointly on audio embeddings and text, aligning prosodic cues with the <eot> objective.
Data: Robust to real-world noise, device/mic variance, overlapping speech.

Calibration:
Choose a decision threshold to balance precision vs. recall per language/domain. Recommended starting threshold: 0.1. 
Raise the threshold if you find the model interrupting too eagerly, and lower the threshold if you find the model not responding when its supposed to.

## Performance & Deployment

Latency (forward pass): ~65-110 ms on an A6000.

Common pattern: Pair with a streaming VAD (e.g., Silero). Invoke UltraVAD on short silences; its latency is often hidden under TTS time-to-first-token.

## Evaluation

UltraVAD is evaluated on both context-dependent and single-turn datasets.

Contextual benchmark: 400 held-out samples requiring dialog history (fixie-ai/turntaking-contextual-tts).

Single-turn sets: Smart-Turn V2’s Orpheus synthetic datasets (aggregate).

Results

Context-dependent turn-taking (400 held-out samples)
| **Metric** | **UltraVAD** | **Smart-Turn V2** |
|---|---:|---:|
| **Accuracy** | 77.5% | 63.0% |
| **Precision** | 69.6% | 59.8% |
| **Recall** | 97.5% | 79.0% |
| **F1-Score** | 81.3% | 68.1% |
| **AUC** | 89.6% | 70.0% |


Single-turn datasets (Orpheus aggregate)
| **Dataset** | **UltraVAD** | **Smart-Turn V2** |
|---|---:|---:|
| **orpheus-aggregate-test** | 93.7% | 94.3% |