patricklifixie commited on
Commit
ca1966e
·
verified ·
1 Parent(s): 05a5a68

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +16 -102
README.md CHANGED
@@ -3,44 +3,31 @@ library_name: transformers
3
  tags: []
4
  ---
5
 
6
- # Model Card for Model ID
7
-
8
- Ultravad (UltraVAD) is a context-aware, audio-native neural endpointing model. It estimates the probability that a speaker has finished their turn in real time by fusing dialog context (text) with the user speech (audio). UltraVAD frames endpointing as next-token prediction of an explicit end-of-turn token <eot>, and uses the Ultravox audio projector to become audio-native.
9
 
 
10
 
11
  ## Model Details
12
 
13
- ### Model Description
14
-
15
- UltraVAD consumes both audio and text: recent conversation turns (text) plus the current user’s audio. It predicts a calibrated probability for <eot>—i.e.,
16
- 𝑃
17
- (
18
- <eot>
19
-
20
- context
21
- ,
22
- user audio
23
- )
24
- P(<eot>∣context,user audio)—and returns an end-of-turn decision using a configurable threshold. This design captures semantic intent (from text) and paralinguistic cues (from audio) such as pauses, intonation, and pitch.
25
-
26
- - **Developed by:** Ultravox.ai
27
- - **Model type:** Context-aware audio-text fusion endpointing model
28
- - **Language(s):** ar, bg, zh, cs, da, nl, en, fi, fr, de, el, hi, hu, it, ja, pl, pt, ro, ru, sk, es, sv, ta, tr, uk, vi
29
- - **Finetuned from model:** Llama-8b
30
 
31
- ### Model Sources [optional]
 
 
32
 
33
- <!-- Provide the basic links for the model. -->
34
 
35
- - **Repository:** https://ultravox.ai
36
- - **Demo [optional]:** https://demo.ultravox.ai/
37
  - **Benchmark:** https://huggingface.co/datasets/fixie-ai/turntaking-contextual-tts
38
 
39
  ## Usage
40
 
41
- Think of UltraVAD as a turn-taking oracle for voice agents. It is best used when continuously ran a pause during user speech and, when its <eot> probability crosses a threshold, you trigger your agent’s response. It also helps with short replies, numerical answers, and self-repairs, where semantics and prosody determine whether the user is actually done.
42
 
43
- To use the model, try the following:
44
  ```python
45
  # pip install transformers peft librosa
46
 
@@ -50,84 +37,11 @@ import librosa
50
 
51
  pipe = transformers.pipeline(model='fixie-ai/ultraVAD', trust_remote_code=True)
52
 
53
- path = "<path-to-input-audio>" # TODO: pass the audio here
54
  audio, sr = librosa.load(path, sr=16000)
55
 
56
-
57
  turns = [
58
- {
59
- "role": "assistant",
60
- "content": "Hi how are you?"
61
- },
62
  ]
63
- pipe({'audio': audio, 'turns': turns, 'sampling_rate': sr}, max_new_tokens=30)
64
- ```
65
-
66
-
67
- ### Training Details
68
-
69
- UltraVAD is trained in two stages to be context-aware and audio-native.
70
-
71
- Training Data
72
-
73
- Text stage: Synthetic conversational corpora with inserted <eot> tokens after valid stopping points, covering multiple languages via translation augmentation.
74
-
75
- Audio stage: The Ultravox audio projector initialized from Ultravox training (robust to real-world noise, device/mic variance, overlapping speech). Additional single-turn samples emphasize prosodic cues (intonation, pitch, drawn-out syllables).
76
-
77
- Languages: 26 supported; easy to extend by translating the textual corpus and continuing training.
78
-
79
- ## Training Procedure
80
-
81
- LLM next-token post-training (text-only):
82
- Post-train a backbone LLM to predict <eot> in dialogue. This yields a distribution over likely stop points rather than brittle true/false labels.
83
-
84
- Audio-native fusion (Ultravox projector):
85
- Attach and finetune the Ultravox audio projector so the model conditions on both audio embeddings and textual context, aligning audio cues with the <eot> objective.
86
-
87
- Calibration:
88
- Choose a decision threshold for <eot> probability to balance precision/recall for your application. Thresholds can be tuned per language/domain. We recommend starting with 0.1.
89
-
90
- ## Speeds, Sizes, Times
91
-
92
- Forward pass latency: ~100–200 ms for a typical UltraVAD inference.
93
-
94
- Deployment pattern: Often paired with a lightweight streaming VAD (e.g. silero VAD); UltraVAD is invoked when short silences are detected, so its latency is hidden under TTS TTFT.
95
-
96
- ## Evaluation
97
-
98
- UltraVAD is evaluated on both context-dependent and single-turn datasets.
99
-
100
- Testing Data, Factors & Metrics
101
-
102
- Contextual benchmark: 400 held-out samples requiring dialog history (fixie-ai/turntaking-contextual-tts).
103
-
104
- Single-turn sets: Smart-Turn V2’s Orpheus synthetic datasets (aggregate).
105
-
106
- Metrics: Accuracy, Precision, Recall, F1, AUC for end-of-turn classification at recommended thresholds.
107
-
108
- Results
109
-
110
- # Context-dependent turn-taking (400 held-out samples)
111
-
112
- | **Metric** | **UltraVAD** | **Smart-Turn V2** |
113
- |---|---:|---:|
114
- | **Accuracy** | 77.5% | 63.0% |
115
- | **Precision** | 69.6% | 59.8% |
116
- | **Recall** | 97.5% | 79.0% |
117
- | **F1-Score** | 81.3% | 68.1% |
118
- | **AUC** | 89.6% | 70.0% |
119
-
120
-
121
- # Single-turn datasets (Orpheus aggregate)
122
-
123
- | **Dataset** | **UltraVAD** | **Smart-Turn V2** |
124
- |---|---:|---:|
125
- | **orpheus-aggregate-train** | 93.7% | N/A |
126
- | **orpheus-aggregate-test** | N/A | 94.3% |
127
-
128
- Notes:
129
-
130
- Smart-Turn V2 test scores are reported from their paper; UltraVAD uses their train splits for comparison due to test set unavailability. The aggregate numbers are within ~1 percentile, suggesting comparability.
131
-
132
- Thresholds can be re-calibrated per deployment to trade precision vs. recall.
133
 
 
 
3
  tags: []
4
  ---
5
 
6
+ # Model Card: UltraVAD
 
 
7
 
8
+ UltraVAD is a context-aware, audio-native endpointing model. It estimates the probability that a speaker has finished their turn in real time by fusing recent dialog text with the user’s audio. UltraVAD frames endpointing as next-token prediction of an explicit end-of-turn token `<eot>`, using the Ultravox audio projector to become audio-native.
9
 
10
  ## Model Details
11
 
12
+ - **Developer:** Ultravox.ai
13
+ - **Type:** Context-aware audio–text fusion endpointing
14
+ - **Backbone:** Llama-8B (post-trained)
15
+ - **Languages (26):** ar, bg, zh, cs, da, nl, en, fi, fr, de, el, hi, hu, it, ja, pl, pt, ro, ru, sk, es, sv, ta, tr, uk, vi
 
 
 
 
 
 
 
 
 
 
 
 
 
16
 
17
+ **What it predicts.** UltraVAD computes a calibrated probability
18
+ `P(<eot> | context, user_audio)`
19
+ and emits an end-of-turn decision using a configurable threshold. This captures both **semantics** (from text) and **paralinguistic cues** (from audio: pauses, intonation, pitch).
20
 
21
+ ## Sources
22
 
23
+ - **Website/Repo:** https://ultravox.ai
24
+ - **Demo:** https://demo.ultravox.ai/
25
  - **Benchmark:** https://huggingface.co/datasets/fixie-ai/turntaking-contextual-tts
26
 
27
  ## Usage
28
 
29
+ Use UltraVAD as a turn-taking oracle in voice agents. Run it alongside a lightweight streaming VAD; when short silences are detected, call UltraVAD and trigger your agent’s response once the `<eot>` probability crosses your threshold (see **Calibration**).
30
 
 
31
  ```python
32
  # pip install transformers peft librosa
33
 
 
37
 
38
  pipe = transformers.pipeline(model='fixie-ai/ultraVAD', trust_remote_code=True)
39
 
40
+ path = "<path-to-input-audio>"
41
  audio, sr = librosa.load(path, sr=16000)
42
 
 
43
  turns = [
44
+ {"role": "assistant", "content": "Hi, how are you?"},
 
 
 
45
  ]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
46
 
47
+ pipe({'audio': audio, 'turns': turns, 'sampling_rate': sr}, max_new_tokens=30)