Update README.md
Browse files
README.md
CHANGED
|
@@ -3,44 +3,31 @@ library_name: transformers
|
|
| 3 |
tags: []
|
| 4 |
---
|
| 5 |
|
| 6 |
-
# Model Card
|
| 7 |
-
|
| 8 |
-
Ultravad (UltraVAD) is a context-aware, audio-native neural endpointing model. It estimates the probability that a speaker has finished their turn in real time by fusing dialog context (text) with the user speech (audio). UltraVAD frames endpointing as next-token prediction of an explicit end-of-turn token <eot>, and uses the Ultravox audio projector to become audio-native.
|
| 9 |
|
|
|
|
| 10 |
|
| 11 |
## Model Details
|
| 12 |
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
(
|
| 18 |
-
<eot>
|
| 19 |
-
∣
|
| 20 |
-
context
|
| 21 |
-
,
|
| 22 |
-
user audio
|
| 23 |
-
)
|
| 24 |
-
P(<eot>∣context,user audio)—and returns an end-of-turn decision using a configurable threshold. This design captures semantic intent (from text) and paralinguistic cues (from audio) such as pauses, intonation, and pitch.
|
| 25 |
-
|
| 26 |
-
- **Developed by:** Ultravox.ai
|
| 27 |
-
- **Model type:** Context-aware audio-text fusion endpointing model
|
| 28 |
-
- **Language(s):** ar, bg, zh, cs, da, nl, en, fi, fr, de, el, hi, hu, it, ja, pl, pt, ro, ru, sk, es, sv, ta, tr, uk, vi
|
| 29 |
-
- **Finetuned from model:** Llama-8b
|
| 30 |
|
| 31 |
-
|
|
|
|
|
|
|
| 32 |
|
| 33 |
-
|
| 34 |
|
| 35 |
-
- **
|
| 36 |
-
- **Demo
|
| 37 |
- **Benchmark:** https://huggingface.co/datasets/fixie-ai/turntaking-contextual-tts
|
| 38 |
|
| 39 |
## Usage
|
| 40 |
|
| 41 |
-
|
| 42 |
|
| 43 |
-
To use the model, try the following:
|
| 44 |
```python
|
| 45 |
# pip install transformers peft librosa
|
| 46 |
|
|
@@ -50,84 +37,11 @@ import librosa
|
|
| 50 |
|
| 51 |
pipe = transformers.pipeline(model='fixie-ai/ultraVAD', trust_remote_code=True)
|
| 52 |
|
| 53 |
-
path = "<path-to-input-audio>"
|
| 54 |
audio, sr = librosa.load(path, sr=16000)
|
| 55 |
|
| 56 |
-
|
| 57 |
turns = [
|
| 58 |
-
{
|
| 59 |
-
"role": "assistant",
|
| 60 |
-
"content": "Hi how are you?"
|
| 61 |
-
},
|
| 62 |
]
|
| 63 |
-
pipe({'audio': audio, 'turns': turns, 'sampling_rate': sr}, max_new_tokens=30)
|
| 64 |
-
```
|
| 65 |
-
|
| 66 |
-
|
| 67 |
-
### Training Details
|
| 68 |
-
|
| 69 |
-
UltraVAD is trained in two stages to be context-aware and audio-native.
|
| 70 |
-
|
| 71 |
-
Training Data
|
| 72 |
-
|
| 73 |
-
Text stage: Synthetic conversational corpora with inserted <eot> tokens after valid stopping points, covering multiple languages via translation augmentation.
|
| 74 |
-
|
| 75 |
-
Audio stage: The Ultravox audio projector initialized from Ultravox training (robust to real-world noise, device/mic variance, overlapping speech). Additional single-turn samples emphasize prosodic cues (intonation, pitch, drawn-out syllables).
|
| 76 |
-
|
| 77 |
-
Languages: 26 supported; easy to extend by translating the textual corpus and continuing training.
|
| 78 |
-
|
| 79 |
-
## Training Procedure
|
| 80 |
-
|
| 81 |
-
LLM next-token post-training (text-only):
|
| 82 |
-
Post-train a backbone LLM to predict <eot> in dialogue. This yields a distribution over likely stop points rather than brittle true/false labels.
|
| 83 |
-
|
| 84 |
-
Audio-native fusion (Ultravox projector):
|
| 85 |
-
Attach and finetune the Ultravox audio projector so the model conditions on both audio embeddings and textual context, aligning audio cues with the <eot> objective.
|
| 86 |
-
|
| 87 |
-
Calibration:
|
| 88 |
-
Choose a decision threshold for <eot> probability to balance precision/recall for your application. Thresholds can be tuned per language/domain. We recommend starting with 0.1.
|
| 89 |
-
|
| 90 |
-
## Speeds, Sizes, Times
|
| 91 |
-
|
| 92 |
-
Forward pass latency: ~100–200 ms for a typical UltraVAD inference.
|
| 93 |
-
|
| 94 |
-
Deployment pattern: Often paired with a lightweight streaming VAD (e.g. silero VAD); UltraVAD is invoked when short silences are detected, so its latency is hidden under TTS TTFT.
|
| 95 |
-
|
| 96 |
-
## Evaluation
|
| 97 |
-
|
| 98 |
-
UltraVAD is evaluated on both context-dependent and single-turn datasets.
|
| 99 |
-
|
| 100 |
-
Testing Data, Factors & Metrics
|
| 101 |
-
|
| 102 |
-
Contextual benchmark: 400 held-out samples requiring dialog history (fixie-ai/turntaking-contextual-tts).
|
| 103 |
-
|
| 104 |
-
Single-turn sets: Smart-Turn V2’s Orpheus synthetic datasets (aggregate).
|
| 105 |
-
|
| 106 |
-
Metrics: Accuracy, Precision, Recall, F1, AUC for end-of-turn classification at recommended thresholds.
|
| 107 |
-
|
| 108 |
-
Results
|
| 109 |
-
|
| 110 |
-
# Context-dependent turn-taking (400 held-out samples)
|
| 111 |
-
|
| 112 |
-
| **Metric** | **UltraVAD** | **Smart-Turn V2** |
|
| 113 |
-
|---|---:|---:|
|
| 114 |
-
| **Accuracy** | 77.5% | 63.0% |
|
| 115 |
-
| **Precision** | 69.6% | 59.8% |
|
| 116 |
-
| **Recall** | 97.5% | 79.0% |
|
| 117 |
-
| **F1-Score** | 81.3% | 68.1% |
|
| 118 |
-
| **AUC** | 89.6% | 70.0% |
|
| 119 |
-
|
| 120 |
-
|
| 121 |
-
# Single-turn datasets (Orpheus aggregate)
|
| 122 |
-
|
| 123 |
-
| **Dataset** | **UltraVAD** | **Smart-Turn V2** |
|
| 124 |
-
|---|---:|---:|
|
| 125 |
-
| **orpheus-aggregate-train** | 93.7% | N/A |
|
| 126 |
-
| **orpheus-aggregate-test** | N/A | 94.3% |
|
| 127 |
-
|
| 128 |
-
Notes:
|
| 129 |
-
|
| 130 |
-
Smart-Turn V2 test scores are reported from their paper; UltraVAD uses their train splits for comparison due to test set unavailability. The aggregate numbers are within ~1 percentile, suggesting comparability.
|
| 131 |
-
|
| 132 |
-
Thresholds can be re-calibrated per deployment to trade precision vs. recall.
|
| 133 |
|
|
|
|
|
|
| 3 |
tags: []
|
| 4 |
---
|
| 5 |
|
| 6 |
+
# Model Card: UltraVAD
|
|
|
|
|
|
|
| 7 |
|
| 8 |
+
UltraVAD is a context-aware, audio-native endpointing model. It estimates the probability that a speaker has finished their turn in real time by fusing recent dialog text with the user’s audio. UltraVAD frames endpointing as next-token prediction of an explicit end-of-turn token `<eot>`, using the Ultravox audio projector to become audio-native.
|
| 9 |
|
| 10 |
## Model Details
|
| 11 |
|
| 12 |
+
- **Developer:** Ultravox.ai
|
| 13 |
+
- **Type:** Context-aware audio–text fusion endpointing
|
| 14 |
+
- **Backbone:** Llama-8B (post-trained)
|
| 15 |
+
- **Languages (26):** ar, bg, zh, cs, da, nl, en, fi, fr, de, el, hi, hu, it, ja, pl, pt, ro, ru, sk, es, sv, ta, tr, uk, vi
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 16 |
|
| 17 |
+
**What it predicts.** UltraVAD computes a calibrated probability
|
| 18 |
+
`P(<eot> | context, user_audio)`
|
| 19 |
+
and emits an end-of-turn decision using a configurable threshold. This captures both **semantics** (from text) and **paralinguistic cues** (from audio: pauses, intonation, pitch).
|
| 20 |
|
| 21 |
+
## Sources
|
| 22 |
|
| 23 |
+
- **Website/Repo:** https://ultravox.ai
|
| 24 |
+
- **Demo:** https://demo.ultravox.ai/
|
| 25 |
- **Benchmark:** https://huggingface.co/datasets/fixie-ai/turntaking-contextual-tts
|
| 26 |
|
| 27 |
## Usage
|
| 28 |
|
| 29 |
+
Use UltraVAD as a turn-taking oracle in voice agents. Run it alongside a lightweight streaming VAD; when short silences are detected, call UltraVAD and trigger your agent’s response once the `<eot>` probability crosses your threshold (see **Calibration**).
|
| 30 |
|
|
|
|
| 31 |
```python
|
| 32 |
# pip install transformers peft librosa
|
| 33 |
|
|
|
|
| 37 |
|
| 38 |
pipe = transformers.pipeline(model='fixie-ai/ultraVAD', trust_remote_code=True)
|
| 39 |
|
| 40 |
+
path = "<path-to-input-audio>"
|
| 41 |
audio, sr = librosa.load(path, sr=16000)
|
| 42 |
|
|
|
|
| 43 |
turns = [
|
| 44 |
+
{"role": "assistant", "content": "Hi, how are you?"},
|
|
|
|
|
|
|
|
|
|
| 45 |
]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 46 |
|
| 47 |
+
pipe({'audio': audio, 'turns': turns, 'sampling_rate': sr}, max_new_tokens=30)
|