patricklifixie commited on
Commit
48820ed
·
verified ·
1 Parent(s): ca1966e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +58 -0
README.md CHANGED
@@ -45,3 +45,61 @@ turns = [
45
  ]
46
 
47
  pipe({'audio': audio, 'turns': turns, 'sampling_rate': sr}, max_new_tokens=30)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
45
  ]
46
 
47
  pipe({'audio': audio, 'turns': turns, 'sampling_rate': sr}, max_new_tokens=30)
48
+ ```
49
+
50
+ ## Training
51
+
52
+ Text-only post-training (LLM):
53
+ Post-train the backbone to predict <eot> in dialog, yielding a probability over likely stop points rather than brittle binary labels.
54
+ Data: Synthetic conversational corpora with inserted <eot> tokens, translation-augmented across 26 languages.
55
+
56
+ Audio-native fusion (Ultravox projector):
57
+ Attach and fine-tune the Ultravox audio projector so the model conditions jointly on audio embeddings and text, aligning prosodic cues with the <eot> objective.
58
+ Data: Robust to real-world noise, device/mic variance, overlapping speech; extra single-turn samples emphasize intonation, pitch, drawn-out syllables.
59
+
60
+ Calibration:
61
+ Choose a decision threshold to balance precision vs. recall per language/domain. Recommended starting threshold: 0.1.
62
+
63
+ ## Performance & Deployment
64
+
65
+ Latency (forward pass): ~100–200 ms (typical UltraVAD inference)
66
+
67
+ Common pattern: Pair with a streaming VAD (e.g., Silero). Invoke UltraVAD on short silences; its latency is often hidden under TTS time-to-first-token.
68
+
69
+ ## Evaluation
70
+
71
+ UltraVAD is evaluated on both context-dependent and single-turn datasets.
72
+
73
+ Testing Data, Factors & Metrics
74
+
75
+ Contextual benchmark: 400 held-out samples requiring dialog history (fixie-ai/turntaking-contextual-tts).
76
+
77
+ Single-turn sets: Smart-Turn V2’s Orpheus synthetic datasets (aggregate).
78
+
79
+ Metrics: Accuracy, Precision, Recall, F1, AUC for end-of-turn classification at recommended thresholds.
80
+
81
+ Results
82
+
83
+ # Context-dependent turn-taking (400 held-out samples)
84
+
85
+ | **Metric** | **UltraVAD** | **Smart-Turn V2** |
86
+ |---|---:|---:|
87
+ | **Accuracy** | 77.5% | 63.0% |
88
+ | **Precision** | 69.6% | 59.8% |
89
+ | **Recall** | 97.5% | 79.0% |
90
+ | **F1-Score** | 81.3% | 68.1% |
91
+ | **AUC** | 89.6% | 70.0% |
92
+
93
+
94
+ # Single-turn datasets (Orpheus aggregate)
95
+
96
+ | **Dataset** | **UltraVAD** | **Smart-Turn V2** |
97
+ |---|---:|---:|
98
+ | **orpheus-aggregate-train** | 93.7% | N/A |
99
+ | **orpheus-aggregate-test** | N/A | 94.3% |
100
+
101
+ Notes:
102
+
103
+ Smart-Turn V2 test scores are reported from their paper; UltraVAD uses their train splits for comparison due to test set unavailability. The aggregate numbers are within ~1 percentile, suggesting comparability.
104
+
105
+ Thresholds can be re-calibrated per deployment to trade precision vs. recall.