alexwengg commited on
Commit
c5d20e7
ยท
verified ยท
1 Parent(s): eed79f4

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +196 -0
README.md ADDED
@@ -0,0 +1,196 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-4.0
3
+ track_downloads: true
4
+ language:
5
+ - en
6
+ - es
7
+ - fr
8
+ - de
9
+ - bg
10
+ - hr
11
+ - cs
12
+ - da
13
+ - nl
14
+ - et
15
+ - fi
16
+ - el
17
+ - hu
18
+ - it
19
+ - lv
20
+ - lt
21
+ - mt
22
+ - pl
23
+ - pt
24
+ - ro
25
+ - ru
26
+ - sk
27
+ - sl
28
+ - sv
29
+ - uk
30
+ pipeline_tag: automatic-speech-recognition
31
+ library_name: openvino
32
+ datasets:
33
+ - nvidia/Granary
34
+ - nemo/asr-set-3.0
35
+ thumbnail: null
36
+ tags:
37
+ - automatic-speech-recognition
38
+ - speech
39
+ - audio
40
+ - Transducer
41
+ - TDT
42
+ - FastConformer
43
+ - Conformer
44
+ - NeMo
45
+ - OpenVINO
46
+ - Intel NPU
47
+ - hf-asr-leaderboard
48
+ widget:
49
+ - example_title: Librispeech sample 1
50
+ src: https://cdn-media.huggingface.co/speech_samples/sample1.flac
51
+ - example_title: Librispeech sample 2
52
+ src: https://cdn-media.huggingface.co/speech_samples/sample2.flac
53
+ base_model:
54
+ - nvidia/parakeet-tdt_1.1b-v3
55
+ ---
56
+
57
+ # **<span style="color:#5DAF8D">๐Ÿงƒ parakeet-tdt-1.1b-v3: Multilingual Speech-to-Text OpenVINO</span>**
58
+
59
+
60
+ ## Model Details
61
+
62
+ - **Architecture**: Parakeet TDT v3 (Token Duration Transducer, 1.1B parameters)
63
+ - **Input audio**: 16 kHz, mono, Float32 PCM in range [-1, 1]
64
+ - **Languages**: 24 European languages (see below)
65
+ - **Precision**: FP16 (CPU/GPU), INT8 (NPU)
66
+ - **Backend**: OpenVINO 2025.x
67
+
68
+ ## Performance
69
+
70
+ ### librispeech Benchmark, English
71
+
72
+ ```
73
+ ================================================================================
74
+ BENCHMARK RESULTS
75
+ ================================================================================
76
+ Dataset: librispeech test-clean
77
+ Model: parakeet-v3
78
+ Device: NPU
79
+ Files processed: 2620
80
+ Average WER: 3.7%
81
+ Median WER: 0.0%
82
+ Average CER: 1.9%
83
+ Median CER: 0.0%
84
+ Median RTFx: 23.5x
85
+ Overall RTFx: 25.7x (19452.5s / 756.4s)
86
+ Benchmark runtime: 789.8s
87
+ Normalization: OpenAI Whisper English
88
+ ================================================================================
89
+ ```
90
+
91
+ ### FLEURS Benchmark (350 samples per 24 languages)
92
+
93
+ <details>
94
+ <summary><b>View all 24 languages</b></summary>
95
+
96
+ | Language | WER | CER | RTFx |
97
+ |----------|-----|-----|------|
98
+ | Bulgarian (bg_bg) | 16.76% | 4.66% | 41.7ร— |
99
+ | Finnish (fi_fi) | 16.81% | 3.68% | 41.5ร— |
100
+ | Romanian (ro_ro) | 17.51% | 5.89% | 38.9ร— |
101
+ | Croatian (hr_hr) | 17.76% | 5.84% | 41.0ร— |
102
+ | Czech (cs_cz) | 18.52% | 5.30% | 43.1ร— |
103
+ | Swedish (sv_se) | 18.88% | 5.64% | 41.5ร— |
104
+ | Estonian (et_ee) | 20.78% | 4.90% | 43.4ร— |
105
+ | Hungarian (hu_hu) | 20.74% | 6.39% | 41.1ร— |
106
+ | Lithuanian (lt_lt) | 24.55% | 6.66% | 40.4ร— |
107
+ | Danish (da_dk) | 25.44% | 9.31% | 44.1ร— |
108
+ | Maltese (mt_mt) | 25.29% | 9.17% | 41.3ร— |
109
+ | Slovenian (sl_si) | 28.06% | 9.42% | 38.7ร— |
110
+ | Latvian (lv_lv) | 30.64% | 8.09% | 42.6ร— |
111
+ | Greek (el_gr) | 42.74% | 14.99% | 37.2ร— |
112
+
113
+ **Average**: 16.98% WER, 5.39% CER, 41.1ร— RTFx
114
+
115
+ </details>
116
+
117
+
118
+
119
+ ## Usage
120
+
121
+ ### Installation
122
+
123
+ ```bash
124
+ git clone https://github.com/FluidInference/eddy.git
125
+ cd eddy
126
+
127
+ # Build with vcpkg (handles dependencies)
128
+ cmake -S . -B build -DCMAKE_TOOLCHAIN_FILE=[vcpkg]/scripts/buildsystems/vcpkg.cmake
129
+ cmake --build build --config Release
130
+ ```
131
+
132
+ Models auto-download on first run. Cache location:
133
+ - **Windows**: `%LOCALAPPDATA%\eddy\models\parakeet-v3\files\`
134
+ - **Linux**: `~/.cache/eddy/models/parakeet-v3/files/`
135
+
136
+ ### CLI
137
+
138
+ ```bash
139
+ # CPU inference
140
+ build/examples/cpp/Release/parakeet_cli.exe audio.wav --model parakeet-v3
141
+
142
+ # NPU inference (6-10ร— faster)
143
+ build/examples/cpp/Release/parakeet_cli.exe audio.wav --model parakeet-v3 --device NPU
144
+
145
+ # FLEURS benchmark (all 24 languages)
146
+ build/examples/cpp/Release/benchmark_fleurs.exe "%LOCALAPPDATA%\eddy\datasets\FLEURS" --device NPU
147
+ ```
148
+
149
+ ## Supported Languages
150
+
151
+ ๐Ÿ‡ฎ๐Ÿ‡น Italian โ€ข ๐Ÿ‡ช๐Ÿ‡ธ Spanish โ€ข ๐Ÿ‡ฌ๐Ÿ‡ง English โ€ข ๐Ÿ‡ฉ๐Ÿ‡ช German โ€ข ๐Ÿ‡ซ๐Ÿ‡ท French โ€ข ๐Ÿ‡ณ๐Ÿ‡ฑ Dutch โ€ข ๐Ÿ‡ท๐Ÿ‡บ Russian โ€ข ๐Ÿ‡ต๐Ÿ‡ฑ Polish โ€ข ๐Ÿ‡บ๐Ÿ‡ฆ Ukrainian โ€ข ๐Ÿ‡ธ๐Ÿ‡ฐ Slovak โ€ข ๐Ÿ‡ง๐Ÿ‡ฌ Bulgarian โ€ข ๐Ÿ‡ซ๐Ÿ‡ฎ Finnish โ€ข ๐Ÿ‡ท๐Ÿ‡ด Romanian โ€ข ๐Ÿ‡ญ๐Ÿ‡ท Croatian โ€ข ๐Ÿ‡จ๐Ÿ‡ฟ Czech โ€ข ๐Ÿ‡ธ๐Ÿ‡ช Swedish โ€ข ๐Ÿ‡ช๐Ÿ‡ช Estonian โ€ข ๐Ÿ‡ญ๐Ÿ‡บ Hungarian โ€ข ๐Ÿ‡ฑ๐Ÿ‡น Lithuanian โ€ข ๐Ÿ‡ฉ๐Ÿ‡ฐ Danish โ€ข ๐Ÿ‡ฒ๐Ÿ‡น Maltese โ€ข ๐Ÿ‡ธ๐Ÿ‡ฎ Slovenian โ€ข ๐Ÿ‡ฑ๐Ÿ‡ป Latvian โ€ข ๐Ÿ‡ฌ๐Ÿ‡ท Greek
152
+
153
+ ## Model Architecture
154
+
155
+ 4-model FastConformer-RNNT pipeline:
156
+
157
+ 1. **Mel Spectrogram** (preprocessing)
158
+ - Converts raw audio โ†’ 80 mel-frequency bins
159
+ - 25ms window, 10ms hop length
160
+
161
+ 2. **Encoder** (FastConformer)
162
+ - Processes acoustic features
163
+ - Outputs embeddings every 80ms
164
+
165
+ 3. **Decoder** (LSTM)
166
+ - Prediction network with language model
167
+ - Maintains state across chunks
168
+
169
+ 4. **Joint Network**
170
+ - Combines encoder + decoder outputs
171
+ - Greedy decoding for token prediction
172
+
173
+ **Key Features**:
174
+ - LSTM state continuity across audio chunks
175
+ - Token deduplication via 2D search algorithm
176
+ - Batch chunking: 10s windows with 3s overlap
177
+ - Per-token timestamps (80ms granularity) & confidence scores
178
+
179
+ **Recommendation**: Use [V2](https://huggingface.co/FluidInference/parakeet-tdt-0.6b-v2-ov) for English-only applications. Use V3 for multilingual support.
180
+
181
+ ## Limitations
182
+
183
+ - **Language Coverage**: Optimized for 24 European languages; performance may degrade for non-European languages or heavy accents.
184
+ - **Noise Robustness**: Best on clean audio; WER increases with background noise.
185
+ - **Streaming Latency**: ~6 seconds with default buffering (configurable).
186
+
187
+ ## License
188
+
189
+ **CC-BY-4.0** - See [LICENSE](LICENSE) for details.
190
+
191
+ ## Acknowledgments
192
+
193
+ - **Base Model**: NVIDIA NeMo Team for Parakeet TDT architecture
194
+ - **Optimization**: Intel OpenVINO for cross-platform inference
195
+ - **Benchmarks**: Google Research (FLEURS), OpenSLR (LibriSpeech)
196
+ - **Hardware**: Intel Core Ultra NPU acceleration