alexwengg commited on
Commit
dfd55eb
·
verified ·
1 Parent(s): c5d20e7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +63 -139
README.md CHANGED
@@ -1,37 +1,31 @@
1
  ---
2
  license: cc-by-4.0
3
- track_downloads: true
4
  language:
5
  - en
6
  - es
 
7
  - fr
8
  - de
 
 
 
 
 
9
  - bg
 
 
10
  - hr
11
  - cs
12
- - da
13
- - nl
14
  - et
15
- - fi
16
- - el
17
  - hu
18
- - it
19
- - lv
20
  - lt
 
21
  - mt
22
- - pl
23
- - pt
24
- - ro
25
- - ru
26
- - sk
27
  - sl
28
- - sv
29
- - uk
30
  pipeline_tag: automatic-speech-recognition
31
- library_name: openvino
32
- datasets:
33
- - nvidia/Granary
34
- - nemo/asr-set-3.0
35
  thumbnail: null
36
  tags:
37
  - automatic-speech-recognition
@@ -41,156 +35,86 @@ tags:
41
  - TDT
42
  - FastConformer
43
  - Conformer
 
44
  - NeMo
45
  - OpenVINO
46
- - Intel NPU
47
- - hf-asr-leaderboard
48
- widget:
49
- - example_title: Librispeech sample 1
50
- src: https://cdn-media.huggingface.co/speech_samples/sample1.flac
51
- - example_title: Librispeech sample 2
52
- src: https://cdn-media.huggingface.co/speech_samples/sample2.flac
53
  base_model:
54
- - nvidia/parakeet-tdt_1.1b-v3
55
  ---
56
 
57
- # **<span style="color:#5DAF8D">🧃 parakeet-tdt-1.1b-v3: Multilingual Speech-to-Text OpenVINO</span>**
58
-
59
 
60
- ## Model Details
 
61
 
62
- - **Architecture**: Parakeet TDT v3 (Token Duration Transducer, 1.1B parameters)
63
- - **Input audio**: 16 kHz, mono, Float32 PCM in range [-1, 1]
64
- - **Languages**: 24 European languages (see below)
65
- - **Precision**: FP16 (CPU/GPU), INT8 (NPU)
66
- - **Backend**: OpenVINO 2025.x
67
-
68
- ## Performance
69
-
70
- ### librispeech Benchmark, English
71
-
72
- ```
73
- ================================================================================
74
- BENCHMARK RESULTS
75
- ================================================================================
76
- Dataset: librispeech test-clean
77
- Model: parakeet-v3
78
- Device: NPU
79
- Files processed: 2620
80
- Average WER: 3.7%
81
- Median WER: 0.0%
82
- Average CER: 1.9%
83
- Median CER: 0.0%
84
- Median RTFx: 23.5x
85
- Overall RTFx: 25.7x (19452.5s / 756.4s)
86
- Benchmark runtime: 789.8s
87
- Normalization: OpenAI Whisper English
88
- ================================================================================
89
- ```
90
-
91
- ### FLEURS Benchmark (350 samples per 24 languages)
92
-
93
- <details>
94
- <summary><b>View all 24 languages</b></summary>
95
-
96
- | Language | WER | CER | RTFx |
97
- |----------|-----|-----|------|
98
- | Bulgarian (bg_bg) | 16.76% | 4.66% | 41.7× |
99
- | Finnish (fi_fi) | 16.81% | 3.68% | 41.5× |
100
- | Romanian (ro_ro) | 17.51% | 5.89% | 38.9× |
101
- | Croatian (hr_hr) | 17.76% | 5.84% | 41.0× |
102
- | Czech (cs_cz) | 18.52% | 5.30% | 43.1× |
103
- | Swedish (sv_se) | 18.88% | 5.64% | 41.5× |
104
- | Estonian (et_ee) | 20.78% | 4.90% | 43.4× |
105
- | Hungarian (hu_hu) | 20.74% | 6.39% | 41.1× |
106
- | Lithuanian (lt_lt) | 24.55% | 6.66% | 40.4× |
107
- | Danish (da_dk) | 25.44% | 9.31% | 44.1× |
108
- | Maltese (mt_mt) | 25.29% | 9.17% | 41.3× |
109
- | Slovenian (sl_si) | 28.06% | 9.42% | 38.7× |
110
- | Latvian (lv_lv) | 30.64% | 8.09% | 42.6× |
111
- | Greek (el_gr) | 42.74% | 14.99% | 37.2× |
112
-
113
- **Average**: 16.98% WER, 5.39% CER, 41.1× RTFx
114
-
115
- </details>
116
 
 
117
 
 
 
118
 
119
- ## Usage
120
 
121
- ### Installation
 
 
 
 
 
 
 
122
 
123
- ```bash
124
- git clone https://github.com/FluidInference/eddy.git
125
- cd eddy
126
 
127
- # Build with vcpkg (handles dependencies)
128
- cmake -S . -B build -DCMAKE_TOOLCHAIN_FILE=[vcpkg]/scripts/buildsystems/vcpkg.cmake
129
- cmake --build build --config Release
130
- ```
 
 
131
 
132
- Models auto-download on first run. Cache location:
133
- - **Windows**: `%LOCALAPPDATA%\eddy\models\parakeet-v3\files\`
134
- - **Linux**: `~/.cache/eddy/models/parakeet-v3/files/`
135
 
136
- ### CLI
137
 
138
- ```bash
139
- # CPU inference
140
- build/examples/cpp/Release/parakeet_cli.exe audio.wav --model parakeet-v3
141
 
142
- # NPU inference (6-10× faster)
143
- build/examples/cpp/Release/parakeet_cli.exe audio.wav --model parakeet-v3 --device NPU
 
 
 
144
 
145
- # FLEURS benchmark (all 24 languages)
146
- build/examples/cpp/Release/benchmark_fleurs.exe "%LOCALAPPDATA%\eddy\datasets\FLEURS" --device NPU
147
- ```
148
 
149
  ## Supported Languages
150
 
151
- 🇮🇹 Italian 🇪🇸 Spanish 🇬🇧 English • 🇩🇪 German • 🇫🇷 French • 🇳🇱 Dutch • 🇷🇺 Russian • 🇵🇱 Polish • 🇺🇦 Ukrainian • 🇸🇰 Slovak • 🇧🇬 Bulgarian • 🇫🇮 Finnish • 🇷🇴 Romanian • 🇭🇷 Croatian • 🇨🇿 Czech • 🇸🇪 Swedish • 🇪🇪 Estonian • 🇭🇺 Hungarian • 🇱🇹 Lithuanian • 🇩🇰 Danish • 🇲🇹 Maltese • 🇸🇮 Slovenian • 🇱🇻 Latvian • 🇬🇷 Greek
152
 
153
- ## Model Architecture
154
-
155
- 4-model FastConformer-RNNT pipeline:
156
-
157
- 1. **Mel Spectrogram** (preprocessing)
158
- - Converts raw audio → 80 mel-frequency bins
159
- - 25ms window, 10ms hop length
160
-
161
- 2. **Encoder** (FastConformer)
162
- - Processes acoustic features
163
- - Outputs embeddings every 80ms
164
-
165
- 3. **Decoder** (LSTM)
166
- - Prediction network with language model
167
- - Maintains state across chunks
168
 
169
- 4. **Joint Network**
170
- - Combines encoder + decoder outputs
171
- - Greedy decoding for token prediction
172
 
173
- **Key Features**:
174
- - LSTM state continuity across audio chunks
175
- - Token deduplication via 2D search algorithm
176
- - Batch chunking: 10s windows with 3s overlap
177
- - Per-token timestamps (80ms granularity) & confidence scores
178
 
179
- **Recommendation**: Use [V2](https://huggingface.co/FluidInference/parakeet-tdt-0.6b-v2-ov) for English-only applications. Use V3 for multilingual support.
 
 
 
 
 
180
 
181
- ## Limitations
182
 
183
- - **Language Coverage**: Optimized for 24 European languages; performance may degrade for non-European languages or heavy accents.
184
- - **Noise Robustness**: Best on clean audio; WER increases with background noise.
185
- - **Streaming Latency**: ~6 seconds with default buffering (configurable).
186
 
187
- ## License
188
 
189
- **CC-BY-4.0** - See [LICENSE](LICENSE) for details.
 
 
190
 
191
  ## Acknowledgments
192
 
193
- - **Base Model**: NVIDIA NeMo Team for Parakeet TDT architecture
194
- - **Optimization**: Intel OpenVINO for cross-platform inference
195
- - **Benchmarks**: Google Research (FLEURS), OpenSLR (LibriSpeech)
196
- - **Hardware**: Intel Core Ultra NPU acceleration
 
1
  ---
2
  license: cc-by-4.0
 
3
  language:
4
  - en
5
  - es
6
+ - it
7
  - fr
8
  - de
9
+ - nl
10
+ - ru
11
+ - pl
12
+ - uk
13
+ - sk
14
  - bg
15
+ - fi
16
+ - ro
17
  - hr
18
  - cs
19
+ - sv
 
20
  - et
 
 
21
  - hu
 
 
22
  - lt
23
+ - da
24
  - mt
 
 
 
 
 
25
  - sl
26
+ - lv
27
+ - el
28
  pipeline_tag: automatic-speech-recognition
 
 
 
 
29
  thumbnail: null
30
  tags:
31
  - automatic-speech-recognition
 
35
  - TDT
36
  - FastConformer
37
  - Conformer
38
+ - multilingual
39
  - NeMo
40
  - OpenVINO
 
 
 
 
 
 
 
41
  base_model:
42
+ - nvidia/parakeet-tdt-1.1b
43
  ---
44
 
45
+ # Parakeet TDT 1.1B V3 - OpenVINO
 
46
 
47
+ [![Discord](https://img.shields.io/badge/Discord-Join%20Chat-7289da.svg)](https://discord.gg/WNsvaCtmDe)
48
+ [![GitHub Repo stars](https://img.shields.io/github/stars/FluidInference/eddy?style=flat&logo=github)](https://github.com/FluidInference/eddy)
49
 
50
+ OpenVINO-optimized version of NVIDIA's Parakeet TDT 1.1B V3 model for high-performance multilingual automatic speech recognition on Intel NPUs and CPUs.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
51
 
52
+ ## Benchmark Results
53
 
54
+ **Hardware**: Intel Core Ultra 7 155H (Meteor Lake) with Intel AI Boost NPU
55
+ **Software**: OpenVINO 2025.x
56
 
57
+ ### LibriSpeech test-clean (English)
58
 
59
+ | Metric | Value |
60
+ |--------|-------|
61
+ | **Average WER** | 3.7% |
62
+ | **Median WER** | 0.0% |
63
+ | **Average CER** | 1.9% |
64
+ | **RTFx (NPU)** | 25.7× |
65
+ | **RTFx (CPU)** | 5-8× |
66
+ | **Files processed** | 2,620 (5.4 hours) |
67
 
68
+ ### FLEURS Multilingual (24 Languages)
 
 
69
 
70
+ | Metric | Value |
71
+ |--------|-------|
72
+ | **Average WER** | 17.0% |
73
+ | **Average CER** | 5.4% |
74
+ | **Average RTFx** | 41.1× |
75
+ | **Total samples** | ~15,000+ |
76
 
77
+ **Best performing languages** (WER): Italian 4.3%, Spanish 5.4%, English 6.1%, German 7.4%, French 7.7%
 
 
78
 
79
+ See [BENCHMARK_RESULTS.md](https://github.com/FluidInference/eddy/blob/main/BENCHMARK_RESULTS.md) for complete per-language results.
80
 
81
+ ## Performance Comparison
 
 
82
 
83
+ | Implementation | Device | RTFx (Avg) | WER (LibriSpeech) |
84
+ |----------------|--------|------------|-------------------|
85
+ | **eddy (OpenVINO)** | Intel Core Ultra 7 155H NPU | **25.7×** | 3.7% |
86
+ | Parakeet (PyTorch) | Intel Arc 140V GPU | ~20×* | ~2.5%* |
87
+ | **eddy (OpenVINO)** | Intel Core Ultra 7 155H CPU | **5-8×** | 3.7% |
88
 
89
+ > **Note**: Benchmarked on HP EliteBook Ultra G1i. eddy NPU is ~1.3× faster than PyTorch on Intel Arc GPU, with lower power consumption. *V3 estimated from V2 benchmark.
 
 
90
 
91
  ## Supported Languages
92
 
93
+ **24 European languages**: English, Spanish, Italian, French, German, Dutch, Russian, Polish, Ukrainian, Slovak, Bulgarian, Finnish, Romanian, Croatian, Czech, Swedish, Estonian, Hungarian, Lithuanian, Danish, Maltese, Slovenian, Latvian, Greek
94
 
95
+ ## Usage
 
 
 
 
 
 
 
 
 
 
 
 
 
 
96
 
97
+ Python usage via ctypes available - see [eddy repository](https://github.com/FluidInference/eddy) for details.
 
 
98
 
99
+ ## Model Details
 
 
 
 
100
 
101
+ - **Parameters**: 1.1B
102
+ - **Architecture**: FastConformer-RNNT (4-model pipeline)
103
+ - **Languages**: 24 European languages
104
+ - **Blank token ID**: 8192
105
+ - **Context window**: 10s chunks with 3s overlap
106
+ - **Features**: LSTM state continuity, token deduplication, per-token timestamps
107
 
108
+ ## License
109
 
110
+ CC-BY-4.0 - See [LICENSE](LICENSE) for details.
 
 
111
 
112
+ ## Links
113
 
114
+ - **GitHub**: [FluidInference/eddy](https://github.com/FluidInference/eddy)
115
+ - **Base Model**: [nvidia/parakeet-tdt-1.1b](https://huggingface.co/nvidia/parakeet-tdt-1.1b)
116
+ - **Documentation**: [Benchmark Results](https://github.com/FluidInference/eddy/blob/main/BENCHMARK_RESULTS.md)
117
 
118
  ## Acknowledgments
119
 
120
+ Based on NVIDIA's Parakeet TDT model. OpenVINO conversion and optimization by the FluidInference team.