Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,196 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: cc-by-4.0
|
| 3 |
+
track_downloads: true
|
| 4 |
+
language:
|
| 5 |
+
- en
|
| 6 |
+
- es
|
| 7 |
+
- fr
|
| 8 |
+
- de
|
| 9 |
+
- bg
|
| 10 |
+
- hr
|
| 11 |
+
- cs
|
| 12 |
+
- da
|
| 13 |
+
- nl
|
| 14 |
+
- et
|
| 15 |
+
- fi
|
| 16 |
+
- el
|
| 17 |
+
- hu
|
| 18 |
+
- it
|
| 19 |
+
- lv
|
| 20 |
+
- lt
|
| 21 |
+
- mt
|
| 22 |
+
- pl
|
| 23 |
+
- pt
|
| 24 |
+
- ro
|
| 25 |
+
- ru
|
| 26 |
+
- sk
|
| 27 |
+
- sl
|
| 28 |
+
- sv
|
| 29 |
+
- uk
|
| 30 |
+
pipeline_tag: automatic-speech-recognition
|
| 31 |
+
library_name: openvino
|
| 32 |
+
datasets:
|
| 33 |
+
- nvidia/Granary
|
| 34 |
+
- nemo/asr-set-3.0
|
| 35 |
+
thumbnail: null
|
| 36 |
+
tags:
|
| 37 |
+
- automatic-speech-recognition
|
| 38 |
+
- speech
|
| 39 |
+
- audio
|
| 40 |
+
- Transducer
|
| 41 |
+
- TDT
|
| 42 |
+
- FastConformer
|
| 43 |
+
- Conformer
|
| 44 |
+
- NeMo
|
| 45 |
+
- OpenVINO
|
| 46 |
+
- Intel NPU
|
| 47 |
+
- hf-asr-leaderboard
|
| 48 |
+
widget:
|
| 49 |
+
- example_title: Librispeech sample 1
|
| 50 |
+
src: https://cdn-media.huggingface.co/speech_samples/sample1.flac
|
| 51 |
+
- example_title: Librispeech sample 2
|
| 52 |
+
src: https://cdn-media.huggingface.co/speech_samples/sample2.flac
|
| 53 |
+
base_model:
|
| 54 |
+
- nvidia/parakeet-tdt_1.1b-v3
|
| 55 |
+
---
|
| 56 |
+
|
| 57 |
+
# **<span style="color:#5DAF8D">๐ง parakeet-tdt-1.1b-v3: Multilingual Speech-to-Text OpenVINO</span>**
|
| 58 |
+
|
| 59 |
+
|
| 60 |
+
## Model Details
|
| 61 |
+
|
| 62 |
+
- **Architecture**: Parakeet TDT v3 (Token Duration Transducer, 1.1B parameters)
|
| 63 |
+
- **Input audio**: 16 kHz, mono, Float32 PCM in range [-1, 1]
|
| 64 |
+
- **Languages**: 24 European languages (see below)
|
| 65 |
+
- **Precision**: FP16 (CPU/GPU), INT8 (NPU)
|
| 66 |
+
- **Backend**: OpenVINO 2025.x
|
| 67 |
+
|
| 68 |
+
## Performance
|
| 69 |
+
|
| 70 |
+
### librispeech Benchmark, English
|
| 71 |
+
|
| 72 |
+
```
|
| 73 |
+
================================================================================
|
| 74 |
+
BENCHMARK RESULTS
|
| 75 |
+
================================================================================
|
| 76 |
+
Dataset: librispeech test-clean
|
| 77 |
+
Model: parakeet-v3
|
| 78 |
+
Device: NPU
|
| 79 |
+
Files processed: 2620
|
| 80 |
+
Average WER: 3.7%
|
| 81 |
+
Median WER: 0.0%
|
| 82 |
+
Average CER: 1.9%
|
| 83 |
+
Median CER: 0.0%
|
| 84 |
+
Median RTFx: 23.5x
|
| 85 |
+
Overall RTFx: 25.7x (19452.5s / 756.4s)
|
| 86 |
+
Benchmark runtime: 789.8s
|
| 87 |
+
Normalization: OpenAI Whisper English
|
| 88 |
+
================================================================================
|
| 89 |
+
```
|
| 90 |
+
|
| 91 |
+
### FLEURS Benchmark (350 samples per 24 languages)
|
| 92 |
+
|
| 93 |
+
<details>
|
| 94 |
+
<summary><b>View all 24 languages</b></summary>
|
| 95 |
+
|
| 96 |
+
| Language | WER | CER | RTFx |
|
| 97 |
+
|----------|-----|-----|------|
|
| 98 |
+
| Bulgarian (bg_bg) | 16.76% | 4.66% | 41.7ร |
|
| 99 |
+
| Finnish (fi_fi) | 16.81% | 3.68% | 41.5ร |
|
| 100 |
+
| Romanian (ro_ro) | 17.51% | 5.89% | 38.9ร |
|
| 101 |
+
| Croatian (hr_hr) | 17.76% | 5.84% | 41.0ร |
|
| 102 |
+
| Czech (cs_cz) | 18.52% | 5.30% | 43.1ร |
|
| 103 |
+
| Swedish (sv_se) | 18.88% | 5.64% | 41.5ร |
|
| 104 |
+
| Estonian (et_ee) | 20.78% | 4.90% | 43.4ร |
|
| 105 |
+
| Hungarian (hu_hu) | 20.74% | 6.39% | 41.1ร |
|
| 106 |
+
| Lithuanian (lt_lt) | 24.55% | 6.66% | 40.4ร |
|
| 107 |
+
| Danish (da_dk) | 25.44% | 9.31% | 44.1ร |
|
| 108 |
+
| Maltese (mt_mt) | 25.29% | 9.17% | 41.3ร |
|
| 109 |
+
| Slovenian (sl_si) | 28.06% | 9.42% | 38.7ร |
|
| 110 |
+
| Latvian (lv_lv) | 30.64% | 8.09% | 42.6ร |
|
| 111 |
+
| Greek (el_gr) | 42.74% | 14.99% | 37.2ร |
|
| 112 |
+
|
| 113 |
+
**Average**: 16.98% WER, 5.39% CER, 41.1ร RTFx
|
| 114 |
+
|
| 115 |
+
</details>
|
| 116 |
+
|
| 117 |
+
|
| 118 |
+
|
| 119 |
+
## Usage
|
| 120 |
+
|
| 121 |
+
### Installation
|
| 122 |
+
|
| 123 |
+
```bash
|
| 124 |
+
git clone https://github.com/FluidInference/eddy.git
|
| 125 |
+
cd eddy
|
| 126 |
+
|
| 127 |
+
# Build with vcpkg (handles dependencies)
|
| 128 |
+
cmake -S . -B build -DCMAKE_TOOLCHAIN_FILE=[vcpkg]/scripts/buildsystems/vcpkg.cmake
|
| 129 |
+
cmake --build build --config Release
|
| 130 |
+
```
|
| 131 |
+
|
| 132 |
+
Models auto-download on first run. Cache location:
|
| 133 |
+
- **Windows**: `%LOCALAPPDATA%\eddy\models\parakeet-v3\files\`
|
| 134 |
+
- **Linux**: `~/.cache/eddy/models/parakeet-v3/files/`
|
| 135 |
+
|
| 136 |
+
### CLI
|
| 137 |
+
|
| 138 |
+
```bash
|
| 139 |
+
# CPU inference
|
| 140 |
+
build/examples/cpp/Release/parakeet_cli.exe audio.wav --model parakeet-v3
|
| 141 |
+
|
| 142 |
+
# NPU inference (6-10ร faster)
|
| 143 |
+
build/examples/cpp/Release/parakeet_cli.exe audio.wav --model parakeet-v3 --device NPU
|
| 144 |
+
|
| 145 |
+
# FLEURS benchmark (all 24 languages)
|
| 146 |
+
build/examples/cpp/Release/benchmark_fleurs.exe "%LOCALAPPDATA%\eddy\datasets\FLEURS" --device NPU
|
| 147 |
+
```
|
| 148 |
+
|
| 149 |
+
## Supported Languages
|
| 150 |
+
|
| 151 |
+
๐ฎ๐น Italian โข ๐ช๐ธ Spanish โข ๐ฌ๐ง English โข ๐ฉ๐ช German โข ๐ซ๐ท French โข ๐ณ๐ฑ Dutch โข ๐ท๐บ Russian โข ๐ต๐ฑ Polish โข ๐บ๐ฆ Ukrainian โข ๐ธ๐ฐ Slovak โข ๐ง๐ฌ Bulgarian โข ๐ซ๐ฎ Finnish โข ๐ท๐ด Romanian โข ๐ญ๐ท Croatian โข ๐จ๐ฟ Czech โข ๐ธ๐ช Swedish โข ๐ช๐ช Estonian โข ๐ญ๐บ Hungarian โข ๐ฑ๐น Lithuanian โข ๐ฉ๐ฐ Danish โข ๐ฒ๐น Maltese โข ๐ธ๐ฎ Slovenian โข ๐ฑ๐ป Latvian โข ๐ฌ๐ท Greek
|
| 152 |
+
|
| 153 |
+
## Model Architecture
|
| 154 |
+
|
| 155 |
+
4-model FastConformer-RNNT pipeline:
|
| 156 |
+
|
| 157 |
+
1. **Mel Spectrogram** (preprocessing)
|
| 158 |
+
- Converts raw audio โ 80 mel-frequency bins
|
| 159 |
+
- 25ms window, 10ms hop length
|
| 160 |
+
|
| 161 |
+
2. **Encoder** (FastConformer)
|
| 162 |
+
- Processes acoustic features
|
| 163 |
+
- Outputs embeddings every 80ms
|
| 164 |
+
|
| 165 |
+
3. **Decoder** (LSTM)
|
| 166 |
+
- Prediction network with language model
|
| 167 |
+
- Maintains state across chunks
|
| 168 |
+
|
| 169 |
+
4. **Joint Network**
|
| 170 |
+
- Combines encoder + decoder outputs
|
| 171 |
+
- Greedy decoding for token prediction
|
| 172 |
+
|
| 173 |
+
**Key Features**:
|
| 174 |
+
- LSTM state continuity across audio chunks
|
| 175 |
+
- Token deduplication via 2D search algorithm
|
| 176 |
+
- Batch chunking: 10s windows with 3s overlap
|
| 177 |
+
- Per-token timestamps (80ms granularity) & confidence scores
|
| 178 |
+
|
| 179 |
+
**Recommendation**: Use [V2](https://huggingface.co/FluidInference/parakeet-tdt-0.6b-v2-ov) for English-only applications. Use V3 for multilingual support.
|
| 180 |
+
|
| 181 |
+
## Limitations
|
| 182 |
+
|
| 183 |
+
- **Language Coverage**: Optimized for 24 European languages; performance may degrade for non-European languages or heavy accents.
|
| 184 |
+
- **Noise Robustness**: Best on clean audio; WER increases with background noise.
|
| 185 |
+
- **Streaming Latency**: ~6 seconds with default buffering (configurable).
|
| 186 |
+
|
| 187 |
+
## License
|
| 188 |
+
|
| 189 |
+
**CC-BY-4.0** - See [LICENSE](LICENSE) for details.
|
| 190 |
+
|
| 191 |
+
## Acknowledgments
|
| 192 |
+
|
| 193 |
+
- **Base Model**: NVIDIA NeMo Team for Parakeet TDT architecture
|
| 194 |
+
- **Optimization**: Intel OpenVINO for cross-platform inference
|
| 195 |
+
- **Benchmarks**: Google Research (FLEURS), OpenSLR (LibriSpeech)
|
| 196 |
+
- **Hardware**: Intel Core Ultra NPU acceleration
|