sheep52031 commited on
Commit
e7e72c8
Β·
verified Β·
1 Parent(s): 2bc5e54

Add model README with technical documentation

Browse files
Files changed (1) hide show
  1. README.md +289 -0
README.md ADDED
@@ -0,0 +1,289 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - zh
5
+ - nan
6
+ tags:
7
+ - whisper
8
+ - asr
9
+ - taiwanese
10
+ - coreml
11
+ - ane
12
+ - breeze-asr-25
13
+ base_model: MediaTek-Research/Breeze-ASR-25
14
+ ---
15
+
16
+ # Breeze-ASR-25 CoreML ANE Optimized
17
+
18
+ Taiwanese/Mandarin mixed speech recognition model optimized for Apple Neural Engine (ANE).
19
+
20
+ ## 🎯 Model Overview
21
+
22
+ Based on [MediaTek-Research/Breeze-ASR-25](https://huggingface.co/MediaTek-Research/Breeze-ASR-25), converted to CoreML format with ANE optimization for macOS/iOS.
23
+
24
+ ### Model Components
25
+
26
+ | Component | File | Precision | Hardware | Size |
27
+ |-----------|------|-----------|----------|------|
28
+ | **Encoder** | \`encoder/ggml-breeze-asr-25-encoder.mlmodelc/\` | FP16 | ANE | ~1.2 GB |
29
+ | **Decoder** | \`decoder/ggml-breeze-asr-25-q5k.bin\` | Q5_K_M | CPU/GPU | ~1.0 GB |
30
+
31
+ ### Decoder Attribution
32
+
33
+ GGML Decoder from [alan314159/Breeze-ASR-25-whispercpp](https://huggingface.co/alan314159/Breeze-ASR-25-whispercpp). Thank you for sharing!
34
+
35
+ **SHA256**: \`8efbf0ce8a3f50fe332b7617da787fb81354b358c288b008d3bdef8359df64c6\`
36
+
37
+ ---
38
+
39
+ ## πŸš€ Quick Start
40
+
41
+ ### 1. Download Models
42
+
43
+ \`\`\`bash
44
+ # Install HuggingFace CLI
45
+ pip install huggingface_hub
46
+
47
+ # Download all models
48
+ huggingface-cli download sheep52031/breeze-asr-25-coreml-ane \\
49
+ --local-dir ./models
50
+ \`\`\`
51
+
52
+ ### 2. Swift Integration (macOS/iOS)
53
+
54
+ \`\`\`swift
55
+ import CoreML
56
+ import whisper
57
+
58
+ // Load CoreML Encoder
59
+ let encoderURL = Bundle.main.url(
60
+ forResource: "ggml-breeze-asr-25-encoder",
61
+ withExtension: "mlmodelc"
62
+ )!
63
+
64
+ // Load GGML Decoder
65
+ let decoderPath = Bundle.main.path(
66
+ forResource: "ggml-breeze-asr-25-q5k",
67
+ ofType: "bin"
68
+ )!
69
+
70
+ // Initialize whisper.cpp context
71
+ var params = whisper_context_default_params()
72
+ params.use_gpu = true // Enable ANE
73
+
74
+ let ctx = whisper_init_from_file_with_params(decoderPath, params)
75
+ whisper_context_set_coreml_encoder(ctx, encoderURL.path)
76
+
77
+ // Transcribe
78
+ let audioData: [Float] = loadAudio("audio.wav")
79
+ whisper_full(ctx, params, audioData, Int32(audioData.count))
80
+
81
+ let text = whisper_full_get_segment_text(ctx, 0)
82
+ print("Result: \\(String(cString: text!))")
83
+ \`\`\`
84
+
85
+ ---
86
+
87
+ ## πŸ“Š Performance Benchmarks
88
+
89
+ ### macOS (Apple Silicon)
90
+
91
+ **Test Environment**: MacBook Pro M1/M2, 16GB RAM
92
+
93
+ **Configuration**:
94
+ - Window size: 30s (audio_ctx=3000)
95
+ - Overlap: 5s
96
+ - Processing: Serial (parallelism=1)
97
+ - State management: Shared state reuse
98
+
99
+ **Actual Performance** (Verified 2025-10-05):
100
+
101
+ | Audio Length | Processing Time | RTR | Status |
102
+ |--------------|-----------------|------|--------|
103
+ | 30s | ~10s | 0.33x | βœ… Stable |
104
+ | 60s | ~19s | 0.32x | βœ… Stable |
105
+ | 70s | ~22s | 0.31x | βœ… Verified |
106
+ | 120s | ~37s | 0.31x | βœ… Final |
107
+
108
+ **RTR (Real-Time Ratio)**: Lower is better. 0.31 means 3.2x faster than real-time.
109
+
110
+ ### Comparison
111
+
112
+ | Configuration | 120s Audio | RTR | Note |
113
+ |---------------|------------|-----|------|
114
+ | **This Project (FP16 ANE + Q5_K)** | **~37s** | **0.31x** | βœ… Verified |
115
+ | Full GGML (Estimated) | ~72s | 0.60x | πŸ“Š Theoretical |
116
+
117
+ **Note**: "Full GGML" is theoretical estimation based on ANE acceleration ratio. Performance may vary based on:
118
+ - Audio content (speech density)
119
+ - System resources
120
+ - Background tasks
121
+
122
+ ---
123
+
124
+ ## πŸ”§ Technical Modifications for Breeze-ASR-25 Support
125
+
126
+ This project implements a hybrid inference architecture combining CoreML-accelerated Encoder with GGML-quantized Decoder to support Breeze-ASR-25 on Apple Silicon.
127
+
128
+ ### Why Official whisper.cpp Doesn't Work
129
+
130
+ Breeze-ASR-25 is a fine-tuned Whisper model with key differences:
131
+
132
+ - **Vocabulary Size**: 51,865 tokens (vs 51,864 in standard Whisper)
133
+ - **Sequence Length**: `max_source_positions=1500` (encoder output length)
134
+ - **Audio Window**: Supports 30-second audio (3000 mel frames)
135
+
136
+ Official whisper.cpp assumptions:
137
+
138
+ 1. ❌ Hardcodes `input_shape = (1, 80, 3000)` in CoreML conversion
139
+ 2. ❌ Expects vocab_size=51,864
140
+ 3. ❌ Lacks dynamic audio_ctx configuration API
141
+
142
+ ### Our Key Modifications
143
+
144
+ #### 1. CoreML Conversion Script Enhancement
145
+
146
+ ```python
147
+ # whisper.cpp/models/convert-whisper-to-coreml.py
148
+
149
+ # Dynamic sequence length (not hardcoded 3000)
150
+ input_shape = (1, hparams.n_mels, hparams.n_audio_ctx)
151
+
152
+ # Correct feature names for whisper.cpp compatibility
153
+ inputs=[ct.TensorType(name="mel", shape=input_shape)]
154
+ outputs=[ct.TensorType(name="encoder_output")]
155
+ ```
156
+
157
+ #### 2. whisper.cpp API Extension
158
+
159
+ ```cpp
160
+ // Added whisper_set_audio_ctx() for runtime configuration
161
+ // Allows models with smaller n_audio_ctx (like Breeze-ASR-25 with 1500)
162
+ // to work correctly instead of being padded to 30 seconds (3000 frames)
163
+ int whisper_set_audio_ctx(struct whisper_context * ctx, int n_audio_ctx);
164
+ ```
165
+
166
+ *Note: This is our custom modification to support Breeze-ASR-25. Not yet in official whisper.cpp.*
167
+
168
+ #### 3. Modified whisper.cpp Fork
169
+
170
+ We maintain a fork with all necessary modifications:
171
+
172
+ **Repository**: [sheep52031/whisper.cpp](https://github.com/sheep52031/whisper.cpp) (branch: `breeze-asr-25-support`)
173
+
174
+ **Key modifications**:
175
+
176
+ - whisper_set_audio_ctx() API for dynamic audio context
177
+ - CoreML conversion enhancements for fine-tuned models
178
+ - Metal bfloat16 optimizations for M2+ GPUs
179
+ - Based on [Splend1d/whisper-patch-breeze](https://github.com/Splend1d/whisper-patch-breeze) for vocab support
180
+
181
+ **To use**:
182
+
183
+ ```bash
184
+ git clone -b breeze-asr-25-support https://github.com/sheep52031/whisper.cpp
185
+ cd whisper.cpp
186
+ cmake -B build && cmake --build build
187
+ ```
188
+
189
+ #### 4. Hybrid Inference Architecture
190
+
191
+ ```text
192
+ Audio Input (16kHz)
193
+ β†’ Log-Mel Features (80 Γ— 3000)
194
+ β†’ CoreML Encoder (FP16, ANE-accelerated)
195
+ β†’ Hidden States [1, 1500, 1280]
196
+ β†’ GGML Decoder (Q5_K quantized, Metal GPU)
197
+ β†’ Text Output
198
+ ```
199
+
200
+ ### Technical Insights
201
+
202
+ #### Understanding `max_source_positions=1500`
203
+
204
+ - This is the Encoder **output** sequence length
205
+ - Actual input length = 1500 Γ— 2 (conv_stride) = 3000 mel frames
206
+ - Equivalent to 30 seconds of audio (100 fps)
207
+ - Common misconception: "1500 = 15 seconds" ❌
208
+
209
+ #### Why GGML Conversion Works But CoreML Fails
210
+
211
+ - GGML: Directly reads config.json, preserves tensor shapes, dynamic runtime
212
+ - CoreML: Requires TorchScript trace with fixed shapes, hardcoded assumptions
213
+ - Our fix: Make CoreML conversion respect model configuration
214
+
215
+ ### Contributions to Open Source
216
+
217
+ We've identified and fixed critical issues in whisper.cpp's CoreML conversion:
218
+
219
+ 1. βœ… Dynamic sequence length support (not just 3000 frames)
220
+ 2. βœ… Runtime audio_ctx configuration API
221
+ 3. βœ… Correct feature naming for hybrid inference
222
+
223
+ These modifications enable support for **all fine-tuned Whisper variants**, not just Breeze-ASR-25.
224
+
225
+ **Source Code**: All modifications are open-sourced at [sheep52031/whisper.cpp](https://github.com/sheep52031/whisper.cpp) (branch: `breeze-asr-25-support`)
226
+
227
+ ---
228
+
229
+ ## πŸ› οΈ Convert From Scratch
230
+
231
+ ### Requirements
232
+
233
+ \`\`\`bash
234
+ # macOS 13+, Xcode 14+, Python 3.9+
235
+ pip install -r conversion_tools/requirements.txt
236
+ \`\`\`
237
+
238
+ ### Convert Encoder
239
+
240
+ \`\`\`bash
241
+ cd conversion_tools
242
+ python convert_encoder.py --output ../encoder
243
+ \`\`\`
244
+
245
+ ### Convert Decoder
246
+
247
+ \`\`\`bash
248
+ cd conversion_tools
249
+ python convert_decoder.py --output ./output --quantize q5_k
250
+ \`\`\`
251
+
252
+ ---
253
+
254
+ ## βœ… Verification
255
+
256
+ \`\`\`bash
257
+ # Encoder precision check
258
+ cat encoder/ggml-breeze-asr-25-encoder.mlmodelc/metadata.json | grep dataType
259
+ # Should show: "dataType" : "Float16"
260
+
261
+ # Decoder SHA256 check
262
+ shasum -a 256 decoder/ggml-breeze-asr-25-q5k.bin
263
+ # Expected: 8efbf0ce8a3f50fe332b7617da787fb81354b358c288b008d3bdef8359df64c6
264
+ \`\`\`
265
+
266
+ ---
267
+
268
+ ## πŸ“„ License
269
+
270
+ Based on [MediaTek-Research/Breeze-ASR-25](https://huggingface.co/MediaTek-Research/Breeze-ASR-25) (Apache 2.0).
271
+
272
+ **Attribution**:
273
+ - CoreML ANE Optimization: sheep52031 (MIT License)
274
+ - GGML Conversion: alan314159
275
+
276
+ ---
277
+
278
+ ## πŸ™ Acknowledgments
279
+
280
+ - **MediaTek Research**: Breeze-ASR-25 model
281
+ - **alan314159**: GGML conversion & pretrained model
282
+ - **ggerganov**: whisper.cpp framework
283
+ - **Apple**: CoreML Tools & ANE
284
+ - **OpenAI**: Whisper base model
285
+
286
+ ---
287
+
288
+ **Last Updated**: 2025-10-06
289
+ **Version**: 1.0.0