sheep52031
/

breeze-asr-25-coreml-ane

+---
+license: apache-2.0
+language:
+- zh
+- nan
+tags:
+- whisper
+- asr
+- taiwanese
+- coreml
+- ane
+- breeze-asr-25
+base_model: MediaTek-Research/Breeze-ASR-25
+---
+# Breeze-ASR-25 CoreML ANE Optimized
+Taiwanese/Mandarin mixed speech recognition model optimized for Apple Neural Engine (ANE).
+## 🎯 Model Overview
+Based on [MediaTek-Research/Breeze-ASR-25](https://huggingface.co/MediaTek-Research/Breeze-ASR-25), converted to CoreML format with ANE optimization for macOS/iOS.
+### Model Components
+| Component | File | Precision | Hardware | Size |
+|-----------|------|-----------|----------|------|
+| **Encoder** | \`encoder/ggml-breeze-asr-25-encoder.mlmodelc/\` | FP16 | ANE | ~1.2 GB |
+| **Decoder** | \`decoder/ggml-breeze-asr-25-q5k.bin\` | Q5_K_M | CPU/GPU | ~1.0 GB |
+### Decoder Attribution
+GGML Decoder from [alan314159/Breeze-ASR-25-whispercpp](https://huggingface.co/alan314159/Breeze-ASR-25-whispercpp). Thank you for sharing!
+**SHA256**: \`8efbf0ce8a3f50fe332b7617da787fb81354b358c288b008d3bdef8359df64c6\`
+---
+## 🚀 Quick Start
+### 1. Download Models
+\`\`\`bash
+# Install HuggingFace CLI
+pip install huggingface_hub
+# Download all models
+huggingface-cli download sheep52031/breeze-asr-25-coreml-ane \\
+  --local-dir ./models
+\`\`\`
+### 2. Swift Integration (macOS/iOS)
+\`\`\`swift
+import CoreML
+import whisper
+// Load CoreML Encoder
+let encoderURL = Bundle.main.url(
+    forResource: "ggml-breeze-asr-25-encoder",
+    withExtension: "mlmodelc"
+)!
+// Load GGML Decoder
+let decoderPath = Bundle.main.path(
+    forResource: "ggml-breeze-asr-25-q5k",
+    ofType: "bin"
+)!
+// Initialize whisper.cpp context
+var params = whisper_context_default_params()
+params.use_gpu = true  // Enable ANE
+let ctx = whisper_init_from_file_with_params(decoderPath, params)
+whisper_context_set_coreml_encoder(ctx, encoderURL.path)
+// Transcribe
+let audioData: [Float] = loadAudio("audio.wav")
+whisper_full(ctx, params, audioData, Int32(audioData.count))
+let text = whisper_full_get_segment_text(ctx, 0)
+print("Result: \\(String(cString: text!))")
+\`\`\`
+---
+## 📊 Performance Benchmarks
+### macOS (Apple Silicon)
+**Test Environment**: MacBook Pro M1/M2, 16GB RAM
+**Configuration**:
+- Window size: 30s (audio_ctx=3000)
+- Overlap: 5s
+- Processing: Serial (parallelism=1)
+- State management: Shared state reuse
+**Actual Performance** (Verified 2025-10-05):
+| Audio Length | Processing Time | RTR  | Status |
+|--------------|-----------------|------|--------|
+| 30s          | ~10s            | 0.33x | ✅ Stable |
+| 60s          | ~19s            | 0.32x | ✅ Stable |
+| 70s          | ~22s            | 0.31x | ✅ Verified |
+| 120s         | ~37s            | 0.31x | ✅ Final |
+**RTR (Real-Time Ratio)**: Lower is better. 0.31 means 3.2x faster than real-time.
+### Comparison
+| Configuration | 120s Audio | RTR | Note |
+|---------------|------------|-----|------|
+| **This Project (FP16 ANE + Q5_K)** | **~37s** | **0.31x** | ✅ Verified |
+| Full GGML (Estimated) | ~72s | 0.60x | 📊 Theoretical |
+**Note**: "Full GGML" is theoretical estimation based on ANE acceleration ratio. Performance may vary based on:
+- Audio content (speech density)
+- System resources
+- Background tasks
+---
+## 🔧 Technical Modifications for Breeze-ASR-25 Support
+This project implements a hybrid inference architecture combining CoreML-accelerated Encoder with GGML-quantized Decoder to support Breeze-ASR-25 on Apple Silicon.
+### Why Official whisper.cpp Doesn't Work
+Breeze-ASR-25 is a fine-tuned Whisper model with key differences:
+- **Vocabulary Size**: 51,865 tokens (vs 51,864 in standard Whisper)
+- **Sequence Length**: `max_source_positions=1500` (encoder output length)
+- **Audio Window**: Supports 30-second audio (3000 mel frames)
+Official whisper.cpp assumptions:
+1. ❌ Hardcodes `input_shape = (1, 80, 3000)` in CoreML conversion
+2. ❌ Expects vocab_size=51,864
+3. ❌ Lacks dynamic audio_ctx configuration API
+### Our Key Modifications
+#### 1. CoreML Conversion Script Enhancement
+```python
+# whisper.cpp/models/convert-whisper-to-coreml.py
+# Dynamic sequence length (not hardcoded 3000)
+input_shape = (1, hparams.n_mels, hparams.n_audio_ctx)
+# Correct feature names for whisper.cpp compatibility
+inputs=[ct.TensorType(name="mel", shape=input_shape)]
+outputs=[ct.TensorType(name="encoder_output")]
+```
+#### 2. whisper.cpp API Extension
+```cpp
+// Added whisper_set_audio_ctx() for runtime configuration
+// Allows models with smaller n_audio_ctx (like Breeze-ASR-25 with 1500)
+// to work correctly instead of being padded to 30 seconds (3000 frames)
+int whisper_set_audio_ctx(struct whisper_context * ctx, int n_audio_ctx);
+```
+*Note: This is our custom modification to support Breeze-ASR-25. Not yet in official whisper.cpp.*
+#### 3. Modified whisper.cpp Fork
+We maintain a fork with all necessary modifications:
+**Repository**: [sheep52031/whisper.cpp](https://github.com/sheep52031/whisper.cpp) (branch: `breeze-asr-25-support`)
+**Key modifications**:
+- whisper_set_audio_ctx() API for dynamic audio context
+- CoreML conversion enhancements for fine-tuned models
+- Metal bfloat16 optimizations for M2+ GPUs
+- Based on [Splend1d/whisper-patch-breeze](https://github.com/Splend1d/whisper-patch-breeze) for vocab support
+**To use**:
+```bash
+git clone -b breeze-asr-25-support https://github.com/sheep52031/whisper.cpp
+cd whisper.cpp
+cmake -B build && cmake --build build
+```
+#### 4. Hybrid Inference Architecture
+```text
+Audio Input (16kHz)
+    → Log-Mel Features (80 × 3000)
+    → CoreML Encoder (FP16, ANE-accelerated)
+    → Hidden States [1, 1500, 1280]
+    → GGML Decoder (Q5_K quantized, Metal GPU)
+    → Text Output
+```
+### Technical Insights
+#### Understanding `max_source_positions=1500`
+- This is the Encoder **output** sequence length
+- Actual input length = 1500 × 2 (conv_stride) = 3000 mel frames
+- Equivalent to 30 seconds of audio (100 fps)
+- Common misconception: "1500 = 15 seconds" ❌
+#### Why GGML Conversion Works But CoreML Fails
+- GGML: Directly reads config.json, preserves tensor shapes, dynamic runtime
+- CoreML: Requires TorchScript trace with fixed shapes, hardcoded assumptions
+- Our fix: Make CoreML conversion respect model configuration
+### Contributions to Open Source
+We've identified and fixed critical issues in whisper.cpp's CoreML conversion:
+1. ✅ Dynamic sequence length support (not just 3000 frames)
+2. ✅ Runtime audio_ctx configuration API
+3. ✅ Correct feature naming for hybrid inference
+These modifications enable support for **all fine-tuned Whisper variants**, not just Breeze-ASR-25.
+**Source Code**: All modifications are open-sourced at [sheep52031/whisper.cpp](https://github.com/sheep52031/whisper.cpp) (branch: `breeze-asr-25-support`)
+---
+## 🛠️ Convert From Scratch
+### Requirements
+\`\`\`bash
+# macOS 13+, Xcode 14+, Python 3.9+
+pip install -r conversion_tools/requirements.txt
+\`\`\`
+### Convert Encoder
+\`\`\`bash
+cd conversion_tools
+python convert_encoder.py --output ../encoder
+\`\`\`
+### Convert Decoder
+\`\`\`bash
+cd conversion_tools
+python convert_decoder.py --output ./output --quantize q5_k
+\`\`\`
+---
+## ✅ Verification
+\`\`\`bash
+# Encoder precision check
+cat encoder/ggml-breeze-asr-25-encoder.mlmodelc/metadata.json | grep dataType
+# Should show: "dataType" : "Float16"
+# Decoder SHA256 check
+shasum -a 256 decoder/ggml-breeze-asr-25-q5k.bin
+# Expected: 8efbf0ce8a3f50fe332b7617da787fb81354b358c288b008d3bdef8359df64c6
+\`\`\`
+---
+## 📄 License
+Based on [MediaTek-Research/Breeze-ASR-25](https://huggingface.co/MediaTek-Research/Breeze-ASR-25) (Apache 2.0).
+**Attribution**:
+- CoreML ANE Optimization: sheep52031 (MIT License)
+- GGML Conversion: alan314159
+---
+## 🙏 Acknowledgments
+- **MediaTek Research**: Breeze-ASR-25 model
+- **alan314159**: GGML conversion & pretrained model
+- **ggerganov**: whisper.cpp framework
+- **Apple**: CoreML Tools & ANE
+- **OpenAI**: Whisper base model
+---
+**Last Updated**: 2025-10-06
+**Version**: 1.0.0