--- license: apache-2.0 language: - zh - nan tags: - whisper - asr - taiwanese - coreml - ane - breeze-asr-25 base_model: MediaTek-Research/Breeze-ASR-25 --- # Breeze-ASR-25 CoreML ANE Optimized Taiwanese/Mandarin mixed speech recognition model optimized for Apple Neural Engine (ANE). ## 🎯 Model Overview Based on [MediaTek-Research/Breeze-ASR-25](https://huggingface.co/MediaTek-Research/Breeze-ASR-25), converted to CoreML format with ANE optimization for macOS/iOS. ### Model Components | Component | File | Precision | Hardware | Size | |-----------|------|-----------|----------|------| | **Encoder** | \`encoder/ggml-breeze-asr-25-encoder.mlmodelc/\` | FP16 | ANE | ~1.2 GB | | **Decoder** | \`decoder/ggml-breeze-asr-25-q5k.bin\` | Q5_K_M | CPU/GPU | ~1.0 GB | ### Decoder Attribution GGML Decoder from [alan314159/Breeze-ASR-25-whispercpp](https://huggingface.co/alan314159/Breeze-ASR-25-whispercpp). Thank you for sharing! **SHA256**: \`8efbf0ce8a3f50fe332b7617da787fb81354b358c288b008d3bdef8359df64c6\` --- ## 🚀 Quick Start ### 1. Download Models \`\`\`bash # Install HuggingFace CLI pip install huggingface_hub # Download all models huggingface-cli download sheep52031/breeze-asr-25-coreml-ane \\ --local-dir ./models \`\`\` ### 2. Swift Integration (macOS/iOS) \`\`\`swift import CoreML import whisper // Load CoreML Encoder let encoderURL = Bundle.main.url( forResource: "ggml-breeze-asr-25-encoder", withExtension: "mlmodelc" )! // Load GGML Decoder let decoderPath = Bundle.main.path( forResource: "ggml-breeze-asr-25-q5k", ofType: "bin" )! // Initialize whisper.cpp context var params = whisper_context_default_params() params.use_gpu = true // Enable ANE let ctx = whisper_init_from_file_with_params(decoderPath, params) whisper_context_set_coreml_encoder(ctx, encoderURL.path) // Transcribe let audioData: [Float] = loadAudio("audio.wav") whisper_full(ctx, params, audioData, Int32(audioData.count)) let text = whisper_full_get_segment_text(ctx, 0) print("Result: \\(String(cString: text!))") \`\`\` --- ## 📊 Performance Benchmarks ### macOS (Apple Silicon) **Test Environment**: MacBook Pro M1/M2, 16GB RAM **Configuration**: - Window size: 30s (audio_ctx=3000) - Overlap: 5s - Processing: Serial (parallelism=1) - State management: Shared state reuse **Actual Performance** (Verified 2025-10-05): | Audio Length | Processing Time | RTR | Status | |--------------|-----------------|------|--------| | 30s | ~10s | 0.33x | ✅ Stable | | 60s | ~19s | 0.32x | ✅ Stable | | 70s | ~22s | 0.31x | ✅ Verified | | 120s | ~37s | 0.31x | ✅ Final | **RTR (Real-Time Ratio)**: Lower is better. 0.31 means 3.2x faster than real-time. ### Comparison | Configuration | 120s Audio | RTR | Note | |---------------|------------|-----|------| | **This Project (FP16 ANE + Q5_K)** | **~37s** | **0.31x** | ✅ Verified | | Full GGML (Estimated) | ~72s | 0.60x | 📊 Theoretical | **Note**: "Full GGML" is theoretical estimation based on ANE acceleration ratio. Performance may vary based on: - Audio content (speech density) - System resources - Background tasks --- ## 🔧 Technical Modifications for Breeze-ASR-25 Support This project implements a hybrid inference architecture combining CoreML-accelerated Encoder with GGML-quantized Decoder to support Breeze-ASR-25 on Apple Silicon. ### Why Official whisper.cpp Doesn't Work Breeze-ASR-25 is a fine-tuned Whisper model with key differences: - **Vocabulary Size**: 51,865 tokens (vs 51,864 in standard Whisper) - **Sequence Length**: `max_source_positions=1500` (encoder output length) - **Audio Window**: Supports 30-second audio (3000 mel frames) Official whisper.cpp assumptions: 1. ❌ Hardcodes `input_shape = (1, 80, 3000)` in CoreML conversion 2. ❌ Expects vocab_size=51,864 3. ❌ Lacks dynamic audio_ctx configuration API ### Our Key Modifications #### 1. CoreML Conversion Script Enhancement ```python # whisper.cpp/models/convert-whisper-to-coreml.py # Dynamic sequence length (not hardcoded 3000) input_shape = (1, hparams.n_mels, hparams.n_audio_ctx) # Correct feature names for whisper.cpp compatibility inputs=[ct.TensorType(name="mel", shape=input_shape)] outputs=[ct.TensorType(name="encoder_output")] ``` #### 2. whisper.cpp API Extension ```cpp // Added whisper_set_audio_ctx() for runtime configuration // Allows models with smaller n_audio_ctx (like Breeze-ASR-25 with 1500) // to work correctly instead of being padded to 30 seconds (3000 frames) int whisper_set_audio_ctx(struct whisper_context * ctx, int n_audio_ctx); ``` *Note: This is our custom modification to support Breeze-ASR-25. Not yet in official whisper.cpp.* #### 3. Modified whisper.cpp Fork We maintain a fork with all necessary modifications: **Repository**: [sheep52031/whisper.cpp](https://github.com/sheep52031/whisper.cpp) (branch: `breeze-asr-25-support`) **Key modifications**: - whisper_set_audio_ctx() API for dynamic audio context - CoreML conversion enhancements for fine-tuned models - Metal bfloat16 optimizations for M2+ GPUs - Based on [Splend1d/whisper-patch-breeze](https://github.com/Splend1d/whisper-patch-breeze) for vocab support **To use**: ```bash git clone -b breeze-asr-25-support https://github.com/sheep52031/whisper.cpp cd whisper.cpp cmake -B build && cmake --build build ``` #### 4. Hybrid Inference Architecture ```text Audio Input (16kHz) → Log-Mel Features (80 × 3000) → CoreML Encoder (FP16, ANE-accelerated) → Hidden States [1, 1500, 1280] → GGML Decoder (Q5_K quantized, Metal GPU) → Text Output ``` ### Technical Insights #### Understanding `max_source_positions=1500` - This is the Encoder **output** sequence length - Actual input length = 1500 × 2 (conv_stride) = 3000 mel frames - Equivalent to 30 seconds of audio (100 fps) - Common misconception: "1500 = 15 seconds" ❌ #### Why GGML Conversion Works But CoreML Fails - GGML: Directly reads config.json, preserves tensor shapes, dynamic runtime - CoreML: Requires TorchScript trace with fixed shapes, hardcoded assumptions - Our fix: Make CoreML conversion respect model configuration ### Contributions to Open Source We've identified and fixed critical issues in whisper.cpp's CoreML conversion: 1. ✅ Dynamic sequence length support (not just 3000 frames) 2. ✅ Runtime audio_ctx configuration API 3. ✅ Correct feature naming for hybrid inference These modifications enable support for **all fine-tuned Whisper variants**, not just Breeze-ASR-25. **Source Code**: All modifications are open-sourced at [sheep52031/whisper.cpp](https://github.com/sheep52031/whisper.cpp) (branch: `breeze-asr-25-support`) --- ## 🛠️ Convert From Scratch ### Requirements \`\`\`bash # macOS 13+, Xcode 14+, Python 3.9+ pip install -r conversion_tools/requirements.txt \`\`\` ### Convert Encoder \`\`\`bash cd conversion_tools python convert_encoder.py --output ../encoder \`\`\` ### Convert Decoder \`\`\`bash cd conversion_tools python convert_decoder.py --output ./output --quantize q5_k \`\`\` --- ## ✅ Verification \`\`\`bash # Encoder precision check cat encoder/ggml-breeze-asr-25-encoder.mlmodelc/metadata.json | grep dataType # Should show: "dataType" : "Float16" # Decoder SHA256 check shasum -a 256 decoder/ggml-breeze-asr-25-q5k.bin # Expected: 8efbf0ce8a3f50fe332b7617da787fb81354b358c288b008d3bdef8359df64c6 \`\`\` --- ## 📄 License Based on [MediaTek-Research/Breeze-ASR-25](https://huggingface.co/MediaTek-Research/Breeze-ASR-25) (Apache 2.0). **Attribution**: - CoreML ANE Optimization: sheep52031 (MIT License) - GGML Conversion: alan314159 --- ## 🙏 Acknowledgments - **MediaTek Research**: Breeze-ASR-25 model - **alan314159**: GGML conversion & pretrained model - **ggerganov**: whisper.cpp framework - **Apple**: CoreML Tools & ANE - **OpenAI**: Whisper base model --- **Last Updated**: 2025-10-06 **Version**: 1.0.0