---
license: apache-2.0
language:
- zh
- nan
tags:
- whisper
- asr
- taiwanese
- coreml
- ane
- breeze-asr-25
base_model: MediaTek-Research/Breeze-ASR-25
---

# Breeze-ASR-25 CoreML ANE Optimized

Taiwanese/Mandarin mixed speech recognition model optimized for Apple Neural Engine (ANE).

## 🎯 Model Overview

Based on [MediaTek-Research/Breeze-ASR-25](https://huggingface.co/MediaTek-Research/Breeze-ASR-25), converted to CoreML format with ANE optimization for macOS/iOS.

### Model Components

| Component | File | Precision | Hardware | Size |
|-----------|------|-----------|----------|------|
| **Encoder** | \`encoder/ggml-breeze-asr-25-encoder.mlmodelc/\` | FP16 | ANE | ~1.2 GB |
| **Decoder** | \`decoder/ggml-breeze-asr-25-q5k.bin\` | Q5_K_M | CPU/GPU | ~1.0 GB |

### Decoder Attribution

GGML Decoder from [alan314159/Breeze-ASR-25-whispercpp](https://huggingface.co/alan314159/Breeze-ASR-25-whispercpp). Thank you for sharing!

**SHA256**: \`8efbf0ce8a3f50fe332b7617da787fb81354b358c288b008d3bdef8359df64c6\`

---

## 🚀 Quick Start

### 1. Download Models

\`\`\`bash
# Install HuggingFace CLI
pip install huggingface_hub

# Download all models
huggingface-cli download sheep52031/breeze-asr-25-coreml-ane \\
  --local-dir ./models
\`\`\`

### 2. Swift Integration (macOS/iOS)

\`\`\`swift
import CoreML
import whisper

// Load CoreML Encoder
let encoderURL = Bundle.main.url(
    forResource: "ggml-breeze-asr-25-encoder",
    withExtension: "mlmodelc"
)!

// Load GGML Decoder
let decoderPath = Bundle.main.path(
    forResource: "ggml-breeze-asr-25-q5k",
    ofType: "bin"
)!

// Initialize whisper.cpp context
var params = whisper_context_default_params()
params.use_gpu = true  // Enable ANE

let ctx = whisper_init_from_file_with_params(decoderPath, params)
whisper_context_set_coreml_encoder(ctx, encoderURL.path)

// Transcribe
let audioData: [Float] = loadAudio("audio.wav")
whisper_full(ctx, params, audioData, Int32(audioData.count))

let text = whisper_full_get_segment_text(ctx, 0)
print("Result: \\(String(cString: text!))")
\`\`\`

---

## 📊 Performance Benchmarks

### macOS (Apple Silicon)

**Test Environment**: MacBook Pro M1/M2, 16GB RAM

**Configuration**:
- Window size: 30s (audio_ctx=3000)
- Overlap: 5s
- Processing: Serial (parallelism=1)
- State management: Shared state reuse

**Actual Performance** (Verified 2025-10-05):

| Audio Length | Processing Time | RTR  | Status |
|--------------|-----------------|------|--------|
| 30s          | ~10s            | 0.33x | ✅ Stable |
| 60s          | ~19s            | 0.32x | ✅ Stable |
| 70s          | ~22s            | 0.31x | ✅ Verified |
| 120s         | ~37s            | 0.31x | ✅ Final |

**RTR (Real-Time Ratio)**: Lower is better. 0.31 means 3.2x faster than real-time.

### Comparison

| Configuration | 120s Audio | RTR | Note |
|---------------|------------|-----|------|
| **This Project (FP16 ANE + Q5_K)** | **~37s** | **0.31x** | ✅ Verified |
| Full GGML (Estimated) | ~72s | 0.60x | 📊 Theoretical |

**Note**: "Full GGML" is theoretical estimation based on ANE acceleration ratio. Performance may vary based on:
- Audio content (speech density)
- System resources
- Background tasks

---

## 🔧 Technical Modifications for Breeze-ASR-25 Support

This project implements a hybrid inference architecture combining CoreML-accelerated Encoder with GGML-quantized Decoder to support Breeze-ASR-25 on Apple Silicon.

### Why Official whisper.cpp Doesn't Work

Breeze-ASR-25 is a fine-tuned Whisper model with key differences:

- **Vocabulary Size**: 51,865 tokens (vs 51,864 in standard Whisper)
- **Sequence Length**: `max_source_positions=1500` (encoder output length)
- **Audio Window**: Supports 30-second audio (3000 mel frames)

Official whisper.cpp assumptions:

1. ❌ Hardcodes `input_shape = (1, 80, 3000)` in CoreML conversion
2. ❌ Expects vocab_size=51,864
3. ❌ Lacks dynamic audio_ctx configuration API

### Our Key Modifications

#### 1. CoreML Conversion Script Enhancement

```python
# whisper.cpp/models/convert-whisper-to-coreml.py

# Dynamic sequence length (not hardcoded 3000)
input_shape = (1, hparams.n_mels, hparams.n_audio_ctx)

# Correct feature names for whisper.cpp compatibility
inputs=[ct.TensorType(name="mel", shape=input_shape)]
outputs=[ct.TensorType(name="encoder_output")]
```

#### 2. whisper.cpp API Extension

```cpp
// Added whisper_set_audio_ctx() for runtime configuration
// Allows models with smaller n_audio_ctx (like Breeze-ASR-25 with 1500)
// to work correctly instead of being padded to 30 seconds (3000 frames)
int whisper_set_audio_ctx(struct whisper_context * ctx, int n_audio_ctx);
```

*Note: This is our custom modification to support Breeze-ASR-25. Not yet in official whisper.cpp.*

#### 3. Modified whisper.cpp Fork

We maintain a fork with all necessary modifications:

**Repository**: [sheep52031/whisper.cpp](https://github.com/sheep52031/whisper.cpp) (branch: `breeze-asr-25-support`)

**Key modifications**:

- whisper_set_audio_ctx() API for dynamic audio context
- CoreML conversion enhancements for fine-tuned models
- Metal bfloat16 optimizations for M2+ GPUs
- Based on [Splend1d/whisper-patch-breeze](https://github.com/Splend1d/whisper-patch-breeze) for vocab support

**To use**:

```bash
git clone -b breeze-asr-25-support https://github.com/sheep52031/whisper.cpp
cd whisper.cpp
cmake -B build && cmake --build build
```

#### 4. Hybrid Inference Architecture

```text
Audio Input (16kHz)
    → Log-Mel Features (80 × 3000)
    → CoreML Encoder (FP16, ANE-accelerated)
    → Hidden States [1, 1500, 1280]
    → GGML Decoder (Q5_K quantized, Metal GPU)
    → Text Output
```

### Technical Insights

#### Understanding `max_source_positions=1500`

- This is the Encoder **output** sequence length
- Actual input length = 1500 × 2 (conv_stride) = 3000 mel frames
- Equivalent to 30 seconds of audio (100 fps)
- Common misconception: "1500 = 15 seconds" ❌

#### Why GGML Conversion Works But CoreML Fails

- GGML: Directly reads config.json, preserves tensor shapes, dynamic runtime
- CoreML: Requires TorchScript trace with fixed shapes, hardcoded assumptions
- Our fix: Make CoreML conversion respect model configuration

### Contributions to Open Source

We've identified and fixed critical issues in whisper.cpp's CoreML conversion:

1. ✅ Dynamic sequence length support (not just 3000 frames)
2. ✅ Runtime audio_ctx configuration API
3. ✅ Correct feature naming for hybrid inference

These modifications enable support for **all fine-tuned Whisper variants**, not just Breeze-ASR-25.

**Source Code**: All modifications are open-sourced at [sheep52031/whisper.cpp](https://github.com/sheep52031/whisper.cpp) (branch: `breeze-asr-25-support`)

---

## 🛠️ Convert From Scratch

### Requirements

\`\`\`bash
# macOS 13+, Xcode 14+, Python 3.9+
pip install -r conversion_tools/requirements.txt
\`\`\`

### Convert Encoder

\`\`\`bash
cd conversion_tools
python convert_encoder.py --output ../encoder
\`\`\`

### Convert Decoder

\`\`\`bash
cd conversion_tools
python convert_decoder.py --output ./output --quantize q5_k
\`\`\`

---

## ✅ Verification

\`\`\`bash
# Encoder precision check
cat encoder/ggml-breeze-asr-25-encoder.mlmodelc/metadata.json | grep dataType
# Should show: "dataType" : "Float16"

# Decoder SHA256 check
shasum -a 256 decoder/ggml-breeze-asr-25-q5k.bin
# Expected: 8efbf0ce8a3f50fe332b7617da787fb81354b358c288b008d3bdef8359df64c6
\`\`\`

---

## 📄 License

Based on [MediaTek-Research/Breeze-ASR-25](https://huggingface.co/MediaTek-Research/Breeze-ASR-25) (Apache 2.0).

**Attribution**:
- CoreML ANE Optimization: sheep52031 (MIT License)
- GGML Conversion: alan314159

---

## 🙏 Acknowledgments

- **MediaTek Research**: Breeze-ASR-25 model
- **alan314159**: GGML conversion & pretrained model
- **ggerganov**: whisper.cpp framework
- **Apple**: CoreML Tools & ANE
- **OpenAI**: Whisper base model

---

**Last Updated**: 2025-10-06  
**Version**: 1.0.0