---
language: en
datasets:
- librispeech
metrics:
- wer
pipeline_tag: automatic-speech-recognition
tags:
- transcription
- audio
- speech
- chunkformer
- asr
- automatic-speech-recognition
- long-form transcription
- librispeech
license: cc-by-nc-4.0
model-index:
- name: ChunkFormer-Large-En-Libri-960h
results:
- task:
name: Speech Recognition
type: automatic-speech-recognition
dataset:
name: test-clean
type: librispeech
args: en
metrics:
- name: Test WER
type: wer
value: 2.69
- task:
name: Speech Recognition
type: automatic-speech-recognition
dataset:
name: test-other
type: librispeech
args: en
metrics:
- name: Test WER
type: wer
value: 6.91
---
# **ChunkFormer-Large-En-Libri-960h: Pretrained ChunkFormer-Large on 960 hours of LibriSpeech dataset**
[](https://creativecommons.org/licenses/by-nc/4.0/)
[](https://github.com/khanld/chunkformer)
[](https://arxiv.org/abs/2502.14673)
[](#description)
---
## Table of contents
1. [Model Description](#description)
2. [Documentation and Implementation](#implementation)
3. [Benchmark Results](#benchmark)
4. [Usage](#usage)
6. [Citation](#citation)
7. [Contact](#contact)
---
## Model Description
**ChunkFormer-Large-En-Libri-960h** is an English Automatic Speech Recognition (ASR) model based on the **ChunkFormer** architecture, introduced at **ICASSP 2025**. The model has been fine-tuned on 960 hours of LibriSpeech, a widely-used dataset for ASR research.
---
## Documentation and Implementation
The [Documentation]() and [Implementation](https://github.com/khanld/chunkformer) of ChunkFormer are publicly available.
---
## Benchmark Results
We evaluate the models using **Word Error Rate (WER)**. To ensure a fair comparison, all models are trained exclusively with the [**WENET**](https://github.com/wenet-e2e/wenet) framework.
| STT | Model | Test-Clean | Test-Other | Avg. |
|-----|-----------------------|------------|------------|------ |
| 1 | **ChunkFormer** | 2.69 | 6.91 | 4.80 |
| 2 | **Efficient Conformer** | 2.71 | 6.95 | 4.83 |
| 3 | **Conformer** | 2.77 | 6.93 | 4.85 |
| 4 | **Squeezeformer** | 2.87 | 7.16 | 5.02 |
---
## Quick Usage
To use the ChunkFormer model for English Automatic Speech Recognition, follow these steps:
### Option 1: Install from PyPI (Recommended)
```bash
pip install chunkformer
```
### Option 2: Install from source
```bash
git clone https://github.com/khanld/chunkformer.git
cd chunkformer
pip install -e .
```
### Python API Usage
```python
from chunkformer import ChunkFormerModel
# Load the English model from Hugging Face
model = ChunkFormerModel.from_pretrained("khanhld/chunkformer-large-en-libri-960h")
# For single long-form audio transcription
transcription = model.endless_decode(
audio_path="path/to/long_audio.wav",
chunk_size=64,
left_context_size=128,
right_context_size=128,
total_batch_duration=14400, # in seconds
return_timestamps=True
)
print(transcription)
# For batch processing of multiple audio files
audio_files = ["audio1.wav", "audio2.wav", "audio3.wav"]
transcriptions = model.batch_decode(
audio_paths=audio_files,
chunk_size=64,
left_context_size=128,
right_context_size=128,
total_batch_duration=1800 # Total batch duration in seconds
)
for i, transcription in enumerate(transcriptions):
print(f"Audio {i+1}: {transcription}")
```
### Command Line Usage
After installation, you can use the command line interface:
```bash
chunkformer-decode \
--model_checkpoint khanhld/chunkformer-large-en-libri-960h \
--long_form_audio path/to/audio.wav \
--total_batch_duration 14400 \
--chunk_size 64 \
--left_context_size 128 \
--right_context_size 128
```
Example Output:
```
[00:00:01.200] - [00:00:02.400]: this is a transcription example
[00:00:02.500] - [00:00:03.700]: testing the long-form audio
```
**Advanced Usage** can be found [HERE](https://github.com/khanld/chunkformer/tree/main?tab=readme-ov-file#usage)
---
## Citation
If you use this work in your research, please cite:
```bibtex
@INPROCEEDINGS{10888640,
author={Le, Khanh and Ho, Tuan Vu and Tran, Dung and Chau, Duc Thanh},
booktitle={ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription},
year={2025},
volume={},
number={},
pages={1-5},
keywords={Scalability;Memory management;Graphics processing units;Signal processing;Performance gain;Hardware;Resource management;Speech processing;Standards;Context modeling;chunkformer;masked batch;long-form transcription},
doi={10.1109/ICASSP49660.2025.10888640}}
}
```
---
## Contact
- khanhld218@gmail.com
- [](https://github.com/khanld)
- [](https://www.linkedin.com/in/khanhld257/)