|
|
--- |
|
|
tags: |
|
|
- speech-recognition |
|
|
- audio |
|
|
- chunkformer |
|
|
- ctc |
|
|
- pytorch |
|
|
- transformers |
|
|
- automatic-speech-recognition |
|
|
- long-form transcription |
|
|
- asr |
|
|
license: apache-2.0 |
|
|
library_name: transformers |
|
|
pipeline_tag: automatic-speech-recognition |
|
|
--- |
|
|
|
|
|
# ChunkFormer Model |
|
|
<style> |
|
|
img { |
|
|
display: inline; |
|
|
} |
|
|
</style> |
|
|
[](https://github.com/khanld/chunkformer) |
|
|
[](https://arxiv.org/abs/2502.14673) |
|
|
|
|
|
|
|
|
## Usage |
|
|
|
|
|
Install the package: |
|
|
|
|
|
```bash |
|
|
pip install chunkformer |
|
|
``` |
|
|
|
|
|
```python |
|
|
from chunkformer import ChunkFormerModel |
|
|
|
|
|
# Load the model |
|
|
model = ChunkFormerModel.from_pretrained("khanhld/chunkformer-ctc-small-libri-100h") |
|
|
|
|
|
# For long-form audio transcription |
|
|
transcription = model.endless_decode( |
|
|
audio_path="path/to/your/audio.wav", |
|
|
chunk_size=64, |
|
|
left_context_size=128, |
|
|
right_context_size=128, |
|
|
return_timestamps=True |
|
|
) |
|
|
print(transcription) |
|
|
|
|
|
# For batch processing |
|
|
audio_files = ["audio1.wav", "audio2.wav", "audio3.wav"] |
|
|
transcriptions = model.batch_decode( |
|
|
audio_paths=audio_files, |
|
|
chunk_size=64, |
|
|
left_context_size=128, |
|
|
right_context_size=128 |
|
|
) |
|
|
``` |
|
|
|
|
|
## Training |
|
|
|
|
|
This model was trained using the ChunkFormer framework. For more details about the training process and to access the source code, please visit: https://github.com/khanld/chunkformer |
|
|
|
|
|
Paper: https://arxiv.org/abs/2502.14673 |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this work in your research, please cite: |
|
|
|
|
|
```bibtex |
|
|
@INPROCEEDINGS{10888640, |
|
|
author={Le, Khanh and Ho, Tuan Vu and Tran, Dung and Chau, Duc Thanh}, |
|
|
booktitle={ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, |
|
|
title={ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription}, |
|
|
year={2025}, |
|
|
volume={}, |
|
|
number={}, |
|
|
pages={1-5}, |
|
|
keywords={Scalability;Memory management;Graphics processing units;Signal processing;Performance gain;Hardware;Resource management;Speech processing;Standards;Context modeling;chunkformer;masked batch;long-form transcription}, |
|
|
doi={10.1109/ICASSP49660.2025.10888640}} |
|
|
``` |
|
|
|