library_name: transformers
tags:
- speech
- automatic-speech-recognition
- whisper
- multilingual
- speaker-diarization
- meeting-transcription
- target-speaker-asr
- SE-DiCoW
- BUT-FIT
pipeline_tag: automatic-speech-recognition
license: apache-2.0
datasets:
- microsoft/NOTSOFAR
- edinburghcstr/ami
- LibriSpeechMix
- LibriMix
🧠 SE-DiCoW — Self-Enrolled Diarization-Conditioned Whisper
This repository hosts the SE-DiCoW model developed by BUT Speech@FIT, in collaboration with JHU CLSP/HLTCOE and CMU LTI, tailored for target-speaker multi-talker automatic speech recognition (TS-ASR).
🔧 Key Innovations
- Self-Enrollment (SE):
Automatically selects the most informative segment of the target speaker within a conversation and integrates it via cross-attention at each encoder layer. - Improved Initialization & Segmentation:
Refined FDDT initialization and corrected data segmentation for more stable training. - Augmentations:
- Gaussian noise injection to STNO masks
- Segment-wise flipping of dominant STNO classes
- Joint SpecAugment on input + STNO
- MUSAN noise mixing
➡️ Together, these yield 49.7% tcpWER reduction over the original DiCoW on the EMMA MT-ASR benchmark, with over 70% gains on heavily overlapped Libri3Mix.
🛠️ Model Usage
from transformers import AutoModelForSpeechSeq2Seq
MODEL_NAME = "BUT-FIT/SE_DiCoW"
model = AutoModelForSpeechSeq2Seq.from_pretrained(MODEL_NAME, trust_remote_code=True)
➡️ Training and inference pipelines:
🏆 Performance
Benchmark: EMMA MT-ASR (multi-domain, multi-talker)
- SE-DiCoW outperforms DiCoW and DiCoW v3.2 under both oracle and real diarization, particularly in highly overlapped conditions (Libri3Mix).
- Achieves state-of-the-art or comparable performance to domain-tuned systems on AMI, NOTSOFAR-1, and synthetic LibriMix mixtures.
📦 Model Details
Base Model: Whisper large-v3-turbo
Training Datasets:
- NOTSOFAR-1
- AMI Meeting Corpus
- Libri2Mix / Libri3Mix
- LibriSpeech synthetic mixtures
🧬 Source Repositories
📚 Related Publications
📰 ICASSP 2026: SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper [IEEE ICASSP 2026]
📰 Journal Paper (CSL 2026): DiCoW: Diarization-Conditioned Whisper for Target Speaker ASR Computer Speech & Language, 2026
📰 ICASSP 2025: Target Speaker ASR with Whisper IEEE ICASSP 2025
📝 Citation
If you use this model, please cite the following works:
@INPROCEEDINGS{polok2026sedicow,
author={Polok, Alexander and Klement, Dominik and Cornell, Samuele and Wiesner, Matthew and Černocký, Jan and Khudanpur, Sanjeev and Burget, Lukáš},
booktitle={ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper},
year={2026},
pages={1-5},
}
@article{POLOK2026101841,
title = {DiCoW: Diarization-conditioned Whisper for target speaker automatic speech recognition},
journal = {Computer Speech & Language},
volume = {95},
pages = {101841},
year = {2026},
doi = {https://doi.org/10.1016/j.csl.2025.101841},
author = {Alexander Polok and Dominik Klement and Martin Kocour and Jiangyu Han and Federico Landini and Bolaji Yusuf and Matthew Wiesner and Sanjeev Khudanpur and Jan Černocký and Lukáš Burget},
}
@INPROCEEDINGS{10887683,
author={Polok, Alexander and Klement, Dominik and Wiesner, Matthew and Khudanpur, Sanjeev and Černocký, Jan and Burget, Lukáš},
booktitle={ICASSP 2025},
title={Target Speaker ASR with Whisper},
year={2025},
doi={10.1109/ICASSP49660.2025.10887683}
}
📬 Contact
For questions or collaboration inquiries:
📧 Email: [email protected]
🏢 Affiliation: BUT Speech@FIT, Brno University of Technology
🔗 GitHub: BUTSpeechFIT
