--- library_name: transformers tags: - speech - automatic-speech-recognition - whisper - multilingual - speaker-diarization - meeting-transcription - target-speaker-asr - SE-DiCoW - BUT-FIT pipeline_tag: automatic-speech-recognition license: apache-2.0 datasets: - microsoft/NOTSOFAR - edinburghcstr/ami - LibriSpeechMix - LibriMix --- # 🧠 SE-DiCoW — Self-Enrolled Diarization-Conditioned Whisper This repository hosts the **SE-DiCoW** model developed by [BUT Speech@FIT](https://github.com/BUTSpeechFIT), in collaboration with **JHU CLSP/HLTCOE** and **CMU LTI**, tailored for **target-speaker multi-talker automatic speech recognition (TS-ASR)**. ## 🔧 Key Innovations * **Self-Enrollment (SE):** Automatically selects the most informative segment of the target speaker within a conversation and integrates it via **cross-attention** at each encoder layer. * **Improved Initialization & Segmentation:** Refined FDDT initialization and corrected data segmentation for more stable training. * **Augmentations:** - Gaussian noise injection to STNO masks - Segment-wise flipping of dominant STNO classes - Joint **SpecAugment** on input + STNO - **MUSAN** noise mixing ➡️ Together, these yield **49.7% tcpWER reduction** over the original DiCoW on the **EMMA MT-ASR benchmark**, with over **70% gains** on heavily overlapped Libri3Mix. --- ## 🛠️ Model Usage ```python from transformers import AutoModelForSpeechSeq2Seq MODEL_NAME = "BUT-FIT/SE_DiCoW" model = AutoModelForSpeechSeq2Seq.from_pretrained(MODEL_NAME, trust_remote_code=True) ```` ➡️ Training and inference pipelines: * [**Training Code (TS-ASR-Whisper)**](https://github.com/BUTSpeechFIT/TS-ASR-Whisper) * [**Inference Code**](https://github.com/BUTSpeechFIT/DiCoW) --- ## 🏆 Performance **Benchmark:** EMMA MT-ASR (multi-domain, multi-talker) * SE-DiCoW outperforms DiCoW and DiCoW v3.2 under both **oracle** and **real diarization**, particularly in highly overlapped conditions (Libri3Mix). * Achieves **state-of-the-art** or comparable performance to domain-tuned systems on AMI, NOTSOFAR-1, and synthetic LibriMix mixtures. 🔗 [**EMMA-MT ASR Leaderboard**](https://huggingface.co/spaces/BUT-FIT/EMMA_leaderboard) --- ## 📦 Model Details * **Base Model:** [Whisper large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo) * **Training Datasets:** * [NOTSOFAR-1](https://github.com/microsoft/NOTSOFAR1-Challenge) * [AMI Meeting Corpus](http://groups.inf.ed.ac.uk/ami/corpus/) * [Libri2Mix / Libri3Mix](https://github.com/JorisCos/LibriMix) * [LibriSpeech](https://www.openslr.org/12) synthetic mixtures --- ## 🧬 Source Repositories * 🔧 [Training Code: TS-ASR-Whisper](https://github.com/BUTSpeechFIT/TS-ASR-Whisper) * 🚀 [Inference (DiCoW)](https://github.com/BUTSpeechFIT/DiCoW) --- ## 📚 Related Publications * 📰 **ICASSP 2026:** *SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper* [IEEE ICASSP 2026] * 📰 **Journal Paper (CSL 2026):** *DiCoW: Diarization-Conditioned Whisper for Target Speaker ASR* [Computer Speech & Language, 2026](https://www.sciencedirect.com/science/article/pii/S088523082500066X) * 📰 **ICASSP 2025:** *Target Speaker ASR with Whisper* [IEEE ICASSP 2025](https://doi.org/10.1109/ICASSP49660.2025.10887683) --- ## 📝 Citation If you use this model, please cite the following works: ```bibtex @INPROCEEDINGS{polok2026sedicow, author={Polok, Alexander and Klement, Dominik and Cornell, Samuele and Wiesner, Matthew and Černocký, Jan and Khudanpur, Sanjeev and Burget, Lukáš}, booktitle={ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, title={SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper}, year={2026}, pages={1-5}, } @article{POLOK2026101841, title = {DiCoW: Diarization-conditioned Whisper for target speaker automatic speech recognition}, journal = {Computer Speech & Language}, volume = {95}, pages = {101841}, year = {2026}, doi = {https://doi.org/10.1016/j.csl.2025.101841}, author = {Alexander Polok and Dominik Klement and Martin Kocour and Jiangyu Han and Federico Landini and Bolaji Yusuf and Matthew Wiesner and Sanjeev Khudanpur and Jan Černocký and Lukáš Burget}, } @INPROCEEDINGS{10887683, author={Polok, Alexander and Klement, Dominik and Wiesner, Matthew and Khudanpur, Sanjeev and Černocký, Jan and Burget, Lukáš}, booktitle={ICASSP 2025}, title={Target Speaker ASR with Whisper}, year={2025}, doi={10.1109/ICASSP49660.2025.10887683} } ``` --- ## 📬 Contact For questions or collaboration inquiries: 📧 **Email:** [ipoloka@fit.vut.cz](mailto:ipoloka@fit.vut.cz) 🏢 **Affiliation:** [BUT Speech@FIT](https://github.com/BUTSpeechFIT), Brno University of Technology 🔗 **GitHub:** [BUTSpeechFIT](https://github.com/BUTSpeechFIT)