BUT-FIT
/

SE_DiCoW

+---
+library_name: transformers
+tags:
+- speech
+- automatic-speech-recognition
+- whisper
+- multilingual
+- speaker-diarization
+- meeting-transcription
+- target-speaker-asr
+- SE-DiCoW
+- BUT-FIT
+pipeline_tag: automatic-speech-recognition
+license: apache-2.0
+datasets:
+- microsoft/NOTSOFAR
+- edinburghcstr/ami
+- LibriSpeechMix
+- LibriMix
+---
+# 🧠 SE-DiCoW — Self-Enrolled Diarization-Conditioned Whisper
+This repository hosts the **SE-DiCoW** model developed by [BUT Speech@FIT](https://github.com/BUTSpeechFIT), in collaboration with **JHU CLSP/HLTCOE** and **CMU LTI**, tailored for **target-speaker multi-talker automatic speech recognition (TS-ASR)**.
+## 🔧 Key Innovations
+* **Self-Enrollment (SE):**
+  Automatically selects the most informative segment of the target speaker within a conversation and integrates it via **cross-attention** at each encoder layer.
+* **Improved Initialization & Segmentation:**
+  Refined FDDT initialization and corrected data segmentation for more stable training.
+* **Augmentations:**
+  - Gaussian noise injection to STNO masks
+  - Segment-wise flipping of dominant STNO classes
+  - Joint **SpecAugment** on input + STNO
+  - **MUSAN** noise mixing
+➡️ Together, these yield **49.7% tcpWER reduction** over the original DiCoW on the **EMMA MT-ASR benchmark**, with over **70% gains** on heavily overlapped Libri3Mix.
+---
+## 🛠️ Model Usage
+```python
+from transformers import AutoModelForSpeechSeq2Seq
+MODEL_NAME = "BUT-FIT/SE_DiCoW"
+model = AutoModelForSpeechSeq2Seq.from_pretrained(MODEL_NAME, trust_remote_code=True)
+````
+➡️ Training and inference pipelines:
+* [**Training Code (TS-ASR-Whisper)**](https://github.com/BUTSpeechFIT/TS-ASR-Whisper)
+* [**Inference Code**](https://github.com/BUTSpeechFIT/DiCoW)
+---
+## 🏆 Performance
+**Benchmark:** EMMA MT-ASR (multi-domain, multi-talker)
+* SE-DiCoW outperforms DiCoW and DiCoW v3.2 under both **oracle** and **real diarization**, particularly in highly overlapped conditions (Libri3Mix).
+* Achieves **state-of-the-art** or comparable performance to domain-tuned systems on AMI, NOTSOFAR-1, and synthetic LibriMix mixtures.
+🔗 [**EMMA-MT ASR Leaderboard**](https://huggingface.co/spaces/BUT-FIT/EMMA_leaderboard)
+---
+## 📦 Model Details
+* **Base Model:** [Whisper large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo)
+* **Training Datasets:**
+  * [NOTSOFAR-1](https://github.com/microsoft/NOTSOFAR1-Challenge)
+  * [AMI Meeting Corpus](http://groups.inf.ed.ac.uk/ami/corpus/)
+  * [Libri2Mix / Libri3Mix](https://github.com/JorisCos/LibriMix)
+  * [LibriSpeech](https://www.openslr.org/12) synthetic mixtures
+---
+## 🧬 Source Repositories
+* 🔧 [Training Code: TS-ASR-Whisper](https://github.com/BUTSpeechFIT/TS-ASR-Whisper)
+* 🚀 [Inference (DiCoW)](https://github.com/BUTSpeechFIT/DiCoW)
+---
+## 📚 Related Publications
+* 📰 **ICASSP 2026:**
+  *SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper*
+  [IEEE ICASSP 2026]
+* 📰 **Journal Paper (CSL 2026):**
+  *DiCoW: Diarization-Conditioned Whisper for Target Speaker ASR*
+  [Computer Speech & Language, 2026](https://www.sciencedirect.com/science/article/pii/S088523082500066X)
+* 📰 **ICASSP 2025:**
+  *Target Speaker ASR with Whisper*
+  [IEEE ICASSP 2025](https://doi.org/10.1109/ICASSP49660.2025.10887683)
+---
+## 📝 Citation
+If you use this model, please cite the following works:
+```bibtex
+@INPROCEEDINGS{polok2026sedicow,
+  author={Polok, Alexander and Klement, Dominik and Cornell, Samuele and Wiesner, Matthew and Černocký, Jan and Khudanpur, Sanjeev and Burget, Lukáš},
+  booktitle={ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
+  title={SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper},
+  year={2026},
+  pages={1-5},
+}
+@article{POLOK2026101841,
+    title = {DiCoW: Diarization-conditioned Whisper for target speaker automatic speech recognition},
+    journal = {Computer Speech & Language},
+    volume = {95},
+    pages = {101841},
+    year = {2026},
+    doi = {https://doi.org/10.1016/j.csl.2025.101841},
+    author = {Alexander Polok and Dominik Klement and Martin Kocour and Jiangyu Han and Federico Landini and Bolaji Yusuf and Matthew Wiesner and Sanjeev Khudanpur and Jan Černocký and Lukáš Burget},
+}
+@INPROCEEDINGS{10887683,
+  author={Polok, Alexander and Klement, Dominik and Wiesner, Matthew and Khudanpur, Sanjeev and Černocký, Jan and Burget, Lukáš},
+  booktitle={ICASSP 2025},
+  title={Target Speaker ASR with Whisper},
+  year={2025},
+  doi={10.1109/ICASSP49660.2025.10887683}
+}
+```
+---
+## 📬 Contact
+For questions or collaboration inquiries:
+📧 **Email:** [[email protected]](mailto:[email protected])
+🏢 **Affiliation:** [BUT Speech@FIT](https://github.com/BUTSpeechFIT), Brno University of Technology
+🔗 **GitHub:** [BUTSpeechFIT](https://github.com/BUTSpeechFIT)