SE_DiCoW / README.md
Lakoc's picture
Update README.md
0aabf8a verified
---
library_name: transformers
tags:
- speech
- automatic-speech-recognition
- whisper
- multilingual
- speaker-diarization
- meeting-transcription
- target-speaker-asr
- SE-DiCoW
- BUT-FIT
pipeline_tag: automatic-speech-recognition
license: apache-2.0
datasets:
- microsoft/NOTSOFAR
- edinburghcstr/ami
- LibriSpeechMix
- LibriMix
---
# 🧠 SE-DiCoW — Self-Enrolled Diarization-Conditioned Whisper
This repository hosts the **SE-DiCoW** model developed by [BUT Speech@FIT](https://github.com/BUTSpeechFIT), in collaboration with **JHU CLSP/HLTCOE** and **CMU LTI**, tailored for **target-speaker multi-talker automatic speech recognition (TS-ASR)**.
## 🔧 Key Innovations
* **Self-Enrollment (SE):**
Automatically selects the most informative segment of the target speaker within a conversation and integrates it via **cross-attention** at each encoder layer.
* **Improved Initialization & Segmentation:**
Refined FDDT initialization and corrected data segmentation for more stable training.
* **Augmentations:**
- Gaussian noise injection to STNO masks
- Segment-wise flipping of dominant STNO classes
- Joint **SpecAugment** on input + STNO
- **MUSAN** noise mixing
➡️ Together, these yield **49.7% tcpWER reduction** over the original DiCoW on the **EMMA MT-ASR benchmark**, with over **70% gains** on heavily overlapped Libri3Mix.
![SE-DiCoW Architecture](./SE-DiCoW_figure.png)
---
## 🛠️ Model Usage
```python
from transformers import AutoModelForSpeechSeq2Seq
MODEL_NAME = "BUT-FIT/SE_DiCoW"
model = AutoModelForSpeechSeq2Seq.from_pretrained(MODEL_NAME, trust_remote_code=True)
````
➡️ Training and inference pipelines:
* [**Training Code (TS-ASR-Whisper)**](https://github.com/BUTSpeechFIT/TS-ASR-Whisper)
* [**Inference Code**](https://github.com/BUTSpeechFIT/DiCoW)
---
## 🏆 Performance
**Benchmark:** EMMA MT-ASR (multi-domain, multi-talker)
* SE-DiCoW outperforms DiCoW and DiCoW v3.2 under both **oracle** and **real diarization**, particularly in highly overlapped conditions (Libri3Mix).
* Achieves **state-of-the-art** or comparable performance to domain-tuned systems on AMI, NOTSOFAR-1, and synthetic LibriMix mixtures.
🔗 [**EMMA-MT ASR Leaderboard**](https://huggingface.co/spaces/BUT-FIT/EMMA_leaderboard)
---
## 📦 Model Details
* **Base Model:** [Whisper large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo)
* **Training Datasets:**
* [NOTSOFAR-1](https://github.com/microsoft/NOTSOFAR1-Challenge)
* [AMI Meeting Corpus](http://groups.inf.ed.ac.uk/ami/corpus/)
* [Libri2Mix / Libri3Mix](https://github.com/JorisCos/LibriMix)
* [LibriSpeech](https://www.openslr.org/12) synthetic mixtures
---
## 🧬 Source Repositories
* 🔧 [Training Code: TS-ASR-Whisper](https://github.com/BUTSpeechFIT/TS-ASR-Whisper)
* 🚀 [Inference (DiCoW)](https://github.com/BUTSpeechFIT/DiCoW)
---
## 📚 Related Publications
* 📰 **ICASSP 2026:**
*SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper*
[IEEE ICASSP 2026]
* 📰 **Journal Paper (CSL 2026):**
*DiCoW: Diarization-Conditioned Whisper for Target Speaker ASR*
[Computer Speech & Language, 2026](https://www.sciencedirect.com/science/article/pii/S088523082500066X)
* 📰 **ICASSP 2025:**
*Target Speaker ASR with Whisper*
[IEEE ICASSP 2025](https://doi.org/10.1109/ICASSP49660.2025.10887683)
---
## 📝 Citation
If you use this model, please cite the following works:
```bibtex
@INPROCEEDINGS{polok2026sedicow,
author={Polok, Alexander and Klement, Dominik and Cornell, Samuele and Wiesner, Matthew and Černocký, Jan and Khudanpur, Sanjeev and Burget, Lukáš},
booktitle={ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper},
year={2026},
pages={1-5},
}
@article{POLOK2026101841,
title = {DiCoW: Diarization-conditioned Whisper for target speaker automatic speech recognition},
journal = {Computer Speech & Language},
volume = {95},
pages = {101841},
year = {2026},
doi = {https://doi.org/10.1016/j.csl.2025.101841},
author = {Alexander Polok and Dominik Klement and Martin Kocour and Jiangyu Han and Federico Landini and Bolaji Yusuf and Matthew Wiesner and Sanjeev Khudanpur and Jan Černocký and Lukáš Burget},
}
@INPROCEEDINGS{10887683,
author={Polok, Alexander and Klement, Dominik and Wiesner, Matthew and Khudanpur, Sanjeev and Černocký, Jan and Burget, Lukáš},
booktitle={ICASSP 2025},
title={Target Speaker ASR with Whisper},
year={2025},
doi={10.1109/ICASSP49660.2025.10887683}
}
```
---
## 📬 Contact
For questions or collaboration inquiries:
📧 **Email:** [[email protected]](mailto:[email protected])
🏢 **Affiliation:** [BUT Speech@FIT](https://github.com/BUTSpeechFIT), Brno University of Technology
🔗 **GitHub:** [BUTSpeechFIT](https://github.com/BUTSpeechFIT)