|
|
--- |
|
|
library_name: transformers |
|
|
tags: |
|
|
- speech |
|
|
- automatic-speech-recognition |
|
|
- whisper |
|
|
- multilingual |
|
|
- speaker-diarization |
|
|
- meeting-transcription |
|
|
- target-speaker-asr |
|
|
- SE-DiCoW |
|
|
- BUT-FIT |
|
|
pipeline_tag: automatic-speech-recognition |
|
|
license: apache-2.0 |
|
|
datasets: |
|
|
- microsoft/NOTSOFAR |
|
|
- edinburghcstr/ami |
|
|
- LibriSpeechMix |
|
|
- LibriMix |
|
|
--- |
|
|
|
|
|
# 🧠 SE-DiCoW — Self-Enrolled Diarization-Conditioned Whisper |
|
|
|
|
|
This repository hosts the **SE-DiCoW** model developed by [BUT Speech@FIT](https://github.com/BUTSpeechFIT), in collaboration with **JHU CLSP/HLTCOE** and **CMU LTI**, tailored for **target-speaker multi-talker automatic speech recognition (TS-ASR)**. |
|
|
|
|
|
## 🔧 Key Innovations |
|
|
|
|
|
* **Self-Enrollment (SE):** |
|
|
Automatically selects the most informative segment of the target speaker within a conversation and integrates it via **cross-attention** at each encoder layer. |
|
|
* **Improved Initialization & Segmentation:** |
|
|
Refined FDDT initialization and corrected data segmentation for more stable training. |
|
|
* **Augmentations:** |
|
|
- Gaussian noise injection to STNO masks |
|
|
- Segment-wise flipping of dominant STNO classes |
|
|
- Joint **SpecAugment** on input + STNO |
|
|
- **MUSAN** noise mixing |
|
|
|
|
|
➡️ Together, these yield **49.7% tcpWER reduction** over the original DiCoW on the **EMMA MT-ASR benchmark**, with over **70% gains** on heavily overlapped Libri3Mix. |
|
|
|
|
|
 |
|
|
--- |
|
|
|
|
|
## 🛠️ Model Usage |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForSpeechSeq2Seq |
|
|
|
|
|
MODEL_NAME = "BUT-FIT/SE_DiCoW" |
|
|
model = AutoModelForSpeechSeq2Seq.from_pretrained(MODEL_NAME, trust_remote_code=True) |
|
|
```` |
|
|
|
|
|
➡️ Training and inference pipelines: |
|
|
|
|
|
* [**Training Code (TS-ASR-Whisper)**](https://github.com/BUTSpeechFIT/TS-ASR-Whisper) |
|
|
* [**Inference Code**](https://github.com/BUTSpeechFIT/DiCoW) |
|
|
|
|
|
--- |
|
|
|
|
|
## 🏆 Performance |
|
|
|
|
|
**Benchmark:** EMMA MT-ASR (multi-domain, multi-talker) |
|
|
|
|
|
* SE-DiCoW outperforms DiCoW and DiCoW v3.2 under both **oracle** and **real diarization**, particularly in highly overlapped conditions (Libri3Mix). |
|
|
* Achieves **state-of-the-art** or comparable performance to domain-tuned systems on AMI, NOTSOFAR-1, and synthetic LibriMix mixtures. |
|
|
|
|
|
🔗 [**EMMA-MT ASR Leaderboard**](https://huggingface.co/spaces/BUT-FIT/EMMA_leaderboard) |
|
|
|
|
|
--- |
|
|
|
|
|
## 📦 Model Details |
|
|
|
|
|
* **Base Model:** [Whisper large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo) |
|
|
* **Training Datasets:** |
|
|
|
|
|
* [NOTSOFAR-1](https://github.com/microsoft/NOTSOFAR1-Challenge) |
|
|
* [AMI Meeting Corpus](http://groups.inf.ed.ac.uk/ami/corpus/) |
|
|
* [Libri2Mix / Libri3Mix](https://github.com/JorisCos/LibriMix) |
|
|
* [LibriSpeech](https://www.openslr.org/12) synthetic mixtures |
|
|
|
|
|
--- |
|
|
|
|
|
## 🧬 Source Repositories |
|
|
|
|
|
* 🔧 [Training Code: TS-ASR-Whisper](https://github.com/BUTSpeechFIT/TS-ASR-Whisper) |
|
|
* 🚀 [Inference (DiCoW)](https://github.com/BUTSpeechFIT/DiCoW) |
|
|
|
|
|
--- |
|
|
|
|
|
## 📚 Related Publications |
|
|
|
|
|
* 📰 **ICASSP 2026:** |
|
|
*SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper* |
|
|
[IEEE ICASSP 2026] |
|
|
|
|
|
* 📰 **Journal Paper (CSL 2026):** |
|
|
*DiCoW: Diarization-Conditioned Whisper for Target Speaker ASR* |
|
|
[Computer Speech & Language, 2026](https://www.sciencedirect.com/science/article/pii/S088523082500066X) |
|
|
|
|
|
* 📰 **ICASSP 2025:** |
|
|
*Target Speaker ASR with Whisper* |
|
|
[IEEE ICASSP 2025](https://doi.org/10.1109/ICASSP49660.2025.10887683) |
|
|
|
|
|
--- |
|
|
|
|
|
## 📝 Citation |
|
|
|
|
|
If you use this model, please cite the following works: |
|
|
|
|
|
```bibtex |
|
|
@INPROCEEDINGS{polok2026sedicow, |
|
|
author={Polok, Alexander and Klement, Dominik and Cornell, Samuele and Wiesner, Matthew and Černocký, Jan and Khudanpur, Sanjeev and Burget, Lukáš}, |
|
|
booktitle={ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, |
|
|
title={SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper}, |
|
|
year={2026}, |
|
|
pages={1-5}, |
|
|
} |
|
|
|
|
|
@article{POLOK2026101841, |
|
|
title = {DiCoW: Diarization-conditioned Whisper for target speaker automatic speech recognition}, |
|
|
journal = {Computer Speech & Language}, |
|
|
volume = {95}, |
|
|
pages = {101841}, |
|
|
year = {2026}, |
|
|
doi = {https://doi.org/10.1016/j.csl.2025.101841}, |
|
|
author = {Alexander Polok and Dominik Klement and Martin Kocour and Jiangyu Han and Federico Landini and Bolaji Yusuf and Matthew Wiesner and Sanjeev Khudanpur and Jan Černocký and Lukáš Burget}, |
|
|
} |
|
|
|
|
|
@INPROCEEDINGS{10887683, |
|
|
author={Polok, Alexander and Klement, Dominik and Wiesner, Matthew and Khudanpur, Sanjeev and Černocký, Jan and Burget, Lukáš}, |
|
|
booktitle={ICASSP 2025}, |
|
|
title={Target Speaker ASR with Whisper}, |
|
|
year={2025}, |
|
|
doi={10.1109/ICASSP49660.2025.10887683} |
|
|
} |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## 📬 Contact |
|
|
|
|
|
For questions or collaboration inquiries: |
|
|
|
|
|
📧 **Email:** [[email protected]](mailto:[email protected]) |
|
|
|
|
|
🏢 **Affiliation:** [BUT Speech@FIT](https://github.com/BUTSpeechFIT), Brno University of Technology |
|
|
|
|
|
🔗 **GitHub:** [BUTSpeechFIT](https://github.com/BUTSpeechFIT) |
|
|
|