Lakoc commited on
Commit
4355b59
·
verified ·
1 Parent(s): 423d272

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +146 -1
README.md CHANGED
@@ -1 +1,146 @@
1
- TBD
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ tags:
4
+ - speech
5
+ - automatic-speech-recognition
6
+ - whisper
7
+ - multilingual
8
+ - speaker-diarization
9
+ - meeting-transcription
10
+ - target-speaker-asr
11
+ - SE-DiCoW
12
+ - BUT-FIT
13
+ pipeline_tag: automatic-speech-recognition
14
+ license: apache-2.0
15
+ datasets:
16
+ - microsoft/NOTSOFAR
17
+ - edinburghcstr/ami
18
+ - LibriSpeechMix
19
+ - LibriMix
20
+ ---
21
+
22
+ # 🧠 SE-DiCoW — Self-Enrolled Diarization-Conditioned Whisper
23
+
24
+ This repository hosts the **SE-DiCoW** model developed by [BUT Speech@FIT](https://github.com/BUTSpeechFIT), in collaboration with **JHU CLSP/HLTCOE** and **CMU LTI**, tailored for **target-speaker multi-talker automatic speech recognition (TS-ASR)**.
25
+
26
+ ## 🔧 Key Innovations
27
+
28
+ * **Self-Enrollment (SE):**
29
+ Automatically selects the most informative segment of the target speaker within a conversation and integrates it via **cross-attention** at each encoder layer.
30
+ * **Improved Initialization & Segmentation:**
31
+ Refined FDDT initialization and corrected data segmentation for more stable training.
32
+ * **Augmentations:**
33
+ - Gaussian noise injection to STNO masks
34
+ - Segment-wise flipping of dominant STNO classes
35
+ - Joint **SpecAugment** on input + STNO
36
+ - **MUSAN** noise mixing
37
+
38
+ ➡️ Together, these yield **49.7% tcpWER reduction** over the original DiCoW on the **EMMA MT-ASR benchmark**, with over **70% gains** on heavily overlapped Libri3Mix.
39
+
40
+ ---
41
+
42
+ ## 🛠️ Model Usage
43
+
44
+ ```python
45
+ from transformers import AutoModelForSpeechSeq2Seq
46
+
47
+ MODEL_NAME = "BUT-FIT/SE_DiCoW"
48
+ model = AutoModelForSpeechSeq2Seq.from_pretrained(MODEL_NAME, trust_remote_code=True)
49
+ ````
50
+
51
+ ➡️ Training and inference pipelines:
52
+
53
+ * [**Training Code (TS-ASR-Whisper)**](https://github.com/BUTSpeechFIT/TS-ASR-Whisper)
54
+ * [**Inference Code**](https://github.com/BUTSpeechFIT/DiCoW)
55
+
56
+ ---
57
+
58
+ ## 🏆 Performance
59
+
60
+ **Benchmark:** EMMA MT-ASR (multi-domain, multi-talker)
61
+
62
+ * SE-DiCoW outperforms DiCoW and DiCoW v3.2 under both **oracle** and **real diarization**, particularly in highly overlapped conditions (Libri3Mix).
63
+ * Achieves **state-of-the-art** or comparable performance to domain-tuned systems on AMI, NOTSOFAR-1, and synthetic LibriMix mixtures.
64
+
65
+ 🔗 [**EMMA-MT ASR Leaderboard**](https://huggingface.co/spaces/BUT-FIT/EMMA_leaderboard)
66
+
67
+ ---
68
+
69
+ ## 📦 Model Details
70
+
71
+ * **Base Model:** [Whisper large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo)
72
+ * **Training Datasets:**
73
+
74
+ * [NOTSOFAR-1](https://github.com/microsoft/NOTSOFAR1-Challenge)
75
+ * [AMI Meeting Corpus](http://groups.inf.ed.ac.uk/ami/corpus/)
76
+ * [Libri2Mix / Libri3Mix](https://github.com/JorisCos/LibriMix)
77
+ * [LibriSpeech](https://www.openslr.org/12) synthetic mixtures
78
+
79
+ ---
80
+
81
+ ## 🧬 Source Repositories
82
+
83
+ * 🔧 [Training Code: TS-ASR-Whisper](https://github.com/BUTSpeechFIT/TS-ASR-Whisper)
84
+ * 🚀 [Inference (DiCoW)](https://github.com/BUTSpeechFIT/DiCoW)
85
+
86
+ ---
87
+
88
+ ## 📚 Related Publications
89
+
90
+ * 📰 **ICASSP 2026:**
91
+ *SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper*
92
+ [IEEE ICASSP 2026]
93
+
94
+ * 📰 **Journal Paper (CSL 2026):**
95
+ *DiCoW: Diarization-Conditioned Whisper for Target Speaker ASR*
96
+ [Computer Speech & Language, 2026](https://www.sciencedirect.com/science/article/pii/S088523082500066X)
97
+
98
+ * 📰 **ICASSP 2025:**
99
+ *Target Speaker ASR with Whisper*
100
+ [IEEE ICASSP 2025](https://doi.org/10.1109/ICASSP49660.2025.10887683)
101
+
102
+ ---
103
+
104
+ ## 📝 Citation
105
+
106
+ If you use this model, please cite the following works:
107
+
108
+ ```bibtex
109
+ @INPROCEEDINGS{polok2026sedicow,
110
+ author={Polok, Alexander and Klement, Dominik and Cornell, Samuele and Wiesner, Matthew and Černocký, Jan and Khudanpur, Sanjeev and Burget, Lukáš},
111
+ booktitle={ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
112
+ title={SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper},
113
+ year={2026},
114
+ pages={1-5},
115
+ }
116
+
117
+ @article{POLOK2026101841,
118
+ title = {DiCoW: Diarization-conditioned Whisper for target speaker automatic speech recognition},
119
+ journal = {Computer Speech & Language},
120
+ volume = {95},
121
+ pages = {101841},
122
+ year = {2026},
123
+ doi = {https://doi.org/10.1016/j.csl.2025.101841},
124
+ author = {Alexander Polok and Dominik Klement and Martin Kocour and Jiangyu Han and Federico Landini and Bolaji Yusuf and Matthew Wiesner and Sanjeev Khudanpur and Jan Černocký and Lukáš Burget},
125
+ }
126
+
127
+ @INPROCEEDINGS{10887683,
128
+ author={Polok, Alexander and Klement, Dominik and Wiesner, Matthew and Khudanpur, Sanjeev and Černocký, Jan and Burget, Lukáš},
129
+ booktitle={ICASSP 2025},
130
+ title={Target Speaker ASR with Whisper},
131
+ year={2025},
132
+ doi={10.1109/ICASSP49660.2025.10887683}
133
+ }
134
+ ```
135
+
136
+ ---
137
+
138
+ ## 📬 Contact
139
+
140
+ For questions or collaboration inquiries:
141
+
142
+ 📧 **Email:** [[email protected]](mailto:[email protected])
143
+
144
+ 🏢 **Affiliation:** [BUT Speech@FIT](https://github.com/BUTSpeechFIT), Brno University of Technology
145
+
146
+ 🔗 **GitHub:** [BUTSpeechFIT](https://github.com/BUTSpeechFIT)