nicolajreck commited on
Commit
d4fefc1
·
verified ·
1 Parent(s): fd90081

Upload readme.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. readme.md +213 -0
readme.md ADDED
@@ -0,0 +1,213 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - da
5
+ base_model: sesame/csm-1b
6
+ tags:
7
+ - text-to-speech
8
+ - tts
9
+ - danish
10
+ - lora
11
+ - csm
12
+ - audio-generation
13
+ - speech-synthesis
14
+ library_name: transformers
15
+ pipeline_tag: text-to-speech
16
+ datasets:
17
+ - mozilla-foundation/common_voice_17_0
18
+ - CoRal-project/coral-tts
19
+ ---
20
+
21
+ # CSM-1B Danish Text-to-Speech (LoRA)
22
+
23
+ A natural-sounding Danish text-to-speech model based on CSM-1B, fine-tuned using LoRA (Low-Rank Adaptation) on a combination of Common Voice 17, CoRal-TTS, and private Danish speech data. Authored by [Nicolaj Reck](https://www.linkedin.com/in/nicolaj-reck-053aa38a/),
24
+
25
+ ## Model Description
26
+
27
+ This model is a LoRA adapter for [`sesame/csm-1b`](https://huggingface.co/sesame/csm-1b) that enables natural Danish speech synthesis with optional voice control. The adapter was trained specifically for Danish TTS while preserving the multilingual capabilities of the base model.
28
+
29
+ - **Base Model**: [`sesame/csm-1b`](https://huggingface.co/sesame/csm-1b)
30
+ - **Language**: Danish (da)
31
+ - **Task**: Text-to-Speech
32
+ - **License**: Apache 2.0
33
+ - **Model Type**: LoRA Adapter
34
+ - **Precision**: FP16/BF16
35
+
36
+ ## Key Features
37
+
38
+ - 🎯 **Natural Danish synthesis** with clear pronunciation and fluent prosody
39
+ - 🇬🇧 **Exceptional English with Danish accent** - Perfect for bilingual content
40
+ - 🔄 **Voice control** with male/female speaker selection
41
+ - ⚡ **Efficient fine-tuning** using LoRA (only ~16M parameters trained)
42
+ - 🛡️ **Voice leakage prevention** through frozen speaker/codec modules
43
+ - 📱 **Ready-to-use Gradio interface** included
44
+
45
+ ## Quick Start
46
+
47
+ ### Installation
48
+
49
+ ```bash
50
+ pip install transformers torch torchaudio gradio
51
+ ```
52
+
53
+ ### Basic Usage
54
+
55
+ ```python
56
+ import torch
57
+ from transformers import CsmForConditionalGeneration, AutoProcessor
58
+
59
+ # Load model and processor
60
+ model = CsmForConditionalGeneration.from_pretrained("nicolajreck/csm-1b-danish-tts")
61
+ processor = AutoProcessor.from_pretrained("nicolajreck/csm-1b-danish-tts")
62
+
63
+ # Generate speech
64
+ text = "[1]Hej! Velkommen til dansk tale syntese." # [1] for female voice
65
+ inputs = processor(text, add_special_tokens=True).to("cuda")
66
+ audio = model.generate(**inputs, output_audio=True)
67
+
68
+ # Save audio
69
+ processor.save_audio(audio, "output.wav")
70
+ ```
71
+
72
+ ### Web Interface
73
+
74
+ Launch the included Gradio interface:
75
+
76
+ ```bash
77
+ python danish_tts.py
78
+ ```
79
+
80
+ Access at `http://localhost:7860` for an interactive TTS experience.
81
+
82
+ ## Voice Control
83
+
84
+ The model supports two speaker voices:
85
+ - `[0]` - Male voice
86
+ - `[1]` - Female voice
87
+
88
+ Simply prefix your Danish text with the speaker token:
89
+ - `[0]God morgen! Hvordan har du det?` (Male)
90
+ - `[1]God morgen! Hvordan har du det?` (Female)
91
+
92
+ ## Training Details
93
+
94
+ ### Training Data
95
+
96
+ The model was trained on a carefully curated mix of Danish speech data:
97
+
98
+ - **[Common Voice 17 Danish](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0)**: ~10,224 validated samples
99
+ - **[CoRal-TTS Danish](https://huggingface.co/datasets/CoRal-project/coral-tts)**: ~16,547 filtered samples
100
+ - **Private Extension**: ~8,644 additional samples
101
+
102
+ Total: ~35,415 Danish speech samples with balanced representation across datasets.
103
+
104
+ ### Training Configuration
105
+
106
+ - **Method**: LoRA (Low-Rank Adaptation)
107
+ - **Rank**: 16, Alpha: 32, Dropout: 0.05
108
+ - **Target Modules**: `{q_proj, k_proj, v_proj, o_proj, out_proj, gate_proj, up_proj, down_proj, fc1, fc2}`
109
+ - **Hardware**: Single RTX 3090 (24GB)
110
+ - **Precision**: FP16 training, supports FP16/BF16 inference
111
+
112
+ ### Data Processing
113
+
114
+ - Duration filtering: 0.6-16 seconds
115
+ - Text normalization: Quote stripping, terminal punctuation
116
+ - Equal-probability dataset mixing to prevent bias
117
+ - Chat-style formatting with Danish language cue
118
+
119
+ ## Recommended Settings
120
+
121
+ For the most natural and fluent speech, use these generation parameters:
122
+
123
+ ```python
124
+ # Natural speech settings
125
+ audio = model.generate(
126
+ **inputs,
127
+ output_audio=True,
128
+ do_sample=True,
129
+ temperature=0.96,
130
+ depth_decoder_temperature=0.7,
131
+ top_k=50,
132
+ top_p=0.9,
133
+ repetition_penalty=1.0
134
+ )
135
+ ```
136
+
137
+ ## Example Outputs
138
+
139
+ The model handles various Danish text types effectively:
140
+
141
+ | Danish Text | Audio |
142
+ |-------------|-------|
143
+ | *"Husk at gemme arbejdet, før computeren genstarter, ellers risikerer du at miste både filer og vigtige ændringer."* | <audio controls><source src="./tts_examples/technical_instructions.wav" type="audio/wav"><source src="./tts_examples/technical_instructions.mp3" type="audio/mp3">Your browser does not support the audio element.</audio> |
144
+ | *"Pakken leveres i morgen mellem 9 og 12, og du får en SMS-besked, så snart den er klar til afhentning."* | <audio controls><source src="./tts_examples/service_message.wav" type="audio/wav"><source src="./tts_examples/service_message.mp3" type="audio/mp3">Your browser does not support the audio element.</audio> |
145
+ | *"Vi gør opmærksom på, at toget mod Københavns Hovedbanegård er forsinket med omkring 15 minutter. Vi undskylder ventetiden."* | <audio controls><source src="./tts_examples/announcement.wav" type="audio/wav"><source src="./tts_examples/announcement.mp3" type="audio/mp3">Your browser does not support the audio element.</audio> |
146
+ | *"Når du planlægger en rejse, kan det betale sig at undersøge, både transportmuligheder, overnatning og oplevelser inden da. Sådan får du mest muligt ud af tiden, og du slipper for unødvendig stress undervejs."* | <audio controls><source src="./tts_examples/travel_planning.wav" type="audio/wav"><source src="./tts_examples/travel_planning.mp3" type="audio/mp3">Your browser does not support the audio element.</audio> |
147
+
148
+ ## Performance
149
+
150
+ Compared to the base CSM-1B model on Danish text:
151
+ - ✅ Pronunciation and word clarity
152
+ - ✅ Natural rhythm and speaking flow
153
+ - ✅ Speech with fewer dropped sounds
154
+ - ✅ Pleasant voice across different text types
155
+
156
+ ## Gradio Interface Features
157
+
158
+ The included `danish_tts.py` provides a comprehensive web interface with:
159
+
160
+ - **Three-column layout**: Input settings, sampling controls, audio output
161
+ - **Auto max-length calculation** with adjustable multiplier
162
+ - **Advanced parameter control**: Dual temperatures, Top-K/Top-P, repetition penalty
163
+ - **Pre-configured examples** with optimized settings
164
+ - **Real-time generation** and audio playback
165
+
166
+ ## Limitations
167
+
168
+ - Optimized specifically for Danish - other languages may have reduced quality
169
+ - Requires base model `sesame/csm-1b` to function
170
+ - Voice control limited to male/female binary selection
171
+ - Generated audio should be identified as synthetic in production use
172
+
173
+ ## Technical Details
174
+
175
+ ### Model Architecture
176
+ - **Base**: CSM-1B encoder-decoder with depth decoder
177
+ - **Audio Format**: 24kHz, generated via audio tokens
178
+ - **LoRA Integration**: Language projections only, speaker/codec frozen
179
+ - **Memory Requirements**: ~8GB VRAM for inference
180
+
181
+ ### Files Included
182
+ - LoRA adapter weights
183
+ - Processor configuration
184
+ - Gradio web interface (`danish_tts.py`)
185
+ - Training scripts and utilities
186
+
187
+ ## Citation
188
+
189
+ If you use this model, please cite:
190
+
191
+ ```bibtex
192
+ @misc{csm1b-danish-2024,
193
+ title={High-Quality Danish Text-to-Speech with CSM-1B: Data Mixing, Voice Control, and LoRA Fine-Tuning},
194
+ author={Nicolaj Reck},
195
+ year={2024},
196
+ howpublished={\\url{https://huggingface.co/nicolajreck/csm-1b-danish-tts}},
197
+ note={LinkedIn: https://www.linkedin.com/in/nicolaj-reck-053aa38a/}
198
+ }
199
+ ```
200
+
201
+ ## Acknowledgments
202
+
203
+ **Authored by**: [Nicolaj Reck](https://www.linkedin.com/in/nicolaj-reck-053aa38a/) -
204
+
205
+ Thanks to:
206
+ - **[Mozilla Foundation](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0)** for the Common Voice 17 dataset
207
+ - **[CoRal-TTS project](https://huggingface.co/datasets/CoRal-project/coral-tts)** for the Danish speech corpus
208
+ - **[Sesame Research](https://huggingface.co/sesame/csm-1b)** for the base CSM-1B model
209
+ - The open-source community for tools and frameworks
210
+
211
+ ## License
212
+
213
+ This model is released under the Apache 2.0 license. Please see the base model license for additional terms.