Upload readme.md with huggingface_hub
Browse files
readme.md
ADDED
|
@@ -0,0 +1,213 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
language:
|
| 4 |
+
- da
|
| 5 |
+
base_model: sesame/csm-1b
|
| 6 |
+
tags:
|
| 7 |
+
- text-to-speech
|
| 8 |
+
- tts
|
| 9 |
+
- danish
|
| 10 |
+
- lora
|
| 11 |
+
- csm
|
| 12 |
+
- audio-generation
|
| 13 |
+
- speech-synthesis
|
| 14 |
+
library_name: transformers
|
| 15 |
+
pipeline_tag: text-to-speech
|
| 16 |
+
datasets:
|
| 17 |
+
- mozilla-foundation/common_voice_17_0
|
| 18 |
+
- CoRal-project/coral-tts
|
| 19 |
+
---
|
| 20 |
+
|
| 21 |
+
# CSM-1B Danish Text-to-Speech (LoRA)
|
| 22 |
+
|
| 23 |
+
A natural-sounding Danish text-to-speech model based on CSM-1B, fine-tuned using LoRA (Low-Rank Adaptation) on a combination of Common Voice 17, CoRal-TTS, and private Danish speech data. Authored by [Nicolaj Reck](https://www.linkedin.com/in/nicolaj-reck-053aa38a/),
|
| 24 |
+
|
| 25 |
+
## Model Description
|
| 26 |
+
|
| 27 |
+
This model is a LoRA adapter for [`sesame/csm-1b`](https://huggingface.co/sesame/csm-1b) that enables natural Danish speech synthesis with optional voice control. The adapter was trained specifically for Danish TTS while preserving the multilingual capabilities of the base model.
|
| 28 |
+
|
| 29 |
+
- **Base Model**: [`sesame/csm-1b`](https://huggingface.co/sesame/csm-1b)
|
| 30 |
+
- **Language**: Danish (da)
|
| 31 |
+
- **Task**: Text-to-Speech
|
| 32 |
+
- **License**: Apache 2.0
|
| 33 |
+
- **Model Type**: LoRA Adapter
|
| 34 |
+
- **Precision**: FP16/BF16
|
| 35 |
+
|
| 36 |
+
## Key Features
|
| 37 |
+
|
| 38 |
+
- 🎯 **Natural Danish synthesis** with clear pronunciation and fluent prosody
|
| 39 |
+
- 🇬🇧 **Exceptional English with Danish accent** - Perfect for bilingual content
|
| 40 |
+
- 🔄 **Voice control** with male/female speaker selection
|
| 41 |
+
- ⚡ **Efficient fine-tuning** using LoRA (only ~16M parameters trained)
|
| 42 |
+
- 🛡️ **Voice leakage prevention** through frozen speaker/codec modules
|
| 43 |
+
- 📱 **Ready-to-use Gradio interface** included
|
| 44 |
+
|
| 45 |
+
## Quick Start
|
| 46 |
+
|
| 47 |
+
### Installation
|
| 48 |
+
|
| 49 |
+
```bash
|
| 50 |
+
pip install transformers torch torchaudio gradio
|
| 51 |
+
```
|
| 52 |
+
|
| 53 |
+
### Basic Usage
|
| 54 |
+
|
| 55 |
+
```python
|
| 56 |
+
import torch
|
| 57 |
+
from transformers import CsmForConditionalGeneration, AutoProcessor
|
| 58 |
+
|
| 59 |
+
# Load model and processor
|
| 60 |
+
model = CsmForConditionalGeneration.from_pretrained("nicolajreck/csm-1b-danish-tts")
|
| 61 |
+
processor = AutoProcessor.from_pretrained("nicolajreck/csm-1b-danish-tts")
|
| 62 |
+
|
| 63 |
+
# Generate speech
|
| 64 |
+
text = "[1]Hej! Velkommen til dansk tale syntese." # [1] for female voice
|
| 65 |
+
inputs = processor(text, add_special_tokens=True).to("cuda")
|
| 66 |
+
audio = model.generate(**inputs, output_audio=True)
|
| 67 |
+
|
| 68 |
+
# Save audio
|
| 69 |
+
processor.save_audio(audio, "output.wav")
|
| 70 |
+
```
|
| 71 |
+
|
| 72 |
+
### Web Interface
|
| 73 |
+
|
| 74 |
+
Launch the included Gradio interface:
|
| 75 |
+
|
| 76 |
+
```bash
|
| 77 |
+
python danish_tts.py
|
| 78 |
+
```
|
| 79 |
+
|
| 80 |
+
Access at `http://localhost:7860` for an interactive TTS experience.
|
| 81 |
+
|
| 82 |
+
## Voice Control
|
| 83 |
+
|
| 84 |
+
The model supports two speaker voices:
|
| 85 |
+
- `[0]` - Male voice
|
| 86 |
+
- `[1]` - Female voice
|
| 87 |
+
|
| 88 |
+
Simply prefix your Danish text with the speaker token:
|
| 89 |
+
- `[0]God morgen! Hvordan har du det?` (Male)
|
| 90 |
+
- `[1]God morgen! Hvordan har du det?` (Female)
|
| 91 |
+
|
| 92 |
+
## Training Details
|
| 93 |
+
|
| 94 |
+
### Training Data
|
| 95 |
+
|
| 96 |
+
The model was trained on a carefully curated mix of Danish speech data:
|
| 97 |
+
|
| 98 |
+
- **[Common Voice 17 Danish](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0)**: ~10,224 validated samples
|
| 99 |
+
- **[CoRal-TTS Danish](https://huggingface.co/datasets/CoRal-project/coral-tts)**: ~16,547 filtered samples
|
| 100 |
+
- **Private Extension**: ~8,644 additional samples
|
| 101 |
+
|
| 102 |
+
Total: ~35,415 Danish speech samples with balanced representation across datasets.
|
| 103 |
+
|
| 104 |
+
### Training Configuration
|
| 105 |
+
|
| 106 |
+
- **Method**: LoRA (Low-Rank Adaptation)
|
| 107 |
+
- **Rank**: 16, Alpha: 32, Dropout: 0.05
|
| 108 |
+
- **Target Modules**: `{q_proj, k_proj, v_proj, o_proj, out_proj, gate_proj, up_proj, down_proj, fc1, fc2}`
|
| 109 |
+
- **Hardware**: Single RTX 3090 (24GB)
|
| 110 |
+
- **Precision**: FP16 training, supports FP16/BF16 inference
|
| 111 |
+
|
| 112 |
+
### Data Processing
|
| 113 |
+
|
| 114 |
+
- Duration filtering: 0.6-16 seconds
|
| 115 |
+
- Text normalization: Quote stripping, terminal punctuation
|
| 116 |
+
- Equal-probability dataset mixing to prevent bias
|
| 117 |
+
- Chat-style formatting with Danish language cue
|
| 118 |
+
|
| 119 |
+
## Recommended Settings
|
| 120 |
+
|
| 121 |
+
For the most natural and fluent speech, use these generation parameters:
|
| 122 |
+
|
| 123 |
+
```python
|
| 124 |
+
# Natural speech settings
|
| 125 |
+
audio = model.generate(
|
| 126 |
+
**inputs,
|
| 127 |
+
output_audio=True,
|
| 128 |
+
do_sample=True,
|
| 129 |
+
temperature=0.96,
|
| 130 |
+
depth_decoder_temperature=0.7,
|
| 131 |
+
top_k=50,
|
| 132 |
+
top_p=0.9,
|
| 133 |
+
repetition_penalty=1.0
|
| 134 |
+
)
|
| 135 |
+
```
|
| 136 |
+
|
| 137 |
+
## Example Outputs
|
| 138 |
+
|
| 139 |
+
The model handles various Danish text types effectively:
|
| 140 |
+
|
| 141 |
+
| Danish Text | Audio |
|
| 142 |
+
|-------------|-------|
|
| 143 |
+
| *"Husk at gemme arbejdet, før computeren genstarter, ellers risikerer du at miste både filer og vigtige ændringer."* | <audio controls><source src="./tts_examples/technical_instructions.wav" type="audio/wav"><source src="./tts_examples/technical_instructions.mp3" type="audio/mp3">Your browser does not support the audio element.</audio> |
|
| 144 |
+
| *"Pakken leveres i morgen mellem 9 og 12, og du får en SMS-besked, så snart den er klar til afhentning."* | <audio controls><source src="./tts_examples/service_message.wav" type="audio/wav"><source src="./tts_examples/service_message.mp3" type="audio/mp3">Your browser does not support the audio element.</audio> |
|
| 145 |
+
| *"Vi gør opmærksom på, at toget mod Københavns Hovedbanegård er forsinket med omkring 15 minutter. Vi undskylder ventetiden."* | <audio controls><source src="./tts_examples/announcement.wav" type="audio/wav"><source src="./tts_examples/announcement.mp3" type="audio/mp3">Your browser does not support the audio element.</audio> |
|
| 146 |
+
| *"Når du planlægger en rejse, kan det betale sig at undersøge, både transportmuligheder, overnatning og oplevelser inden da. Sådan får du mest muligt ud af tiden, og du slipper for unødvendig stress undervejs."* | <audio controls><source src="./tts_examples/travel_planning.wav" type="audio/wav"><source src="./tts_examples/travel_planning.mp3" type="audio/mp3">Your browser does not support the audio element.</audio> |
|
| 147 |
+
|
| 148 |
+
## Performance
|
| 149 |
+
|
| 150 |
+
Compared to the base CSM-1B model on Danish text:
|
| 151 |
+
- ✅ Pronunciation and word clarity
|
| 152 |
+
- ✅ Natural rhythm and speaking flow
|
| 153 |
+
- ✅ Speech with fewer dropped sounds
|
| 154 |
+
- ✅ Pleasant voice across different text types
|
| 155 |
+
|
| 156 |
+
## Gradio Interface Features
|
| 157 |
+
|
| 158 |
+
The included `danish_tts.py` provides a comprehensive web interface with:
|
| 159 |
+
|
| 160 |
+
- **Three-column layout**: Input settings, sampling controls, audio output
|
| 161 |
+
- **Auto max-length calculation** with adjustable multiplier
|
| 162 |
+
- **Advanced parameter control**: Dual temperatures, Top-K/Top-P, repetition penalty
|
| 163 |
+
- **Pre-configured examples** with optimized settings
|
| 164 |
+
- **Real-time generation** and audio playback
|
| 165 |
+
|
| 166 |
+
## Limitations
|
| 167 |
+
|
| 168 |
+
- Optimized specifically for Danish - other languages may have reduced quality
|
| 169 |
+
- Requires base model `sesame/csm-1b` to function
|
| 170 |
+
- Voice control limited to male/female binary selection
|
| 171 |
+
- Generated audio should be identified as synthetic in production use
|
| 172 |
+
|
| 173 |
+
## Technical Details
|
| 174 |
+
|
| 175 |
+
### Model Architecture
|
| 176 |
+
- **Base**: CSM-1B encoder-decoder with depth decoder
|
| 177 |
+
- **Audio Format**: 24kHz, generated via audio tokens
|
| 178 |
+
- **LoRA Integration**: Language projections only, speaker/codec frozen
|
| 179 |
+
- **Memory Requirements**: ~8GB VRAM for inference
|
| 180 |
+
|
| 181 |
+
### Files Included
|
| 182 |
+
- LoRA adapter weights
|
| 183 |
+
- Processor configuration
|
| 184 |
+
- Gradio web interface (`danish_tts.py`)
|
| 185 |
+
- Training scripts and utilities
|
| 186 |
+
|
| 187 |
+
## Citation
|
| 188 |
+
|
| 189 |
+
If you use this model, please cite:
|
| 190 |
+
|
| 191 |
+
```bibtex
|
| 192 |
+
@misc{csm1b-danish-2024,
|
| 193 |
+
title={High-Quality Danish Text-to-Speech with CSM-1B: Data Mixing, Voice Control, and LoRA Fine-Tuning},
|
| 194 |
+
author={Nicolaj Reck},
|
| 195 |
+
year={2024},
|
| 196 |
+
howpublished={\\url{https://huggingface.co/nicolajreck/csm-1b-danish-tts}},
|
| 197 |
+
note={LinkedIn: https://www.linkedin.com/in/nicolaj-reck-053aa38a/}
|
| 198 |
+
}
|
| 199 |
+
```
|
| 200 |
+
|
| 201 |
+
## Acknowledgments
|
| 202 |
+
|
| 203 |
+
**Authored by**: [Nicolaj Reck](https://www.linkedin.com/in/nicolaj-reck-053aa38a/) -
|
| 204 |
+
|
| 205 |
+
Thanks to:
|
| 206 |
+
- **[Mozilla Foundation](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0)** for the Common Voice 17 dataset
|
| 207 |
+
- **[CoRal-TTS project](https://huggingface.co/datasets/CoRal-project/coral-tts)** for the Danish speech corpus
|
| 208 |
+
- **[Sesame Research](https://huggingface.co/sesame/csm-1b)** for the base CSM-1B model
|
| 209 |
+
- The open-source community for tools and frameworks
|
| 210 |
+
|
| 211 |
+
## License
|
| 212 |
+
|
| 213 |
+
This model is released under the Apache 2.0 license. Please see the base model license for additional terms.
|