kyutai
/

tts-0.75b-en-public

Model card Files Files and versions

adefossez commited on Sep 11

Commit

816a849

·

verified ·

1 Parent(s): 2ed3db9

Update README.md

Files changed (1) hide show

README.md +47 -0

README.md CHANGED Viewed

@@ -63,6 +63,53 @@ This model does not perform watermarking for two reasons:
 This model is provided primarily for the purpose of scientific comparisons on public benchmarks.
 In particular, please check our pipeline for running TTS model evaluations on a number of benchmarks: [tts_longeval](https://github.com/kyutai-labs/tts_longeval).
 ## Model Card Authors
 Neil Zeghidour, Eugene Kharitonov, Manu Orsini, Václav Volhejn, Gabriel de Marmiesse, Edouard Grave, Patrick Perez, Laurent Mazaré, Alexandre Défossez

 This model is provided primarily for the purpose of scientific comparisons on public benchmarks.
 In particular, please check our pipeline for running TTS model evaluations on a number of benchmarks: [tts_longeval](https://github.com/kyutai-labs/tts_longeval).
+Here is an example, first install `moshi`, for instance with
+```bash
+pip install -U "git+https://[email protected]/kyutai-labs/moshi.git#egg=moshi&subdirectory=moshi
+```
+```python
+import torch
+from moshi.models.loaders import CheckpointInfo
+from moshi.models.tts import DEFAULT_DSM_TTS_VOICE_REPO, TTSModel
+text = "Hey there! How are you? I had the craziest day today."
+voice = "expresso/ex03-ex01_happy_001_channel1_334s.wav"
+checkpoint_info = CheckpointInfo.from_hf_repo('kyutai/tts-0.75b-en-public')
+tts_model = TTSModel.from_checkpoint_info(
+    checkpoint_info, n_q=16, temp=0.6, cfg_coef=3, device=torch.device("cuda")
+)
+entries = tts_model.prepare_script([text], padding_between=1)
+# `voice` could also be a local wav file.
+voice_path = tts_model.get_voice_path(voice)
+prefix = tts_model.get_prefix(voice_path)
+print("Generating audio...")
+pcms = []
+def _on_frame(frame):
+    print("Step", len(pcms), end="\r")
+    if (frame[:, 1:] != -1).all():
+        pcm = tts_model.mimi.decode(frame[:, 1:, :]).cpu()
+        pcms.append(pcm.clip(-1, 1))
+# You could also generate multiple audios at once by extending the following lists.
+all_entries = [entries]
+prefixes = [prefix]
+with tts_model.mimi.streaming(len(all_entries)):
+    result = tts_model.generate(all_entries, [], on_frame=_on_frame, prefixes=prefixes)
+print("Done generating.")
+audios = torch.cat(pcms, dim=-1)
+for audio, prefix in zip(audios, prefixes):
+    # We need to skip the audio prefix.
+    skip = int((tts_model.mimi.sample_rate * prefix.shape[-1]) / tts_model.mimi.frame_rate)
+    audio = audio[..., skip:]
+    # Now do something with this audio!
+```
 ## Model Card Authors
 Neil Zeghidour, Eugene Kharitonov, Manu Orsini, Václav Volhejn, Gabriel de Marmiesse, Edouard Grave, Patrick Perez, Laurent Mazaré, Alexandre Défossez