readme update
Browse files
README.md
CHANGED
|
@@ -18,6 +18,7 @@ datasets:
|
|
| 18 |
- MLCommons/peoples_speech
|
| 19 |
thumbnail: null
|
| 20 |
tags:
|
|
|
|
| 21 |
- automatic-speech-recognition
|
| 22 |
- speech
|
| 23 |
- audio
|
|
@@ -182,6 +183,77 @@ img {
|
|
| 182 |
It is an XL version of FastConformer CTC [1] (around 600M parameters) model.
|
| 183 |
See the [model architecture](#model-architecture) section and [NeMo documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#fast-conformer) for complete architecture details.
|
| 184 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 185 |
## NVIDIA NeMo: Training
|
| 186 |
|
| 187 |
To train, fine-tune or play with the model you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed latest PyTorch version.
|
|
|
|
| 18 |
- MLCommons/peoples_speech
|
| 19 |
thumbnail: null
|
| 20 |
tags:
|
| 21 |
+
- transformers
|
| 22 |
- automatic-speech-recognition
|
| 23 |
- speech
|
| 24 |
- audio
|
|
|
|
| 183 |
It is an XL version of FastConformer CTC [1] (around 600M parameters) model.
|
| 184 |
See the [model architecture](#model-architecture) section and [NeMo documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#fast-conformer) for complete architecture details.
|
| 185 |
|
| 186 |
+
## Transformers
|
| 187 |
+
|
| 188 |
+
You can now run Parakeet CTC natively with [Transformers](https://github.com/huggingface/transformers) 🤗
|
| 189 |
+
|
| 190 |
+
```bash
|
| 191 |
+
pip install git+https://github.com/huggingface/transformers
|
| 192 |
+
```
|
| 193 |
+
|
| 194 |
+
<details>
|
| 195 |
+
<summary>➡️ Pipeline usage</summary>
|
| 196 |
+
|
| 197 |
+
```python
|
| 198 |
+
from transformers import pipeline
|
| 199 |
+
|
| 200 |
+
pipe = pipeline("automatic-speech-recognition", model="nvidia/parakeet-ctc-0.6b")
|
| 201 |
+
out = pipe("https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/bcn_weather.mp3")
|
| 202 |
+
print(out)
|
| 203 |
+
```
|
| 204 |
+
</details>
|
| 205 |
+
|
| 206 |
+
<details>
|
| 207 |
+
<summary>➡️ AutoModel</summary>
|
| 208 |
+
|
| 209 |
+
```python
|
| 210 |
+
from transformers import AutoModelForCTC, AutoProcessor
|
| 211 |
+
from datasets import load_dataset, Audio
|
| 212 |
+
import torch
|
| 213 |
+
|
| 214 |
+
device = "cuda" if torch.cuda.is_available() else "cpu"
|
| 215 |
+
|
| 216 |
+
processor = AutoProcessor.from_pretrained("nvidia/parakeet-ctc-0.6b")
|
| 217 |
+
model = AutoModelForCTC.from_pretrained("nvidia/parakeet-ctc-0.6b", dtype="auto", device_map=device)
|
| 218 |
+
|
| 219 |
+
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
|
| 220 |
+
ds = ds.cast_column("audio", Audio(sampling_rate=processor.feature_extractor.sampling_rate))
|
| 221 |
+
speech_samples = [el['array'] for el in ds["audio"][:5]]
|
| 222 |
+
|
| 223 |
+
inputs = processor(speech_samples, sampling_rate=processor.feature_extractor.sampling_rate)
|
| 224 |
+
inputs.to(model.device, dtype=model.dtype)
|
| 225 |
+
outputs = model.generate(**inputs)
|
| 226 |
+
print(processor.batch_decode(outputs))
|
| 227 |
+
```
|
| 228 |
+
</details>
|
| 229 |
+
|
| 230 |
+
<details>
|
| 231 |
+
<summary>➡️ Training</summary>
|
| 232 |
+
|
| 233 |
+
```python
|
| 234 |
+
from transformers import AutoModelForCTC, AutoProcessor
|
| 235 |
+
from datasets import load_dataset, Audio
|
| 236 |
+
import torch
|
| 237 |
+
|
| 238 |
+
device = "cuda" if torch.cuda.is_available() else "cpu"
|
| 239 |
+
|
| 240 |
+
processor = AutoProcessor.from_pretrained("nvidia/parakeet-ctc-0.6b")
|
| 241 |
+
model = AutoModelForCTC.from_pretrained("nvidia/parakeet-ctc-0.6b", dtype="auto", device_map=device)
|
| 242 |
+
|
| 243 |
+
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
|
| 244 |
+
ds = ds.cast_column("audio", Audio(sampling_rate=processor.feature_extractor.sampling_rate))
|
| 245 |
+
speech_samples = [el['array'] for el in ds["audio"][:5]]
|
| 246 |
+
text_samples = [el for el in ds["text"][:5]]
|
| 247 |
+
|
| 248 |
+
# passing `text` to the processor will prepare inputs' `labels` key
|
| 249 |
+
inputs = processor(audio=speech_samples, text=text_samples, sampling_rate=processor.feature_extractor.sampling_rate)
|
| 250 |
+
inputs.to(device, dtype=model.dtype)
|
| 251 |
+
|
| 252 |
+
outputs = model(**inputs)
|
| 253 |
+
outputs.loss.backward()
|
| 254 |
+
```
|
| 255 |
+
</details>
|
| 256 |
+
|
| 257 |
## NVIDIA NeMo: Training
|
| 258 |
|
| 259 |
To train, fine-tune or play with the model you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed latest PyTorch version.
|