Update Model Card: Clarification about WER values disparency && Adding workaround for GPUs that doesn't support CUDA Graphs

Browse files

Files changed (1) hide show

README.md +12 -2

README.md CHANGED Viewed

@@ -59,7 +59,9 @@ img {
 `soloni-114m-tdt-ctc` is a fine tuned version of nvidia's [`parakeet-tdt_ctc-110m`](https://huggingface.co/nvidia/parakeet-tdt_ctc-110m) that transcribes bambara language speech. Unlike its base model, this model cannot write Punctuations and Capitalizations since these were absent from its training.
 The model was fine-tuned using **NVIDIA NeMo** and supports **both TDT (Token-and-Duration Transducer) and CTC (Connectionist Temporal Classification) decoding**.
-## **🚨 Important Note**
 This model, along with its associated resources, is part of an **ongoing research effort**, improvements and refinements are expected in future versions. Users should be aware that:
 - **The model may not generalize very well accross all speaking conditions and dialects.**
@@ -89,6 +91,14 @@ asr_model = nemo_asr.models.EncDecHybridRNNTCTCBPEModel.from_pretrained(model_na
 asr_model.transcribe(['sample_audio.wav'])
 ```
 ### Input
 This model accepts **16000 Hz mono-channel** audio (wav files) as input.
@@ -126,7 +136,7 @@ These are greedy WER numbers without external LM. By default the main decoder br
 ```python
 # Retrieve the CTC decoding config
-ctc_decoding_cfg = model.cfg.aux_ctc.decoding
 # Then change the decoding strategy
 asr_model.change_decoding_strategy(decoder_type='ctc', decoding_cfg=ctc_decoding_cfg)
 # Transcribe with the CTC decoder

 `soloni-114m-tdt-ctc` is a fine tuned version of nvidia's [`parakeet-tdt_ctc-110m`](https://huggingface.co/nvidia/parakeet-tdt_ctc-110m) that transcribes bambara language speech. Unlike its base model, this model cannot write Punctuations and Capitalizations since these were absent from its training.
 The model was fine-tuned using **NVIDIA NeMo** and supports **both TDT (Token-and-Duration Transducer) and CTC (Connectionist Temporal Classification) decoding**.
+## **🚨 Important Note**
+**Update (February 17th):** We observed a significantly lower WER (~36%) for the TDT branch when using an external WER calculation method that relies solely on the predicted and reference transcriptions. However, the WER values reported in this model card are derived from the standard NeMo workflow using PyTorch Lightning's trainer, where the TDT branch yielded higher WER scores (~66%). Differences may arise due to variations in post-processing, alignment handling, or evaluation methodologies.
 This model, along with its associated resources, is part of an **ongoing research effort**, improvements and refinements are expected in future versions. Users should be aware that:
 - **The model may not generalize very well accross all speaking conditions and dialects.**
 asr_model.transcribe(['sample_audio.wav'])
 ```
+Note that the decoding strategy for the TDT decoder use CUDA Graphs by default but not all GPUs and versions of cuda support that parameter. If you run into a `RuntimeError: CUDA error: invalid argument` you should set that argument to false in the decoding strategy before call asr_model.transcribe()
+```python
+decoding_cfg = asr_model.cfg.decoding
+# Disable CUDA Graphs
+decoding_cfg.greedy.use_cuda_graph_decoder = False
+# Then change the decoding strategy
+asr_model.change_decoding_strategy(decoding_cfg=decoding_cfg)
+```
 ### Input
 This model accepts **16000 Hz mono-channel** audio (wav files) as input.
 ```python
 # Retrieve the CTC decoding config
+ctc_decoding_cfg = asr_model.cfg.aux_ctc.decoding
 # Then change the decoding strategy
 asr_model.change_decoding_strategy(decoder_type='ctc', decoding_cfg=ctc_decoding_cfg)
 # Transcribe with the CTC decoder