nvidia/parakeet-tdt-0.6b-v2 · Is CUDA supported when running on Jetson Orin?

May 8

Some of MeMo models (like https://huggingface.co/nvidia/mel-codec-22khz for example) explicitly list Jetson as supported hardware but not this model. It is runnable with CUDA? I managed to run it on CPU on Jetson Orin Nano but failed to get it running on CUDA with trying various base containers and even rebuilding NeMo framework from latest main branch.

nithinraok

NVIDIA org May 8

We haven't tried on Jetson.
Could you share the error you are getting

kikaitachi

May 8

Running container based on nvcr.io/nvidia/nemo:dev fails this way:

WARNING: CUDA Minor Version Compatibility mode ENABLED.
  Using driver version 540.4.0 which has support for CUDA 12.6.  This container
  was built with CUDA 12.8 and will be run in Minor Version Compatibility mode.
  CUDA Forward Compatibility is preferred over Minor Version Compatibility for use
  with this container but was unavailable:
  [[]]
  See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.

NOTE: The SHMEM allocation limit is set to the default of 64MB.  This may be
   insufficient for NeMo Framework.  NVIDIA recommends the use of the following flags:
   docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 ...

[NeMo W 2025-05-08 20:11:55 nemo_logging:405] Please use the EncDecSpeakerLabelModel instead of this model. EncDecClassificationModel model is kept for backward compatibility with older models.
[NeMo I 2025-05-08 20:12:04 nemo_logging:393] Tokenizer SentencePieceTokenizer initialized with 1024 tokens
[NeMo W 2025-05-08 20:12:05 nemo_logging:405] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    use_lhotse: true
    skip_missing_manifest_entries: true
    input_cfg: null
    tarred_audio_filepaths: null
    manifest_filepath: null
    sample_rate: 16000
    shuffle: true
    num_workers: 2
    pin_memory: true
    max_duration: 40.0
    min_duration: 0.1
    text_field: answer
    batch_duration: null
    use_bucketing: true
    bucket_duration_bins: null
    bucket_batch_size: null
    num_buckets: 30
    bucket_buffer_size: 20000
    shuffle_buffer_size: 10000
    
[NeMo W 2025-05-08 20:12:05 nemo_logging:405] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    use_lhotse: true
    manifest_filepath: null
    sample_rate: 16000
    batch_size: 16
    shuffle: false
    max_duration: 40.0
    min_duration: 0.1
    num_workers: 2
    pin_memory: true
    text_field: answer
    
[NeMo I 2025-05-08 20:12:05 nemo_logging:393] PADDING: 0
[NeMo I 2025-05-08 20:12:10 nemo_logging:393] Using RNNT Loss : tdt
    Loss tdt_kwargs: {'fastemit_lambda': 0.0, 'clamp': -1.0, 'durations': [0, 1, 2, 3, 4], 'sigma': 0.02, 'omega': 0.1}
[NeMo I 2025-05-08 20:12:10 nemo_logging:393] Using RNNT Loss : tdt
    Loss tdt_kwargs: {'fastemit_lambda': 0.0, 'clamp': -1.0, 'durations': [0, 1, 2, 3, 4], 'sigma': 0.02, 'omega': 0.1}
[NeMo I 2025-05-08 20:12:10 nemo_logging:393] Using RNNT Loss : tdt
    Loss tdt_kwargs: {'fastemit_lambda': 0.0, 'clamp': -1.0, 'durations': [0, 1, 2, 3, 4], 'sigma': 0.02, 'omega': 0.1}
Traceback (most recent call last):
  File "/app/asr.py", line 4, in <module>
    asr_model = nemo_asr.models.ASRModel.restore_from(restore_path="asr.nemo")
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/NeMo/nemo/core/classes/modelPT.py", line 482, in restore_from
    instance = cls._save_restore_connector.restore_from(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/NeMo/nemo/core/connectors/save_restore_connector.py", line 260, in restore_from
    loaded_params = self.load_config_and_state_dict(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/NeMo/nemo/core/connectors/save_restore_connector.py", line 183, in load_config_and_state_dict
    instance = instance.to(map_location)
               ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/lightning/fabric/utilities/device_dtype_mixin.py", line 55, in to
    return super().to(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1355, in to
    return self._apply(convert)
           ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 915, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 915, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 915, in _apply
    module._apply(fn)
  [Previous line repeated 1 more time]
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/rnn.py", line 290, in _apply
    self._init_flat_weights()
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/rnn.py", line 215, in _init_flat_weights
    self.flatten_parameters()
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/rnn.py", line 271, in flatten_parameters
    torch._cudnn_rnn_flatten_weight(
RuntimeError: CUDA error: no kernel image is available for execution on the device
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

kikaitachi

May 8

Running container based on nvcr.io/nvidia/pytorch:24.07-py3 and installing "nemo_toolkit['asr']@git+https://github.com/NVIDIA/NeMo" inside container works:

[NeMo I 2025-05-08 20:19:32 features:305] PADDING: 0
[NeMo I 2025-05-08 20:19:37 rnnt_models:226] Using RNNT Loss : tdt
    Loss tdt_kwargs: {'fastemit_lambda': 0.0, 'clamp': -1.0, 'durations': [0, 1, 2, 3, 4], 'sigma': 0.02, 'omega': 0.1}
[NeMo I 2025-05-08 20:19:37 rnnt_models:226] Using RNNT Loss : tdt
    Loss tdt_kwargs: {'fastemit_lambda': 0.0, 'clamp': -1.0, 'durations': [0, 1, 2, 3, 4], 'sigma': 0.02, 'omega': 0.1}
[NeMo W 2025-05-08 20:19:37 tdt_loop_labels_computer:305] No conditional node support for Cuda.
    Cuda graphs with while loops are disabled, decoding speed will be slower
    Reason: CUDA is not available
[NeMo I 2025-05-08 20:19:37 rnnt_models:226] Using RNNT Loss : tdt
    Loss tdt_kwargs: {'fastemit_lambda': 0.0, 'clamp': -1.0, 'durations': [0, 1, 2, 3, 4], 'sigma': 0.02, 'omega': 0.1}
[NeMo W 2025-05-08 20:19:37 tdt_loop_labels_computer:305] No conditional node support for Cuda.
    Cuda graphs with while loops are disabled, decoding speed will be slower
    Reason: CUDA is not available
[NeMo I 2025-05-08 20:19:41 save_restore_connector:275] Model EncDecRNNTBPEModel was successfully restored from /app/asr.nemo.
Transcribing: 100%|██████████| 1/1 [00:01<00:00,  1.92s/it]

but from l see in logs it seems to be running on CPU not CUDA.

nithinraok

NVIDIA org May 16

It shows CUDA is not available in above logs. I presume its CUDA installation error issue than the model inference.

Kaloper

Sep 10

Hello, I've encountered a similar problem. Can you please ask if you have solved the problem?
When I use parakeet-tdt-0.6b-v2 to perform automatic speech recognition on audio files, the error is as follows, but I am pretty sure I have cuda-python installed under the current envs

(full-duplex-bench) root@i003757-767469748d-br25v:/data-mnt/data/personal/yfxu/Full-Duplex-Bench# pip list | grep cuda
cuda-bindings 13.0.1
cuda-pathfinder 1.2.2
cuda-python 13.0.1
nvidia-cuda-cupti-cu12 12.8.90
nvidia-cuda-nvrtc-cu12 12.8.93
nvidia-cuda-runtime-cu12 12.8.90
(full-duplex-bench) root@i003757-767469748d-br25v:/data-mnt/data/personal/yfxu/Full-Duplex-Bench# python get_transcript/asr.py --root_dir /data-mnt/data/personal/yfxu/output_audios --task full
[NeMo I 2025-09-10 23:46:19 mixins:181] Tokenizer SentencePieceTokenizer initialized with 1024 tokens
[NeMo W 2025-09-10 23:46:19 modelPT:176] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
Train config :
use_lhotse: true
skip_missing_manifest_entries: true
input_cfg: null
tarred_audio_filepaths: null
manifest_filepath: null
sample_rate: 16000
shuffle: true
num_workers: 2
pin_memory: true
max_duration: 40.0
min_duration: 0.1
text_field: answer
batch_duration: null
use_bucketing: true
bucket_duration_bins: null
bucket_batch_size: null
num_buckets: 30
bucket_buffer_size: 20000
shuffle_buffer_size: 10000

[NeMo W 2025-09-10 23:46:19 modelPT:183] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s).
Validation config :
use_lhotse: true
manifest_filepath: null
sample_rate: 16000
batch_size: 16
shuffle: false
max_duration: 40.0
min_duration: 0.1
num_workers: 2
pin_memory: true
text_field: answer

[NeMo I 2025-09-10 23:46:19 features:305] PADDING: 0
[NeMo I 2025-09-10 23:46:25 rnnt_models:226] Using RNNT Loss : tdt
Loss tdt_kwargs: {'fastemit_lambda': 0.0, 'clamp': -1.0, 'durations': [0, 1, 2, 3, 4], 'sigma': 0.02, 'omega': 0.1}
[NeMo I 2025-09-10 23:46:25 rnnt_models:226] Using RNNT Loss : tdt
Loss tdt_kwargs: {'fastemit_lambda': 0.0, 'clamp': -1.0, 'durations': [0, 1, 2, 3, 4], 'sigma': 0.02, 'omega': 0.1}
[NeMo W 2025-09-10 23:46:25 tdt_loop_labels_computer:281] No conditional node support for Cuda.
Cuda graphs with while loops are disabled, decoding speed will be slower
Reason: No cuda-python module. Please do pip install cuda-python>=12.3
[NeMo I 2025-09-10 23:46:25 rnnt_models:226] Using RNNT Loss : tdt
Loss tdt_kwargs: {'fastemit_lambda': 0.0, 'clamp': -1.0, 'durations': [0, 1, 2, 3, 4], 'sigma': 0.02, 'omega': 0.1}
[NeMo W 2025-09-10 23:46:25 tdt_loop_labels_computer:281] No conditional node support for Cuda.
Cuda graphs with while loops are disabled, decoding speed will be slower
Reason: No cuda-python module. Please do pip install cuda-python>=12.3
[NeMo I 2025-09-10 23:46:29 save_restore_connector:275] Model EncDecRNNTBPEModel was successfully restored from /data-mnt/data/personal/yfxu/.cache/huggingface/hub/models--nvidia--parakeet-tdt-0.6b-v2/snapshots/4f7f0088738aa056a90bdacbd6a0e22672b0f206/parakeet-tdt-0.6b-v2.nemo.
0it [00:00, ?it/s]

kikaitachi

Sep 10

I couldn't figure out how to run NeMo model with CUDA on Jetson Orin Nano. I am currently running this model via https://github.com/k2-fsa/sherpa-onnx

raymondlo84

5 days ago

https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2/discussions/66 <-- I think this will help. You may need to be careful with the RAM.