Is CUDA supported when running on Jetson Orin?
Some of MeMo models (like https://huggingface.co/nvidia/mel-codec-22khz for example) explicitly list Jetson as supported hardware but not this model. It is runnable with CUDA? I managed to run it on CPU on Jetson Orin Nano but failed to get it running on CUDA with trying various base containers and even rebuilding NeMo framework from latest main branch.
We haven't tried on Jetson.
Could you share the error you are getting
Running container based on nvcr.io/nvidia/nemo:dev fails this way:
WARNING: CUDA Minor Version Compatibility mode ENABLED.
Using driver version 540.4.0 which has support for CUDA 12.6. This container
was built with CUDA 12.8 and will be run in Minor Version Compatibility mode.
CUDA Forward Compatibility is preferred over Minor Version Compatibility for use
with this container but was unavailable:
[[]]
See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.
NOTE: The SHMEM allocation limit is set to the default of 64MB. This may be
insufficient for NeMo Framework. NVIDIA recommends the use of the following flags:
docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 ...
[NeMo W 2025-05-08 20:11:55 nemo_logging:405] Please use the EncDecSpeakerLabelModel instead of this model. EncDecClassificationModel model is kept for backward compatibility with older models.
[NeMo I 2025-05-08 20:12:04 nemo_logging:393] Tokenizer SentencePieceTokenizer initialized with 1024 tokens
[NeMo W 2025-05-08 20:12:05 nemo_logging:405] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
Train config :
use_lhotse: true
skip_missing_manifest_entries: true
input_cfg: null
tarred_audio_filepaths: null
manifest_filepath: null
sample_rate: 16000
shuffle: true
num_workers: 2
pin_memory: true
max_duration: 40.0
min_duration: 0.1
text_field: answer
batch_duration: null
use_bucketing: true
bucket_duration_bins: null
bucket_batch_size: null
num_buckets: 30
bucket_buffer_size: 20000
shuffle_buffer_size: 10000
[NeMo W 2025-05-08 20:12:05 nemo_logging:405] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s).
Validation config :
use_lhotse: true
manifest_filepath: null
sample_rate: 16000
batch_size: 16
shuffle: false
max_duration: 40.0
min_duration: 0.1
num_workers: 2
pin_memory: true
text_field: answer
[NeMo I 2025-05-08 20:12:05 nemo_logging:393] PADDING: 0
[NeMo I 2025-05-08 20:12:10 nemo_logging:393] Using RNNT Loss : tdt
Loss tdt_kwargs: {'fastemit_lambda': 0.0, 'clamp': -1.0, 'durations': [0, 1, 2, 3, 4], 'sigma': 0.02, 'omega': 0.1}
[NeMo I 2025-05-08 20:12:10 nemo_logging:393] Using RNNT Loss : tdt
Loss tdt_kwargs: {'fastemit_lambda': 0.0, 'clamp': -1.0, 'durations': [0, 1, 2, 3, 4], 'sigma': 0.02, 'omega': 0.1}
[NeMo I 2025-05-08 20:12:10 nemo_logging:393] Using RNNT Loss : tdt
Loss tdt_kwargs: {'fastemit_lambda': 0.0, 'clamp': -1.0, 'durations': [0, 1, 2, 3, 4], 'sigma': 0.02, 'omega': 0.1}
Traceback (most recent call last):
File "/app/asr.py", line 4, in <module>
asr_model = nemo_asr.models.ASRModel.restore_from(restore_path="asr.nemo")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/NeMo/nemo/core/classes/modelPT.py", line 482, in restore_from
instance = cls._save_restore_connector.restore_from(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/NeMo/nemo/core/connectors/save_restore_connector.py", line 260, in restore_from
loaded_params = self.load_config_and_state_dict(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/NeMo/nemo/core/connectors/save_restore_connector.py", line 183, in load_config_and_state_dict
instance = instance.to(map_location)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/lightning/fabric/utilities/device_dtype_mixin.py", line 55, in to
return super().to(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1355, in to
return self._apply(convert)
^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 915, in _apply
module._apply(fn)
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 915, in _apply
module._apply(fn)
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 915, in _apply
module._apply(fn)
[Previous line repeated 1 more time]
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/rnn.py", line 290, in _apply
self._init_flat_weights()
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/rnn.py", line 215, in _init_flat_weights
self.flatten_parameters()
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/rnn.py", line 271, in flatten_parameters
torch._cudnn_rnn_flatten_weight(
RuntimeError: CUDA error: no kernel image is available for execution on the device
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Running container based on nvcr.io/nvidia/pytorch:24.07-py3 and installing "nemo_toolkit['asr']@git+https://github.com/NVIDIA/NeMo" inside container works:
[NeMo I 2025-05-08 20:19:32 features:305] PADDING: 0
[NeMo I 2025-05-08 20:19:37 rnnt_models:226] Using RNNT Loss : tdt
Loss tdt_kwargs: {'fastemit_lambda': 0.0, 'clamp': -1.0, 'durations': [0, 1, 2, 3, 4], 'sigma': 0.02, 'omega': 0.1}
[NeMo I 2025-05-08 20:19:37 rnnt_models:226] Using RNNT Loss : tdt
Loss tdt_kwargs: {'fastemit_lambda': 0.0, 'clamp': -1.0, 'durations': [0, 1, 2, 3, 4], 'sigma': 0.02, 'omega': 0.1}
[NeMo W 2025-05-08 20:19:37 tdt_loop_labels_computer:305] No conditional node support for Cuda.
Cuda graphs with while loops are disabled, decoding speed will be slower
Reason: CUDA is not available
[NeMo I 2025-05-08 20:19:37 rnnt_models:226] Using RNNT Loss : tdt
Loss tdt_kwargs: {'fastemit_lambda': 0.0, 'clamp': -1.0, 'durations': [0, 1, 2, 3, 4], 'sigma': 0.02, 'omega': 0.1}
[NeMo W 2025-05-08 20:19:37 tdt_loop_labels_computer:305] No conditional node support for Cuda.
Cuda graphs with while loops are disabled, decoding speed will be slower
Reason: CUDA is not available
[NeMo I 2025-05-08 20:19:41 save_restore_connector:275] Model EncDecRNNTBPEModel was successfully restored from /app/asr.nemo.
Transcribing: 100%|ββββββββββ| 1/1 [00:01<00:00, 1.92s/it]
but from l see in logs it seems to be running on CPU not CUDA.
It shows CUDA is not available in above logs. I presume its CUDA installation error issue than the model inference.
Hello, I've encountered a similar problem. Can you please ask if you have solved the problem?
When I use parakeet-tdt-0.6b-v2 to perform automatic speech recognition on audio files, the error is as follows, but I am pretty sure I have cuda-python installed under the current envs
(full-duplex-bench) root@i003757-767469748d-br25v:/data-mnt/data/personal/yfxu/Full-Duplex-Bench# pip list | grep cuda
cuda-bindings 13.0.1
cuda-pathfinder 1.2.2
cuda-python 13.0.1
nvidia-cuda-cupti-cu12 12.8.90
nvidia-cuda-nvrtc-cu12 12.8.93
nvidia-cuda-runtime-cu12 12.8.90
(full-duplex-bench) root@i003757-767469748d-br25v:/data-mnt/data/personal/yfxu/Full-Duplex-Bench# python get_transcript/asr.py --root_dir /data-mnt/data/personal/yfxu/output_audios --task full
[NeMo I 2025-09-10 23:46:19 mixins:181] Tokenizer SentencePieceTokenizer initialized with 1024 tokens
[NeMo W 2025-09-10 23:46:19 modelPT:176] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
Train config :
use_lhotse: true
skip_missing_manifest_entries: true
input_cfg: null
tarred_audio_filepaths: null
manifest_filepath: null
sample_rate: 16000
shuffle: true
num_workers: 2
pin_memory: true
max_duration: 40.0
min_duration: 0.1
text_field: answer
batch_duration: null
use_bucketing: true
bucket_duration_bins: null
bucket_batch_size: null
num_buckets: 30
bucket_buffer_size: 20000
shuffle_buffer_size: 10000
[NeMo W 2025-09-10 23:46:19 modelPT:183] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s).
Validation config :
use_lhotse: true
manifest_filepath: null
sample_rate: 16000
batch_size: 16
shuffle: false
max_duration: 40.0
min_duration: 0.1
num_workers: 2
pin_memory: true
text_field: answer
[NeMo I 2025-09-10 23:46:19 features:305] PADDING: 0
[NeMo I 2025-09-10 23:46:25 rnnt_models:226] Using RNNT Loss : tdt
Loss tdt_kwargs: {'fastemit_lambda': 0.0, 'clamp': -1.0, 'durations': [0, 1, 2, 3, 4], 'sigma': 0.02, 'omega': 0.1}
[NeMo I 2025-09-10 23:46:25 rnnt_models:226] Using RNNT Loss : tdt
Loss tdt_kwargs: {'fastemit_lambda': 0.0, 'clamp': -1.0, 'durations': [0, 1, 2, 3, 4], 'sigma': 0.02, 'omega': 0.1}
[NeMo W 2025-09-10 23:46:25 tdt_loop_labels_computer:281] No conditional node support for Cuda.
Cuda graphs with while loops are disabled, decoding speed will be slower
Reason: No cuda-python module. Please do pip install cuda-python>=12.3
[NeMo I 2025-09-10 23:46:25 rnnt_models:226] Using RNNT Loss : tdt
Loss tdt_kwargs: {'fastemit_lambda': 0.0, 'clamp': -1.0, 'durations': [0, 1, 2, 3, 4], 'sigma': 0.02, 'omega': 0.1}
[NeMo W 2025-09-10 23:46:25 tdt_loop_labels_computer:281] No conditional node support for Cuda.
Cuda graphs with while loops are disabled, decoding speed will be slower
Reason: No cuda-python module. Please do pip install cuda-python>=12.3
[NeMo I 2025-09-10 23:46:29 save_restore_connector:275] Model EncDecRNNTBPEModel was successfully restored from /data-mnt/data/personal/yfxu/.cache/huggingface/hub/models--nvidia--parakeet-tdt-0.6b-v2/snapshots/4f7f0088738aa056a90bdacbd6a0e22672b0f206/parakeet-tdt-0.6b-v2.nemo.
0it [00:00, ?it/s]
I couldn't figure out how to run NeMo model with CUDA on Jetson Orin Nano. I am currently running this model via https://github.com/k2-fsa/sherpa-onnx
https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2/discussions/66 <-- I think this will help. You may need to be careful with the RAM.