RuntimeError: operator _C::marlin_qqq_gemm does not exist

by sunnykaibai - opened Aug 23

Discussion

sunnykaibai

Aug 23

I follow the guide to install env:

22716)

git clone -b glm-45 https://github.com/zRzRzRzRzRzRzR/vllm.git
cd vllm
VLLM_USE_PRECOMPILED=1 pip install .

Install preview build of Transformers with GLM-4.5V support

pip install transformers-v4.55.0-GLM-4.5V-preview

but still got the error

INFO 08-23 17:18:09 [init.py:241] Automatically detected platform cuda.
Traceback (most recent call last):
File "/mnt/workspace/zichen.shx/infer/infer_v6_cmd_glm.py", line 25, in
from vllm import LLM, SamplingParams
File "", line 1075, in _handle_fromlist
File "/opt/conda/envs/glmfp4/lib/python3.10/site-packages/vllm/init.py", line 64, in getattr
module = import_module(module_name, package)
File "/opt/conda/envs/glmfp4/lib/python3.10/importlib/init.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "/opt/conda/envs/glmfp4/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 24, in
from vllm.engine.llm_engine import LLMEngine
File "/opt/conda/envs/glmfp4/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 28, in
from vllm.engine.output_processor.util import create_output_by_sequence_group
File "/opt/conda/envs/glmfp4/lib/python3.10/site-packages/vllm/engine/output_processor/util.py", line 8, in
from vllm.model_executor.layers.sampler import SamplerOutput
File "/opt/conda/envs/glmfp4/lib/python3.10/site-packages/vllm/model_executor/layers/sampler.py", line 16, in
from vllm.model_executor.layers.utils import apply_penalties
File "/opt/conda/envs/glmfp4/lib/python3.10/site-packages/vllm/model_executor/layers/utils.py", line 8, in
from vllm import _custom_ops as ops
File "/opt/conda/envs/glmfp4/lib/python3.10/site-packages/vllm/_custom_ops.py", line 472, in
def _marlin_qqq_gemm_fake(a: torch.Tensor, b_q_weight: torch.Tensor,
File "/opt/conda/envs/glmfp4/lib/python3.10/site-packages/torch/library.py", line 1023, in register
use_lib._register_fake(op_name, func, _stacklevel=stacklevel + 1)
File "/opt/conda/envs/glmfp4/lib/python3.10/site-packages/torch/library.py", line 214, in _register_fake
handle = entry.fake_impl.register(func_to_register, source)
File "/opt/conda/envs/glmfp4/lib/python3.10/site-packages/torch/_library/fake_impl.py", line 31, in register
if torch._C._dispatch_has_kernel_for_dispatch_key(self.qualname, "Meta"):
RuntimeError: operator _C::marlin_qqq_gemm does not exist

JunHowie

QuantTrio org Aug 25

After completing the installation of vllm, simply downgrade the transformers version.
The commands you need are:

pip install -U vllm
pip install transformers-v4.55.0-GLM-4.5V-preview

Check dependencies with:

pip list
pip show vllm transformers

It is also recommended to clear the pip cache:

pip cache purge

When launching this model with vllm, the required command is:

vllm serve \
    QuantTrio/GLM-4.5V-AWQ \
    --served-model-name GLM-4.5V-AWQ \
    --enable-expert-parallel \
    --tensor-parallel-size 4   # replace 4 with the actual number of GPUs

tclf90

QuantTrio org Aug 25

I follow the guide to install env:

Patched vLLM (see: https://github.com/vllm-project/vllm/pull/22716)

git clone -b glm-45 https://github.com/zRzRzRzRzRzRzR/vllm.git
cd vllm
VLLM_USE_PRECOMPILED=1 pip install .

Install preview build of Transformers with GLM-4.5V support

pip install transformers-v4.55.0-GLM-4.5V-preview

but still got the error

INFO 08-23 17:18:09 [init.py:241] Automatically detected platform cuda.
Traceback (most recent call last):
File "/mnt/workspace/zichen.shx/infer/infer_v6_cmd_glm.py", line 25, in
from vllm import LLM, SamplingParams
File "", line 1075, in _handle_fromlist
File "/opt/conda/envs/glmfp4/lib/python3.10/site-packages/vllm/init.py", line 64, in getattr
module = import_module(module_name, package)
File "/opt/conda/envs/glmfp4/lib/python3.10/importlib/init.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "/opt/conda/envs/glmfp4/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 24, in
from vllm.engine.llm_engine import LLMEngine
File "/opt/conda/envs/glmfp4/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 28, in
from vllm.engine.output_processor.util import create_output_by_sequence_group
File "/opt/conda/envs/glmfp4/lib/python3.10/site-packages/vllm/engine/output_processor/util.py", line 8, in
from vllm.model_executor.layers.sampler import SamplerOutput
File "/opt/conda/envs/glmfp4/lib/python3.10/site-packages/vllm/model_executor/layers/sampler.py", line 16, in
from vllm.model_executor.layers.utils import apply_penalties
File "/opt/conda/envs/glmfp4/lib/python3.10/site-packages/vllm/model_executor/layers/utils.py", line 8, in
from vllm import _custom_ops as ops
File "/opt/conda/envs/glmfp4/lib/python3.10/site-packages/vllm/_custom_ops.py", line 472, in
def _marlin_qqq_gemm_fake(a: torch.Tensor, b_q_weight: torch.Tensor,
File "/opt/conda/envs/glmfp4/lib/python3.10/site-packages/torch/library.py", line 1023, in register
use_lib._register_fake(op_name, func, _stacklevel=stacklevel + 1)
File "/opt/conda/envs/glmfp4/lib/python3.10/site-packages/torch/library.py", line 214, in _register_fake
handle = entry.fake_impl.register(func_to_register, source)
File "/opt/conda/envs/glmfp4/lib/python3.10/site-packages/torch/_library/fake_impl.py", line 31, in register
if torch._C._dispatch_has_kernel_for_dispatch_key(self.qualname, "Meta"):
RuntimeError: operator _C::marlin_qqq_gemm does not exist

Thank you for your feedback. This error occurs a lot in the recent nightly versions.
At the moment of this post, we can just install the official vllm:

pip install vllm==0.10.1.1
pip install transformers-v4.55.0-GLM-4.5V-preview

I have updated the readme file accordingly.

sunnykaibai

Aug 25

thank you for your answer, it does work!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment