The new vllm does not appear to be functioning.

#3
by eggward - opened

vLLM自动将AWQ转换为了 awq_marlin!
从日志可以看到:
INFO: The model is convertible to awq_marlin during runtime. Using awq_marlin kernel. 在awq_marlin.py 的第180-181行:

if is_layer_skipped(
prefix, self.modules_to_not_convert, self.packed_modules_mapping
):

但注意!它没有使用 skip_with_substr=True 参数! 这意味着

  1. vLLM自动将AWQ转换为 awq_marlin 以提高性能
  2. awq_marlin.py 的 is_layer_skipped 调用缺少 skip_with_substr=True 参数
  3. 默认情况下,is_layer_skipped 使用精确前缀匹配,而不是子串匹配
  4. modules_to_not_convert: ["model.layers.0."] 在子串匹配下会匹配 model.layers.0.mlp.down_proj
  5. 但在精确前缀匹配下,不会匹配!

需要修改:/site-packages/vllm/model_executor/layers/quantization/awq_marlin.py
178 isinstance(layer, ParallelLMHead) and self.lm_head_quantized
179 ):
180 if is_layer_skipped(
181 - prefix, self.modules_to_not_convert, self.packed_modules_mapping
181 + prefix, self.modules_to_not_convert, self.packed_modules_mapping,
182 + skip_with_substr=True
183 ):
184 return UnquantizedLinearMethod()
185 # Check if the layer is supported by AWQMarlin.

修复第197行的调用(针对FusedMoE):

site-packages/vllm/model_executor/layers/quantization/awq_marlin.py
195 elif isinstance(layer, FusedMoE):
196 from vllm.model_executor.layers.quantization.moe_wna16 import MoeWNA16Config
197
198 - if is_layer_skipped(prefix, getattr(self, "modules_to_not_convert", [])):
198 + if is_layer_skipped(prefix, getattr(self, "modules_to_not_convert", []),
199 + skip_with_substr=True):
200 return UnquantizedFusedMoEMethod(layer.moe_config)
201 if not check_moe_marlin_supports_layer(layer, self.group_size):
202 logger.warning_once(

QuantTrio org

需要你那边的启动命令我们看看(如果你们无法直接用vllm拉起来这个模型的话)

MODEL_PATH="/mnt/cache/models/deepseek-ai/DeepSeek-V3.2-Exp-AWQ"
MODEL_NAME="MY_MODEL"
HOST="0.0.0.0"
PORT="8000"

# Server settings
MAX_MODEL_LEN=32768
MAX_NUM_SEQS=32
GPU_MEMORY_UTIL=0.95
SWAP_SPACE=16

# Set environment
export VLLM_USE_MODELSCOPE=true

echo "=========================================="
echo "Starting vLLM DeepSeek V3.2 Server"
echo "=========================================="
echo "Model: $MODEL_PATH"
echo "Host: $HOST:$PORT"
echo "Max Model Length: $MAX_MODEL_LEN"
echo "GPU Memory Utilization: $GPU_MEMORY_UTIL"
echo "=========================================="
echo ""

# Start server
vllm serve "$MODEL_PATH" \
    --served-model-name "$MODEL_NAME" \
    --data-parallel-size 8 \
    --enable-expert-parallel \
    --enable-auto-tool-choice \
    --tool-call-parser deepseek_v31 \
    --swap-space "$SWAP_SPACE" \
    --max-num-seqs "$MAX_NUM_SEQS" \
    --max-model-len "$MAX_MODEL_LEN" \
    --gpu-memory-utilization "$GPU_MEMORY_UTIL" \
    --trust-remote-code \
    --disable-log-requests \
    --host "$HOST" \
    --port "$PORT"

安装流程直接根据您给出的走的

QuantTrio org

确实,你指出的那个地方 skip_with_substr 是vLLM官方新引入的参数;他们有正确修改awq.py的部分,但没有正确修改awq_marlin.py的部分。
引入bug的commit为 https://github.com/vllm-project/vllm/commit/352c0c8a285414b11373e65fef095af7b07b94d8
需要提issue了。
在awq_marlin.py内,强制插入 skip_with_substr=True,能正确加载吗?(手上刚好没H设备了,暂时测不了)

我仔细研究了一下,应该这块对应模型能正确加载,并且经过测试,能给出合理的回复

Sign up or log in to comment