The new vllm does not appear to be functioning.
vLLM自动将AWQ转换为了 awq_marlin!
  从日志可以看到:
  INFO: The model is convertible to awq_marlin during runtime. Using awq_marlin kernel. 在awq_marlin.py 的第180-181行:
  if is_layer_skipped(
      prefix, self.modules_to_not_convert, self.packed_modules_mapping
  ):                         
但注意!它没有使用 skip_with_substr=True 参数! 这意味着
- vLLM自动将AWQ转换为 awq_marlin 以提高性能
- awq_marlin.py 的 is_layer_skipped 调用缺少 skip_with_substr=True 参数
- 默认情况下,is_layer_skipped 使用精确前缀匹配,而不是子串匹配
- modules_to_not_convert: ["model.layers.0."] 在子串匹配下会匹配 model.layers.0.mlp.down_proj
- 但在精确前缀匹配下,不会匹配!
需要修改:/site-packages/vllm/model_executor/layers/quantization/awq_marlin.py
       178                isinstance(layer, ParallelLMHead) and self.lm_head_quantized
       179            ):
       180                if is_layer_skipped(
       181 -                  prefix, self.modules_to_not_convert, self.packed_modules_mapping
       181 +                  prefix, self.modules_to_not_convert, self.packed_modules_mapping,
       182 +                  skip_with_substr=True
       183                ):
       184                    return UnquantizedLinearMethod()
       185                # Check if the layer is supported by AWQMarlin.
修复第197行的调用(针对FusedMoE):
 site-packages/vllm/model_executor/layers/quantization/awq_marlin.py
       195            elif isinstance(layer, FusedMoE):
       196                from vllm.model_executor.layers.quantization.moe_wna16 import MoeWNA16Config
       197
       198 -              if is_layer_skipped(prefix, getattr(self, "modules_to_not_convert", [])):
       198 +              if is_layer_skipped(prefix, getattr(self, "modules_to_not_convert", []),
       199 +                                  skip_with_substr=True):
       200                    return UnquantizedFusedMoEMethod(layer.moe_config)
       201                if not check_moe_marlin_supports_layer(layer, self.group_size):
       202                    logger.warning_once(
需要你那边的启动命令我们看看(如果你们无法直接用vllm拉起来这个模型的话)
MODEL_PATH="/mnt/cache/models/deepseek-ai/DeepSeek-V3.2-Exp-AWQ"
MODEL_NAME="MY_MODEL"
HOST="0.0.0.0"
PORT="8000"
# Server settings
MAX_MODEL_LEN=32768
MAX_NUM_SEQS=32
GPU_MEMORY_UTIL=0.95
SWAP_SPACE=16
# Set environment
export VLLM_USE_MODELSCOPE=true
echo "=========================================="
echo "Starting vLLM DeepSeek V3.2 Server"
echo "=========================================="
echo "Model: $MODEL_PATH"
echo "Host: $HOST:$PORT"
echo "Max Model Length: $MAX_MODEL_LEN"
echo "GPU Memory Utilization: $GPU_MEMORY_UTIL"
echo "=========================================="
echo ""
# Start server
vllm serve "$MODEL_PATH" \
    --served-model-name "$MODEL_NAME" \
    --data-parallel-size 8 \
    --enable-expert-parallel \
    --enable-auto-tool-choice \
    --tool-call-parser deepseek_v31 \
    --swap-space "$SWAP_SPACE" \
    --max-num-seqs "$MAX_NUM_SEQS" \
    --max-model-len "$MAX_MODEL_LEN" \
    --gpu-memory-utilization "$GPU_MEMORY_UTIL" \
    --trust-remote-code \
    --disable-log-requests \
    --host "$HOST" \
    --port "$PORT"
安装流程直接根据您给出的走的
确实,你指出的那个地方 skip_with_substr 是vLLM官方新引入的参数;他们有正确修改awq.py的部分,但没有正确修改awq_marlin.py的部分。
引入bug的commit为 https://github.com/vllm-project/vllm/commit/352c0c8a285414b11373e65fef095af7b07b94d8
需要提issue了。
在awq_marlin.py内,强制插入 skip_with_substr=True,能正确加载吗?(手上刚好没H设备了,暂时测不了)
我仔细研究了一下,应该这块对应模型能正确加载,并且经过测试,能给出合理的回复

