The new vllm does not appear to be functioning.

by eggward - opened 8 days ago

8 days ago

vLLM自动将AWQ转换为了 awq_marlin！
从日志可以看到：
INFO: The model is convertible to awq_marlin during runtime. Using awq_marlin kernel. 在awq_marlin.py 的第180-181行：

if is_layer_skipped(
prefix, self.modules_to_not_convert, self.packed_modules_mapping
):

但注意！它没有使用 skip_with_substr=True 参数！这意味着

vLLM自动将AWQ转换为 awq_marlin 以提高性能
awq_marlin.py 的 is_layer_skipped 调用缺少 skip_with_substr=True 参数
默认情况下，is_layer_skipped 使用精确前缀匹配，而不是子串匹配
modules_to_not_convert: ["model.layers.0."] 在子串匹配下会匹配 model.layers.0.mlp.down_proj
但在精确前缀匹配下，不会匹配！

需要修改：/site-packages/vllm/model_executor/layers/quantization/awq_marlin.py
178 isinstance(layer, ParallelLMHead) and self.lm_head_quantized
179 ):
180 if is_layer_skipped(
181 - prefix, self.modules_to_not_convert, self.packed_modules_mapping
181 + prefix, self.modules_to_not_convert, self.packed_modules_mapping,
182 + skip_with_substr=True
183 ):
184 return UnquantizedLinearMethod()
185 # Check if the layer is supported by AWQMarlin.

修复第197行的调用（针对FusedMoE）：

site-packages/vllm/model_executor/layers/quantization/awq_marlin.py
195 elif isinstance(layer, FusedMoE):
196 from vllm.model_executor.layers.quantization.moe_wna16 import MoeWNA16Config
197
198 - if is_layer_skipped(prefix, getattr(self, "modules_to_not_convert", [])):
198 + if is_layer_skipped(prefix, getattr(self, "modules_to_not_convert", []),
199 + skip_with_substr=True):
200 return UnquantizedFusedMoEMethod(layer.moe_config)
201 if not check_moe_marlin_supports_layer(layer, self.group_size):
202 logger.warning_once(

tclf90

QuantTrio org 8 days ago

需要你那边的启动命令我们看看（如果你们无法直接用vllm拉起来这个模型的话）

eggward

8 days ago

MODEL_PATH="/mnt/cache/models/deepseek-ai/DeepSeek-V3.2-Exp-AWQ"
MODEL_NAME="MY_MODEL"
HOST="0.0.0.0"
PORT="8000"

# Server settings
MAX_MODEL_LEN=32768
MAX_NUM_SEQS=32
GPU_MEMORY_UTIL=0.95
SWAP_SPACE=16

# Set environment
export VLLM_USE_MODELSCOPE=true

echo "=========================================="
echo "Starting vLLM DeepSeek V3.2 Server"
echo "=========================================="
echo "Model: $MODEL_PATH"
echo "Host: $HOST:$PORT"
echo "Max Model Length: $MAX_MODEL_LEN"
echo "GPU Memory Utilization: $GPU_MEMORY_UTIL"
echo "=========================================="
echo ""

# Start server
vllm serve "$MODEL_PATH" \
    --served-model-name "$MODEL_NAME" \
    --data-parallel-size 8 \
    --enable-expert-parallel \
    --enable-auto-tool-choice \
    --tool-call-parser deepseek_v31 \
    --swap-space "$SWAP_SPACE" \
    --max-num-seqs "$MAX_NUM_SEQS" \
    --max-model-len "$MAX_MODEL_LEN" \
    --gpu-memory-utilization "$GPU_MEMORY_UTIL" \
    --trust-remote-code \
    --disable-log-requests \
    --host "$HOST" \
    --port "$PORT"

安装流程直接根据您给出的走的

tclf90

QuantTrio org 8 days ago

确实，你指出的那个地方 skip_with_substr 是vLLM官方新引入的参数；他们有正确修改awq.py的部分，但没有正确修改awq_marlin.py的部分。
引入bug的commit为 https://github.com/vllm-project/vllm/commit/352c0c8a285414b11373e65fef095af7b07b94d8
需要提issue了。
在awq_marlin.py内，强制插入 skip_with_substr=True，能正确加载吗？（手上刚好没H设备了，暂时测不了）

eggward

8 days ago

我仔细研究了一下，应该这块对应模型能正确加载，并且经过测试，能给出合理的回复

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment