Activation Quantization Process

#1
by wantsleep - opened

When I load this quantized model from HuggingFace, am I only loading quantized weights? How does activation quantization work during inference since I didn't change any forward method??

  • Also, how can I verify whether activation tensors are actually quantized at runtime?
wantsleep changed discussion status to closed
This comment has been hidden (marked as Spam)
wantsleep changed discussion status to open

W8A8 quantization is a weights and activation quantization method. Activation are computed at INT8. When using vLLM not additional configuration is required.

Sign up or log in to comment