Activation Quantization Process

by wantsleep - opened Aug 6

Aug 6

When I load this quantized model from HuggingFace, am I only loading quantized weights? How does activation quantization work during inference since I didn't change any forward method??

wantsleep

Aug 6

Also, how can I verify whether activation tensors are actually quantized at runtime?

wantsleep changed discussion status to closed Aug 6

wantsleep

Aug 6

This comment has been hidden (marked as Spam)

wantsleep changed discussion status to open Aug 6

ramblingpolymath

Owner Aug 11

W8A8 quantization is a weights and activation quantization method. Activation are computed at INT8. When using vLLM not additional configuration is required.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment