Llama-2-70b-chat-hf-WMXFP4FP8-AMXFP4FP8-AMP-KVFP8

Introduction

This model was created by applying Quark with calibration samples from Pile dataset.
Quantization Stragegy
- Quantized Layers: All linear layers excluding "lm_head"
- Weight: Auto Mixed Precision quantized by Quark, each weight has either quantization scheme in candidates of
  - FP8 symmetric per-tensor
  - OCP Microscaling (MX) FP4
- Activation: Auto Mixed Precision quantized by Quark, each activation input has the same quantization scheme with weight, i.e., in candidates of
  - FP8 symmetric per-tensor
  - OCP Microscaling (MX) FP4
- KV Cache: FP8 symmetric per-tensor
Quick Start

Download and install Quark
[TODO] We will provide example script(s) to run auto mixed precision (AMP) quantizations later.

Deployment

The Quark quantized Auto Mixed Precision (AMP) models are now supported to be easily deployed in vLLM backend (vLLM-compatible).

Evaluation

The quantization evaluation results are conducted in pseudo-quantization mode, which may slightly differ from the actual quantized inference accuracy. These results are provided for reference only.

Evaluation scores

Quant scheme	arc challenge (↑) (acc)		gsm8k (↑) (strict-match)		mmlu (↑) (acc)		winogrande (↑) (acc)
	absolute value	recovery rate	absolute value	recovery rate	absolute value	recovery rate	absolute value	recovery rate
FP16	0.5290	100.0%	0.5049	100.0%	0.6110	100.0%	0.7490	100.0%
FP8	0.5265	99.5%	0.5262	104.2%	0.6107	100.0%	0.7451	99.5%
AMP	0.5273	99.7%	0.5125	101.5%	0.6007	98.3%	0.7324	97.8%
MXFP4	0.5094	96.3%	0.4572	90.6%	0.5869	96.1%	0.7316	97.7%

License

Built with Meta Llama.

Downloads last month: 3

Safetensors

Model size

55B params

Tensor type

F16

F8_E4M3

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for amd/Llama-2-70b-chat-hf-WMXFP4FP8-AMXFP4FP8-AMP-KVFP8

Base model

meta-llama/Llama-2-70b-chat-hf

Quantized

(12)

this model

Collection including amd/Llama-2-70b-chat-hf-WMXFP4FP8-AMXFP4FP8-AMP-KVFP8

Quark Quantized Auto Mixed Precision (AMP) Models

Collection

6 items • Updated 18 days ago • 2