Update README.md
Browse files
README.md
CHANGED
|
@@ -223,7 +223,7 @@ Since the checkpoint is tuned on `mmlu_pro`, we check against the accuracy for `
|
|
| 223 |
| Benchmark | | | |
|
| 224 |
|----------------------------------|----------------|---------------------------|---------------------------|
|
| 225 |
| | microsoft/Phi-4-mini-instruct | pytorch/Phi-4-mini-instruct-INT4 | pytorch/Phi-4-mini-instruct-AWQ-INT4
|
| 226 |
-
| mmlu_pro
|
| 227 |
|
| 228 |
|
| 229 |
<details>
|
|
@@ -313,13 +313,14 @@ print(f"Peak Memory Usage: {mem:.02f} GB")
|
|
| 313 |
## Results (H100 machine)
|
| 314 |
| Benchmark (Latency) | | | |
|
| 315 |
|----------------------------------|----------------|--------------------------|--------------------------|
|
| 316 |
-
| | microsoft/Phi-4-mini-instruct
|
| 317 |
-
| latency (batch_size=1) | 1.
|
| 318 |
-
| latency (batch_size=256) | 5.
|
| 319 |
|
| 320 |
Note: it's expected that the awq-int4 checkpoint is slower when batch size is 256 since the problem is not memory bound but becomes compute bound when batch size is larger, while
|
| 321 |
int4 weight only checkpoint is only expected to have speedup for memory bound situations.
|
| 322 |
-
Note: we are comparing to jerryzh168/Phi-4-mini-instruct-INT4 which is a checkpoint for H100, since the AWQ-INT4 is
|
|
|
|
| 323 |
|
| 324 |
<details>
|
| 325 |
<summary> Reproduce Model Performance Results </summary>
|
|
|
|
| 223 |
| Benchmark | | | |
|
| 224 |
|----------------------------------|----------------|---------------------------|---------------------------|
|
| 225 |
| | microsoft/Phi-4-mini-instruct | pytorch/Phi-4-mini-instruct-INT4 | pytorch/Phi-4-mini-instruct-AWQ-INT4
|
| 226 |
+
| mmlu_pro | 46.43 | 36.74 | 43.13 |
|
| 227 |
|
| 228 |
|
| 229 |
<details>
|
|
|
|
| 313 |
## Results (H100 machine)
|
| 314 |
| Benchmark (Latency) | | | |
|
| 315 |
|----------------------------------|----------------|--------------------------|--------------------------|
|
| 316 |
+
| | microsoft/Phi-4-mini-instruct | jerryzh168/Phi-4-mini-instruct-INT4 | pytorch/Phi-4-mini-instruct-AWQ-INT4
|
| 317 |
+
| latency (batch_size=1) | 1.61s | 1.09s (1.48x speedup) | 1.37s (1.17x speedup) |
|
| 318 |
+
| latency (batch_size=256) | 5.31s | 32.33s | 5.44s (0.98x speedup) |
|
| 319 |
|
| 320 |
Note: it's expected that the awq-int4 checkpoint is slower when batch size is 256 since the problem is not memory bound but becomes compute bound when batch size is larger, while
|
| 321 |
int4 weight only checkpoint is only expected to have speedup for memory bound situations.
|
| 322 |
+
Note: we are comparing to jerryzh168/Phi-4-mini-instruct-INT4 which is a checkpoint for H100, since the AWQ-INT4 is using the new INT4 config that's optimized for H100. It's possible to generate
|
| 323 |
+
AWQ-INT4 for A100 as well using `Int4WeightOnlyConfig(group_size=128, int4_packing_foramt="tile_packed_to_4d", int4_choose_qparams_algorithm="hqq")`
|
| 324 |
|
| 325 |
<details>
|
| 326 |
<summary> Reproduce Model Performance Results </summary>
|