pytorch
/

Phi-4-mini-instruct-AWQ-INT4

@@ -223,7 +223,7 @@ Since the checkpoint is tuned on `mmlu_pro`, we check against the accuracy for `
 | Benchmark                        |                |                           |                           |
 |----------------------------------|----------------|---------------------------|---------------------------|
 |                                  | microsoft/Phi-4-mini-instruct   | pytorch/Phi-4-mini-instruct-INT4 | pytorch/Phi-4-mini-instruct-AWQ-INT4
-| mmlu_pro                             | 46.43   | 36.74                      |    43.13        |
 <details>
@@ -313,13 +313,14 @@ print(f"Peak Memory Usage: {mem:.02f} GB")
 ## Results (H100 machine)
 | Benchmark (Latency)              |                |                          |                          |
 |----------------------------------|----------------|--------------------------|--------------------------|
-|                                  | microsoft/Phi-4-mini-instruct   | jerryzh168/Phi-4-mini-instruct-INT4 | pytorch/Phi-4-mini-instruct-AWQ-INT4
-| latency (batch_size=1)           | 1.60s          | TODO    | 1.37s (1.17x speedup) |
-| latency (batch_size=256)         | 5.47s          | TODO    | 5.55s (0.98x speedup) |
 Note: it's expected that the awq-int4 checkpoint is slower when batch size is 256 since the problem is not memory bound but becomes compute bound when batch size is larger, while
 int4 weight only checkpoint is only expected to have speedup for memory bound situations.
-Note: we are comparing to jerryzh168/Phi-4-mini-instruct-INT4 which is a checkpoint for H100, since the AWQ-INT4 is for the new INT4 config that's optimized for H100.
 <details>
 <summary> Reproduce Model Performance Results </summary>

 | Benchmark                        |                |                           |                           |
 |----------------------------------|----------------|---------------------------|---------------------------|
 |                                  | microsoft/Phi-4-mini-instruct   | pytorch/Phi-4-mini-instruct-INT4 | pytorch/Phi-4-mini-instruct-AWQ-INT4
+| mmlu_pro                         | 46.43   | 36.74                      |    43.13        |
 <details>
 ## Results (H100 machine)
 | Benchmark (Latency)              |                |                          |                          |
 |----------------------------------|----------------|--------------------------|--------------------------|
+|                                  | microsoft/Phi-4-mini-instruct | jerryzh168/Phi-4-mini-instruct-INT4 | pytorch/Phi-4-mini-instruct-AWQ-INT4
+| latency (batch_size=1)           | 1.61s          | 1.09s (1.48x speedup)    | 1.37s (1.17x speedup) |
+| latency (batch_size=256)         | 5.31s          | 32.33s    | 5.44s (0.98x speedup) |
 Note: it's expected that the awq-int4 checkpoint is slower when batch size is 256 since the problem is not memory bound but becomes compute bound when batch size is larger, while
 int4 weight only checkpoint is only expected to have speedup for memory bound situations.
+Note: we are comparing to jerryzh168/Phi-4-mini-instruct-INT4 which is a checkpoint for H100, since the AWQ-INT4 is using the new INT4 config that's optimized for H100. It's possible to generate
+AWQ-INT4 for A100 as well using `Int4WeightOnlyConfig(group_size=128, int4_packing_foramt="tile_packed_to_4d", int4_choose_qparams_algorithm="hqq")`
 <details>
 <summary> Reproduce Model Performance Results </summary>