jerryzh168 commited on
Commit
a2f67ff
·
verified ·
1 Parent(s): 32d06e3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -5
README.md CHANGED
@@ -223,7 +223,7 @@ Since the checkpoint is tuned on `mmlu_pro`, we check against the accuracy for `
223
  | Benchmark | | | |
224
  |----------------------------------|----------------|---------------------------|---------------------------|
225
  | | microsoft/Phi-4-mini-instruct | pytorch/Phi-4-mini-instruct-INT4 | pytorch/Phi-4-mini-instruct-AWQ-INT4
226
- | mmlu_pro | 46.43 | 36.74 | 43.13 |
227
 
228
 
229
  <details>
@@ -313,13 +313,14 @@ print(f"Peak Memory Usage: {mem:.02f} GB")
313
  ## Results (H100 machine)
314
  | Benchmark (Latency) | | | |
315
  |----------------------------------|----------------|--------------------------|--------------------------|
316
- | | microsoft/Phi-4-mini-instruct | jerryzh168/Phi-4-mini-instruct-INT4 | pytorch/Phi-4-mini-instruct-AWQ-INT4
317
- | latency (batch_size=1) | 1.60s | TODO | 1.37s (1.17x speedup) |
318
- | latency (batch_size=256) | 5.47s | TODO | 5.55s (0.98x speedup) |
319
 
320
  Note: it's expected that the awq-int4 checkpoint is slower when batch size is 256 since the problem is not memory bound but becomes compute bound when batch size is larger, while
321
  int4 weight only checkpoint is only expected to have speedup for memory bound situations.
322
- Note: we are comparing to jerryzh168/Phi-4-mini-instruct-INT4 which is a checkpoint for H100, since the AWQ-INT4 is for the new INT4 config that's optimized for H100.
 
323
 
324
  <details>
325
  <summary> Reproduce Model Performance Results </summary>
 
223
  | Benchmark | | | |
224
  |----------------------------------|----------------|---------------------------|---------------------------|
225
  | | microsoft/Phi-4-mini-instruct | pytorch/Phi-4-mini-instruct-INT4 | pytorch/Phi-4-mini-instruct-AWQ-INT4
226
+ | mmlu_pro | 46.43 | 36.74 | 43.13 |
227
 
228
 
229
  <details>
 
313
  ## Results (H100 machine)
314
  | Benchmark (Latency) | | | |
315
  |----------------------------------|----------------|--------------------------|--------------------------|
316
+ | | microsoft/Phi-4-mini-instruct | jerryzh168/Phi-4-mini-instruct-INT4 | pytorch/Phi-4-mini-instruct-AWQ-INT4
317
+ | latency (batch_size=1) | 1.61s | 1.09s (1.48x speedup) | 1.37s (1.17x speedup) |
318
+ | latency (batch_size=256) | 5.31s | 32.33s | 5.44s (0.98x speedup) |
319
 
320
  Note: it's expected that the awq-int4 checkpoint is slower when batch size is 256 since the problem is not memory bound but becomes compute bound when batch size is larger, while
321
  int4 weight only checkpoint is only expected to have speedup for memory bound situations.
322
+ Note: we are comparing to jerryzh168/Phi-4-mini-instruct-INT4 which is a checkpoint for H100, since the AWQ-INT4 is using the new INT4 config that's optimized for H100. It's possible to generate
323
+ AWQ-INT4 for A100 as well using `Int4WeightOnlyConfig(group_size=128, int4_packing_foramt="tile_packed_to_4d", int4_choose_qparams_algorithm="hqq")`
324
 
325
  <details>
326
  <summary> Reproduce Model Performance Results </summary>