pytorch
/

gemma-3-12b-it-INT4

@@ -188,7 +188,7 @@ We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-h
 | Benchmark                        |                |                           |
 |----------------------------------|----------------|---------------------------|
 |                                  | google/gemma-3-12b-it   | jerryzh168/gemma-3-12b-it-INT4         |
-| mmlu                             | To be filled   | To be filled                      |
 <details>
@@ -219,7 +219,7 @@ lm_eval --model hf --model_args pretrained=$MODEL --tasks mmlu --device cuda:0 -
 | Benchmark        |                |                                |
 |------------------|----------------|--------------------------------|
 |                  | google/gemma-3-12b-it   | jerryzh168/gemma-3-12b-it-INT4              |
-| Peak Memory (GB) | To be filled   | To be filled (?% reduction)    |
@@ -279,7 +279,7 @@ print(f"Peak Memory Usage: {mem:.02f} GB")
 | Benchmark (Latency)              |                |                          |
 |----------------------------------|----------------|--------------------------|
 |                                  | google/gemma-3-12b-it   | jerryzh168/gemma-3-12b-it-INT4        |
-| latency (batch_size=1)           | ?s          | ?s (?x speedup)    |
 <details>
 <summary> Reproduce Model Performance Results </summary>
@@ -311,48 +311,6 @@ python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model
 export MODEL=jerryzh168/gemma-3-12b-it-INT4
 VLLM_DISABLE_COMPILE_CACHE=1 python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model $MODEL --batch-size 1
 ```
-## benchmark_serving
-We benchmarked the throughput in a serving environment.
-Download sharegpt dataset:
-```Shell
-wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
-```
-Other datasets can be found in: https://github.com/vllm-project/vllm/tree/main/benchmarks
-Note: you can change the number of prompts to be benchmarked with `--num-prompts` argument for `benchmark_serving` script.
-### baseline
-Server:
-```Shell
-export MODEL=google/gemma-3-12b-it
-vllm serve $MODEL --tokenizer $MODEL -O3
-```
-Client:
-```Shell
-export MODEL=google/gemma-3-12b-it
-python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer $MODEL --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model $MODEL --num-prompts 1
-```
-### INT4
-Server:
-```Shell
-export MODEL=jerryzh168/gemma-3-12b-it-INT4
-VLLM_DISABLE_COMPILE_CACHE=1 vllm serve $MODEL --tokenizer $MODEL -O3 --pt-load-map-location cuda:0
-```
-Client:
-```Shell
-export MODEL=jerryzh168/gemma-3-12b-it-INT4
-python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer $MODEL --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model $MODEL --num-prompts 1
-```
 </details>

 | Benchmark                        |                |                           |
 |----------------------------------|----------------|---------------------------|
 |                                  | google/gemma-3-12b-it   | jerryzh168/gemma-3-12b-it-INT4         |
+| mmlu                             | 71.51   | 68.96                      |
 <details>
 | Benchmark        |                |                                |
 |------------------|----------------|--------------------------------|
 |                  | google/gemma-3-12b-it   | jerryzh168/gemma-3-12b-it-INT4              |
+| Peak Memory (GB) | 24.50   | 8.68 (65% reduction)    |
 | Benchmark (Latency)              |                |                          |
 |----------------------------------|----------------|--------------------------|
 |                                  | google/gemma-3-12b-it   | jerryzh168/gemma-3-12b-it-INT4        |
+| latency (batch_size=1)           | 3.73s          | 2.16s (1.73x speedup)    |
 <details>
 <summary> Reproduce Model Performance Results </summary>
 export MODEL=jerryzh168/gemma-3-12b-it-INT4
 VLLM_DISABLE_COMPILE_CACHE=1 python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model $MODEL --batch-size 1
 ```
 </details>