kunhunjon commited on
Commit
9452995
·
verified ·
1 Parent(s): 6223ae4

Update to continuous batching model (batch_size=4, neuronxcc 2.21)

Browse files
Files changed (3) hide show
  1. README.md +24 -4
  2. model.pt +2 -2
  3. neuron_config.json +4 -4
README.md CHANGED
@@ -9,22 +9,25 @@ tags:
9
  - aws-trainium
10
  - vllm
11
  - optimum-neuron
 
12
  base_model: karanps/ChessLM_Qwen3
13
  ---
14
 
15
- # ChessLM Qwen3 - Neuron Traced for AWS Trainium/Inferentia
16
 
17
- This is a Neuron-traced version of [karanps/ChessLM_Qwen3](https://huggingface.co/karanps/ChessLM_Qwen3) optimized for AWS Trainium (trn1) and Inferentia (inf2) instances using vLLM.
18
 
19
  ## Model Details
20
 
21
  - **Base Model**: Qwen3-2B fine-tuned for chess
22
  - **Compilation**: optimum-neuron[vllm]==0.3.0
 
23
  - **Target Hardware**: AWS Trainium (trn1) / Inferentia (inf2)
24
  - **Precision**: BF16
25
  - **Tensor Parallelism**: 2 cores
26
- - **Batch Size**: 1
27
  - **Max Sequence Length**: 2048
 
28
 
29
  ## Requirements
30
 
@@ -62,12 +65,29 @@ print(result)
62
  ## Compilation Details
63
 
64
  This model was traced with the following parameters:
65
- - `batch_size=1`
66
  - `sequence_length=2048`
67
  - `num_cores=2`
68
  - `auto_cast_type="bf16"`
 
69
  - vLLM-compatible compilation
70
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
71
  ## License
72
 
73
  This model inherits the license from the base model [karanps/ChessLM_Qwen3](https://huggingface.co/karanps/ChessLM_Qwen3).
 
9
  - aws-trainium
10
  - vllm
11
  - optimum-neuron
12
+ - continuous-batching
13
  base_model: karanps/ChessLM_Qwen3
14
  ---
15
 
16
+ # ChessLM Qwen3 - Neuron Traced for AWS Trainium/Inferentia (Continuous Batching)
17
 
18
+ This is a Neuron-traced version of [karanps/ChessLM_Qwen3](https://huggingface.co/karanps/ChessLM_Qwen3) optimized for AWS Trainium (trn1) and Inferentia (inf2) instances using vLLM with **continuous batching enabled**.
19
 
20
  ## Model Details
21
 
22
  - **Base Model**: Qwen3-2B fine-tuned for chess
23
  - **Compilation**: optimum-neuron[vllm]==0.3.0
24
+ - **Compiler Version**: neuronxcc 2.21.33363.0
25
  - **Target Hardware**: AWS Trainium (trn1) / Inferentia (inf2)
26
  - **Precision**: BF16
27
  - **Tensor Parallelism**: 2 cores
28
+ - **Batch Size**: 4 (continuous batching enabled)
29
  - **Max Sequence Length**: 2048
30
+ - **On-Device Sampling**: Disabled (due to runtime limitation with TP=2)
31
 
32
  ## Requirements
33
 
 
65
  ## Compilation Details
66
 
67
  This model was traced with the following parameters:
68
+ - `batch_size=4`
69
  - `sequence_length=2048`
70
  - `num_cores=2`
71
  - `auto_cast_type="bf16"`
72
+ - `continuous_batching=True`
73
  - vLLM-compatible compilation
74
 
75
+ ### Continuous Batching
76
+
77
+ This model is compiled with **continuous batching enabled**, which allows vLLM to:
78
+ - Process multiple requests simultaneously with dynamic batch sizes up to 4
79
+ - Optimize throughput by batching requests with different sequence lengths
80
+ - Reduce latency for concurrent inference workloads
81
+
82
+ **Note**: On-device sampling is disabled due to a known Neuron runtime limitation when using tensor parallelism with 2 cores. Sampling is handled on the host instead.
83
+
84
+ ## Compilation Metrics
85
+
86
+ - **Total compilation time**: ~8.1 minutes
87
+ - **Token generation model**: 219 seconds
88
+ - **Context encoding model**: 165 seconds
89
+ - **Model size**: 17GB
90
+
91
  ## License
92
 
93
  This model inherits the license from the base model [karanps/ChessLM_Qwen3](https://huggingface.co/karanps/ChessLM_Qwen3).
model.pt CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:541d0dc6938c399d6dd91de76b6fa232a47fa6ab82fbf84010f547afd8988c48
3
- size 16707986819
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:464989b5c79dac0618dd8b9d1c58df8196ec48f89f913ca9ad1e530e04edff5f
3
+ size 17614391015
neuron_config.json CHANGED
@@ -2,12 +2,12 @@
2
  "_serialized_key": "NxDNeuronConfig",
3
  "async_mode": false,
4
  "attn_kernel_enabled": false,
5
- "batch_size": 1,
6
  "capacity_factor": null,
7
  "cc_pipeline_tiling_factor": 2,
8
  "checkpoint_id": "karanps/ChessLM_Qwen3",
9
  "checkpoint_revision": "e0d57507d96b2be2dd0dc901ecb231dec2dd6330",
10
- "continuous_batching": false,
11
  "enable_bucketing": false,
12
  "ep_degree": 1,
13
  "flash_decoding_enabled": false,
@@ -16,7 +16,7 @@
16
  "is_chunked_prefill": false,
17
  "local_ranks_size": 2,
18
  "logical_nc_config": 1,
19
- "max_batch_size": 1,
20
  "max_context_length": 2048,
21
  "max_topk": 256,
22
  "mlp_kernel_enabled": false,
@@ -24,7 +24,7 @@
24
  "n_active_tokens": 2048,
25
  "neuronxcc_version": "2.21.33363.0+82129205",
26
  "num_cores_per_group": 1,
27
- "on_device_sampling": true,
28
  "optimum_neuron_version": "0.3.0",
29
  "output_logits": false,
30
  "padding_side": "right",
 
2
  "_serialized_key": "NxDNeuronConfig",
3
  "async_mode": false,
4
  "attn_kernel_enabled": false,
5
+ "batch_size": 4,
6
  "capacity_factor": null,
7
  "cc_pipeline_tiling_factor": 2,
8
  "checkpoint_id": "karanps/ChessLM_Qwen3",
9
  "checkpoint_revision": "e0d57507d96b2be2dd0dc901ecb231dec2dd6330",
10
+ "continuous_batching": true,
11
  "enable_bucketing": false,
12
  "ep_degree": 1,
13
  "flash_decoding_enabled": false,
 
16
  "is_chunked_prefill": false,
17
  "local_ranks_size": 2,
18
  "logical_nc_config": 1,
19
+ "max_batch_size": 4,
20
  "max_context_length": 2048,
21
  "max_topk": 256,
22
  "mlp_kernel_enabled": false,
 
24
  "n_active_tokens": 2048,
25
  "neuronxcc_version": "2.21.33363.0+82129205",
26
  "num_cores_per_group": 1,
27
+ "on_device_sampling": false,
28
  "optimum_neuron_version": "0.3.0",
29
  "output_logits": false,
30
  "padding_side": "right",