kunhunjon
/

ChessLM_Qwen3_Trainium

@@ -9,22 +9,25 @@ tags:
 - aws-trainium
 - vllm
 - optimum-neuron
 base_model: karanps/ChessLM_Qwen3
 ---
-# ChessLM Qwen3 - Neuron Traced for AWS Trainium/Inferentia
-This is a Neuron-traced version of [karanps/ChessLM_Qwen3](https://huggingface.co/karanps/ChessLM_Qwen3) optimized for AWS Trainium (trn1) and Inferentia (inf2) instances using vLLM.
 ## Model Details
 - **Base Model**: Qwen3-2B fine-tuned for chess
 - **Compilation**: optimum-neuron[vllm]==0.3.0
 - **Target Hardware**: AWS Trainium (trn1) / Inferentia (inf2)
 - **Precision**: BF16
 - **Tensor Parallelism**: 2 cores
-- **Batch Size**: 1
 - **Max Sequence Length**: 2048
 ## Requirements
@@ -62,12 +65,29 @@ print(result)
 ## Compilation Details
 This model was traced with the following parameters:
-- `batch_size=1`
 - `sequence_length=2048`
 - `num_cores=2`
 - `auto_cast_type="bf16"`
 - vLLM-compatible compilation
 ## License
 This model inherits the license from the base model [karanps/ChessLM_Qwen3](https://huggingface.co/karanps/ChessLM_Qwen3).

 - aws-trainium
 - vllm
 - optimum-neuron
+- continuous-batching
 base_model: karanps/ChessLM_Qwen3
 ---
+# ChessLM Qwen3 - Neuron Traced for AWS Trainium/Inferentia (Continuous Batching)
+This is a Neuron-traced version of [karanps/ChessLM_Qwen3](https://huggingface.co/karanps/ChessLM_Qwen3) optimized for AWS Trainium (trn1) and Inferentia (inf2) instances using vLLM with **continuous batching enabled**.
 ## Model Details
 - **Base Model**: Qwen3-2B fine-tuned for chess
 - **Compilation**: optimum-neuron[vllm]==0.3.0
+- **Compiler Version**: neuronxcc 2.21.33363.0
 - **Target Hardware**: AWS Trainium (trn1) / Inferentia (inf2)
 - **Precision**: BF16
 - **Tensor Parallelism**: 2 cores
+- **Batch Size**: 4 (continuous batching enabled)
 - **Max Sequence Length**: 2048
+- **On-Device Sampling**: Disabled (due to runtime limitation with TP=2)
 ## Requirements
 ## Compilation Details
 This model was traced with the following parameters:
+- `batch_size=4`
 - `sequence_length=2048`
 - `num_cores=2`
 - `auto_cast_type="bf16"`
+- `continuous_batching=True`
 - vLLM-compatible compilation
+### Continuous Batching
+This model is compiled with **continuous batching enabled**, which allows vLLM to:
+- Process multiple requests simultaneously with dynamic batch sizes up to 4
+- Optimize throughput by batching requests with different sequence lengths
+- Reduce latency for concurrent inference workloads
+**Note**: On-device sampling is disabled due to a known Neuron runtime limitation when using tensor parallelism with 2 cores. Sampling is handled on the host instead.
+## Compilation Metrics
+- **Total compilation time**: ~8.1 minutes
+- **Token generation model**: 219 seconds
+- **Context encoding model**: 165 seconds
+- **Model size**: 17GB
 ## License
 This model inherits the license from the base model [karanps/ChessLM_Qwen3](https://huggingface.co/karanps/ChessLM_Qwen3).

model.pt CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:541d0dc6938c399d6dd91de76b6fa232a47fa6ab82fbf84010f547afd8988c48
-size 16707986819

 version https://git-lfs.github.com/spec/v1
+oid sha256:464989b5c79dac0618dd8b9d1c58df8196ec48f89f913ca9ad1e530e04edff5f
+size 17614391015

neuron_config.json CHANGED Viewed

@@ -2,12 +2,12 @@
   "_serialized_key": "NxDNeuronConfig",
   "async_mode": false,
   "attn_kernel_enabled": false,
-  "batch_size": 1,
   "capacity_factor": null,
   "cc_pipeline_tiling_factor": 2,
   "checkpoint_id": "karanps/ChessLM_Qwen3",
   "checkpoint_revision": "e0d57507d96b2be2dd0dc901ecb231dec2dd6330",
-  "continuous_batching": false,
   "enable_bucketing": false,
   "ep_degree": 1,
   "flash_decoding_enabled": false,
@@ -16,7 +16,7 @@
   "is_chunked_prefill": false,
   "local_ranks_size": 2,
   "logical_nc_config": 1,
-  "max_batch_size": 1,
   "max_context_length": 2048,
   "max_topk": 256,
   "mlp_kernel_enabled": false,
@@ -24,7 +24,7 @@
   "n_active_tokens": 2048,
   "neuronxcc_version": "2.21.33363.0+82129205",
   "num_cores_per_group": 1,
-  "on_device_sampling": true,
   "optimum_neuron_version": "0.3.0",
   "output_logits": false,
   "padding_side": "right",

   "_serialized_key": "NxDNeuronConfig",
   "async_mode": false,
   "attn_kernel_enabled": false,
+  "batch_size": 4,
   "capacity_factor": null,
   "cc_pipeline_tiling_factor": 2,
   "checkpoint_id": "karanps/ChessLM_Qwen3",
   "checkpoint_revision": "e0d57507d96b2be2dd0dc901ecb231dec2dd6330",
+  "continuous_batching": true,
   "enable_bucketing": false,
   "ep_degree": 1,
   "flash_decoding_enabled": false,
   "is_chunked_prefill": false,
   "local_ranks_size": 2,
   "logical_nc_config": 1,
+  "max_batch_size": 4,
   "max_context_length": 2048,
   "max_topk": 256,
   "mlp_kernel_enabled": false,
   "n_active_tokens": 2048,
   "neuronxcc_version": "2.21.33363.0+82129205",
   "num_cores_per_group": 1,
+  "on_device_sampling": false,
   "optimum_neuron_version": "0.3.0",
   "output_logits": false,
   "padding_side": "right",