Update to continuous batching model (batch_size=4, neuronxcc 2.21)
Browse files- README.md +24 -4
- model.pt +2 -2
- neuron_config.json +4 -4
README.md
CHANGED
|
@@ -9,22 +9,25 @@ tags:
|
|
| 9 |
- aws-trainium
|
| 10 |
- vllm
|
| 11 |
- optimum-neuron
|
|
|
|
| 12 |
base_model: karanps/ChessLM_Qwen3
|
| 13 |
---
|
| 14 |
|
| 15 |
-
# ChessLM Qwen3 - Neuron Traced for AWS Trainium/Inferentia
|
| 16 |
|
| 17 |
-
This is a Neuron-traced version of [karanps/ChessLM_Qwen3](https://huggingface.co/karanps/ChessLM_Qwen3) optimized for AWS Trainium (trn1) and Inferentia (inf2) instances using vLLM
|
| 18 |
|
| 19 |
## Model Details
|
| 20 |
|
| 21 |
- **Base Model**: Qwen3-2B fine-tuned for chess
|
| 22 |
- **Compilation**: optimum-neuron[vllm]==0.3.0
|
|
|
|
| 23 |
- **Target Hardware**: AWS Trainium (trn1) / Inferentia (inf2)
|
| 24 |
- **Precision**: BF16
|
| 25 |
- **Tensor Parallelism**: 2 cores
|
| 26 |
-
- **Batch Size**:
|
| 27 |
- **Max Sequence Length**: 2048
|
|
|
|
| 28 |
|
| 29 |
## Requirements
|
| 30 |
|
|
@@ -62,12 +65,29 @@ print(result)
|
|
| 62 |
## Compilation Details
|
| 63 |
|
| 64 |
This model was traced with the following parameters:
|
| 65 |
-
- `batch_size=
|
| 66 |
- `sequence_length=2048`
|
| 67 |
- `num_cores=2`
|
| 68 |
- `auto_cast_type="bf16"`
|
|
|
|
| 69 |
- vLLM-compatible compilation
|
| 70 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 71 |
## License
|
| 72 |
|
| 73 |
This model inherits the license from the base model [karanps/ChessLM_Qwen3](https://huggingface.co/karanps/ChessLM_Qwen3).
|
|
|
|
| 9 |
- aws-trainium
|
| 10 |
- vllm
|
| 11 |
- optimum-neuron
|
| 12 |
+
- continuous-batching
|
| 13 |
base_model: karanps/ChessLM_Qwen3
|
| 14 |
---
|
| 15 |
|
| 16 |
+
# ChessLM Qwen3 - Neuron Traced for AWS Trainium/Inferentia (Continuous Batching)
|
| 17 |
|
| 18 |
+
This is a Neuron-traced version of [karanps/ChessLM_Qwen3](https://huggingface.co/karanps/ChessLM_Qwen3) optimized for AWS Trainium (trn1) and Inferentia (inf2) instances using vLLM with **continuous batching enabled**.
|
| 19 |
|
| 20 |
## Model Details
|
| 21 |
|
| 22 |
- **Base Model**: Qwen3-2B fine-tuned for chess
|
| 23 |
- **Compilation**: optimum-neuron[vllm]==0.3.0
|
| 24 |
+
- **Compiler Version**: neuronxcc 2.21.33363.0
|
| 25 |
- **Target Hardware**: AWS Trainium (trn1) / Inferentia (inf2)
|
| 26 |
- **Precision**: BF16
|
| 27 |
- **Tensor Parallelism**: 2 cores
|
| 28 |
+
- **Batch Size**: 4 (continuous batching enabled)
|
| 29 |
- **Max Sequence Length**: 2048
|
| 30 |
+
- **On-Device Sampling**: Disabled (due to runtime limitation with TP=2)
|
| 31 |
|
| 32 |
## Requirements
|
| 33 |
|
|
|
|
| 65 |
## Compilation Details
|
| 66 |
|
| 67 |
This model was traced with the following parameters:
|
| 68 |
+
- `batch_size=4`
|
| 69 |
- `sequence_length=2048`
|
| 70 |
- `num_cores=2`
|
| 71 |
- `auto_cast_type="bf16"`
|
| 72 |
+
- `continuous_batching=True`
|
| 73 |
- vLLM-compatible compilation
|
| 74 |
|
| 75 |
+
### Continuous Batching
|
| 76 |
+
|
| 77 |
+
This model is compiled with **continuous batching enabled**, which allows vLLM to:
|
| 78 |
+
- Process multiple requests simultaneously with dynamic batch sizes up to 4
|
| 79 |
+
- Optimize throughput by batching requests with different sequence lengths
|
| 80 |
+
- Reduce latency for concurrent inference workloads
|
| 81 |
+
|
| 82 |
+
**Note**: On-device sampling is disabled due to a known Neuron runtime limitation when using tensor parallelism with 2 cores. Sampling is handled on the host instead.
|
| 83 |
+
|
| 84 |
+
## Compilation Metrics
|
| 85 |
+
|
| 86 |
+
- **Total compilation time**: ~8.1 minutes
|
| 87 |
+
- **Token generation model**: 219 seconds
|
| 88 |
+
- **Context encoding model**: 165 seconds
|
| 89 |
+
- **Model size**: 17GB
|
| 90 |
+
|
| 91 |
## License
|
| 92 |
|
| 93 |
This model inherits the license from the base model [karanps/ChessLM_Qwen3](https://huggingface.co/karanps/ChessLM_Qwen3).
|
model.pt
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:464989b5c79dac0618dd8b9d1c58df8196ec48f89f913ca9ad1e530e04edff5f
|
| 3 |
+
size 17614391015
|
neuron_config.json
CHANGED
|
@@ -2,12 +2,12 @@
|
|
| 2 |
"_serialized_key": "NxDNeuronConfig",
|
| 3 |
"async_mode": false,
|
| 4 |
"attn_kernel_enabled": false,
|
| 5 |
-
"batch_size":
|
| 6 |
"capacity_factor": null,
|
| 7 |
"cc_pipeline_tiling_factor": 2,
|
| 8 |
"checkpoint_id": "karanps/ChessLM_Qwen3",
|
| 9 |
"checkpoint_revision": "e0d57507d96b2be2dd0dc901ecb231dec2dd6330",
|
| 10 |
-
"continuous_batching":
|
| 11 |
"enable_bucketing": false,
|
| 12 |
"ep_degree": 1,
|
| 13 |
"flash_decoding_enabled": false,
|
|
@@ -16,7 +16,7 @@
|
|
| 16 |
"is_chunked_prefill": false,
|
| 17 |
"local_ranks_size": 2,
|
| 18 |
"logical_nc_config": 1,
|
| 19 |
-
"max_batch_size":
|
| 20 |
"max_context_length": 2048,
|
| 21 |
"max_topk": 256,
|
| 22 |
"mlp_kernel_enabled": false,
|
|
@@ -24,7 +24,7 @@
|
|
| 24 |
"n_active_tokens": 2048,
|
| 25 |
"neuronxcc_version": "2.21.33363.0+82129205",
|
| 26 |
"num_cores_per_group": 1,
|
| 27 |
-
"on_device_sampling":
|
| 28 |
"optimum_neuron_version": "0.3.0",
|
| 29 |
"output_logits": false,
|
| 30 |
"padding_side": "right",
|
|
|
|
| 2 |
"_serialized_key": "NxDNeuronConfig",
|
| 3 |
"async_mode": false,
|
| 4 |
"attn_kernel_enabled": false,
|
| 5 |
+
"batch_size": 4,
|
| 6 |
"capacity_factor": null,
|
| 7 |
"cc_pipeline_tiling_factor": 2,
|
| 8 |
"checkpoint_id": "karanps/ChessLM_Qwen3",
|
| 9 |
"checkpoint_revision": "e0d57507d96b2be2dd0dc901ecb231dec2dd6330",
|
| 10 |
+
"continuous_batching": true,
|
| 11 |
"enable_bucketing": false,
|
| 12 |
"ep_degree": 1,
|
| 13 |
"flash_decoding_enabled": false,
|
|
|
|
| 16 |
"is_chunked_prefill": false,
|
| 17 |
"local_ranks_size": 2,
|
| 18 |
"logical_nc_config": 1,
|
| 19 |
+
"max_batch_size": 4,
|
| 20 |
"max_context_length": 2048,
|
| 21 |
"max_topk": 256,
|
| 22 |
"mlp_kernel_enabled": false,
|
|
|
|
| 24 |
"n_active_tokens": 2048,
|
| 25 |
"neuronxcc_version": "2.21.33363.0+82129205",
|
| 26 |
"num_cores_per_group": 1,
|
| 27 |
+
"on_device_sampling": false,
|
| 28 |
"optimum_neuron_version": "0.3.0",
|
| 29 |
"output_logits": false,
|
| 30 |
"padding_side": "right",
|