ChessLM Qwen3 - Neuron Traced for AWS Trainium/Inferentia (Continuous Batching)
This is a Neuron-traced version of karanps/ChessLM_Qwen3 optimized for AWS Trainium (trn1) and Inferentia (inf2) instances using vLLM with continuous batching enabled.
Model Details
- Base Model: Qwen3-8B fine-tuned for chess
- Compilation: optimum-neuron[vllm]==0.3.0
- Compiler Version: neuronxcc 2.21.33363.0
- Target Hardware: AWS Trainium (trn1) / Inferentia (inf2)
- Precision: BF16
- Tensor Parallelism: 2 cores
- Batch Size: 4 (continuous batching enabled)
- Max Sequence Length: 2048
- On-Device Sampling: Disabled (due to runtime limitation with TP=2)
Requirements
pip install optimum-neuron[vllm]==0.3.0
pip install neuronx-distributed --extra-index-url=https://pip.repos.neuron.amazonaws.com
Usage
Loading the Model
from optimum.neuron import NeuronModelForCausalLM
from transformers import AutoTokenizer
# Load the traced model
model = NeuronModelForCausalLM.from_pretrained("kunhunjon/ChessLM_Qwen3_Trainium")
tokenizer = AutoTokenizer.from_pretrained("kunhunjon/ChessLM_Qwen3_Trainium")
# Run inference
prompt = "e2e4"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=20)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)
Hardware Requirements
- AWS Trainium (trn1.32xlarge, trn1.2xlarge) or Inferentia (inf2) instances
- At least 2 Neuron cores (as configured during tracing)
- Minimum 32GB RAM recommended
Compilation Details
This model was traced with the following parameters:
batch_size=4sequence_length=2048num_cores=2auto_cast_type="bf16"continuous_batching=True- vLLM-compatible compilation
Continuous Batching
This model is compiled with continuous batching enabled, which allows vLLM to:
- Process multiple requests simultaneously with dynamic batch sizes up to 4
- Optimize throughput by batching requests with different sequence lengths
- Reduce latency for concurrent inference workloads
Note: On-device sampling is disabled due to a known Neuron runtime limitation when using tensor parallelism with 2 cores. Sampling is handled on the host instead.
Compilation Metrics
- Total compilation time: ~8.1 minutes
- Token generation model: 219 seconds
- Context encoding model: 165 seconds
- Model size: 17GB
License
This model inherits the license from the base model karanps/ChessLM_Qwen3.
Citation
If you use this model, please cite the original ChessLM model and AWS Neuron tools.
- Downloads last month
- 60
Model tree for kunhunjon/ChessLM_Qwen3_Trainium
Base model
karanps/ChessLM_Qwen3