ChessLM Qwen3 - Neuron Traced (Sharded Model)
This is a sharded version of the Neuron-traced karanps/ChessLM_Qwen3 optimized for AWS Trainium (trn1) and Inferentia (inf2) instances using vLLM with continuous batching enabled.
The model.pt file (16.4GB) has been split into 9 shards of ~2GB each for easier downloading and storage.
Model Details
- Base Model: Qwen3-8B fine-tuned for chess
- Compilation: optimum-neuron[vllm]==0.3.0
- Compiler Version: neuronxcc 2.21.33363.0
- Target Hardware: AWS Trainium (trn1) / Inferentia (inf2)
- Precision: BF16
- Tensor Parallelism: 2 cores
- Batch Size: 4 (continuous batching enabled)
- Max Sequence Length: 2048
- Model Format: Sharded (9 parts)
Files
Model Shards
model.shard0000.ptthroughmodel.shard0007.pt: 2GB eachmodel.shard0008.pt: 799MB (final shard)model.shards.json: Metadata with SHA256 hashes for verificationreconstruct.py: Script to reconstruct the original model.pt
Configuration Files
config.json: Model configurationneuron_config.json: Neuron compilation settings- Tokenizer files:
tokenizer.json,vocab.json,merges.txt, etc.
Usage
Option 1: Reconstruct the Full Model
If you need the complete model.pt file:
# Clone the repository
git clone https://huggingface.co/kunhunjon/ChessLM_Qwen3_Trainium_Sharded
cd ChessLM_Qwen3_Trainium_Sharded
# Reconstruct the original model.pt
python3 reconstruct.py
# This will create model.pt (16.4GB) from the shards
Option 2: Use Directly with optimum-neuron
The model can be loaded directly without reconstruction:
from optimum.neuron import NeuronModelForCausalLM
from transformers import AutoTokenizer
# Load the model (will handle shards automatically if needed)
model = NeuronModelForCausalLM.from_pretrained("kunhunjon/ChessLM_Qwen3_Trainium_Sharded")
tokenizer = AutoTokenizer.from_pretrained("kunhunjon/ChessLM_Qwen3_Trainium_Sharded")
# Run inference
prompt = "e2e4"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=20)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)
Requirements
pip install optimum-neuron[vllm]==0.3.0
pip install neuronx-distributed --extra-index-url=https://pip.repos.neuron.amazonaws.com
Hardware Requirements
- AWS Trainium (trn1.32xlarge, trn1.2xlarge) or Inferentia (inf2) instances
- At least 2 Neuron cores (as configured during tracing)
- Minimum 32GB RAM recommended
Sharding Details
The model was sharded using a custom script that:
- Splits the 16.4GB model.pt into 9 chunks of ~2GB each
- Generates SHA256 hashes for each shard for integrity verification
- Includes a reconstruction script to reassemble the original file
- Preserves all original model functionality
Verification
The model.shards.json file contains SHA256 hashes for each shard. The reconstruction script automatically verifies these hashes when reassembling the model.
Continuous Batching
This model is compiled with continuous batching enabled, which allows vLLM to:
- Process multiple requests simultaneously with dynamic batch sizes up to 4
- Optimize throughput by batching requests with different sequence lengths
- Reduce latency for concurrent inference workloads
Note: On-device sampling is disabled due to a known Neuron runtime limitation when using tensor parallelism with 2 cores. Sampling is handled on the host instead.
Compilation Details
batch_size=4sequence_length=2048num_cores=2auto_cast_type="bf16"continuous_batching=True- Total compilation time: ~8.1 minutes
License
This model inherits the license from the base model karanps/ChessLM_Qwen3.
Citation
If you use this model, please cite the original ChessLM model and AWS Neuron tools.
- Downloads last month
- 16
Model tree for kunhunjon/ChessLM_Qwen3_Trainium_Sharded
Base model
karanps/ChessLM_Qwen3