ChessLM Qwen3 - Neuron Traced (Sharded Model)

This is a sharded version of the Neuron-traced karanps/ChessLM_Qwen3 optimized for AWS Trainium (trn1) and Inferentia (inf2) instances using vLLM with continuous batching enabled.

The model.pt file (16.4GB) has been split into 9 shards of ~2GB each for easier downloading and storage.

Model Details

Base Model: Qwen3-8B fine-tuned for chess
Compilation: optimum-neuron[vllm]==0.3.0
Compiler Version: neuronxcc 2.21.33363.0
Target Hardware: AWS Trainium (trn1) / Inferentia (inf2)
Precision: BF16
Tensor Parallelism: 2 cores
Batch Size: 4 (continuous batching enabled)
Max Sequence Length: 2048
Model Format: Sharded (9 parts)

Files

Model Shards

model.shard0000.pt through model.shard0007.pt: 2GB each
model.shard0008.pt: 799MB (final shard)
model.shards.json: Metadata with SHA256 hashes for verification
reconstruct.py: Script to reconstruct the original model.pt

Configuration Files

config.json: Model configuration
neuron_config.json: Neuron compilation settings
Tokenizer files: tokenizer.json, vocab.json, merges.txt, etc.

Usage

Option 1: Reconstruct the Full Model

If you need the complete model.pt file:

# Clone the repository
git clone https://huggingface.co/kunhunjon/ChessLM_Qwen3_Trainium_Sharded
cd ChessLM_Qwen3_Trainium_Sharded

# Reconstruct the original model.pt
python3 reconstruct.py

# This will create model.pt (16.4GB) from the shards

Option 2: Use Directly with optimum-neuron

The model can be loaded directly without reconstruction:

from optimum.neuron import NeuronModelForCausalLM
from transformers import AutoTokenizer

# Load the model (will handle shards automatically if needed)
model = NeuronModelForCausalLM.from_pretrained("kunhunjon/ChessLM_Qwen3_Trainium_Sharded")
tokenizer = AutoTokenizer.from_pretrained("kunhunjon/ChessLM_Qwen3_Trainium_Sharded")

# Run inference
prompt = "e2e4"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=20)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)

Requirements

pip install optimum-neuron[vllm]==0.3.0
pip install neuronx-distributed --extra-index-url=https://pip.repos.neuron.amazonaws.com

Hardware Requirements

AWS Trainium (trn1.32xlarge, trn1.2xlarge) or Inferentia (inf2) instances
At least 2 Neuron cores (as configured during tracing)
Minimum 32GB RAM recommended

Sharding Details

The model was sharded using a custom script that:

Splits the 16.4GB model.pt into 9 chunks of ~2GB each
Generates SHA256 hashes for each shard for integrity verification
Includes a reconstruction script to reassemble the original file
Preserves all original model functionality

Verification

The model.shards.json file contains SHA256 hashes for each shard. The reconstruction script automatically verifies these hashes when reassembling the model.

Continuous Batching

This model is compiled with continuous batching enabled, which allows vLLM to:

Process multiple requests simultaneously with dynamic batch sizes up to 4
Optimize throughput by batching requests with different sequence lengths
Reduce latency for concurrent inference workloads

Note: On-device sampling is disabled due to a known Neuron runtime limitation when using tensor parallelism with 2 cores. Sampling is handled on the host instead.

Compilation Details

batch_size=4
sequence_length=2048
num_cores=2
auto_cast_type="bf16"
continuous_batching=True
Total compilation time: ~8.1 minutes

License

This model inherits the license from the base model karanps/ChessLM_Qwen3.

Citation

If you use this model, please cite the original ChessLM model and AWS Neuron tools.

Downloads last month: 16

Model tree for kunhunjon/ChessLM_Qwen3_Trainium_Sharded

Base model

karanps/ChessLM_Qwen3

Finetuned

(5)

this model