ChessLM Qwen3 - Neuron Traced (Sharded Model)

This is a sharded version of the Neuron-traced karanps/ChessLM_Qwen3 optimized for AWS Trainium (trn1) and Inferentia (inf2) instances using vLLM with continuous batching enabled.

The model.pt file (16.4GB) has been split into 9 shards of ~2GB each for easier downloading and storage.

Model Details

  • Base Model: Qwen3-8B fine-tuned for chess
  • Compilation: optimum-neuron[vllm]==0.3.0
  • Compiler Version: neuronxcc 2.21.33363.0
  • Target Hardware: AWS Trainium (trn1) / Inferentia (inf2)
  • Precision: BF16
  • Tensor Parallelism: 2 cores
  • Batch Size: 4 (continuous batching enabled)
  • Max Sequence Length: 2048
  • Model Format: Sharded (9 parts)

Files

Model Shards

  • model.shard0000.pt through model.shard0007.pt: 2GB each
  • model.shard0008.pt: 799MB (final shard)
  • model.shards.json: Metadata with SHA256 hashes for verification
  • reconstruct.py: Script to reconstruct the original model.pt

Configuration Files

  • config.json: Model configuration
  • neuron_config.json: Neuron compilation settings
  • Tokenizer files: tokenizer.json, vocab.json, merges.txt, etc.

Usage

Option 1: Reconstruct the Full Model

If you need the complete model.pt file:

# Clone the repository
git clone https://huggingface.co/kunhunjon/ChessLM_Qwen3_Trainium_Sharded
cd ChessLM_Qwen3_Trainium_Sharded

# Reconstruct the original model.pt
python3 reconstruct.py

# This will create model.pt (16.4GB) from the shards

Option 2: Use Directly with optimum-neuron

The model can be loaded directly without reconstruction:

from optimum.neuron import NeuronModelForCausalLM
from transformers import AutoTokenizer

# Load the model (will handle shards automatically if needed)
model = NeuronModelForCausalLM.from_pretrained("kunhunjon/ChessLM_Qwen3_Trainium_Sharded")
tokenizer = AutoTokenizer.from_pretrained("kunhunjon/ChessLM_Qwen3_Trainium_Sharded")

# Run inference
prompt = "e2e4"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=20)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)

Requirements

pip install optimum-neuron[vllm]==0.3.0
pip install neuronx-distributed --extra-index-url=https://pip.repos.neuron.amazonaws.com

Hardware Requirements

  • AWS Trainium (trn1.32xlarge, trn1.2xlarge) or Inferentia (inf2) instances
  • At least 2 Neuron cores (as configured during tracing)
  • Minimum 32GB RAM recommended

Sharding Details

The model was sharded using a custom script that:

  • Splits the 16.4GB model.pt into 9 chunks of ~2GB each
  • Generates SHA256 hashes for each shard for integrity verification
  • Includes a reconstruction script to reassemble the original file
  • Preserves all original model functionality

Verification

The model.shards.json file contains SHA256 hashes for each shard. The reconstruction script automatically verifies these hashes when reassembling the model.

Continuous Batching

This model is compiled with continuous batching enabled, which allows vLLM to:

  • Process multiple requests simultaneously with dynamic batch sizes up to 4
  • Optimize throughput by batching requests with different sequence lengths
  • Reduce latency for concurrent inference workloads

Note: On-device sampling is disabled due to a known Neuron runtime limitation when using tensor parallelism with 2 cores. Sampling is handled on the host instead.

Compilation Details

  • batch_size=4
  • sequence_length=2048
  • num_cores=2
  • auto_cast_type="bf16"
  • continuous_batching=True
  • Total compilation time: ~8.1 minutes

License

This model inherits the license from the base model karanps/ChessLM_Qwen3.

Citation

If you use this model, please cite the original ChessLM model and AWS Neuron tools.

Downloads last month
16
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for kunhunjon/ChessLM_Qwen3_Trainium_Sharded

Finetuned
(5)
this model