--- language: - en license: apache-2.0 pipeline_tag: text-generation tags: - chess - neuron - aws-trainium - vllm - optimum-neuron - continuous-batching - sharded base_model: karanps/ChessLM_Qwen3 --- # ChessLM Qwen3 - Neuron Traced (Sharded Model) This is a **sharded version** of the Neuron-traced [karanps/ChessLM_Qwen3](https://huggingface.co/karanps/ChessLM_Qwen3) optimized for AWS Trainium (trn1) and Inferentia (inf2) instances using vLLM with **continuous batching enabled**. The model.pt file (16.4GB) has been split into **9 shards** of ~2GB each for easier downloading and storage. ## Model Details - **Base Model**: Qwen3-8B fine-tuned for chess - **Compilation**: optimum-neuron[vllm]==0.3.0 - **Compiler Version**: neuronxcc 2.21.33363.0 - **Target Hardware**: AWS Trainium (trn1) / Inferentia (inf2) - **Precision**: BF16 - **Tensor Parallelism**: 2 cores - **Batch Size**: 4 (continuous batching enabled) - **Max Sequence Length**: 2048 - **Model Format**: Sharded (9 parts) ## Files ### Model Shards - `model.shard0000.pt` through `model.shard0007.pt`: 2GB each - `model.shard0008.pt`: 799MB (final shard) - `model.shards.json`: Metadata with SHA256 hashes for verification - `reconstruct.py`: Script to reconstruct the original model.pt ### Configuration Files - `config.json`: Model configuration - `neuron_config.json`: Neuron compilation settings - Tokenizer files: `tokenizer.json`, `vocab.json`, `merges.txt`, etc. ## Usage ### Option 1: Reconstruct the Full Model If you need the complete `model.pt` file: ```bash # Clone the repository git clone https://huggingface.co/kunhunjon/ChessLM_Qwen3_Trainium_Sharded cd ChessLM_Qwen3_Trainium_Sharded # Reconstruct the original model.pt python3 reconstruct.py # This will create model.pt (16.4GB) from the shards ``` ### Option 2: Use Directly with optimum-neuron The model can be loaded directly without reconstruction: ```python from optimum.neuron import NeuronModelForCausalLM from transformers import AutoTokenizer # Load the model (will handle shards automatically if needed) model = NeuronModelForCausalLM.from_pretrained("kunhunjon/ChessLM_Qwen3_Trainium_Sharded") tokenizer = AutoTokenizer.from_pretrained("kunhunjon/ChessLM_Qwen3_Trainium_Sharded") # Run inference prompt = "e2e4" inputs = tokenizer(prompt, return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=20) result = tokenizer.decode(outputs[0], skip_special_tokens=True) print(result) ``` ## Requirements ```bash pip install optimum-neuron[vllm]==0.3.0 pip install neuronx-distributed --extra-index-url=https://pip.repos.neuron.amazonaws.com ``` ## Hardware Requirements - AWS Trainium (trn1.32xlarge, trn1.2xlarge) or Inferentia (inf2) instances - At least 2 Neuron cores (as configured during tracing) - Minimum 32GB RAM recommended ## Sharding Details The model was sharded using a custom script that: - Splits the 16.4GB model.pt into 9 chunks of ~2GB each - Generates SHA256 hashes for each shard for integrity verification - Includes a reconstruction script to reassemble the original file - Preserves all original model functionality ### Verification The `model.shards.json` file contains SHA256 hashes for each shard. The reconstruction script automatically verifies these hashes when reassembling the model. ## Continuous Batching This model is compiled with **continuous batching enabled**, which allows vLLM to: - Process multiple requests simultaneously with dynamic batch sizes up to 4 - Optimize throughput by batching requests with different sequence lengths - Reduce latency for concurrent inference workloads **Note**: On-device sampling is disabled due to a known Neuron runtime limitation when using tensor parallelism with 2 cores. Sampling is handled on the host instead. ## Compilation Details - `batch_size=4` - `sequence_length=2048` - `num_cores=2` - `auto_cast_type="bf16"` - `continuous_batching=True` - Total compilation time: ~8.1 minutes ## License This model inherits the license from the base model [karanps/ChessLM_Qwen3](https://huggingface.co/karanps/ChessLM_Qwen3). ## Citation If you use this model, please cite the original ChessLM model and AWS Neuron tools.