--- language: - en license: apache-2.0 pipeline_tag: text-generation tags: - chess - neuron - aws-trainium - vllm - optimum-neuron - continuous-batching base_model: karanps/ChessLM_Qwen3 --- # ChessLM Qwen3 - Neuron Traced for AWS Trainium/Inferentia (Continuous Batching) This is a Neuron-traced version of [karanps/ChessLM_Qwen3](https://huggingface.co/karanps/ChessLM_Qwen3) optimized for AWS Trainium (trn1) and Inferentia (inf2) instances using vLLM with **continuous batching enabled**. ## Model Details - **Base Model**: Qwen3-8B fine-tuned for chess - **Compilation**: optimum-neuron[vllm]==0.3.0 - **Compiler Version**: neuronxcc 2.21.33363.0 - **Target Hardware**: AWS Trainium (trn1) / Inferentia (inf2) - **Precision**: BF16 - **Tensor Parallelism**: 2 cores - **Batch Size**: 4 (continuous batching enabled) - **Max Sequence Length**: 2048 - **On-Device Sampling**: Disabled (due to runtime limitation with TP=2) ## Requirements ```bash pip install optimum-neuron[vllm]==0.3.0 pip install neuronx-distributed --extra-index-url=https://pip.repos.neuron.amazonaws.com ``` ## Usage ### Loading the Model ```python from optimum.neuron import NeuronModelForCausalLM from transformers import AutoTokenizer # Load the traced model model = NeuronModelForCausalLM.from_pretrained("kunhunjon/ChessLM_Qwen3_Trainium") tokenizer = AutoTokenizer.from_pretrained("kunhunjon/ChessLM_Qwen3_Trainium") # Run inference prompt = "e2e4" inputs = tokenizer(prompt, return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=20) result = tokenizer.decode(outputs[0], skip_special_tokens=True) print(result) ``` ### Hardware Requirements - AWS Trainium (trn1.32xlarge, trn1.2xlarge) or Inferentia (inf2) instances - At least 2 Neuron cores (as configured during tracing) - Minimum 32GB RAM recommended ## Compilation Details This model was traced with the following parameters: - `batch_size=4` - `sequence_length=2048` - `num_cores=2` - `auto_cast_type="bf16"` - `continuous_batching=True` - vLLM-compatible compilation ### Continuous Batching This model is compiled with **continuous batching enabled**, which allows vLLM to: - Process multiple requests simultaneously with dynamic batch sizes up to 4 - Optimize throughput by batching requests with different sequence lengths - Reduce latency for concurrent inference workloads **Note**: On-device sampling is disabled due to a known Neuron runtime limitation when using tensor parallelism with 2 cores. Sampling is handled on the host instead. ## Compilation Metrics - **Total compilation time**: ~8.1 minutes - **Token generation model**: 219 seconds - **Context encoding model**: 165 seconds - **Model size**: 17GB ## License This model inherits the license from the base model [karanps/ChessLM_Qwen3](https://huggingface.co/karanps/ChessLM_Qwen3). ## Citation If you use this model, please cite the original ChessLM model and AWS Neuron tools.