---
language:
- en
license: apache-2.0
pipeline_tag: text-generation
tags:
- chess
- neuron
- aws-trainium
- vllm
- optimum-neuron
- continuous-batching
base_model: karanps/ChessLM_Qwen3
---

# ChessLM Qwen3 - Neuron Traced for AWS Trainium/Inferentia (Continuous Batching)

This is a Neuron-traced version of [karanps/ChessLM_Qwen3](https://huggingface.co/karanps/ChessLM_Qwen3) optimized for AWS Trainium (trn1) and Inferentia (inf2) instances using vLLM with **continuous batching enabled**.

## Model Details

- **Base Model**: Qwen3-8B fine-tuned for chess
- **Compilation**: optimum-neuron[vllm]==0.3.0
- **Compiler Version**: neuronxcc 2.21.33363.0
- **Target Hardware**: AWS Trainium (trn1) / Inferentia (inf2)
- **Precision**: BF16
- **Tensor Parallelism**: 2 cores
- **Batch Size**: 4 (continuous batching enabled)
- **Max Sequence Length**: 2048
- **On-Device Sampling**: Disabled (due to runtime limitation with TP=2)

## Requirements

```bash
pip install optimum-neuron[vllm]==0.3.0
pip install neuronx-distributed --extra-index-url=https://pip.repos.neuron.amazonaws.com
```

## Usage

### Loading the Model

```python
from optimum.neuron import NeuronModelForCausalLM
from transformers import AutoTokenizer

# Load the traced model
model = NeuronModelForCausalLM.from_pretrained("kunhunjon/ChessLM_Qwen3_Trainium")
tokenizer = AutoTokenizer.from_pretrained("kunhunjon/ChessLM_Qwen3_Trainium")

# Run inference
prompt = "e2e4"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=20)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)
```

### Hardware Requirements

- AWS Trainium (trn1.32xlarge, trn1.2xlarge) or Inferentia (inf2) instances
- At least 2 Neuron cores (as configured during tracing)
- Minimum 32GB RAM recommended

## Compilation Details

This model was traced with the following parameters:
- `batch_size=4`
- `sequence_length=2048`
- `num_cores=2`
- `auto_cast_type="bf16"`
- `continuous_batching=True`
- vLLM-compatible compilation

### Continuous Batching

This model is compiled with **continuous batching enabled**, which allows vLLM to:
- Process multiple requests simultaneously with dynamic batch sizes up to 4
- Optimize throughput by batching requests with different sequence lengths
- Reduce latency for concurrent inference workloads

**Note**: On-device sampling is disabled due to a known Neuron runtime limitation when using tensor parallelism with 2 cores. Sampling is handled on the host instead.

## Compilation Metrics

- **Total compilation time**: ~8.1 minutes
- **Token generation model**: 219 seconds
- **Context encoding model**: 165 seconds
- **Model size**: 17GB

## License

This model inherits the license from the base model [karanps/ChessLM_Qwen3](https://huggingface.co/karanps/ChessLM_Qwen3).

## Citation

If you use this model, please cite the original ChessLM model and AWS Neuron tools.