ChessLM Qwen3 - Neuron Traced for AWS Trainium/Inferentia (Continuous Batching)

This is a Neuron-traced version of karanps/ChessLM_Qwen3 optimized for AWS Trainium (trn1) and Inferentia (inf2) instances using vLLM with continuous batching enabled.

Model Details

  • Base Model: Qwen3-8B fine-tuned for chess
  • Compilation: optimum-neuron[vllm]==0.3.0
  • Compiler Version: neuronxcc 2.21.33363.0
  • Target Hardware: AWS Trainium (trn1) / Inferentia (inf2)
  • Precision: BF16
  • Tensor Parallelism: 2 cores
  • Batch Size: 4 (continuous batching enabled)
  • Max Sequence Length: 2048
  • On-Device Sampling: Disabled (due to runtime limitation with TP=2)

Requirements

pip install optimum-neuron[vllm]==0.3.0
pip install neuronx-distributed --extra-index-url=https://pip.repos.neuron.amazonaws.com

Usage

Loading the Model

from optimum.neuron import NeuronModelForCausalLM
from transformers import AutoTokenizer

# Load the traced model
model = NeuronModelForCausalLM.from_pretrained("kunhunjon/ChessLM_Qwen3_Trainium")
tokenizer = AutoTokenizer.from_pretrained("kunhunjon/ChessLM_Qwen3_Trainium")

# Run inference
prompt = "e2e4"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=20)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)

Hardware Requirements

  • AWS Trainium (trn1.32xlarge, trn1.2xlarge) or Inferentia (inf2) instances
  • At least 2 Neuron cores (as configured during tracing)
  • Minimum 32GB RAM recommended

Compilation Details

This model was traced with the following parameters:

  • batch_size=4
  • sequence_length=2048
  • num_cores=2
  • auto_cast_type="bf16"
  • continuous_batching=True
  • vLLM-compatible compilation

Continuous Batching

This model is compiled with continuous batching enabled, which allows vLLM to:

  • Process multiple requests simultaneously with dynamic batch sizes up to 4
  • Optimize throughput by batching requests with different sequence lengths
  • Reduce latency for concurrent inference workloads

Note: On-device sampling is disabled due to a known Neuron runtime limitation when using tensor parallelism with 2 cores. Sampling is handled on the host instead.

Compilation Metrics

  • Total compilation time: ~8.1 minutes
  • Token generation model: 219 seconds
  • Context encoding model: 165 seconds
  • Model size: 17GB

License

This model inherits the license from the base model karanps/ChessLM_Qwen3.

Citation

If you use this model, please cite the original ChessLM model and AWS Neuron tools.

Downloads last month
60
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for kunhunjon/ChessLM_Qwen3_Trainium

Finetuned
(5)
this model