Connect 4 QLoRA Adapter for Qwen/Qwen3-0.6B-Base

Model Summary

This repository distributes a QLoRA adapter trained to steer the Qwen/Qwen3-0.6B-Base model toward Connect Four next-move prediction and short-form move generation. Prompts encode game history as concatenated column indices (0–6) along with the starter and side-to-move context, allowing the adapter to focus on legal column selection. The weights are stored separately from the base checkpoint; load or merge them into the matching base revision before running inference. Runs on a single consumer GPU with 4 GB of VRAM in 4bit mode.

This updated version includes a larger training set and automatically selected the checkpoint with the lowest evaluation loss for improved reliability and performance.

How to Use

Load for inference with PEFT

from pathlib import Path

import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

BASE_MODEL = "Qwen/Qwen3-0.6B-Base"
ADAPTER_PATH = "RenaudGaudron/Qwen3-0.6B-Connect4"

model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    device_map="auto",
    torch_dtype=torch.float16,
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
model = PeftModel.from_pretrained(model, ADAPTER_PATH)
model.eval()

PROMPT_TEMPLATE = (
    "Connect Four is played on a 7-column by 6-row grid. Players alternate"
    " dropping discs that stack upwards, and full columns cannot be used."
    "\nGame starter: Player {starter}"
    "\nPlayer to move: Player {current_player}"
    "\nMoves so far: {history}"
    "\nSelect a legal column (0-6) for Player {current_player} and avoid full columns."
    "\nAnswer with a single digit representing the column index."
    "\nResponse:"
)

example_prompt = PROMPT_TEMPLATE.format(
    starter=1,
    current_player=2,
    history="32344553",
)
inputs = tokenizer(example_prompt, return_tensors="pt").to(model.device)
with torch.inference_mode():
    outputs = model.generate(
        **inputs,
        max_new_tokens=2,
        do_sample=False,
    )
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Merge adapters into the base (optional)

from pathlib import Path

import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM

BASE_MODEL = "Qwen/Qwen3-0.6B-Base"
ADAPTER_PATH = "RenaudGaudron/Qwen3-0.6B-Connect4"
OUTPUT_DIR = Path("./merged-connect4-qwen3-0.6b")

base_model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    torch_dtype=torch.float16,
    device_map="auto",
    load_in_4bit=False,
)
model = PeftModel.from_pretrained(base_model, ADAPTER_PATH)
merged = model.merge_and_unload()
merged.save_pretrained(OUTPUT_DIR)
model.tokenizer.save_pretrained(OUTPUT_DIR)

print(f"Merged model saved to {OUTPUT_DIR}. Review the base model license before redistribution.")

Prompt format. Training examples follow the template above. The Moves so far string serialises column indices (0–6) without separators so column legality remains reconstructible. Use none when the board is empty, and always align the move history with the declared starter to avoid illegal column suggestions. Responses remain single digits; deterministic decoding or constrained vocabularies help preserve the format.

Training Details

LoRA configuration

Method: QLoRA with a frozen 4-bit base model.
Rank (r): 128.
Scaling (lora_alpha): 256.
Dropout (lora_dropout): 0.1.
Target modules: attention projections (q_proj, k_proj, v_proj, o_proj) and MLP projections (gate_proj, up_proj, down_proj).
Adapter bias: disabled; only rank update matrices are trainable.

QLoRA keeps the dense Qwen3 backbone quantised to 4-bit NF4 while learning a lightweight LoRA stack initialised from the resumed checkpoint. The adapters stay in float32 while the quantised backbone executes in 4-bit, keeping peak VRAM well under 8 GB on the training GPU.

Optimisation setup

Optimiser: adamw_torch_fused with β₂ = 0.98 and ε = 1e-6.
Learning rate: 5e-7 with constant_with_warmup scheduling and a warmup ratio of 5 %.
Weight decay: 0.0.
Gradient accumulation: 8 steps with per-device batch size 8 → effective batch size 64 sequences.
Max gradient norm: 25.0 with clipping applied every optimisation step.
Label smoothing: disabled.
Attention backend: PyTorch SDPA with math kernel fallback; flash and memory-efficient kernels were unavailable on the training GPU.
Length-aware sampling: 64-bucket sampler enabled to reduce padding skew during both training and evaluation.

Precision and memory

Base model loaded in 4-bit NF4 with double quantisation; LoRA weights stored in float32.
Computation dtype: BF16 with TF32 matmuls enabled; FP16 disabled.
Gradient checkpointing: disabled.

Data

Dataset source: private self-play Connect Four rollouts across three CSV files.
Aggregated examples: 114 363 generated prompts (102 926 train / 11 437 validation) from 4 200 games.
Per-file caps: 100 level-0 games (two players making random legal moves), 4 000 games played between Minimax agents of depth 8, and 100 mixed-level games (depth 4 vs 8). Each game is expanded into multiple move-prefix supervision sequences for training.
Minimum move threshold: unset (min_moves=0) so every legal position contributes.
Validation split: 10 % stratified after shuffle.

Training run

Epochs: 5 planned
Best observed validation loss: 0.7196 at step 50 000 (saved as the published checkpoint).
Training log span: ~60h with periodic evaluations every 1 000 steps.
Hardware: single-process Accelerate session on a Windows workstation loading bitsandbytes CUDA 12.4 bindings and operating on a CUDA device.
Seed: 42 across Python, NumPy, and PyTorch. Dataloader workers (10) and CUDA kernels may still introduce nondeterminism.

Evaluation

Evaluation reuses the training prompt template with eval_max_seq_length=128 truncation. The primary metric is token-level cross-entropy (reported as loss) measuring next-move prediction quality.

Published checkpoint validation loss: 0.7196 (best checkpoint at step 50 000).

Loss values indicate that the adapter reliably tracks legal play patterns after the expanded curriculum while remaining sensitive to decoding constraints near terminal states.

Stability controls mirrored training: gradient clipping at norm 25.0, SDPA math kernel fallback, and warmup scheduling. No extra label smoothing or dropout beyond the LoRA stack was introduced.

Special Tokens

The adapter ships without introducing new tokens. The bundled tokenizer metadata mirrors the base checkpoint, reusing <|endoftext|> as both EOS and padding alongside the stock Qwen multimodal specials.

Intended Use & Limitations

Intended use.

Pair the adapter with Qwen/Qwen3-0.6B-Base for Connect Four next-move suggestion or short move-sequence generation. Prompts should follow the documented template for reliable legality.

Limitations.

The compact 0.6B backbone cannot guarantee optimal play in deep tactical lines; illegal or low-quality moves remain possible, especially near endgame scenarios.
Outputs are restricted to single-digit column indices; free-form chat or multi-turn dialogue falls outside the training distribution.
The adapter does not include safety layers, toxicity filtering, or alignment for open-domain generation. Avoid deploying it in user-facing production systems.
Quality still depends on accurate, complete move histories and deterministic decoding (e.g., do_sample=False or constrained vocab sampling).

Compatibility

The adapter was trained and validated with the following stack:

transformers ≥ 4.39.0
peft ≥ 0.8.2
bitsandbytes ≥ 0.43.0
torch ≥ 2.1 (CUDA 12.4 build per bitsandbytes log)
accelerate ≥ 0.25.0

Windows-based CUDA 12.4 bindings powered the training environment. On Linux or alternate CUDA releases, Accelerate falls back to available SDPA kernels; if flash or memory-efficient kernels are missing (as during training), PyTorch defaults to the math implementation. Always load the same Qwen/Qwen3-0.6B-Base revision used during fine-tuning to avoid key mismatches.

Reproducibility & Seeds

The global seed remained 42 across Python, NumPy, and PyTorch. Despite deterministic settings, CUDA kernel scheduling, dataloader worker ordering, and filesystem timing can introduce minor nondeterminism.

Changelog

2025-11-04 – Expanded dataset fine-tune with automated best-checkpoint capture.
2025-10-25 – Initial public adapter release.

License

The adapter is released under the MIT License. The base model, Qwen/Qwen3-0.6B-Base, ships under Apache 2.0. Please ensure that downstream usage respects both licenses before merging, redistributing, or further fine-tuning.

Citations

Hu et al., “LoRA: Low-Rank Adaptation of Large Language Models,” arXiv:2106.09685.
Dettmers et al., “QLoRA: Efficient Finetuning of Quantized LLMs,” arXiv:2305.14314.
Wolf et al., “HuggingFace's Transformers: State-of-the-Art Natural Language Processing,” arXiv:1910.03771.

Downloads last month: 142

Model tree for RenaudGaudron/Qwen3-0.6B-Connect4

Base model

Qwen/Qwen3-0.6B-Base

Adapter

(28)

this model