Instructions to use m-beps/qwen3-8b-finetune-multit-thinking with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use m-beps/qwen3-8b-finetune-multit-thinking with PEFT:

from peft import PeftModel
from transformers import AutoModelForCausalLM

base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-8B")
model = PeftModel.from_pretrained(base_model, "m-beps/qwen3-8b-finetune-multit-thinking")

Transformers

How to use m-beps/qwen3-8b-finetune-multit-thinking with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="m-beps/qwen3-8b-finetune-multit-thinking")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("m-beps/qwen3-8b-finetune-multit-thinking", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use m-beps/qwen3-8b-finetune-multit-thinking with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "m-beps/qwen3-8b-finetune-multit-thinking"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "m-beps/qwen3-8b-finetune-multit-thinking",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/m-beps/qwen3-8b-finetune-multit-thinking

SGLang

How to use m-beps/qwen3-8b-finetune-multit-thinking with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "m-beps/qwen3-8b-finetune-multit-thinking" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "m-beps/qwen3-8b-finetune-multit-thinking",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "m-beps/qwen3-8b-finetune-multit-thinking" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "m-beps/qwen3-8b-finetune-multit-thinking",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use m-beps/qwen3-8b-finetune-multit-thinking with Docker Model Runner:
```
docker model run hf.co/m-beps/qwen3-8b-finetune-multit-thinking
```

Qwen3 8B — Italian Cultural Alignment [V2 — Thinking Only]

Qwen3 8B [V2] is a LoRA adapter fine-tuned on top of Qwen/Qwen3-8B to improve Italian cultural alignment using exclusively thinking-format (synthetic chain-of-thought) training data. It was trained on a thinking-converted version of the Mult-IT dataset and evaluated on the ITALIC benchmark. V2 is the second version in a series of experiments exploring how supervised fine-tuning data format affects both cultural performance and the chain-of-thought reasoning capabilities of Qwen3's hybrid-reasoning architecture.

Author: Maruf Bepary, King's College London
Research report: Alignment in Large Language Models

Note: Thinking mode is the primary use case. V2 was trained exclusively on thinking-format data. This substantially improved Qwen3's chain-of-thought (<think>) performance whilst leaving No Thinking mode near-baseline. Use Thinking mode (enable_thinking=True) to benefit from fine-tuning. See Key Finding below.

Model Summary

Property	Value
Base model	`Qwen/Qwen3-8B`
PEFT type	LoRA
Task	Causal language modelling (Italian Q&A / instruction following)
Training dataset	Mult-IT (~21,108 thinking-format samples)
Evaluation benchmark	ITALIC (10,000 questions)
No Thinking accuracy (V2)	70.27% (+0.10 pp over baseline — near-unchanged)
Thinking accuracy (V2)	77.87% (+3.38 pp over baseline — improved)
Trainable parameters	65,470,464 / 8,256,205,824 (0.79%)

Intended Use

This model is intended for:

Italian reasoning tasks — multiple-choice Q&A, cultural knowledge, and instruction following in Italian using chain-of-thought reasoning.
Research — studying the effect of thinking-format SFT on hybrid-reasoning language models, and confirming the role of training data format in reasoning mode degradation.
Benchmarking — comparing Italian cultural alignment across model sizes, training strategies, and inference modes.

Not recommended for:

Tasks that require maximum No Thinking mode performance — use V1 or V3 for that.
High-stakes or safety-critical applications.
Languages other than Italian.

Key Finding — Thinking Mode Recovery

Training Qwen3 exclusively on thinking-format (synthetic chain-of-thought) data is the inverse of V1. Where V1 boosted No Thinking (+3.60 pp) whilst collapsing Thinking (−15.16 pp), V2 boosts Thinking (+3.38 pp) whilst leaving No Thinking virtually unchanged (+0.10 pp):

Mode	Baseline	V2	Delta
No Thinking (total)	70.17%	70.27%	+0.10 pp
Thinking (total)	74.49%	77.87%	+3.38 pp

This confirms that the training data format directly determines which inference mode benefits from SFT. The reasoning-mode collapse observed in V1 was caused entirely by the non-thinking data format, not by the LoRA fine-tuning process itself.

A notable additional result: V2 Thinking (77.87%) surpasses the baseline Thinking (74.49%) and approaches Qwen3 14B Thinking (78.78%) despite being an 8B model — demonstrating that targeted SFT compensates meaningfully for the size gap.

Version Series

Version	Training Format	No Thinking Total	Thinking Total	Key Result
V1	Non-thinking only	73.77% (+3.60)	59.33% (−15.16)	Thinking collapsed
V2	Thinking only	70.27% (+0.10)	77.87% (+3.38)	Thinking recovered and improved
V3	Mixed (both)	73.81% (+3.64)	77.57% (+3.08)	Both modes improved

Training Details

LoRA Configuration

Parameter	Value
LoRA rank (`r`)	24
LoRA alpha	48
LoRA dropout	0.1
Target modules	`q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`
Bias	none

Training Hyperparameters

Parameter	Value
Epochs	1
Total steps	~577
Per-device batch size	4
Gradient accumulation	8 steps (effective batch: 32)
Sequence packing	Yes (max 2,048 tokens per slot)
Peak learning rate	5×10⁻⁵
LR schedule	Cosine
Warmup steps	~~29 steps (~~5%)
Max sequence length	2,048 tokens

Framework & Hardware

Component	Version / Spec
TRL	0.21.0
PEFT	0.17.0
Transformers	4.55.0
PyTorch	2.5.1+cu121
Hardware	NVIDIA GeForce RTX 3090

Training Dataset — Mult-IT (Thinking Format)

Dataset: Mult-IT — Multiple Choice Questions on Multiple Topics in Italian
Source: CALAMITA Shared Task @ CLiC-it 2024
Language: Italian
Size: ~21,108 training samples
Format: JSONL, multiple-choice Q&A with synthetic CoT reasoning traces. Only those training samples answered correctly by the model were included (no non-thinking format examples).
Reference: Mult-IT: Multiple Choice Questions on Multiple Topics in Italian (2024)

ITALIC Benchmark Results

Benchmark: ITALIC (NAACL 2025) — Italian Culture-Aware Natural Language Benchmark
Format: Zero-shot, multiple-choice (12 categories, 10,000 questions)
System prompt: "Sei un assistente utile."

No Thinking Mode — V2 vs Baseline (virtually unchanged)

Category	Baseline	V2	Δ
Art	69.29	68.27	−1.02
Civic	73.18	72.86	−0.32
Events	76.09	75.83	−0.26
Geography	75.89	74.87	−1.02
History	71.37	71.68	+0.31
Literature	64.33	66.77	+2.44
Tourism	68.27	66.63	−1.64
Lexicon	84.27	85.29	+1.02
Morphology	50.71	47.29	−3.42
Orthography	54.04	55.51	+1.47
Synonyms	84.04	84.45	+0.41
Syntax	59.20	59.10	−0.10
Culture (subtotal)	70.47	70.26	−0.21
Language (subtotal)	69.73	70.28	+0.55
Total	70.17	70.27	+0.10

Thinking Mode — V2 vs Baseline (improved)

Category	Baseline	V2	Δ
Art	74.85	76.48	+1.63
Civic	74.07	76.90	+2.83
Events	76.09	76.91	+0.82
Geography	81.44	83.77	+2.33
History	71.26	76.62	+5.36
Literature	70.06	74.48	+4.42
Tourism	66.87	71.13	+4.26
Lexicon	89.82	91.36	+1.54
Morphology	54.14	61.43	+7.29
Orthography	68.67	72.18	+3.51
Synonyms	91.57	92.61	+1.04
Syntax	59.04	65.58	+6.54
Culture (subtotal)	73.13	76.57	+3.44
Language (subtotal)	76.49	79.79	+3.30
Total	74.49	77.87	+3.38

Comparison with Other Models (Thinking Mode, ITALIC Total)

Model	Total	Parameters
Llama 3.1 70B	83.61%	70B
GPT-4o Mini	82.22%	~8B
Qwen3 14B (Thinking)	78.78%	14B
Qwen3 8B (Thinking) [V2]	77.87%	8B
Qwen3 8B (Thinking) [V3 / Mixed]	77.57%	8B
Qwen3 8B (Thinking) baseline	74.49%	8B
Qwen3 8B (Thinking) [V1]	59.33%	8B

All scores evaluated under identical zero-shot conditions on the ITALIC benchmark.

Usage

✅ Thinking mode is recommended. Use enable_thinking=True to activate chain-of-thought reasoning and benefit from V2 fine-tuning. No Thinking mode also works but performance is near-baseline.

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch
import re

base_model_id = "Qwen/Qwen3-8B"
adapter_id = "maruf-bepary/qwen3-8b-italian-v2-thinking"

# Load tokeniser and base model
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# Load LoRA adapter
model = PeftModel.from_pretrained(model, adapter_id)
model.eval()

# Example: Italian multiple-choice question
messages = [
    {"role": "system", "content": "Sei un assistente utile."},
    {
        "role": "user",
        "content": (
            "Qual è la capitale d'Italia?\n"
            "A) Milano\nB) Roma\nC) Napoli\nD) Torino\n\n"
            "Rispondi con la lettera della risposta corretta."
        ),
    },
]

# Apply chat template — enable thinking mode (recommended for V2)
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True,   # <-- primary mode for V2
)

inputs = tokenizer(text, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        do_sample=False,
        temperature=None,
        top_p=None,
    )

full_response = tokenizer.decode(
    outputs[0][inputs["input_ids"].shape[-1]:],
    skip_special_tokens=True,
)

# Strip the <think>...</think> block to extract the final answer
final_answer = re.sub(r"<think>.*?</think>", "", full_response, flags=re.DOTALL).strip()
print(final_answer)
# Expected output: "B"

No Thinking mode is also available but yields near-baseline performance:

# No Thinking mode — near-baseline performance
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False,  # near-baseline; thinking mode is preferred for V2
)

Limitations

No Thinking mode shows negligible improvement — thinking-only training does not improve direct (non-reasoning) inference. Use Thinking mode to benefit from fine-tuning.
Morphology in No Thinking mode degraded slightly (−3.42 pp) — thinking-only training appears to have slightly diminished direct morphological recall without the reasoning pathway.
Benchmark scope — evaluation was conducted solely on ITALIC; Italian cultural performance on other benchmarks (e.g. MMLU-IT, HellaSwag-IT) is unverified.
Single-GPU training — training used one RTX 3090; larger batch sizes or multi-GPU configurations may yield different results.
Dataset bias — Mult-IT is a multiple-choice dataset; the model may not generalise equally well to open-ended Italian generation tasks.

References

Related resources:

Research report: Alignment in Large Language Models
Base model: Qwen/Qwen3-8B
ITALIC benchmark: RiTA-nlp/ITALIC
Mult-IT dataset: sapienzanlp/Mult-IT
PEFT documentation: huggingface.co/docs/peft

Downloads last month: 1

Model tree for m-beps/qwen3-8b-finetune-multit-thinking

Base model

Qwen/Qwen3-8B-Base

Finetuned

Qwen/Qwen3-8B

Adapter

(1426)

this model