Instructions to use m-beps/qwen3-8b-finetune-multit-thinking with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use m-beps/qwen3-8b-finetune-multit-thinking with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-8B") model = PeftModel.from_pretrained(base_model, "m-beps/qwen3-8b-finetune-multit-thinking") - Transformers
How to use m-beps/qwen3-8b-finetune-multit-thinking with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="m-beps/qwen3-8b-finetune-multit-thinking") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("m-beps/qwen3-8b-finetune-multit-thinking", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use m-beps/qwen3-8b-finetune-multit-thinking with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "m-beps/qwen3-8b-finetune-multit-thinking" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "m-beps/qwen3-8b-finetune-multit-thinking", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/m-beps/qwen3-8b-finetune-multit-thinking
- SGLang
How to use m-beps/qwen3-8b-finetune-multit-thinking with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "m-beps/qwen3-8b-finetune-multit-thinking" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "m-beps/qwen3-8b-finetune-multit-thinking", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "m-beps/qwen3-8b-finetune-multit-thinking" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "m-beps/qwen3-8b-finetune-multit-thinking", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use m-beps/qwen3-8b-finetune-multit-thinking with Docker Model Runner:
docker model run hf.co/m-beps/qwen3-8b-finetune-multit-thinking
Qwen3 8B β Italian Cultural Alignment [V2 β Thinking Only]
Qwen3 8B [V2] is a LoRA adapter fine-tuned on top of Qwen/Qwen3-8B to improve Italian cultural alignment using exclusively thinking-format (synthetic chain-of-thought) training data. It was trained on a thinking-converted version of the Mult-IT dataset and evaluated on the ITALIC benchmark. V2 is the second version in a series of experiments exploring how supervised fine-tuning data format affects both cultural performance and the chain-of-thought reasoning capabilities of Qwen3's hybrid-reasoning architecture.
Author: Maruf Bepary, King's College London
Research report: Alignment in Large Language Models
Note: Thinking mode is the primary use case. V2 was trained exclusively on thinking-format data. This substantially improved Qwen3's chain-of-thought (
<think>) performance whilst leaving No Thinking mode near-baseline. Use Thinking mode (enable_thinking=True) to benefit from fine-tuning. See Key Finding below.
Model Summary
| Property | Value |
|---|---|
| Base model | Qwen/Qwen3-8B |
| PEFT type | LoRA |
| Task | Causal language modelling (Italian Q&A / instruction following) |
| Training dataset | Mult-IT (~21,108 thinking-format samples) |
| Evaluation benchmark | ITALIC (10,000 questions) |
| No Thinking accuracy (V2) | 70.27% (+0.10 pp over baseline β near-unchanged) |
| Thinking accuracy (V2) | 77.87% (+3.38 pp over baseline β improved) |
| Trainable parameters | 65,470,464 / 8,256,205,824 (0.79%) |
Intended Use
This model is intended for:
- Italian reasoning tasks β multiple-choice Q&A, cultural knowledge, and instruction following in Italian using chain-of-thought reasoning.
- Research β studying the effect of thinking-format SFT on hybrid-reasoning language models, and confirming the role of training data format in reasoning mode degradation.
- Benchmarking β comparing Italian cultural alignment across model sizes, training strategies, and inference modes.
Not recommended for:
- Tasks that require maximum No Thinking mode performance β use V1 or V3 for that.
- High-stakes or safety-critical applications.
- Languages other than Italian.
Key Finding β Thinking Mode Recovery
Training Qwen3 exclusively on thinking-format (synthetic chain-of-thought) data is the inverse of V1. Where V1 boosted No Thinking (+3.60 pp) whilst collapsing Thinking (β15.16 pp), V2 boosts Thinking (+3.38 pp) whilst leaving No Thinking virtually unchanged (+0.10 pp):
| Mode | Baseline | V2 | Delta |
|---|---|---|---|
| No Thinking (total) | 70.17% | 70.27% | +0.10 pp |
| Thinking (total) | 74.49% | 77.87% | +3.38 pp |
This confirms that the training data format directly determines which inference mode benefits from SFT. The reasoning-mode collapse observed in V1 was caused entirely by the non-thinking data format, not by the LoRA fine-tuning process itself.
A notable additional result: V2 Thinking (77.87%) surpasses the baseline Thinking (74.49%) and approaches Qwen3 14B Thinking (78.78%) despite being an 8B model β demonstrating that targeted SFT compensates meaningfully for the size gap.
Version Series
| Version | Training Format | No Thinking Total | Thinking Total | Key Result |
|---|---|---|---|---|
| V1 | Non-thinking only | 73.77% (+3.60) | 59.33% (β15.16) | Thinking collapsed |
| V2 | Thinking only | 70.27% (+0.10) | 77.87% (+3.38) | Thinking recovered and improved |
| V3 | Mixed (both) | 73.81% (+3.64) | 77.57% (+3.08) | Both modes improved |
Training Details
LoRA Configuration
| Parameter | Value |
|---|---|
LoRA rank (r) |
24 |
| LoRA alpha | 48 |
| LoRA dropout | 0.1 |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Bias | none |
Training Hyperparameters
| Parameter | Value |
|---|---|
| Epochs | 1 |
| Total steps | ~577 |
| Per-device batch size | 4 |
| Gradient accumulation | 8 steps (effective batch: 32) |
| Sequence packing | Yes (max 2,048 tokens per slot) |
| Peak learning rate | 5Γ10β»β΅ |
| LR schedule | Cosine |
| Warmup steps | |
| Max sequence length | 2,048 tokens |
Framework & Hardware
| Component | Version / Spec |
|---|---|
| TRL | 0.21.0 |
| PEFT | 0.17.0 |
| Transformers | 4.55.0 |
| PyTorch | 2.5.1+cu121 |
| Hardware | NVIDIA GeForce RTX 3090 |
Training Dataset β Mult-IT (Thinking Format)
- Dataset: Mult-IT β Multiple Choice Questions on Multiple Topics in Italian
- Source: CALAMITA Shared Task @ CLiC-it 2024
- Language: Italian
- Size: ~21,108 training samples
- Format: JSONL, multiple-choice Q&A with synthetic CoT reasoning traces. Only those training samples answered correctly by the model were included (no non-thinking format examples).
- Reference: Mult-IT: Multiple Choice Questions on Multiple Topics in Italian (2024)
ITALIC Benchmark Results
Benchmark: ITALIC (NAACL 2025) β Italian Culture-Aware Natural Language Benchmark
Format: Zero-shot, multiple-choice (12 categories, 10,000 questions)
System prompt: "Sei un assistente utile."
No Thinking Mode β V2 vs Baseline (virtually unchanged)
| Category | Baseline | V2 | Ξ |
|---|---|---|---|
| Art | 69.29 | 68.27 | β1.02 |
| Civic | 73.18 | 72.86 | β0.32 |
| Events | 76.09 | 75.83 | β0.26 |
| Geography | 75.89 | 74.87 | β1.02 |
| History | 71.37 | 71.68 | +0.31 |
| Literature | 64.33 | 66.77 | +2.44 |
| Tourism | 68.27 | 66.63 | β1.64 |
| Lexicon | 84.27 | 85.29 | +1.02 |
| Morphology | 50.71 | 47.29 | β3.42 |
| Orthography | 54.04 | 55.51 | +1.47 |
| Synonyms | 84.04 | 84.45 | +0.41 |
| Syntax | 59.20 | 59.10 | β0.10 |
| Culture (subtotal) | 70.47 | 70.26 | β0.21 |
| Language (subtotal) | 69.73 | 70.28 | +0.55 |
| Total | 70.17 | 70.27 | +0.10 |
Thinking Mode β V2 vs Baseline (improved)
| Category | Baseline | V2 | Ξ |
|---|---|---|---|
| Art | 74.85 | 76.48 | +1.63 |
| Civic | 74.07 | 76.90 | +2.83 |
| Events | 76.09 | 76.91 | +0.82 |
| Geography | 81.44 | 83.77 | +2.33 |
| History | 71.26 | 76.62 | +5.36 |
| Literature | 70.06 | 74.48 | +4.42 |
| Tourism | 66.87 | 71.13 | +4.26 |
| Lexicon | 89.82 | 91.36 | +1.54 |
| Morphology | 54.14 | 61.43 | +7.29 |
| Orthography | 68.67 | 72.18 | +3.51 |
| Synonyms | 91.57 | 92.61 | +1.04 |
| Syntax | 59.04 | 65.58 | +6.54 |
| Culture (subtotal) | 73.13 | 76.57 | +3.44 |
| Language (subtotal) | 76.49 | 79.79 | +3.30 |
| Total | 74.49 | 77.87 | +3.38 |
Comparison with Other Models (Thinking Mode, ITALIC Total)
| Model | Total | Parameters |
|---|---|---|
| Llama 3.1 70B | 83.61% | 70B |
| GPT-4o Mini | 82.22% | ~8B |
| Qwen3 14B (Thinking) | 78.78% | 14B |
| Qwen3 8B (Thinking) [V2] | 77.87% | 8B |
| Qwen3 8B (Thinking) [V3 / Mixed] | 77.57% | 8B |
| Qwen3 8B (Thinking) baseline | 74.49% | 8B |
| Qwen3 8B (Thinking) [V1] | 59.33% | 8B |
All scores evaluated under identical zero-shot conditions on the ITALIC benchmark.
Usage
β Thinking mode is recommended. Use
enable_thinking=Trueto activate chain-of-thought reasoning and benefit from V2 fine-tuning. No Thinking mode also works but performance is near-baseline.
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch
import re
base_model_id = "Qwen/Qwen3-8B"
adapter_id = "maruf-bepary/qwen3-8b-italian-v2-thinking"
# Load tokeniser and base model
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
model = AutoModelForCausalLM.from_pretrained(
base_model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
# Load LoRA adapter
model = PeftModel.from_pretrained(model, adapter_id)
model.eval()
# Example: Italian multiple-choice question
messages = [
{"role": "system", "content": "Sei un assistente utile."},
{
"role": "user",
"content": (
"Qual Γ¨ la capitale d'Italia?\n"
"A) Milano\nB) Roma\nC) Napoli\nD) Torino\n\n"
"Rispondi con la lettera della risposta corretta."
),
},
]
# Apply chat template β enable thinking mode (recommended for V2)
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=True, # <-- primary mode for V2
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=512,
do_sample=False,
temperature=None,
top_p=None,
)
full_response = tokenizer.decode(
outputs[0][inputs["input_ids"].shape[-1]:],
skip_special_tokens=True,
)
# Strip the <think>...</think> block to extract the final answer
final_answer = re.sub(r"<think>.*?</think>", "", full_response, flags=re.DOTALL).strip()
print(final_answer)
# Expected output: "B"
No Thinking mode is also available but yields near-baseline performance:
# No Thinking mode β near-baseline performance
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=False, # near-baseline; thinking mode is preferred for V2
)
Limitations
- No Thinking mode shows negligible improvement β thinking-only training does not improve direct (non-reasoning) inference. Use Thinking mode to benefit from fine-tuning.
- Morphology in No Thinking mode degraded slightly (β3.42 pp) β thinking-only training appears to have slightly diminished direct morphological recall without the reasoning pathway.
- Benchmark scope β evaluation was conducted solely on ITALIC; Italian cultural performance on other benchmarks (e.g. MMLU-IT, HellaSwag-IT) is unverified.
- Single-GPU training β training used one RTX 3090; larger batch sizes or multi-GPU configurations may yield different results.
- Dataset bias β Mult-IT is a multiple-choice dataset; the model may not generalise equally well to open-ended Italian generation tasks.
References
Related resources:
- Research report: Alignment in Large Language Models
- Base model: Qwen/Qwen3-8B
- ITALIC benchmark: RiTA-nlp/ITALIC
- Mult-IT dataset: sapienzanlp/Mult-IT
- PEFT documentation: huggingface.co/docs/peft
- Downloads last month
- 1