Qwen3 8B β€” Italian Cultural Alignment [V2 β€” Thinking Only]

Qwen3 8B [V2] is a LoRA adapter fine-tuned on top of Qwen/Qwen3-8B to improve Italian cultural alignment using exclusively thinking-format (synthetic chain-of-thought) training data. It was trained on a thinking-converted version of the Mult-IT dataset and evaluated on the ITALIC benchmark. V2 is the second version in a series of experiments exploring how supervised fine-tuning data format affects both cultural performance and the chain-of-thought reasoning capabilities of Qwen3's hybrid-reasoning architecture.

Author: Maruf Bepary, King's College London
Research report: Alignment in Large Language Models

Note: Thinking mode is the primary use case. V2 was trained exclusively on thinking-format data. This substantially improved Qwen3's chain-of-thought (<think>) performance whilst leaving No Thinking mode near-baseline. Use Thinking mode (enable_thinking=True) to benefit from fine-tuning. See Key Finding below.


Model Summary

Property Value
Base model Qwen/Qwen3-8B
PEFT type LoRA
Task Causal language modelling (Italian Q&A / instruction following)
Training dataset Mult-IT (~21,108 thinking-format samples)
Evaluation benchmark ITALIC (10,000 questions)
No Thinking accuracy (V2) 70.27% (+0.10 pp over baseline β€” near-unchanged)
Thinking accuracy (V2) 77.87% (+3.38 pp over baseline β€” improved)
Trainable parameters 65,470,464 / 8,256,205,824 (0.79%)

Intended Use

This model is intended for:

  • Italian reasoning tasks β€” multiple-choice Q&A, cultural knowledge, and instruction following in Italian using chain-of-thought reasoning.
  • Research β€” studying the effect of thinking-format SFT on hybrid-reasoning language models, and confirming the role of training data format in reasoning mode degradation.
  • Benchmarking β€” comparing Italian cultural alignment across model sizes, training strategies, and inference modes.

Not recommended for:

  • Tasks that require maximum No Thinking mode performance β€” use V1 or V3 for that.
  • High-stakes or safety-critical applications.
  • Languages other than Italian.

Key Finding β€” Thinking Mode Recovery

Training Qwen3 exclusively on thinking-format (synthetic chain-of-thought) data is the inverse of V1. Where V1 boosted No Thinking (+3.60 pp) whilst collapsing Thinking (βˆ’15.16 pp), V2 boosts Thinking (+3.38 pp) whilst leaving No Thinking virtually unchanged (+0.10 pp):

Mode Baseline V2 Delta
No Thinking (total) 70.17% 70.27% +0.10 pp
Thinking (total) 74.49% 77.87% +3.38 pp

This confirms that the training data format directly determines which inference mode benefits from SFT. The reasoning-mode collapse observed in V1 was caused entirely by the non-thinking data format, not by the LoRA fine-tuning process itself.

A notable additional result: V2 Thinking (77.87%) surpasses the baseline Thinking (74.49%) and approaches Qwen3 14B Thinking (78.78%) despite being an 8B model β€” demonstrating that targeted SFT compensates meaningfully for the size gap.


Version Series

Version Training Format No Thinking Total Thinking Total Key Result
V1 Non-thinking only 73.77% (+3.60) 59.33% (βˆ’15.16) Thinking collapsed
V2 Thinking only 70.27% (+0.10) 77.87% (+3.38) Thinking recovered and improved
V3 Mixed (both) 73.81% (+3.64) 77.57% (+3.08) Both modes improved

Training Details

LoRA Configuration

Parameter Value
LoRA rank (r) 24
LoRA alpha 48
LoRA dropout 0.1
Target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Bias none

Training Hyperparameters

Parameter Value
Epochs 1
Total steps ~577
Per-device batch size 4
Gradient accumulation 8 steps (effective batch: 32)
Sequence packing Yes (max 2,048 tokens per slot)
Peak learning rate 5Γ—10⁻⁡
LR schedule Cosine
Warmup steps 29 steps (5%)
Max sequence length 2,048 tokens

Framework & Hardware

Component Version / Spec
TRL 0.21.0
PEFT 0.17.0
Transformers 4.55.0
PyTorch 2.5.1+cu121
Hardware NVIDIA GeForce RTX 3090

Training Dataset β€” Mult-IT (Thinking Format)

  • Dataset: Mult-IT β€” Multiple Choice Questions on Multiple Topics in Italian
  • Source: CALAMITA Shared Task @ CLiC-it 2024
  • Language: Italian
  • Size: ~21,108 training samples
  • Format: JSONL, multiple-choice Q&A with synthetic CoT reasoning traces. Only those training samples answered correctly by the model were included (no non-thinking format examples).
  • Reference: Mult-IT: Multiple Choice Questions on Multiple Topics in Italian (2024)

ITALIC Benchmark Results

Benchmark: ITALIC (NAACL 2025) β€” Italian Culture-Aware Natural Language Benchmark
Format: Zero-shot, multiple-choice (12 categories, 10,000 questions)
System prompt: "Sei un assistente utile."

No Thinking Mode β€” V2 vs Baseline (virtually unchanged)

Category Baseline V2 Ξ”
Art 69.29 68.27 βˆ’1.02
Civic 73.18 72.86 βˆ’0.32
Events 76.09 75.83 βˆ’0.26
Geography 75.89 74.87 βˆ’1.02
History 71.37 71.68 +0.31
Literature 64.33 66.77 +2.44
Tourism 68.27 66.63 βˆ’1.64
Lexicon 84.27 85.29 +1.02
Morphology 50.71 47.29 βˆ’3.42
Orthography 54.04 55.51 +1.47
Synonyms 84.04 84.45 +0.41
Syntax 59.20 59.10 βˆ’0.10
Culture (subtotal) 70.47 70.26 βˆ’0.21
Language (subtotal) 69.73 70.28 +0.55
Total 70.17 70.27 +0.10

Thinking Mode β€” V2 vs Baseline (improved)

Category Baseline V2 Ξ”
Art 74.85 76.48 +1.63
Civic 74.07 76.90 +2.83
Events 76.09 76.91 +0.82
Geography 81.44 83.77 +2.33
History 71.26 76.62 +5.36
Literature 70.06 74.48 +4.42
Tourism 66.87 71.13 +4.26
Lexicon 89.82 91.36 +1.54
Morphology 54.14 61.43 +7.29
Orthography 68.67 72.18 +3.51
Synonyms 91.57 92.61 +1.04
Syntax 59.04 65.58 +6.54
Culture (subtotal) 73.13 76.57 +3.44
Language (subtotal) 76.49 79.79 +3.30
Total 74.49 77.87 +3.38

Comparison with Other Models (Thinking Mode, ITALIC Total)

Model Total Parameters
Llama 3.1 70B 83.61% 70B
GPT-4o Mini 82.22% ~8B
Qwen3 14B (Thinking) 78.78% 14B
Qwen3 8B (Thinking) [V2] 77.87% 8B
Qwen3 8B (Thinking) [V3 / Mixed] 77.57% 8B
Qwen3 8B (Thinking) baseline 74.49% 8B
Qwen3 8B (Thinking) [V1] 59.33% 8B

All scores evaluated under identical zero-shot conditions on the ITALIC benchmark.


Usage

βœ… Thinking mode is recommended. Use enable_thinking=True to activate chain-of-thought reasoning and benefit from V2 fine-tuning. No Thinking mode also works but performance is near-baseline.

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch
import re

base_model_id = "Qwen/Qwen3-8B"
adapter_id = "maruf-bepary/qwen3-8b-italian-v2-thinking"

# Load tokeniser and base model
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# Load LoRA adapter
model = PeftModel.from_pretrained(model, adapter_id)
model.eval()

# Example: Italian multiple-choice question
messages = [
    {"role": "system", "content": "Sei un assistente utile."},
    {
        "role": "user",
        "content": (
            "Qual Γ¨ la capitale d'Italia?\n"
            "A) Milano\nB) Roma\nC) Napoli\nD) Torino\n\n"
            "Rispondi con la lettera della risposta corretta."
        ),
    },
]

# Apply chat template β€” enable thinking mode (recommended for V2)
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True,   # <-- primary mode for V2
)

inputs = tokenizer(text, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        do_sample=False,
        temperature=None,
        top_p=None,
    )

full_response = tokenizer.decode(
    outputs[0][inputs["input_ids"].shape[-1]:],
    skip_special_tokens=True,
)

# Strip the <think>...</think> block to extract the final answer
final_answer = re.sub(r"<think>.*?</think>", "", full_response, flags=re.DOTALL).strip()
print(final_answer)
# Expected output: "B"

No Thinking mode is also available but yields near-baseline performance:

# No Thinking mode β€” near-baseline performance
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False,  # near-baseline; thinking mode is preferred for V2
)

Limitations

  • No Thinking mode shows negligible improvement β€” thinking-only training does not improve direct (non-reasoning) inference. Use Thinking mode to benefit from fine-tuning.
  • Morphology in No Thinking mode degraded slightly (βˆ’3.42 pp) β€” thinking-only training appears to have slightly diminished direct morphological recall without the reasoning pathway.
  • Benchmark scope β€” evaluation was conducted solely on ITALIC; Italian cultural performance on other benchmarks (e.g. MMLU-IT, HellaSwag-IT) is unverified.
  • Single-GPU training β€” training used one RTX 3090; larger batch sizes or multi-GPU configurations may yield different results.
  • Dataset bias β€” Mult-IT is a multiple-choice dataset; the model may not generalise equally well to open-ended Italian generation tasks.

References

Related resources:

Downloads last month
1
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for m-beps/qwen3-8b-finetune-multit-thinking

Finetuned
Qwen/Qwen3-8B
Adapter
(1426)
this model