Qwen3 8B β€” Italian Cultural Alignment [V1]

Qwen3 8B [V1] is a LoRA adapter fine-tuned on top of Qwen/Qwen3-8B to improve Italian cultural alignment. It was trained on the Mult-IT dataset and evaluated on the ITALIC benchmark. This is the first version in a series of experiments exploring how supervised fine-tuning affects both cultural performance and the chain-of-thought reasoning capabilities of Qwen3's hybrid-reasoning architecture.

Author: Maruf Bepary, King's College London
Research report: Alignment in Large Language Models

⚠️ Important: V1 was trained exclusively on non-thinking format data. This caused catastrophic forgetting of Qwen3's chain-of-thought (<think>) capability. Use No Thinking mode only with this adapter. See Key Finding below.


Model Summary

Property Value
Base model Qwen/Qwen3-8B
PEFT type LoRA
Task Causal language modelling (Italian Q&A / instruction following)
Training dataset Mult-IT (~86,929 samples)
Evaluation benchmark ITALIC (10,000 questions)
No Thinking accuracy (V1) 73.77% (+3.60 pp over baseline)
Thinking accuracy (V1) 59.33% (βˆ’15.16 pp β€” collapsed)
Trainable parameters 65,470,464 / 8,256,205,824 (0.79%)

Intended Use

This model is intended for:

  • Italian language understanding β€” multiple-choice Q&A, cultural knowledge, and general instruction following in Italian.
  • Research β€” studying the effects of SFT on hybrid-reasoning language models, particularly reasoning mode degradation.
  • Benchmarking β€” comparing Italian cultural alignment across model sizes and training strategies.

Not recommended for:

  • Tasks requiring chain-of-thought reasoning (Thinking mode is non-functional in V1).
  • High-stakes or safety-critical applications.
  • Languages other than Italian.

Key Finding β€” Reasoning Degradation

Training Qwen3 (a hybrid-reasoning model) exclusively on non-thinking format supervised fine-tuning data causes catastrophic forgetting of chain-of-thought capability:

Mode Baseline V1 Delta
No Thinking (total) 70.17% 73.77% +3.60 pp
Thinking (total) 74.49% 59.33% βˆ’15.16 pp

V1 improved No Thinking performance across all 12 ITALIC categories whilst completely disrupting the <think>…</think> reasoning pathway. This finding motivated a mixed-training approach in V2 and V3, where both thinking and non-thinking formatted examples are interleaved within a single SFT pass.


Training Details

LoRA Configuration

Parameter Value
LoRA rank (r) 24
LoRA alpha 48
LoRA dropout 0.1
Target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Bias none

Training Hyperparameters

Parameter Value
Epochs 2
Total steps 3,076
Per-device batch size 4
Sequence packing Yes (max 2,048 tokens per slot)
Peak learning rate ~4Γ—10⁻⁡
LR schedule Cosine
Warmup steps 308 steps (10%)
Max sequence length 2,048 tokens

Checkpoints

Checkpoint Step Epoch
checkpoint-1538 1,538 1
checkpoint-3076 3,076 2 (final)

Framework & Hardware

Component Version / Spec
TRL 0.21.0
PEFT 0.17.0
Transformers 4.55.0
PyTorch 2.5.1+cu121
Hardware NVIDIA GeForce RTX 3090

Training Dataset β€” Mult-IT

  • Dataset: Mult-IT β€” Multiple Choice Questions on Multiple Topics in Italian
  • Source: CALAMITA Shared Task @ CLiC-it 2024
  • Language: Italian
  • Size: ~86,929 training samples
  • Format: JSONL, multiple-choice Q&A
  • Reference: Mult-IT: Multiple Choice Questions on Multiple Topics in Italian (2024)

ITALIC Benchmark Results

Benchmark: ITALIC (NAACL 2025) β€” Italian Culture-Aware Natural Language Benchmark
Format: Zero-shot, multiple-choice (12 categories, 10,000 questions)
System prompt: "Sei un assistente utile."

No Thinking Mode β€” V1 vs Baseline

Category Baseline V1 Ξ”
Art 69.29 71.02 +1.73
Civic 73.18 76.98 +3.80
Events 76.09 76.02 βˆ’0.07
Geography 75.89 77.22 +1.33
History 71.37 74.44 +3.07
Literature 64.33 68.09 +3.76
Tourism 68.27 69.49 +1.22
Lexicon 84.27 87.33 +3.06
Morphology 50.71 54.71 +4.00
Orthography 54.04 63.44 +9.40
Synonyms 84.04 90.42 +6.38
Syntax 59.20 61.87 +2.67
Culture (subtotal) 70.47 72.91 +2.44
Language (subtotal) 69.73 75.05 +5.32
Total 70.17 73.77 +3.60

Thinking Mode β€” V1 vs Baseline (collapsed)

Metric Baseline V1 Ξ”
Total 74.49 59.33 βˆ’15.16
Culture 73.13 57.25 βˆ’15.88
Language 76.49 62.42 βˆ’14.07

Comparison with Other Models (No Thinking, ITALIC Total)

Model Total Parameters
Llama 3.1 70B 83.61% 70B
GPT-4o Mini 82.22% ~8B
Qwen3 14B (No Thinking) 77.78% 14B
Qwen3 8B (No Thinking) [V3] 73.81% 8B
Qwen3 8B (No Thinking) [V1] 73.77% 8B
Llama 3.1 8B Ita [V1] 73.91% 8B
Qwen3 8B (No Thinking) baseline 70.17% 8B
Llama 3.1 8B 66.38% 8B

All scores evaluated under identical zero-shot conditions on the ITALIC benchmark.


Usage

⚠️ Thinking mode must be disabled. V1 fine-tuning disrupted Qwen3's chain-of-thought capability. Always pass enable_thinking=False when using this adapter.

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch

base_model_id = "Qwen/Qwen3-8B"
adapter_id = "maruf-bepary/qwen3-8b-italian-v1"

# Load tokeniser and base model
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# Load LoRA adapter
model = PeftModel.from_pretrained(model, adapter_id)
model.eval()

# Example: Italian multiple-choice question
messages = [
    {"role": "system", "content": "Sei un assistente utile."},
    {
        "role": "user",
        "content": (
            "Qual Γ¨ la capitale d'Italia?\n"
            "A) Milano\nB) Roma\nC) Napoli\nD) Torino\n\n"
            "Rispondi con la lettera della risposta corretta."
        ),
    },
]

# Apply chat template β€” disable thinking mode (critical for V1)
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False,   # <-- must be False for V1
)

inputs = tokenizer(text, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=64,
        do_sample=False,
        temperature=None,
        top_p=None,
    )

response = tokenizer.decode(
    outputs[0][inputs["input_ids"].shape[-1]:],
    skip_special_tokens=True,
)
print(response)
# Expected output: "B"

Limitations

  • Thinking mode is non-functional β€” chain-of-thought reasoning was catastrophically disrupted during V1 training. Use No Thinking mode exclusively.
  • Morphology remains the weakest category at 54.71%, suggesting limited syntactic generalisation.
  • Benchmark scope β€” evaluation was conducted solely on ITALIC; Italian cultural performance on other benchmarks (e.g. MMLU-IT, HellaSwag-IT) is unverified.
  • Single-GPU training β€” training used one RTX 3090; larger batch sizes or multi-GPU configurations may yield different results.
  • Dataset bias β€” Mult-IT is a multiple-choice dataset; the model may not generalise equally well to open-ended Italian generation tasks.

Related resources:

Downloads last month
2
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for m-beps/qwen3-8b-finetune-multit-nothinking

Finetuned
Qwen/Qwen3-8B
Adapter
(1421)
this model