Instructions to use m-beps/qwen3-8b-finetune-multit-nothinking with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use m-beps/qwen3-8b-finetune-multit-nothinking with PEFT:

from peft import PeftModel
from transformers import AutoModelForCausalLM

base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-8B")
model = PeftModel.from_pretrained(base_model, "m-beps/qwen3-8b-finetune-multit-nothinking")

Transformers

How to use m-beps/qwen3-8b-finetune-multit-nothinking with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="m-beps/qwen3-8b-finetune-multit-nothinking")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("m-beps/qwen3-8b-finetune-multit-nothinking", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use m-beps/qwen3-8b-finetune-multit-nothinking with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "m-beps/qwen3-8b-finetune-multit-nothinking"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "m-beps/qwen3-8b-finetune-multit-nothinking",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/m-beps/qwen3-8b-finetune-multit-nothinking

SGLang

How to use m-beps/qwen3-8b-finetune-multit-nothinking with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "m-beps/qwen3-8b-finetune-multit-nothinking" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "m-beps/qwen3-8b-finetune-multit-nothinking",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "m-beps/qwen3-8b-finetune-multit-nothinking" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "m-beps/qwen3-8b-finetune-multit-nothinking",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use m-beps/qwen3-8b-finetune-multit-nothinking with Docker Model Runner:
```
docker model run hf.co/m-beps/qwen3-8b-finetune-multit-nothinking
```

Qwen3 8B — Italian Cultural Alignment [V1]

Qwen3 8B [V1] is a LoRA adapter fine-tuned on top of Qwen/Qwen3-8B to improve Italian cultural alignment. It was trained on the Mult-IT dataset and evaluated on the ITALIC benchmark. This is the first version in a series of experiments exploring how supervised fine-tuning affects both cultural performance and the chain-of-thought reasoning capabilities of Qwen3's hybrid-reasoning architecture.

Author: Maruf Bepary, King's College London
Research report: Alignment in Large Language Models

⚠️ Important: V1 was trained exclusively on non-thinking format data. This caused catastrophic forgetting of Qwen3's chain-of-thought (<think>) capability. Use No Thinking mode only with this adapter. See Key Finding below.

Model Summary

Property	Value
Base model	`Qwen/Qwen3-8B`
PEFT type	LoRA
Task	Causal language modelling (Italian Q&A / instruction following)
Training dataset	Mult-IT (~86,929 samples)
Evaluation benchmark	ITALIC (10,000 questions)
No Thinking accuracy (V1)	73.77% (+3.60 pp over baseline)
Thinking accuracy (V1)	59.33% (−15.16 pp — collapsed)
Trainable parameters	65,470,464 / 8,256,205,824 (0.79%)

Intended Use

This model is intended for:

Italian language understanding — multiple-choice Q&A, cultural knowledge, and general instruction following in Italian.
Research — studying the effects of SFT on hybrid-reasoning language models, particularly reasoning mode degradation.
Benchmarking — comparing Italian cultural alignment across model sizes and training strategies.

Not recommended for:

Tasks requiring chain-of-thought reasoning (Thinking mode is non-functional in V1).
High-stakes or safety-critical applications.
Languages other than Italian.

Key Finding — Reasoning Degradation

Training Qwen3 (a hybrid-reasoning model) exclusively on non-thinking format supervised fine-tuning data causes catastrophic forgetting of chain-of-thought capability:

Mode	Baseline	V1	Delta
No Thinking (total)	70.17%	73.77%	+3.60 pp
Thinking (total)	74.49%	59.33%	−15.16 pp

V1 improved No Thinking performance across all 12 ITALIC categories whilst completely disrupting the <think>…</think> reasoning pathway. This finding motivated a mixed-training approach in V2 and V3, where both thinking and non-thinking formatted examples are interleaved within a single SFT pass.

Training Details

LoRA Configuration

Parameter	Value
LoRA rank (`r`)	24
LoRA alpha	48
LoRA dropout	0.1
Target modules	`q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`
Bias	none

Training Hyperparameters

Parameter	Value
Epochs	2
Total steps	3,076
Per-device batch size	4
Sequence packing	Yes (max 2,048 tokens per slot)
Peak learning rate	~4×10⁻⁵
LR schedule	Cosine
Warmup steps	~~308 steps (~~10%)
Max sequence length	2,048 tokens

Checkpoints

Checkpoint	Step	Epoch
`checkpoint-1538`	1,538	1
`checkpoint-3076`	3,076	2 (final)

Framework & Hardware

Component	Version / Spec
TRL	0.21.0
PEFT	0.17.0
Transformers	4.55.0
PyTorch	2.5.1+cu121
Hardware	NVIDIA GeForce RTX 3090

Training Dataset — Mult-IT

Dataset: Mult-IT — Multiple Choice Questions on Multiple Topics in Italian
Source: CALAMITA Shared Task @ CLiC-it 2024
Language: Italian
Size: ~86,929 training samples
Format: JSONL, multiple-choice Q&A
Reference: Mult-IT: Multiple Choice Questions on Multiple Topics in Italian (2024)

ITALIC Benchmark Results

Benchmark: ITALIC (NAACL 2025) — Italian Culture-Aware Natural Language Benchmark
Format: Zero-shot, multiple-choice (12 categories, 10,000 questions)
System prompt: "Sei un assistente utile."

No Thinking Mode — V1 vs Baseline

Category	Baseline	V1	Δ
Art	69.29	71.02	+1.73
Civic	73.18	76.98	+3.80
Events	76.09	76.02	−0.07
Geography	75.89	77.22	+1.33
History	71.37	74.44	+3.07
Literature	64.33	68.09	+3.76
Tourism	68.27	69.49	+1.22
Lexicon	84.27	87.33	+3.06
Morphology	50.71	54.71	+4.00
Orthography	54.04	63.44	+9.40
Synonyms	84.04	90.42	+6.38
Syntax	59.20	61.87	+2.67
Culture (subtotal)	70.47	72.91	+2.44
Language (subtotal)	69.73	75.05	+5.32
Total	70.17	73.77	+3.60

Thinking Mode — V1 vs Baseline (collapsed)

Metric	Baseline	V1	Δ
Total	74.49	59.33	−15.16
Culture	73.13	57.25	−15.88
Language	76.49	62.42	−14.07

Comparison with Other Models (No Thinking, ITALIC Total)

Model	Total	Parameters
Llama 3.1 70B	83.61%	70B
GPT-4o Mini	82.22%	~8B
Qwen3 14B (No Thinking)	77.78%	14B
Qwen3 8B (No Thinking) [V3]	73.81%	8B
Qwen3 8B (No Thinking) [V1]	73.77%	8B
Llama 3.1 8B Ita [V1]	73.91%	8B
Qwen3 8B (No Thinking) baseline	70.17%	8B
Llama 3.1 8B	66.38%	8B

All scores evaluated under identical zero-shot conditions on the ITALIC benchmark.

Usage

⚠️ Thinking mode must be disabled. V1 fine-tuning disrupted Qwen3's chain-of-thought capability. Always pass enable_thinking=False when using this adapter.

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch

base_model_id = "Qwen/Qwen3-8B"
adapter_id = "maruf-bepary/qwen3-8b-italian-v1"

# Load tokeniser and base model
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# Load LoRA adapter
model = PeftModel.from_pretrained(model, adapter_id)
model.eval()

# Example: Italian multiple-choice question
messages = [
    {"role": "system", "content": "Sei un assistente utile."},
    {
        "role": "user",
        "content": (
            "Qual è la capitale d'Italia?\n"
            "A) Milano\nB) Roma\nC) Napoli\nD) Torino\n\n"
            "Rispondi con la lettera della risposta corretta."
        ),
    },
]

# Apply chat template — disable thinking mode (critical for V1)
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False,   # <-- must be False for V1
)

inputs = tokenizer(text, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=64,
        do_sample=False,
        temperature=None,
        top_p=None,
    )

response = tokenizer.decode(
    outputs[0][inputs["input_ids"].shape[-1]:],
    skip_special_tokens=True,
)
print(response)
# Expected output: "B"

Limitations

Thinking mode is non-functional — chain-of-thought reasoning was catastrophically disrupted during V1 training. Use No Thinking mode exclusively.
Morphology remains the weakest category at 54.71%, suggesting limited syntactic generalisation.
Benchmark scope — evaluation was conducted solely on ITALIC; Italian cultural performance on other benchmarks (e.g. MMLU-IT, HellaSwag-IT) is unverified.
Single-GPU training — training used one RTX 3090; larger batch sizes or multi-GPU configurations may yield different results.
Dataset bias — Mult-IT is a multiple-choice dataset; the model may not generalise equally well to open-ended Italian generation tasks.

Related resources:

Research report: Alignment in Large Language Models
Base model: Qwen/Qwen3-8B
ITALIC benchmark: RiTA-nlp/ITALIC
Mult-IT dataset: sapienzanlp/Mult-IT
PEFT documentation: huggingface.co/docs/peft

Downloads last month: 2

Model tree for m-beps/qwen3-8b-finetune-multit-nothinking

Base model

Qwen/Qwen3-8B-Base

Finetuned

Qwen/Qwen3-8B

Adapter

(1421)

this model