Instructions to use apol/alia-40b-distill-vapol with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use apol/alia-40b-distill-vapol with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="apol/alia-40b-distill-vapol")

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("apol/alia-40b-distill-vapol", dtype="auto")

PEFT
How to use apol/alia-40b-distill-vapol with PEFT:
```
Task type is invalid.
```
Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use apol/alia-40b-distill-vapol with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "apol/alia-40b-distill-vapol"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "apol/alia-40b-distill-vapol",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/apol/alia-40b-distill-vapol

SGLang

How to use apol/alia-40b-distill-vapol with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "apol/alia-40b-distill-vapol" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "apol/alia-40b-distill-vapol",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "apol/alia-40b-distill-vapol" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "apol/alia-40b-distill-vapol",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use apol/alia-40b-distill-vapol with Docker Model Runner:
```
docker model run hf.co/apol/alia-40b-distill-vapol
```

ALIA-40B Distill Vapol

apol/alia-40b-distill-vapol is a post-trained release derived from BSC-LT/ALIA-40b-instruct-2601, optimized for practical multilingual assistant behavior, structured output reliability, tool-call formatting, RAG-style answers, and coding/debugging tasks.

Interactive demo Space: apol/alia-40b-distill-vapol-demo

Detailed technical article in Spanish: BLOG.md

Deliverables

This repo contains:

Artifact	Location	Use
Q4_K_M GGUF	`gguf_chunks/ALIA-40b-distill-vapol-Q4_K_M.gguf.part-*`	Transport chunks for reconstructing the single-file llama.cpp / LM Studio deployment.
PEFT adapter	`adapter/`	Highest-fidelity Hub artifact; load on top of `BSC-LT/ALIA-40b-instruct-2601` for adapter-based inference or further research.
Runtime helper	`runtime/repair_eval_responses.py`	Optional deterministic repair layer for strict JSON/tool/RAG/code contracts.
Evaluation reports	`reports/`	Local task metrics, hidden-suite validation outputs, and distillation summaries.

Intended Use

The model is intended for general assistant use, with emphasis on:

Spanish assistant tasks.
Catalan, Basque, and Galician instruction following.
Structured JSON output.
Tool-call formatting and missing-argument clarification.
Administrative and legal-style summarization.
Coding/debugging assistance.
Source-grounded long-context and RAG-style synthesis.

Loading

PEFT Adapter

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = "BSC-LT/ALIA-40b-instruct-2601"
repo = "apol/alia-40b-distill-vapol"

tokenizer = AutoTokenizer.from_pretrained(repo, subfolder="adapter")
model = AutoModelForCausalLM.from_pretrained(base, device_map="auto", torch_dtype="auto")
model = PeftModel.from_pretrained(model, repo, subfolder="adapter")

llama.cpp / LM Studio

The Q4_K_M GGUF is published as transport chunks for reliable Hub distribution. Reassemble it locally before loading:

cat gguf_chunks/ALIA-40b-distill-vapol-Q4_K_M.gguf.part-* > ALIA-40b-distill-vapol-Q4_K_M.gguf
sha256sum ALIA-40b-distill-vapol-Q4_K_M.gguf

Expected SHA256:

45f75478c721cf26617dc10f89bbfc663f5946a3779ddd19982bb7787790d285

Then load the reassembled file:

llama-cli \
  -m ALIA-40b-distill-vapol-Q4_K_M.gguf \
  -c 4096 \
  -ngl 99 \
  --temp 0.2 \
  -p "<prompt>"

What Was Improved

The work focused on competence and performance on practical assistant tasks rather than broad memorization. The main interventions were:

Lever	What was applied	Research influence
Targeted QLoRA SFT	Efficient LoRA/QLoRA post-training on high-value assistant behaviors.	QLoRA and HF FSDP/QLoRA practice.
Hard-example active distillation	Data came from actual model failures: invalid JSON, missing tool fields, citation mistakes, weak multilingual responses, and incomplete code fixes.	DeepSeek-style staged post-training and rejection-sampling distillation.
DPO preference alignment	Chosen/rejected pairs contrasted corrected outputs against current-model failure patterns.	DPO, SimPO/ORPO-style preference optimization ideas.
Verifier-first gates	Deterministic validators controlled JSON validity, tool-call shape, citations, and task constraints before promotion.	RLVR/GRPO-style emphasis on verifiable rewards and automatic gates.
Tool/RAG task shaping	Training examples used realistic tool contracts, missing arguments, citation requirements, and multilingual source-grounded answers.	DeepSeek V4, Kimi agentic training reports, HF Cookbook, and Smol Training Playbook.

These references informed design choices. This release does not claim to reproduce frontier-scale RL or agentic training.

Local Evaluation

The following local suites are deterministic assistant-task evaluations. They measure structured output, tool-call behavior, source-grounded answers, code fixes, and language constraints. They are not a substitute for a full academic benchmark campaign.

Model / Artifact	Visible assistant eval	Hidden verifier-first suite	Hidden competence suite	Notes
`BSC-LT/ALIA-40b` base	not directly comparable	not applicable	not applicable	Raw completion model; not instruction aligned.
`BSC-LT/ALIA-40b-instruct-2601`	21/80 rows, 386/519 checks	baseline not included	baseline not included	Original instruction model under local validator style.
Distill Vapol adapter	33/80 rows, 446/519 checks	16/20 rows, 111/115 checks	11/20 rows, 100/115 checks	Best model-only result.
Distill Vapol with deterministic runtime repair	41/80 rows, 458/519 checks	20/20 rows, 115/115 checks	20/20 rows, 115/115 checks	Best practical deployment path when strict validators are available.
Distill Vapol Q4_K_M GGUF	portable artifact	integrity verified	integrity verified	Quantized release for LM Studio/llama.cpp; published as chunks and reassembled into one file locally. The PEFT adapter is the canonical highest-fidelity artifact.

Relative local improvement over the original ALIA instruct model on the visible assistant eval:

Row pass rate: 21/80 -> 33/80, a +57.1% relative increase.
Check pass rate: 386/519 -> 446/519, a +15.5% relative increase.
With deterministic runtime repair: 21/80 -> 41/80 rows, a +95.2% relative increase.

Official Reference Scores

The official BSC model cards report broad benchmark numbers for the source models. These are reference points, not direct comparisons to the local task evals above.

Sources:

Selected official BSC-LT/ALIA-40b-instruct-2601 reference scores:

Area	Benchmark	Official score
English knowledge	MMLU	0.45
English reasoning	ARC Challenge	0.40
English reasoning	ARC Easy	0.73
English reading	Belebele English	0.77
English commonsense	HellaSwag acc	0.54
Spanish knowledge	MMMLU Spanish	0.41
Spanish reading	Belebele Spanish	0.72
Catalan reading	Belebele Catalan	0.71
Basque reading	Belebele Basque	0.67
Galician reading	Belebele Galician	0.73

Estimated academic benchmark movement should be treated conservatively. The post-training targeted assistant reliability, formats, tool/RAG behavior, and multilingual task compliance; it should not be expected to dramatically change broad pretrained knowledge benchmarks such as MMLU.

Notes

The adapter is the highest-fidelity Hub artifact.
The Q4_K_M GGUF is the recommended portable local artifact; the Hub copy is chunked for reliable transport and reconstructs to one GGUF file.
The optional runtime repair helper is not embedded in the GGUF; it is a deployment-side deterministic layer for strict formal outputs.
For practical GGUF inference, use LM Studio or a CUDA-enabled llama.cpp build with GPU offload.