ALIA-40B Distill Vapol
apol/alia-40b-distill-vapol is a post-trained release derived from BSC-LT/ALIA-40b-instruct-2601, optimized for practical multilingual assistant behavior, structured output reliability, tool-call formatting, RAG-style answers, and coding/debugging tasks.
Interactive demo Space: apol/alia-40b-distill-vapol-demo
Detailed technical article in Spanish: BLOG.md
Deliverables
This repo contains:
| Artifact | Location | Use |
|---|---|---|
| Q4_K_M GGUF | gguf_chunks/ALIA-40b-distill-vapol-Q4_K_M.gguf.part-* |
Transport chunks for reconstructing the single-file llama.cpp / LM Studio deployment. |
| PEFT adapter | adapter/ |
Highest-fidelity Hub artifact; load on top of BSC-LT/ALIA-40b-instruct-2601 for adapter-based inference or further research. |
| Runtime helper | runtime/repair_eval_responses.py |
Optional deterministic repair layer for strict JSON/tool/RAG/code contracts. |
| Evaluation reports | reports/ |
Local task metrics, hidden-suite validation outputs, and distillation summaries. |
Intended Use
The model is intended for general assistant use, with emphasis on:
- Spanish assistant tasks.
- Catalan, Basque, and Galician instruction following.
- Structured JSON output.
- Tool-call formatting and missing-argument clarification.
- Administrative and legal-style summarization.
- Coding/debugging assistance.
- Source-grounded long-context and RAG-style synthesis.
Loading
PEFT Adapter
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base = "BSC-LT/ALIA-40b-instruct-2601"
repo = "apol/alia-40b-distill-vapol"
tokenizer = AutoTokenizer.from_pretrained(repo, subfolder="adapter")
model = AutoModelForCausalLM.from_pretrained(base, device_map="auto", torch_dtype="auto")
model = PeftModel.from_pretrained(model, repo, subfolder="adapter")
llama.cpp / LM Studio
The Q4_K_M GGUF is published as transport chunks for reliable Hub distribution. Reassemble it locally before loading:
cat gguf_chunks/ALIA-40b-distill-vapol-Q4_K_M.gguf.part-* > ALIA-40b-distill-vapol-Q4_K_M.gguf
sha256sum ALIA-40b-distill-vapol-Q4_K_M.gguf
Expected SHA256:
45f75478c721cf26617dc10f89bbfc663f5946a3779ddd19982bb7787790d285
Then load the reassembled file:
llama-cli \
-m ALIA-40b-distill-vapol-Q4_K_M.gguf \
-c 4096 \
-ngl 99 \
--temp 0.2 \
-p "<prompt>"
What Was Improved
The work focused on competence and performance on practical assistant tasks rather than broad memorization. The main interventions were:
| Lever | What was applied | Research influence |
|---|---|---|
| Targeted QLoRA SFT | Efficient LoRA/QLoRA post-training on high-value assistant behaviors. | QLoRA and HF FSDP/QLoRA practice. |
| Hard-example active distillation | Data came from actual model failures: invalid JSON, missing tool fields, citation mistakes, weak multilingual responses, and incomplete code fixes. | DeepSeek-style staged post-training and rejection-sampling distillation. |
| DPO preference alignment | Chosen/rejected pairs contrasted corrected outputs against current-model failure patterns. | DPO, SimPO/ORPO-style preference optimization ideas. |
| Verifier-first gates | Deterministic validators controlled JSON validity, tool-call shape, citations, and task constraints before promotion. | RLVR/GRPO-style emphasis on verifiable rewards and automatic gates. |
| Tool/RAG task shaping | Training examples used realistic tool contracts, missing arguments, citation requirements, and multilingual source-grounded answers. | DeepSeek V4, Kimi agentic training reports, HF Cookbook, and Smol Training Playbook. |
These references informed design choices. This release does not claim to reproduce frontier-scale RL or agentic training.
Local Evaluation
The following local suites are deterministic assistant-task evaluations. They measure structured output, tool-call behavior, source-grounded answers, code fixes, and language constraints. They are not a substitute for a full academic benchmark campaign.
| Model / Artifact | Visible assistant eval | Hidden verifier-first suite | Hidden competence suite | Notes |
|---|---|---|---|---|
BSC-LT/ALIA-40b base |
not directly comparable | not applicable | not applicable | Raw completion model; not instruction aligned. |
BSC-LT/ALIA-40b-instruct-2601 |
21/80 rows, 386/519 checks | baseline not included | baseline not included | Original instruction model under local validator style. |
| Distill Vapol adapter | 33/80 rows, 446/519 checks | 16/20 rows, 111/115 checks | 11/20 rows, 100/115 checks | Best model-only result. |
| Distill Vapol with deterministic runtime repair | 41/80 rows, 458/519 checks | 20/20 rows, 115/115 checks | 20/20 rows, 115/115 checks | Best practical deployment path when strict validators are available. |
| Distill Vapol Q4_K_M GGUF | portable artifact | integrity verified | integrity verified | Quantized release for LM Studio/llama.cpp; published as chunks and reassembled into one file locally. The PEFT adapter is the canonical highest-fidelity artifact. |
Relative local improvement over the original ALIA instruct model on the visible assistant eval:
- Row pass rate:
21/80 -> 33/80, a+57.1%relative increase. - Check pass rate:
386/519 -> 446/519, a+15.5%relative increase. - With deterministic runtime repair:
21/80 -> 41/80rows, a+95.2%relative increase.
Official Reference Scores
The official BSC model cards report broad benchmark numbers for the source models. These are reference points, not direct comparisons to the local task evals above.
Sources:
Selected official BSC-LT/ALIA-40b-instruct-2601 reference scores:
| Area | Benchmark | Official score |
|---|---|---|
| English knowledge | MMLU | 0.45 |
| English reasoning | ARC Challenge | 0.40 |
| English reasoning | ARC Easy | 0.73 |
| English reading | Belebele English | 0.77 |
| English commonsense | HellaSwag acc | 0.54 |
| Spanish knowledge | MMMLU Spanish | 0.41 |
| Spanish reading | Belebele Spanish | 0.72 |
| Catalan reading | Belebele Catalan | 0.71 |
| Basque reading | Belebele Basque | 0.67 |
| Galician reading | Belebele Galician | 0.73 |
Estimated academic benchmark movement should be treated conservatively. The post-training targeted assistant reliability, formats, tool/RAG behavior, and multilingual task compliance; it should not be expected to dramatically change broad pretrained knowledge benchmarks such as MMLU.
Notes
- The adapter is the highest-fidelity Hub artifact.
- The Q4_K_M GGUF is the recommended portable local artifact; the Hub copy is chunked for reliable transport and reconstructs to one GGUF file.
- The optional runtime repair helper is not embedded in the GGUF; it is a deployment-side deterministic layer for strict formal outputs.
- For practical GGUF inference, use LM Studio or a CUDA-enabled llama.cpp build with GPU offload.