a smol course documentation
Hands-On Exercises: Fine-Tuning SmolVLM2-2.2B-Instruct
Hands-On Exercises: Fine-Tuning SmolVLM2-2.2B-Instruct
Welcome to the practical section! Here you’ll put into practice everything you’ve learned about vision language models (VLMs) using HuggingFaceTB/SmolVLM2-2.2B-Instruct.
The exercises progress from foundational concepts to advanced techniques, helping you gain real-world, hands-on experience.
Learning Objectives
By the end of these exercises, you will be able to:
- Work with VLM datasets: Explore and prepare HuggingFaceM4/ChartQA.
- Optimize training: Apply quantization and PEFT for efficient fine-tuning.
- Fine-tune models in practice: Train HuggingFaceTB/SmolVLM2-2.2B-Instruct using both Python APIs and CLI tools.
- Adapt datasets for TRL: Prepare VLM datasets to integrate seamlessly with TRL workflows.
- Move to production: Understand how to scale and manage production-ready fine-tuning workflows for VLMs.
Exercise 1: Explore SmolVLM2-2.2B-Instruct
Objective: Get familiar with the SmolVLM2-2.2B-Instruct model and evaluate the model using a sample from the dataset.
Environment Setup
- You need a GPU with at least 8GB VRAM for training. CPU/MPS can run formatting and dataset exploration, but training larger models will likely fail.
- First run will download several GB of model weights; ensure 15GB+ free disk and a stable connection.
- If you need access to private repos, authenticate with Hugging Face Hub via
login().
First, install the required libraries: transformers, datasets, trl,huggingface_hub, and trackio.
These packages provide the tools for working with the model, datasets, and Hugging Face Hub.
# Install required packages (run in Colab or your environment)
pip install transformers datasets trl huggingface_hub trackio num2words==0.5.14Import dependencies
Now, import the main dependencies we’ll use:
# Import dependencies
import torch
import os
from transformers import AutoProcessor, AutoModelForImageTextToText, BitsAndBytesConfig
from transformers.image_utils import load_imageLoad the model and processor
1. Select the device
We start by selecting the device where the model will run. It can be a GPU (cuda), Apple Silicon (mps), or the CPU as a fallback.
device = (
"cuda"
if torch.cuda.is_available()
else "mps" if torch.backends.mps.is_available() else "cpu"
)2. Authenticate with Hugging Face
To work with private models or to push your fine-tuned model to the Hub (as we’ll do in this exercise), you need to authenticate with your Hugging Face account.
You can create and copy your access token from the Hugging Face tokens page in your profile.
from huggingface_hub import login
login()3. Load the model and processor
Finally, we load the HuggingFaceTB/SmolVLM2-2.2B-Instruct model.
The AutoProcessor is also initialized here — it ensures that both text and images are preprocessed correctly before being passed to the model.
model_name = "HuggingFaceTB/SmolVLM2-2.2B-Instruct"
model = AutoModelForImageTextToText.from_pretrained(
model_name,
dtype=torch.bfloat16,
).to(device)
processor = AutoProcessor.from_pretrained(model_name)Explore the dataset
In this step, we load a small subset of the ChartQA dataset — just 10% of the training and validation splits — to keep the exercises fast and manageable for learning purposes.
We then display one of the chart images using matplotlib to get a visual sense of the model’s input.
Additionally, we print the corresponding query and label so you can fully understand the dataset structure and the type of tasks the model will handle.
from datasets import load_dataset
import matplotlib.pyplot as plt
train_dataset, eval_dataset = load_dataset("HuggingFaceM4/ChartQA", split=["train[:10%]", "val[:10%]"])
example = train_dataset[1]
image = load_image(example["image"])
print(example["query"])
print(example["label"][0])Output
How many values are below 40 in Unfavorable graph? 6
plt.imshow(image)
plt.axis("off")
plt.title("Sample Chart Image")
plt.show()
Build a chat-style prompt
We create a chat message list that includes a user query along with the image.
Using processor.apply_chat_template, we transform this into the exact input format the model expects.
# Define a chat-style prompt
messages = [
{"role": "user", "content": [
{"type": "image", "image": image},
{"type": "text", "text": example["query"]},
]}
]
# Apply the chat template
chat_prompt = processor.apply_chat_template(
messages, add_generation_prompt=True\
)
print(chat_prompt)Output
<|im_start|>User:<image>How many values are below 40 in Unfavorable graph?<end_of_utterance> Assistant:
Run inference
We tokenize the chat prompt and image into tensors, then generate a response with the model. Finally, we decode the output tokens back into text.
# Tokenize input
inputs = processor(images=[image], text=chat_prompt, return_tensors="pt").to(device)
# Generate model output
with torch.no_grad():
output = model.generate(**inputs, max_new_tokens=20)
# Trim the generated ids to remove the input ids
trimmed_generated_ids = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, output)]
# Decode the output text
output_text = processor.batch_decode(
trimmed_generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text[0])Output
3.
The model generates a reponse, but it’s not exactly correct. It could be improved with some fine-tuning. Now that we’ve seen how to build prompts and generate responses with SmolVLM2-2.2B-Instruct, it’s time to learn how to adapt and fine-tune the model efficiently using LoRA (Low-Rank Adaptation). This approach allows training large models with fewer resources and prepares the model for specific downstream tasks.
Exercise 2: Fine-Tune the Model Using LoRA
In this exercise, we’ll apply LoRA (Low-Rank Adaptation) to fine-tune our Vision-Language Model efficiently.
LoRA works by injecting trainable low-rank matrices into existing model layers, enabling large models to be fine-tuned with significantly fewer trainable parameters.
This approach reduces memory usage and speeds up training while maintaining high performance.
system_message = """You are a Vision Language Model specialized in interpreting visual data from chart images.
Your task is to analyze the provided chart image and respond to queries with concise answers, usually a single word, number, or short phrase.
The charts include a variety of types (e.g., line charts, bar charts) and contain colors, labels, and text.
Focus on delivering accurate, succinct answers based on the visual information. Avoid additional explanation unless absolutely necessary."""We’ll format the dataset into a chatbot-style structure, where each example includes:
- A system message defining the assistant’s role
- The chart image
- The user query
- The expected answer
This is the format expected by the SFTTrainer, including the images and messages columns.
You can learn more about preparing datasets for VLM post-training in the documentation.
Format the Dataset
The first step is to structure the dataset for VLM training.
We’ll define a system message that instructs the model to act as a chart analysis expert, providing concise, accurate answers about chart images.
def format_data(sample):
return {
"images": [sample["image"]],
"messages": [
{
"role": "system",
"content": [{"type": "text", "text": system_message}],
},
{
"role": "user",
"content": [
{
"type": "image",
"image": sample["image"],
},
{
"type": "text",
"text": sample["query"],
},
],
},
{
"role": "assistant",
"content": [{"type": "text", "text": sample["label"][0]}],
},
],
}Now, let’s format the data using the chatbot structure. This will set up the interactions for the model.
train_dataset = [format_data(sample) for sample in train_dataset]
eval_dataset = [format_data(sample) for sample in eval_dataset]Configure LoRA
Here we define a LoraConfig:
randlora_alphacontrol the rank and scaling of the adaptation matrices.target_modulesspecifies which parts of the model to adapt.task_typeis set for causal language modeling.
We then apply LoRA to the base model using get_peft_model and print out the trainable parameters to verify the adaptation.
from peft import LoraConfig, get_peft_model
# Configure LoRA
peft_config = LoraConfig(
lora_alpha=16,
lora_dropout=0.05,
r=8,
target_modules=["q_proj", "v_proj"],
task_type="CAUSAL_LM",
)
# Apply PEFT model adaptation
peft_model = get_peft_model(model, peft_config)
# Print trainable parameters
peft_model.print_trainable_parameters()Set up the Trainer
We configure the SFTTrainer from trl with SFTConfig:
num_train_epochs,batch_size, andgradient_accumulation_stepscontrol the training loop.gradient_checkpointing, andbf16optimize memory and speed.learning_ratemanages optimization.train_datasetandeval_datasetare aligned with your dataset.
This prepares the trainer to handle fine-tuning with PEFT/LoRA.
from trl import SFTConfig, SFTTrainer
# Configure training arguments using SFTConfig
training_args = SFTConfig(
output_dir="smol-course-smolvlm2-2.2b-instruct-trl-sft-ChartQA",
num_train_epochs=1,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=1e-4,
logging_steps=25,
save_strategy="steps",
save_steps=25,
optim="adamw_torch_fused",
bf16=True,
push_to_hub=True,
report_to="trackio",
max_length=None,
)
# Initialize the Trainer
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
peft_config=peft_config,
)
# Align the SFTTrainer params with your chosen dataset.Train and Save the Model
Now we run the training loop:
trainer.train()starts fine-tuning with LoRA.trainer.save_model()stores the locally trained model.
This step ensures the model is ready for downstream tasks with minimal additional parameters.
# Train the model
trainer.train()
# Save the model
trainer.save_model(training_args.output_dir)With the foundations of Python-based fine-tuning and LoRA in place, we can now move this workflow to a production environment using the TRL CLI. This approach lets you automate fine-tuning and create reproducible pipelines without writing full Python scripts.
Exercise 3: Production Workflow with TRL CLI
In the previous exercises, we focused on using the Python API to fine-tune SmolVLM2-2.2B-Instruct, exploring dataset preparation and generating chat-style prompts.
In this exercise, we’ll demonstrate how to perform fine-tuning using the TRL CLI, a common workflow in production environments. The CLI allows you to run experiments and manage training without writing Python scripts. If you want a refresher, we previously introduced this tool here, and the same concepts and troubleshooting tips apply.
The TRL CLI leverages the same logic and configuration options as the Python API but presents them through a simple command-line interface. This means you can define everything—from the model and dataset to training hyperparameters and output location—in a single command.
The example below shows how to fine-tune SmolVLM2-2.2B-Instruct on the trl-lib/llava-instruct-mix dataset, using LoRA for parameter-efficient fine-tuning, mixed precision for faster training, and optional push-to-Hub for sharing your model. The dataset now is different.
We are using a different dataset here because it already comes formatted in the expected VLM structure, as discussed earlier.
--model_name_or_pathspecifies the base model to fine-tune.--dataset_nameand--dataset_configdefine the dataset and subset.--output_dirsets the local directory for saving the fine-tuned model.--per_device_train_batch_sizeand--gradient_accumulation_stepscontrol effective batch size and memory usage.--learning_rate,--num_train_epochs, and--max_lengthdefine the core training hyperparameters.--bf16enables mixed precision for faster and more memory-efficient training on compatible GPUs.--push_to_huband--hub_model_idallow automatic upload of the trained model to your Hugging Face Hub repository.
Using the TRL CLI is functionally equivalent to writing a full Python training script, but it’s faster to configure, easier to reproduce, and ideal for production pipelines or automated training workflows.
trl sft \
--model_name_or_path HuggingFaceTB/SmolVLM2-2.2B-Instruct \
--dataset_name trl-lib/llava-instruct-mix \
--output_dir ./smolvln-instruct-sft-cli \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 16 \
--learning_rate 2e-4 \
--num_train_epochs 3 \
--max_length -1 \
--logging_steps 5 \
--save_steps 100 \
--warmup_steps 50 \
--bf16 True \
--push_to_hub \
--hub_model_id your-username/smolvlm2-2.2b-instruct-sft-cliExercise 4: Training with Hugging Face Jobs
In Unit 1, we introduced Hugging Face Jobs (HF Jobs) and demonstrated how to fine-tune a model using this managed cloud service.
HF Jobs provides a fully managed infrastructure for training models, eliminating the need to set up GPUs, manage dependencies, or configure environments locally. This is especially useful for SFT training, which can be both resource-intensive and time-consuming.
Following the same approach, we can use HF Jobs to fine-tune our Vision-Language Model (VLM).
If needed, refer back to Unit 1 to refresh your understanding of HF Jobs and their workflow.
Here’s an example of how to launch a training job using TRL’s maintained SFT script:
# Use TRL's maintained SFT script directly
hf jobs uv run \
--flavor a10g-large \
--timeout 2h \
--secrets HF_TOKEN \
--with num2words==0.5.14 \
"https://raw.githubusercontent.com/huggingface/trl/main/trl/scripts/sft.py" \
--model_name_or_path HuggingFaceTB/SmolVLM2-2.2B-Instruct \
--dataset_name trl-lib/llava-instruct-mix\
--learning_rate 5e-5 \
--per_device_train_batch_size 4 \
--max_length -1 \
--max_steps 1000 \
--output_dir smolvlm2-2.2b-instruct-sft-jobs \
--push_to_hub \
--hub_model_id your-username/smolvlm2-2.2b-instruct-sft-jobs \
--report_to trackioAfter launching the job, HF Jobs will handle the entire training process in the cloud. You can monitor progress, view logs, and track metrics directly from the Hugging Face Hub.
Once the job completes:
- The fine-tuned model will be available in the
output_diryou specified. - If
--push_to_hubwas used, the model will also be accessible from your Hugging Face account, ready for inference or further fine-tuning. - You can resume, replicate, or scale training easily by re-running or modifying the job configuration.
This workflow removes the overhead of managing local resources, allowing you to focus on model experimentation and evaluation.
Test your knowledge
You’ve completed the unit — great work! Now put your learning to the test by taking the quiz.
Resources for Further Learning
Here are some helpful resources to deepen your understanding and continue experimenting with vision language models and TRL workflows:
- TRL Documentation – Complete reference for using TRL, including Python API and CLI.
- HuggingFaceTB/SmolVLM2-2.2B-Instruct Model Card – Detailed information about the model architecture, training, and usage.
- HuggingFaceM4/ChartQA Dataset – Dataset used for training and fine-tuning VLMs.
- Hugging Face Hub – Platform to share your fine-tuned models and discover community models.
- Hugging Face Discord Community – Join the community for discussions, support, and troubleshooting.