Using Pretrained VLMs

Visual Language Models (VLMs) process images and text simultaneously, enabling advanced tasks like generating captions, answering visual questions, or reasoning across modalities. In this section, we focus on how VLMs work and how to use them practically.

Architecture Overview

VLM Architecture

VLMs combine image-processing and text-generation components for a unified multimodal understanding. The main elements are:

Image/Vision Encoder: Converts images into compact numerical representations. Examples: CLIP, SigLIP.
Embedding Projector: Aligns image features with text embeddings (often a small MLP or linear layer fine-tuned for the multimodal task).
Multimodal Projector / Fusion Module: Fuses and enhances connections between visual and textual representations. This step goes beyond alignment, enabling rich cross-modal interaction.representations.
Text Decoder: Generates text (or other outputs) from the fused multimodal representations.

Most VLMs leverage pretrained image encoders and text decoders, then fine-tune on paired image-text datasets for efficient training and generalization.

Practical Usage

VLMs can be applied to tasks such as:

Image Captioning: generating descriptions for images
Visual Question Answering (VQA): answering questions about an image
Cross-Modal Retrieval: matching images with text and vice versa
Creative Applications: design, art generation, multimedia content

High-quality paired datasets are key, and 🤗 transformers provide pretrained models and streamlined fine-tuning workflows.

VLM Usage

Chat Format

Many VLMs support chat-like interactions, with messages structured as:

System message: sets context: "You are an assistant analyzing visual data."
User queries: combine text and images.
Assistant responses: generated text based on multimodal analysis.

Example:

[
  {
    "role": "system",
    "content": [{"type": "text", "text": "You are a VLM specialized in charts."}]
  },
  {
    "role": "user",
    "content": [
      {"type": "image", "image": "<image_data>"},
      {"type": "text", "text": "What is the highest value in this chart?"}
    ]
  },
  {
    "role": "assistant",
    "content": [{"type": "text", "text": "42"}]
  }
]

VLMs can also handle multiple images or video frames as input by passing sequences of images through the same chat template.

Using a VLM via pipeline

As we saw in Unit 1, the easiest way to use a VLM is through the 🤗 pipeline abstraction:

from transformers import pipeline

# Initialize the pipeline with a VLM
pipe = pipeline("image-text-to-text", "HuggingFaceTB/SmolVLM2-2.2B-Instruct", device_map="auto")

# Define your conversation with an image
messages = [
     {
         "role": "user",
         "content": [
             {
                 "type": "image",
                 "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg",
             },
             {"type": "text", "text": "Describe this image."},
         ],
     }
 ]

outputs = pipe(text=messages, max_new_tokens=60, return_full_text=False)

# Generate response - pipeline handles multimodal inputs automatically
response = pipe(messages, max_new_tokens=128, temperature=0.7)

print(response[0]['generated_text'][-1]['content'])  # Print the model's description

Output

The image depicts a close-up view of a flower garden, specifically focusing on a pink flower. The flower is the central subject of the image, and it is a prominent feature due to its vibrant color and intricate details. The flower has a circular shape, with petals that are slightly curled and have a gradient from light to dark pink. The petals are arranged symmetrically around the central pistil, which is visible in the center of the flower. The pistil is a small, yellow structure that is surrounded by a cluster of stamens, which are visible as small, yellow structures. The flower also has a small, black

Using a VLM via Transformers (Full Control)

For advanced use, you can access a VLM directly via 🤗 Transformers, giving you full control over each component.
To reduce memory usage and speed up inference, we can apply 4-bit quantization using bitsandbytes.

Unlike standard LLM usage, VLMs require a processor instead of just a tokenizer. The processor handles both text tokenization and image preprocessing, streamlining the workflow for multimodal inputs.

import torch
from transformers import AutoProcessor, AutoModelForImageTextToText, BitsAndBytesConfig
from transformers.image_utils import load_image

device = "cuda" if torch.cuda.is_available() else "cpu"

# Quantization for efficiency
quant_config = BitsAndBytesConfig(load_in_4bit=True)
model_name = "HuggingFaceTB/SmolVLM2-2.2B-Instruct"
model = AutoModelForImageTextToText.from_pretrained(model_name, quantization_config=quant_config).to(device)
processor = AutoProcessor.from_pretrained(model_name)

Example: Describe an Image

We can use the chat template to describe images. Each image is represented as {"type": "image"} in the message, and the actual image data is passed to the processor via the images argument. The processor handles both text and visual inputs seamlessly.

# Load image
image_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"
image = load_image(image_url)

# Create input messages
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Can you describe the image?"}
        ]
    },
]

# Prepare inputs
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="pt")
inputs = inputs.to(device)

# Generate outputs
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(
    generated_ids,
    skip_special_tokens=True,
)[0]

# Extract only the assistant response
assistant_response = generated_texts.split("Assistant:")[-1].strip()

print(assistant_response)

Output

The image is of a bee on a flower.

The processor combines the text and image inputs, so the model can generate coherent multimodal outputs.

Similar templates can handle multiple images, OCR tasks, or even video frames, making VLMs highly versatile.

Resources

Update on GitHub

a smol course

Using Pretrained VLMs

Architecture Overview

Practical Usage

Chat Format

Using a VLM via pipeline

Using a VLM via Transformers (Full Control)

Example: Describe an Image

Resources