Hands-On Exercise: Direct Preference Optimization with SmolLM3

Welcome to the hands-on section for Direct Preference Optimization! In this exercise, you’ll apply everything you’ve learned about preference alignment by training SmolLM3 using DPO. You’ll then submit your results to the course leaderboard using Hugging Face Jobs.

Prerequisites: This exercise assumes you have completed Unit 1 (Instruction Tuning) or are familiar with instruction-tuned models. DPO requires a model that has already been fine-tuned to follow instructions.

Exercise: Direct Preference Optimization Training

Objective: Train SmolLM3 using DPO to create a preference-aligned language model and submit it to the leaderboard.

Environment Setup

You need a Hugging Face Pro, Team, or Enterprise plan to use HF Jobs for training

DPO training requires significant compute resources - we recommend using HF Jobs with GPU instances

Local training requires a GPU with at least 16GB VRAM for SmolLM3-3B

First run will download several GB of model weights and datasets

Let’s start by setting up our environment and exploring DPO concepts locally before scaling to HF Jobs.

# Install required packages
pip install "transformers>=4.56.1" "trl>=0.23.0" "datasets>=4.1.0" "torch>=2.8.0"
pip install "accelerate>=1.10.1" "peft>=0.17.0" "trackio"

Import Libraries and Setup

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from trl import DPOTrainer, DPOConfig
import json

# Check available device
if torch.cuda.is_available():
    device = "cuda"
    print(f"Using CUDA GPU: {torch.cuda.get_device_name()}")
    print(f"GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f}GB")
elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
    device = "mps"
    print("Using Apple MPS")
else:
    device = "cpu"
    print("Using CPU - you will need HF Jobs with GPU for actual training")

# Authenticate with Hugging Face
from huggingface_hub import login
login()  # Required for HF Jobs and model uploads

Understanding DPO Data Format

Before training, let’s explore what preference data looks like and how DPO uses it:

# Load a preference dataset to understand the format
dataset = load_dataset("Anthropic/hh-rlhf", split="train")
print(f"Dataset size: {len(dataset)}")
print("Dataset features:", dataset.features.keys())

# Examine a preference pair
sample = dataset[0]
print(f"\nChosen (Preferred): {sample['chosen'][:200]}...")
print(f"\nRejected (Non-preferred): {sample['rejected'][:200]}...")

# This shows how DPO learns to prefer "chosen" responses over "rejected" ones

Expected Output

Dataset size: 112000
Dataset features: ['prompt', 'chosen', 'rejected']

=== DPO Preference Pair Example ===
Prompt: Human: What are some cuss words in english?

Chosen (Preferred): Assistant: I understand you're curious about profanity in English. While I can acknowledge that such words exist, I'd prefer not to list them explicitly. These are generally considered offensive or inappropriate in most contexts...

Rejected (Non-preferred): Assistant: Here are some common cuss words in English: [explicit list follows]...

Local DPO Training Test (Optional)

If you have sufficient local GPU resources, you can test DPO training locally before scaling to HF Jobs:

# Load a small subset for local testing
small_dataset = dataset.select(range(1000))

# Load SmolLM3-3B-Instruct model
model_name = "HuggingFaceTB/SmolLM3-3B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# Configure DPO training for local testing
training_args = DPOConfig(
    beta=0.1,                           # Preference optimization strength
    learning_rate=5e-7,                 # Lower than SFT
    per_device_train_batch_size=1,      # Small batch for local testing
    gradient_accumulation_steps=4,      # Effective batch size = 4
    max_steps=50,                       # Very short for testing
    logging_steps=10,
    output_dir="./local_dpo_test",
    report_to="trackio",
)

# Create trainer (but don't train yet - save resources for HF Jobs)
trainer = DPOTrainer(
    model=model,
    args=training_args,
    train_dataset=small_dataset,
    processing_class=tokenizer,
)

print("Local DPO trainer configured successfully!")
print("Ready to scale to HF Jobs for full training...")

Training with Hugging Face Jobs

Now let’s set up DPO training using HF Jobs for scalable, cloud-based training.

Create DPO Training Script

First, create a training script that uses TRL’s DPO capabilities:

# dpo_training.py
# /// script
# dependencies = [
#     "trl[dpo]>=0.7.0",
#     "transformers>=4.36.0", 
#     "datasets>=2.14.0",
#     "accelerate>=0.24.0",
#     "torch>=2.0.0"
# ]
# ///

from trl import DPOTrainer, DPOConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset

def main():
    # Load preference dataset
    dataset = load_dataset("Anthropic/hh-rlhf", split="train")
    
    # Take a reasonable subset for training
    train_dataset = dataset.select(range(10000))
    
    # Load SmolLM3-3B model (pre-trained with SFT)
    model_name = "HuggingFaceTB/SmolLM3-3B"
    model = AutoModelForCausalLM.from_pretrained(model_name)
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokenizer.pad_token = tokenizer.eos_token
    
    # Configure DPO training
    training_args = DPOConfig(
        # Core DPO parameters
        beta=0.1,                           # Preference optimization strength
        max_prompt_length=512,              # Maximum prompt length
        max_length=1024,                    # Maximum total sequence length
        
        # Training configuration
        learning_rate=5e-7,                 # Lower than SFT for stability
        per_device_train_batch_size=2,      # Adjust for GPU memory
        gradient_accumulation_steps=8,      # Effective batch size = 16
        max_steps=1000,                     # Sufficient for good alignment
        
        # Optimization
        warmup_steps=100,
        lr_scheduler_type="cosine",
        gradient_checkpointing=True,        # Memory efficiency
        bf16=True,                          # Mixed precision
        
        # Logging and saving
        logging_steps=50,
        save_steps=250,
        output_dir="./smollm3-dpo-aligned",
        
        # Hub integration
        push_to_hub=True,
        hub_model_id="your-username/smollm3-dpo-aligned",  # Change this!
        report_to="trackio",
        
        # Remove unused columns for cleaner training
        remove_unused_columns=False,
    )
    
    # Initialize DPO trainer
    trainer = DPOTrainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        processing_class=tokenizer,
    )
    
    # Start training
    print("Starting DPO training...")
    trainer.train()
    
    print("Training completed! Model saved and pushed to Hub.")

if __name__ == "__main__":
    main()

Submit DPO Training Job

Now submit your training job to HF Jobs:

# Submit DPO training job to HF Jobs
hf jobs uv run \
    --flavor a100-large \
    --timeout 3h \
    --secrets HF_TOKEN \
    dpo_training.py

Hardware Recommendations for DPO:

a100-large: Best performance, 40GB GPU memory (recommended)

a10g-large: Good balance, 24GB GPU memory

l4x1: Budget option, 24GB GPU memory

DPO training typically takes 1-2 hours for 1000 steps on an A100.

Alternative: Using TRL’s Built-in DPO Script

You can also use TRL’s maintained DPO script directly:

# Use TRL's DPO script with HF Jobs
hf jobs uv run \
    --flavor a100-large \
    --timeout 3h \
    --secrets HF_TOKEN \
    "https://raw.githubusercontent.com/huggingface/trl/main/trl/scripts/dpo.py" \
    --model_name_or_path HuggingFaceTB/SmolLM3-3B \
    --dataset_name Anthropic/hh-rlhf \
    --learning_rate 5e-7 \
    --per_device_train_batch_size 2 \
    --gradient_accumulation_steps 8 \
    --max_steps 1000 \
    --beta 0.1 \
    --max_prompt_length 512 \
    --max_length 1024 \
    --output_dir smollm3-dpo-aligned \
    --push_to_hub \
    --hub_model_id your-username/smollm3-dpo-aligned \
    --report_to trackio

Monitor Your Training Job

Track your DPO training progress using the HF Jobs CLI:

# List all your jobs
hf jobs ps -a

# Monitor job logs in real-time
hf jobs logs <job_id> --follow

# Check job details
hf jobs inspect <job_id>

You can also monitor training metrics through Trackio at the URL provided in the job logs.

Evaluate Your DPO-Aligned Model

Once training is complete, evaluate your model’s alignment quality:

# Local evaluation of your trained model
from transformers import pipeline

# Load your trained model
model_name = "your-username/smollm3-dpo-aligned"
generator = pipeline("text-generation", model=model_name, tokenizer=model_name)

# Test alignment on various prompts
test_prompts = [
    "How should I handle a disagreement with my friend?",
    "What's the best way to learn programming?", 
    "How can I be more productive at work?",
    "What should I do if I see someone being bullied?"
]

print("=== DPO Model Alignment Test ===")
for prompt in test_prompts:
    response = generator(prompt, max_length=200, do_sample=True, temperature=0.7)
    print(f"\nPrompt: {prompt}")
    print(f"Response: {response[0]['generated_text'][len(prompt):].strip()}")

Submit to Course Leaderboard

Ready to submit your aligned model to the leaderboard? Continue to the submission page where you’ll:

Evaluate your model using HF Jobs and LightEval
Submit your results to the course leaderboard
Compare your model’s alignment quality with other submissions

Resources and Further Reading

DPO Paper - Original Direct Preference Optimization research
TRL DPO Documentation - Implementation details and examples
Anthropic HH-RLHF Paper - Human feedback methodology
Alignment Handbook - Advanced alignment techniques

Congratulations on completing DPO training! Your preference-aligned model is now ready for evaluation and submission to the leaderboard.

Update on GitHub

a smol course

Hands-On Exercise: Direct Preference Optimization with SmolLM3

Exercise: Direct Preference Optimization Training

Environment Setup

Import Libraries and Setup

Understanding DPO Data Format

Local DPO Training Test (Optional)

Training with Hugging Face Jobs

Create DPO Training Script

Submit DPO Training Job

Alternative: Using TRL’s Built-in DPO Script

Monitor Your Training Job

Evaluate Your DPO-Aligned Model

Submit to Course Leaderboard

Resources and Further Reading