Submit your final project!

It’s time to submit your DPO-aligned model! This unit uses the same leaderboard-based submission system as Unit 1. Here’s the plan:

Read the written guide for the chapter ✅
Train a model using what you learned in the chapter.
Push the model to the Hugging Face Hub.
Evaluate the model using hf jobs.
Open a pull request on the leaderboard.

On this page we will go through each step.

1. Read the written guide for the chapter and 2. Train a model using what you learned in the chapter.

For Unit 3’s submission, you should read all the materials in the unit and train a preference-aligned model using DPO. The training code is provided in:

DPO Training Exercise - Complete DPO training guide with SmolLM3

You’ll need to combine this with Training with Hugging Face Jobs techniques from Unit 1.

3. Push the model to the Hugging Face Hub

Once you’ve trained your DPO-aligned model, you’ll need to push it to a repo on the Hugging Face Hub. TRL will take care of this for you if you add the --push_to_hub flag to your training command.

DPO Training with Hub Upload:

hf jobs uv run \
    --flavor a100-large \
    --secrets HF_TOKEN \
    "https://raw.githubusercontent.com/huggingface/trl/main/trl/scripts/dpo.py" \
    --model_name_or_path HuggingFaceTB/SmolLM3-3B \
    --dataset_name Anthropic/hh-rlhf \
    --learning_rate 5e-7 \
    --beta 0.1 \
    --max_steps 1000 \
    --push_to_hub \
    --hub_model_id your-username/smollm3-dpo-aligned \
    --report_to trackio

Your trained model will be available at your-username/your-model-name. For detailed documentation, check out the checkpoints documentation from transformers.

4. Evaluate the model using hf jobs

Now, we will evaluate your DPO-aligned model. We will use hf jobs to evaluate the model and combine it with lighteval. We will push the evaluation results to a dataset on the hub.

For DPO evaluation, we use tasks that test both helpfulness and safety aspects of alignment, including truthfulqa, gsm8k, and other alignment-focused benchmarks.

hf jobs uv run \
    --flavor a10g-large \
    --with "lighteval[vllm]" \
    --secrets HF_TOKEN \
    lighteval vllm "model_name=<your-username>/<your-model-name>" \
    "lighteval|truthfulqa:mc2|0|0,lighteval|hellaswag|0|0,lighteval|arc:challenge|0|0" \
    --push-to-hub --results-org <your-username>

This command will evaluate the model using lighteval and vllm and save the results to the Hugging Face Hub in a dataset repo that you define.

We focus on alignment evaluation in Unit 3, but in Unit 2 we explore evaluation in more detail. The key benchmarks for DPO evaluation include:

TruthfulQA: Tests for truthful and honest responses

GSM8K: Mathematical reasoning and helpfulness

MMLU: Broad knowledge and helpfulness across domains

5. Open a pull request on the leaderboard space

You are now ready to submit your DPO-aligned model to the leaderboard! You need to do two things:

Add your model’s results to submissions.json
Share your training and evaluation commands in the PR text.

Add your model’s results to submissions.json

Open a pull request on the leaderboard space to submit your model. You just need to add your model info and reference to the dataset you created in the previous step. We will pull the results and display them on the leaderboard.

{
    "submissions": [

        ... // existing submissions
        
        {
            "username": "<your-username>",
            "model_name": "<your-model-name>", 
            "chapter": "3",
            "method": "DPO",
            "submission_date": "<your-submission-date>",
            "results-dataset": "<your-results-dataset>",
            "base_model": "HuggingFaceTB/SmolLM3-3B",
            "preference_dataset": "Anthropic/hh-rlhf"
        }
    ]
}

Share your training and evaluation commands in the PR text.

Within the PR text, share both your training and evaluation commands.

Wait for the PR to be merged

Once the PR is merged, your DPO-aligned model will be added to the leaderboard! You can check the leaderboard here.

Test your knowledge

You’ve completed the unit — great work! Now put your learning to the test by taking the quiz.

Resources and Further Reading

Unit 3 DPO Exercise - Complete DPO training guide
DPO Paper - Original research paper
TRL DPO Documentation - Implementation details
Anthropic HH-RLHF Paper - Human feedback methodology
Alignment Handbook - Advanced alignment techniques

Good luck with your DPO preference alignment submission! 🚀

Update on GitHub

a smol course

Submit your final project!

1. Read the written guide for the chapter and 2. Train a model using what you learned in the chapter.

3. Push the model to the Hugging Face Hub

DPO Training with Hub Upload:

4. Evaluate the model using hf jobs

5. Open a pull request on the leaderboard space

Add your model’s results to submissions.json

Share your training and evaluation commands in the PR text.

Wait for the PR to be merged

Test your knowledge

Resources and Further Reading