a smol course documentation
Submit your final project!
Submit your final project!
It’s time to submit your DPO-aligned model! This unit uses the same leaderboard-based submission system as Unit 1. Here’s the plan:
- Read the written guide for the chapter ✅
- Train a model using what you learned in the chapter.
- Push the model to the Hugging Face Hub.
- Evaluate the model using
hf jobs. - Open a pull request on the leaderboard.
On this page we will go through each step.
1. Read the written guide for the chapter and 2. Train a model using what you learned in the chapter.
For Unit 3’s submission, you should read all the materials in the unit and train a preference-aligned model using DPO. The training code is provided in:
- DPO Training Exercise - Complete DPO training guide with SmolLM3
You’ll need to combine this with Training with Hugging Face Jobs techniques from Unit 1.
3. Push the model to the Hugging Face Hub
Once you’ve trained your DPO-aligned model, you’ll need to push it to a repo on the Hugging Face Hub. TRL will take care of this for you if you add the --push_to_hub flag to your training command.
DPO Training with Hub Upload:
hf jobs uv run \
--flavor a100-large \
--secrets HF_TOKEN \
"https://raw.githubusercontent.com/huggingface/trl/main/trl/scripts/dpo.py" \
--model_name_or_path HuggingFaceTB/SmolLM3-3B \
--dataset_name Anthropic/hh-rlhf \
--learning_rate 5e-7 \
--beta 0.1 \
--max_steps 1000 \
--push_to_hub \
--hub_model_id your-username/smollm3-dpo-aligned \
--report_to trackioYour trained model will be available at your-username/your-model-name. For detailed documentation, check out the checkpoints documentation from transformers.
4. Evaluate the model using hf jobs
Now, we will evaluate your DPO-aligned model. We will use hf jobs to evaluate the model and combine it with lighteval. We will push the evaluation results to a dataset on the hub.
For DPO evaluation, we use tasks that test both helpfulness and safety aspects of alignment, including
truthfulqa,gsm8k, and other alignment-focused benchmarks.
hf jobs uv run \
--flavor a10g-large \
--with "lighteval[vllm]" \
--secrets HF_TOKEN \
lighteval vllm "model_name=<your-username>/<your-model-name>" \
"lighteval|truthfulqa:mc2|0|0,lighteval|hellaswag|0|0,lighteval|arc:challenge|0|0" \
--push-to-hub --results-org <your-username>This command will evaluate the model using lighteval and vllm and save the results to the Hugging Face Hub in a dataset repo that you define.
We focus on alignment evaluation in Unit 3, but in Unit 2 we explore evaluation in more detail. The key benchmarks for DPO evaluation include:
- TruthfulQA: Tests for truthful and honest responses
- GSM8K: Mathematical reasoning and helpfulness
- MMLU: Broad knowledge and helpfulness across domains
5. Open a pull request on the leaderboard space
You are now ready to submit your DPO-aligned model to the leaderboard! You need to do two things:
- Add your model’s results to
submissions.json - Share your training and evaluation commands in the PR text.
Add your model’s results to submissions.json
Open a pull request on the leaderboard space to submit your model. You just need to add your model info and reference to the dataset you created in the previous step. We will pull the results and display them on the leaderboard.
{
"submissions": [
... // existing submissions
{
"username": "<your-username>",
"model_name": "<your-model-name>",
"chapter": "3",
"method": "DPO",
"submission_date": "<your-submission-date>",
"results-dataset": "<your-results-dataset>",
"base_model": "HuggingFaceTB/SmolLM3-3B",
"preference_dataset": "Anthropic/hh-rlhf"
}
]
}Share your training and evaluation commands in the PR text.
Within the PR text, share both your training and evaluation commands.
Wait for the PR to be merged
Once the PR is merged, your DPO-aligned model will be added to the leaderboard! You can check the leaderboard here.
Test your knowledge
You’ve completed the unit — great work! Now put your learning to the test by taking the quiz.
Resources and Further Reading
- Unit 3 DPO Exercise - Complete DPO training guide
- DPO Paper - Original research paper
- TRL DPO Documentation - Implementation details
- Anthropic HH-RLHF Paper - Human feedback methodology
- Alignment Handbook - Advanced alignment techniques
Good luck with your DPO preference alignment submission! 🚀
Update on GitHub