a smol course documentation
Training with Hugging Face Jobs
Training with Hugging Face Jobs
Hugging Face Jobs provides fully managed cloud infrastructure for training models without the hassle of setting up GPUs, managing dependencies, or configuring environments locally. This is particularly valuable for SFT training, which can be resource-intensive and time-consuming.
Why Use Jobs for SFT Training?
- Scalable Infrastructure: Access to high-end GPUs (A100, L4, etc.) without hardware investment
- Zero Setup: No need to manage CUDA drivers, Docker containers, or environment configurations
- Cost Effective: Pay only for compute time used, with automatic shutdown after completion
- Integrated Workflow: Seamless integration with Hugging Face Hub for model storage and sharing
- Monitoring: Built-in logging and progress tracking through the Hub interface
Requirements
To use Hugging Face Jobs, you need:
- A Pro, Team, or Enterprise Hugging Face plan which you can get here
- Authentication via
hf auth login
Running SFT with Jobs: Two Approaches
The best way to run TRL with HF jobs is using the built-in scripts. They take advantage of uv to manage dependencies and hf jobs to run the training job.
This guide will walk you through using TRL’s built-in scripts to train a model with Hugging Face Jobs. If you want to use a custom script, you can implement uv dependencies and run the script with hf jobs run.
Create a custom training script with inline dependencies
# sft_training.py
# /// script
# dependencies = [
# "trl[sft]>=0.7.0",
# "transformers>=4.36.0",
# "datasets>=2.14.0",
# "accelerate>=0.24.0",
# "peft>=0.7.0"
# ]
# ///
from trl import SFTTrainer, SFTConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM3-3B-Base")
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM3-3B-Base")
# Load dataset
dataset = load_dataset("HuggingFaceTB/smoltalk2", "SFT")
# Configure training
config = SFTConfig(
output_dir="./smollm3-jobs-sft",
per_device_train_batch_size=4,
learning_rate=5e-5,
max_steps=1000,
logging_steps=50,
save_steps=200,
push_to_hub=True,
hub_model_id="your-username/smollm3-jobs-sft"
)
# Train
trainer = SFTTrainer(
model=model,
train_dataset=dataset["smoltalk_everyday_convs_reasoning_Qwen3_32B_think"],
args=config,
)
trainer.train()Then run with the Jobs CLI:
# Run the UV script on Jobs
hf jobs uv run \
--flavor a10g-large \
--timeout 2h \
--secrets HF_TOKEN \
sft_training.pyHardware Selection for SFT
Choose the right hardware flavor based on your model size and training requirements:
For SmolLM3-3B (Recommended):
a10g-large: 24GB GPU memory, cost-effective for most SFT tasksa100-large: 40GB GPU memory, fastest training with larger batch sizesl4x1: 24GB GPU memory, multi-GPU setup for distributed training
For Larger Models (7B+):
a100-large: Required for 7B+ modelsl4x4: Multi-GPU setup for distributed training
Budget Options:
t4-small: 16GB GPU memory, slower but economical for experimentationl4x1: 24GB GPU memory, good balance of cost and performance
For a detailed comparison of the different hardware flavors, you can check out the Pricing Page page.
Advanced Jobs Configuration
# Use TRL's maintained SFT script directly
hf jobs uv run \
--flavor a10g-large \
--timeout 2h \
--secrets HF_TOKEN \
"https://raw.githubusercontent.com/huggingface/trl/main/trl/scripts/sft.py" \
--model_name_or_path HuggingFaceTB/SmolLM3-3B-Base \
--dataset_name HuggingFaceTB/smoltalk2_everyday_convs_think \
--learning_rate 5e-5 \
--per_device_train_batch_size 4 \
--max_steps 1000 \
--output_dir smollm3-sft-jobs \
--push_to_hub \
--hub_model_id your-username/smollm3-sft \
--report_to trackioEnvironment Variables and Secrets:
If you’re working with a custom script, you can use the --secrets flag to pass in environment variables.
hf jobs uv run \
--flavor a10g-large \
--timeout 3h \
--secrets HF_TOKEN=your_token \
--secrets WANDB_API_KEY=your_wandb_key \
--env WANDB_PROJECT=smollm3-sft \
--env CUDA_VISIBLE_DEVICES=0 \
my_sft_training.pyMonitoring Your Training Job
To check you training job, you can use the hf jobs command or you can go to Job Settings on the Hub.
Check Job Status:
# List all jobs
hf jobs ps -a
# Get detailed job information
hf jobs inspect <job_id>
# Stream job logs in real-time
hf jobs logs <job_id> --follow
# Cancel a running job if needed
hf jobs cancel <job_id>LoRA/PEFT on Jobs (optional)
Enable LoRA when using TRL’s maintained SFT script by passing PEFT flags. See the script for authoritative flags and defaults: https://github.com/huggingface/trl/blob/main/trl/scripts/sft.py.
hf jobs uv run \
--flavor a10g-large \
--timeout 2h \
--secrets HF_TOKEN \
"https://raw.githubusercontent.com/huggingface/trl/main/trl/scripts/sft.py" \
--model_name_or_path HuggingFaceTB/SmolLM3-3B-Base \
--dataset_name HuggingFaceTB/smoltalk2_everyday_convs_think \
--output_dir smollm3-lora-sft-jobs \
--per_device_train_batch_size 4 \
--learning_rate 5e-5 \
--max_steps 1000 \
--report_to trackio \
--push_to_hub \
--hub_model_id your-username/smollm3-lora-sft \
--use_peft \
--lora_r 8 \
--lora_alpha 16 \
--lora_dropout 0.05 \
--lora_target_modules all-linearNotes:
- Confirm flag names in the TRL SFT script before running:
https://github.com/huggingface/trl/blob/main/trl/scripts/sft.py. - LoRA trains small adapters, which you can keep separate or merge later for deployment.
Monitoring with Trackio
You can monitor your training job with Trackio.
Cost Estimation
Approximate costs for SmolLM3-3B SFT training (1000 steps):
- l4x1: ~$3-4 per hour (24GB GPU memory)
- a10g-large: ~$4-6 per hour (24GB GPU memory)
- a100-large: ~$8-12 per hour (40GB GPU memory)
Training typically takes 30-90 minutes for 1000 steps depending on hardware and configuration, making Jobs cost-effective compared to local GPU rental or cloud instances.
Cost-Saving Tips:
- Use smaller batch sizes with gradient accumulation to fit on cheaper GPUs
- Start with shorter training runs (500 steps) to validate your setup
- Use
l4x1for initial experiments, then scale to faster GPUs for production - Set appropriate timeouts to avoid unexpected charges
Troubleshooting Common Issues
Out of Memory Errors:
- Reduce
per_device_train_batch_size - Enable gradient checkpointing
- Use smaller
max_length
Timeout Issues:
- Increase timeout parameter
- Reduce training steps or use more powerful hardware
- Optimize data loading and preprocessing
Authentication Errors:
- Ensure HF_TOKEN is correctly set as a secret
- Verify your Hugging Face account has the required plan
- Check token permissions for model uploads
Resources and Further Reading
- Hugging Face Jobs Documentation - Complete Jobs guide
- TRL Jobs Training Guide - TRL-specific Jobs examples
- Jobs Pricing - Current pricing for different hardware flavors
- Jobs CLI Reference - Command-line interface details