bikmish/llm-course-hw2-reward-model
A preference prediction model for RLHF training, fine-tuned on human-like conversation preferences.
Architecture Details
- Base Model: SmoLLM-135M-Instruct
- Head: Linear classification layer (hidden_size β 1)
- Fine-tuning: Last 3 layers + head
- Train/Test Split: 9,796/1,088 examples
Key Metrics
- Accuracy: 89.2% (chosen vs rejected)
- AUC-ROC: 0.93
- Reward Margin: +4.77 Β±1.32
- Inference Speed: 28 samples/sec (A10 GPU)
Preference Signals
The model scores higher for responses that:
- π Use casual/emotional language (+1.2 reward)
- π¬ Include follow-up questions (+0.8)
- β Penalizes:
- Overly formal responses (-1.5)
- Generic disclaimers (-2.1)
- Off-topic digressions (-1.8)
Usage
from transformers import AutoModelForSequenceClassification, AutoTokenizer
model = AutoModelForSequenceClassification.from_pretrained("bikmish/llm-course-hw2-reward-model")
tokenizer = AutoTokenizer.from_pretrained("bikmish/llm-course-hw2-reward-model")
text_pair = [
"That's awesome! Tell me more details! π", # chosen
"I cannot provide opinions about entertainment." # rejected
]
inputs = tokenizer(text_pair, padding=True, return_tensors="pt")
rewards = model(**inputs).logits
print(f"Chosen: {rewards[0].item():.2f}, Rejected: {rewards[1].item():.2f}")
- Downloads last month
- 1
Model tree for bikmish/llm-course-hw2-reward-model
Base model
HuggingFaceTB/SmolLM-135M
Quantized
HuggingFaceTB/SmolLM-135M-Instruct