bikmish/llm-course-hw2-reward-model

A preference prediction model for RLHF training, fine-tuned on human-like conversation preferences.

Architecture Details

  • Base Model: SmoLLM-135M-Instruct
  • Head: Linear classification layer (hidden_size β†’ 1)
  • Fine-tuning: Last 3 layers + head
  • Train/Test Split: 9,796/1,088 examples

Key Metrics

  • Accuracy: 89.2% (chosen vs rejected)
  • AUC-ROC: 0.93
  • Reward Margin: +4.77 Β±1.32
  • Inference Speed: 28 samples/sec (A10 GPU)

Preference Signals

The model scores higher for responses that:

  • 😊 Use casual/emotional language (+1.2 reward)
  • πŸ’¬ Include follow-up questions (+0.8)
  • ❌ Penalizes:
    • Overly formal responses (-1.5)
    • Generic disclaimers (-2.1)
    • Off-topic digressions (-1.8)

Usage

from transformers import AutoModelForSequenceClassification, AutoTokenizer

model = AutoModelForSequenceClassification.from_pretrained("bikmish/llm-course-hw2-reward-model")
tokenizer = AutoTokenizer.from_pretrained("bikmish/llm-course-hw2-reward-model")

text_pair = [
    "That's awesome! Tell me more details! πŸ˜„",  # chosen
    "I cannot provide opinions about entertainment."  # rejected
]

inputs = tokenizer(text_pair, padding=True, return_tensors="pt")
rewards = model(**inputs).logits
print(f"Chosen: {rewards[0].item():.2f}, Rejected: {rewards[1].item():.2f}")
Downloads last month
1
Safetensors
Model size
0.1B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for bikmish/llm-course-hw2-reward-model

Finetuned
(178)
this model

Dataset used to train bikmish/llm-course-hw2-reward-model