MurDanya
/

llm-course-hw2-reward-model

Text Classification

text-generation-inference

Model card Files Files and versions

Reward model for PPO

Описание задания

В этой домашке была обучена Reward model на основе модели SmolLM-135M-Instruct классификации на датасете Human-Like-DPO-Dataset

Пример загрузки

device = torch.device("cuda")
REWARD_MODEL_REPO_NAME = f"MurDanya/llm-course-hw2-reward-model"

tokenizer = AutoTokenizer.from_pretrained(REWARD_MODEL_REPO_NAME)
reward_model = AutoModelForSequenceClassification.from_pretrained(REWARD_MODEL_REPO_NAME)

Downloads last month: -

Safetensors

Model size

0.1B params

Tensor type

F32

·

Model tree for MurDanya/llm-course-hw2-reward-model

Base model

HuggingFaceTB/SmolLM-135M

Quantized

HuggingFaceTB/SmolLM-135M-Instruct

Finetuned

(178)

this model

Dataset used to train MurDanya/llm-course-hw2-reward-model

Collection including MurDanya/llm-course-hw2-reward-model

llm-course-hw2-alignment

Дообучение моделей с помощью DPO и PPO • 3 items • Updated Mar 30