---
license: apache-2.0
base_model: meta-llama/Llama-3.2-1B-Instruct
tags:
- dpo
- peft
- llama
- preference-learning
model-index:
- name: llama3-dpo-llm judge
  results: []
---

# Llama-3.2-1B DPO LLM Judge

This model is a fine-tuned version of [meta-llama/Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct) using Direct Preference Optimization (DPO).

## Model Details

- **Base Model**: meta-llama/Llama-3.2-1B-Instruct
- **Training Method**: Direct Preference Optimization (DPO)
- **Preference Source**: LLM Judge
- **LoRA Configuration**:
  - r: 8
  - alpha: 16
  - target_modules: ['q_proj', 'k_proj', 'v_proj', 'o_proj']
- **Training Steps**: 250
- **Learning Rate**: 0.0002

## Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")
model = PeftModel.from_pretrained(base_model, "pyamy/llama3-dpo-llm judge")

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")
```

## Training Details

- Dataset: 50 instructions from LIMA
- Responses per instruction: 5
- Preference judgment: LLM Judge
- Training framework: TRL DPOTrainer

## Performance

See evaluation results in the repository for detailed performance metrics.