--- license: apache-2.0 base_model: meta-llama/Llama-3.2-1B-Instruct tags: - dpo - peft - llama - preference-learning model-index: - name: llama3-dpo-llm judge results: [] --- # Llama-3.2-1B DPO LLM Judge This model is a fine-tuned version of [meta-llama/Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct) using Direct Preference Optimization (DPO). ## Model Details - **Base Model**: meta-llama/Llama-3.2-1B-Instruct - **Training Method**: Direct Preference Optimization (DPO) - **Preference Source**: LLM Judge - **LoRA Configuration**: - r: 8 - alpha: 16 - target_modules: ['q_proj', 'k_proj', 'v_proj', 'o_proj'] - **Training Steps**: 250 - **Learning Rate**: 0.0002 ## Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer from peft import PeftModel base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B-Instruct") model = PeftModel.from_pretrained(base_model, "pyamy/llama3-dpo-llm judge") tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct") ``` ## Training Details - Dataset: 50 instructions from LIMA - Responses per instruction: 5 - Preference judgment: LLM Judge - Training framework: TRL DPOTrainer ## Performance See evaluation results in the repository for detailed performance metrics.