pyamy
/

llama3-dpo-llm-judge

@@ -1,73 +1,51 @@
 ---
 base_model: meta-llama/Llama-3.2-1B-Instruct
-library_name: peft
-model_name: dpo_llm_judge_model
 tags:
-- base_model:adapter:meta-llama/Llama-3.2-1B-Instruct
 - dpo
-- lora
-- transformers
-- trl
-licence: license
-pipeline_tag: text-generation
 ---
-# Model Card for dpo_llm_judge_model
-This model is a fine-tuned version of [meta-llama/Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct).
-It has been trained using [TRL](https://github.com/huggingface/trl).
-## Quick start
-```python
-from transformers import pipeline
-question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
-generator = pipeline("text-generation", model="None", device="cuda")
-output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
-print(output["generated_text"])
-```
-## Training procedure
-This model was trained with DPO, a method introduced in [Direct Preference Optimization: Your Language Model is Secretly a Reward Model](https://huggingface.co/papers/2305.18290).
-### Framework versions
-- PEFT 0.17.0
-- TRL: 0.21.0
-- Transformers: 4.55.0
-- Pytorch: 2.5.1+cu121
-- Datasets: 4.0.0
-- Tokenizers: 0.21.4
-## Citations
-Cite DPO as:
-```bibtex
-@inproceedings{rafailov2023direct,
-    title        = {{Direct Preference Optimization: Your Language Model is Secretly a Reward Model}},
-    author       = {Rafael Rafailov and Archit Sharma and Eric Mitchell and Christopher D. Manning and Stefano Ermon and Chelsea Finn},
-    year         = 2023,
-    booktitle    = {Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023},
-    url          = {http://papers.nips.cc/paper_files/paper/2023/hash/a85b405ed65c6477a4fe8302b5e06ce7-Abstract-Conference.html},
-    editor       = {Alice Oh and Tristan Naumann and Amir Globerson and Kate Saenko and Moritz Hardt and Sergey Levine},
-}
-```
-Cite TRL as:
-```bibtex
-@misc{vonwerra2022trl,
-	title        = {{TRL: Transformer Reinforcement Learning}},
-	author       = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
-	year         = 2020,
-	journal      = {GitHub repository},
-	publisher    = {GitHub},
-	howpublished = {\url{https://github.com/huggingface/trl}}
-}
-```

 ---
+license: apache-2.0
 base_model: meta-llama/Llama-3.2-1B-Instruct
 tags:
 - dpo
+- peft
+- llama
+- preference-learning
+model-index:
+- name: llama3-dpo-llm judge
+  results: []
 ---
+# Llama-3.2-1B DPO LLM Judge
+This model is a fine-tuned version of [meta-llama/Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct) using Direct Preference Optimization (DPO).
+## Model Details
+- **Base Model**: meta-llama/Llama-3.2-1B-Instruct
+- **Training Method**: Direct Preference Optimization (DPO)
+- **Preference Source**: LLM Judge
+- **LoRA Configuration**:
+  - r: 8
+  - alpha: 16
+  - target_modules: ['q_proj', 'k_proj', 'v_proj', 'o_proj']
+- **Training Steps**: 250
+- **Learning Rate**: 0.0002
+## Usage
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from peft import PeftModel
+base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")
+model = PeftModel.from_pretrained(base_model, "pyamy/llama3-dpo-llm judge")
+tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")
+```
+## Training Details
+- Dataset: 50 instructions from LIMA
+- Responses per instruction: 5
+- Preference judgment: LLM Judge
+- Training framework: TRL DPOTrainer
+## Performance
+See evaluation results in the repository for detailed performance metrics.