pyamy commited on
Commit
d781aad
·
verified ·
1 Parent(s): a68cdbb

Upload DPO LLM Judge fine-tuned model

Browse files
Files changed (1) hide show
  1. README.md +56 -34
README.md CHANGED
@@ -1,51 +1,73 @@
1
  ---
2
- license: apache-2.0
3
  base_model: meta-llama/Llama-3.2-1B-Instruct
 
 
4
  tags:
 
5
  - dpo
6
- - peft
7
- - llama
8
- - preference-learning
9
- model-index:
10
- - name: llama3-dpo-llm judge
11
- results: []
12
  ---
13
 
14
- # Llama-3.2-1B DPO LLM Judge
15
 
16
- This model is a fine-tuned version of [meta-llama/Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct) using Direct Preference Optimization (DPO).
 
17
 
18
- ## Model Details
19
 
20
- - **Base Model**: meta-llama/Llama-3.2-1B-Instruct
21
- - **Training Method**: Direct Preference Optimization (DPO)
22
- - **Preference Source**: LLM Judge
23
- - **LoRA Configuration**:
24
- - r: 8
25
- - alpha: 16
26
- - target_modules: ['q_proj', 'k_proj', 'v_proj', 'o_proj']
27
- - **Training Steps**: 250
28
- - **Learning Rate**: 0.0002
29
 
30
- ## Usage
 
 
 
 
31
 
32
- ```python
33
- from transformers import AutoModelForCausalLM, AutoTokenizer
34
- from peft import PeftModel
35
 
36
- base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")
37
- model = PeftModel.from_pretrained(base_model, "pyamy/llama3-dpo-llm judge")
38
 
39
- tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")
40
- ```
41
 
42
- ## Training Details
43
 
44
- - Dataset: 50 instructions from LIMA
45
- - Responses per instruction: 5
46
- - Preference judgment: LLM Judge
47
- - Training framework: TRL DPOTrainer
48
 
49
- ## Performance
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
50
 
51
- See evaluation results in the repository for detailed performance metrics.
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
 
2
  base_model: meta-llama/Llama-3.2-1B-Instruct
3
+ library_name: peft
4
+ model_name: dpo_llm_judge_model
5
  tags:
6
+ - base_model:adapter:meta-llama/Llama-3.2-1B-Instruct
7
  - dpo
8
+ - lora
9
+ - transformers
10
+ - trl
11
+ licence: license
12
+ pipeline_tag: text-generation
 
13
  ---
14
 
15
+ # Model Card for dpo_llm_judge_model
16
 
17
+ This model is a fine-tuned version of [meta-llama/Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct).
18
+ It has been trained using [TRL](https://github.com/huggingface/trl).
19
 
20
+ ## Quick start
21
 
22
+ ```python
23
+ from transformers import pipeline
 
 
 
 
 
 
 
24
 
25
+ question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
26
+ generator = pipeline("text-generation", model="None", device="cuda")
27
+ output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
28
+ print(output["generated_text"])
29
+ ```
30
 
31
+ ## Training procedure
 
 
32
 
33
+
 
34
 
 
 
35
 
36
+ This model was trained with DPO, a method introduced in [Direct Preference Optimization: Your Language Model is Secretly a Reward Model](https://huggingface.co/papers/2305.18290).
37
 
38
+ ### Framework versions
 
 
 
39
 
40
+ - PEFT 0.17.0
41
+ - TRL: 0.21.0
42
+ - Transformers: 4.55.0
43
+ - Pytorch: 2.5.1+cu121
44
+ - Datasets: 4.0.0
45
+ - Tokenizers: 0.21.4
46
+
47
+ ## Citations
48
+
49
+ Cite DPO as:
50
+
51
+ ```bibtex
52
+ @inproceedings{rafailov2023direct,
53
+ title = {{Direct Preference Optimization: Your Language Model is Secretly a Reward Model}},
54
+ author = {Rafael Rafailov and Archit Sharma and Eric Mitchell and Christopher D. Manning and Stefano Ermon and Chelsea Finn},
55
+ year = 2023,
56
+ booktitle = {Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023},
57
+ url = {http://papers.nips.cc/paper_files/paper/2023/hash/a85b405ed65c6477a4fe8302b5e06ce7-Abstract-Conference.html},
58
+ editor = {Alice Oh and Tristan Naumann and Amir Globerson and Kate Saenko and Moritz Hardt and Sergey Levine},
59
+ }
60
+ ```
61
 
62
+ Cite TRL as:
63
+
64
+ ```bibtex
65
+ @misc{vonwerra2022trl,
66
+ title = {{TRL: Transformer Reinforcement Learning}},
67
+ author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
68
+ year = 2020,
69
+ journal = {GitHub repository},
70
+ publisher = {GitHub},
71
+ howpublished = {\url{https://github.com/huggingface/trl}}
72
+ }
73
+ ```