Blancy commited on
Commit
0e9211a
·
verified ·
1 Parent(s): c4064c4

Model save

Browse files
Files changed (5) hide show
  1. README.md +67 -0
  2. all_results.json +8 -0
  3. generation_config.json +14 -0
  4. train_results.json +8 -0
  5. trainer_state.json +1204 -0
README.md ADDED
@@ -0,0 +1,67 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ model_name: DeepSeek-R1-Distill-Qwen-0.5B-GRPO
4
+ tags:
5
+ - generated_from_trainer
6
+ - trl
7
+ - grpo
8
+ licence: license
9
+ ---
10
+
11
+ # Model Card for DeepSeek-R1-Distill-Qwen-0.5B-GRPO
12
+
13
+ This model is a fine-tuned version of [None](https://huggingface.co/None).
14
+ It has been trained using [TRL](https://github.com/huggingface/trl).
15
+
16
+ ## Quick start
17
+
18
+ ```python
19
+ from transformers import pipeline
20
+
21
+ question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
22
+ generator = pipeline("text-generation", model="Blancy/DeepSeek-R1-Distill-Qwen-0.5B-GRPO", device="cuda")
23
+ output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
24
+ print(output["generated_text"])
25
+ ```
26
+
27
+ ## Training procedure
28
+
29
+ [<img src="https://raw.githubusercontent.com/wandb/assets/main/wandb-github-badge-28.svg" alt="Visualize in Weights & Biases" width="150" height="24"/>](https://wandb.ai/224015062-chinese-university-of-hong-kong-shenzhen/huggingface/runs/ylkpu50v)
30
+
31
+
32
+ This model was trained with GRPO, a method introduced in [DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models](https://huggingface.co/papers/2402.03300).
33
+
34
+ ### Framework versions
35
+
36
+ - TRL: 0.15.2
37
+ - Transformers: 4.49.0
38
+ - Pytorch: 2.5.1
39
+ - Datasets: 3.3.2
40
+ - Tokenizers: 0.21.0
41
+
42
+ ## Citations
43
+
44
+ Cite GRPO as:
45
+
46
+ ```bibtex
47
+ @article{zhihong2024deepseekmath,
48
+ title = {{DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models}},
49
+ author = {Zhihong Shao and Peiyi Wang and Qihao Zhu and Runxin Xu and Junxiao Song and Mingchuan Zhang and Y. K. Li and Y. Wu and Daya Guo},
50
+ year = 2024,
51
+ eprint = {arXiv:2402.03300},
52
+ }
53
+
54
+ ```
55
+
56
+ Cite TRL as:
57
+
58
+ ```bibtex
59
+ @misc{vonwerra2022trl,
60
+ title = {{TRL: Transformer Reinforcement Learning}},
61
+ author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallouédec},
62
+ year = 2020,
63
+ journal = {GitHub repository},
64
+ publisher = {GitHub},
65
+ howpublished = {\url{https://github.com/huggingface/trl}}
66
+ }
67
+ ```
all_results.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "total_flos": 0.0,
3
+ "train_loss": 0.00010365011172895271,
4
+ "train_runtime": 4988.7462,
5
+ "train_samples": 1000,
6
+ "train_samples_per_second": 0.2,
7
+ "train_steps_per_second": 0.017
8
+ }
generation_config.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token_id": 151643,
3
+ "do_sample": true,
4
+ "eos_token_id": [
5
+ 151645,
6
+ 151643
7
+ ],
8
+ "pad_token_id": 151643,
9
+ "repetition_penalty": 1.1,
10
+ "temperature": 0.7,
11
+ "top_k": 20,
12
+ "top_p": 0.8,
13
+ "transformers_version": "4.49.0"
14
+ }
train_results.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "total_flos": 0.0,
3
+ "train_loss": 0.00010365011172895271,
4
+ "train_runtime": 4988.7462,
5
+ "train_samples": 1000,
6
+ "train_samples_per_second": 0.2,
7
+ "train_steps_per_second": 0.017
8
+ }
trainer_state.json ADDED
@@ -0,0 +1,1204 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_metric": null,
3
+ "best_model_checkpoint": null,
4
+ "epoch": 0.9940119760479041,
5
+ "eval_steps": 500,
6
+ "global_step": 83,
7
+ "is_hyper_param_search": false,
8
+ "is_local_process_zero": true,
9
+ "is_world_process_zero": true,
10
+ "log_history": [
11
+ {
12
+ "completion_length": 2037.34375,
13
+ "epoch": 0.011976047904191617,
14
+ "grad_norm": 1.4971734285354614,
15
+ "kl": 0.0,
16
+ "learning_rate": 1.111111111111111e-07,
17
+ "loss": -0.0,
18
+ "reward": 0.6054687649011612,
19
+ "reward_std": 0.05388006288558245,
20
+ "rewards/accuracy_reward": 0.5833333432674408,
21
+ "rewards/format_reward": 0.0,
22
+ "rewards/tag_count_reward": 0.02213541674427688,
23
+ "step": 1
24
+ },
25
+ {
26
+ "completion_length": 2048.0,
27
+ "epoch": 0.023952095808383235,
28
+ "grad_norm": 0.2554255723953247,
29
+ "kl": 0.0,
30
+ "learning_rate": 2.222222222222222e-07,
31
+ "loss": -0.0,
32
+ "reward": 0.4388020932674408,
33
+ "reward_std": 0.06019644718617201,
34
+ "rewards/accuracy_reward": 0.4270833507180214,
35
+ "rewards/format_reward": 0.0,
36
+ "rewards/tag_count_reward": 0.011718750232830644,
37
+ "step": 2
38
+ },
39
+ {
40
+ "completion_length": 2041.7239685058594,
41
+ "epoch": 0.03592814371257485,
42
+ "grad_norm": 0.3514683246612549,
43
+ "kl": 0.00010800361633300781,
44
+ "learning_rate": 3.333333333333333e-07,
45
+ "loss": 0.0,
46
+ "reward": 0.4531250251457095,
47
+ "reward_std": 0.0739070875570178,
48
+ "rewards/accuracy_reward": 0.416666679084301,
49
+ "rewards/format_reward": 0.0,
50
+ "rewards/tag_count_reward": 0.03645833465270698,
51
+ "step": 3
52
+ },
53
+ {
54
+ "completion_length": 2046.96875,
55
+ "epoch": 0.04790419161676647,
56
+ "grad_norm": 0.2502233684062958,
57
+ "kl": 0.0001252889633178711,
58
+ "learning_rate": 4.444444444444444e-07,
59
+ "loss": 0.0,
60
+ "reward": 0.5143229415407404,
61
+ "reward_std": 0.03399638505652547,
62
+ "rewards/accuracy_reward": 0.5000000074505806,
63
+ "rewards/format_reward": 0.0,
64
+ "rewards/tag_count_reward": 0.014322917093522847,
65
+ "step": 4
66
+ },
67
+ {
68
+ "completion_length": 2048.0,
69
+ "epoch": 0.059880239520958084,
70
+ "grad_norm": 0.30932390689849854,
71
+ "kl": 0.000125885009765625,
72
+ "learning_rate": 5.555555555555555e-07,
73
+ "loss": 0.0,
74
+ "reward": 0.44661459885537624,
75
+ "reward_std": 0.055322977248579264,
76
+ "rewards/accuracy_reward": 0.416666679084301,
77
+ "rewards/format_reward": 0.0,
78
+ "rewards/tag_count_reward": 0.029947917792014778,
79
+ "step": 5
80
+ },
81
+ {
82
+ "completion_length": 2037.34375,
83
+ "epoch": 0.0718562874251497,
84
+ "grad_norm": 0.9940594434738159,
85
+ "kl": 0.00013077259063720703,
86
+ "learning_rate": 6.666666666666666e-07,
87
+ "loss": 0.0,
88
+ "reward": 0.5338541669771075,
89
+ "reward_std": 0.055845549795776606,
90
+ "rewards/accuracy_reward": 0.5000000074505806,
91
+ "rewards/format_reward": 0.0,
92
+ "rewards/tag_count_reward": 0.03385416732635349,
93
+ "step": 6
94
+ },
95
+ {
96
+ "completion_length": 2043.6823120117188,
97
+ "epoch": 0.08383233532934131,
98
+ "grad_norm": 0.24632015824317932,
99
+ "kl": 0.00012028217315673828,
100
+ "learning_rate": 7.777777777777778e-07,
101
+ "loss": 0.0,
102
+ "reward": 0.677083358168602,
103
+ "reward_std": 0.031139123253524303,
104
+ "rewards/accuracy_reward": 0.666666679084301,
105
+ "rewards/format_reward": 0.0,
106
+ "rewards/tag_count_reward": 0.010416666977107525,
107
+ "step": 7
108
+ },
109
+ {
110
+ "completion_length": 2037.34375,
111
+ "epoch": 0.09580838323353294,
112
+ "grad_norm": 1.0062243938446045,
113
+ "kl": 0.00013971328735351562,
114
+ "learning_rate": 8.888888888888888e-07,
115
+ "loss": 0.0,
116
+ "reward": 0.6093750298023224,
117
+ "reward_std": 0.06374238524585962,
118
+ "rewards/accuracy_reward": 0.5833333507180214,
119
+ "rewards/format_reward": 0.0,
120
+ "rewards/tag_count_reward": 0.026041667442768812,
121
+ "step": 8
122
+ },
123
+ {
124
+ "completion_length": 2047.3177185058594,
125
+ "epoch": 0.10778443113772455,
126
+ "grad_norm": 0.222465381026268,
127
+ "kl": 0.00013458728790283203,
128
+ "learning_rate": 1e-06,
129
+ "loss": 0.0,
130
+ "reward": 0.5143229439854622,
131
+ "reward_std": 0.02985687693580985,
132
+ "rewards/accuracy_reward": 0.5000000074505806,
133
+ "rewards/format_reward": 0.0,
134
+ "rewards/tag_count_reward": 0.014322917093522847,
135
+ "step": 9
136
+ },
137
+ {
138
+ "completion_length": 2048.0,
139
+ "epoch": 0.11976047904191617,
140
+ "grad_norm": 0.23234856128692627,
141
+ "kl": 0.00010788440704345703,
142
+ "learning_rate": 9.995945347921067e-07,
143
+ "loss": 0.0,
144
+ "reward": 0.25911459082271904,
145
+ "reward_std": 0.029856876470148563,
146
+ "rewards/accuracy_reward": 0.2500000074505806,
147
+ "rewards/format_reward": 0.0,
148
+ "rewards/tag_count_reward": 0.009114583488553762,
149
+ "step": 10
150
+ },
151
+ {
152
+ "completion_length": 2044.1927185058594,
153
+ "epoch": 0.1317365269461078,
154
+ "grad_norm": 0.28732362389564514,
155
+ "kl": 0.00012230873107910156,
156
+ "learning_rate": 9.983788698441369e-07,
157
+ "loss": 0.0,
158
+ "reward": 0.5950520932674408,
159
+ "reward_std": 0.040273543912917376,
160
+ "rewards/accuracy_reward": 0.5833333432674408,
161
+ "rewards/format_reward": 0.0,
162
+ "rewards/tag_count_reward": 0.011718750349245965,
163
+ "step": 11
164
+ },
165
+ {
166
+ "completion_length": 2043.7448120117188,
167
+ "epoch": 0.1437125748502994,
168
+ "grad_norm": 0.24524293839931488,
169
+ "kl": 0.00011754035949707031,
170
+ "learning_rate": 9.963551958664945e-07,
171
+ "loss": 0.0,
172
+ "reward": 0.35286459827329963,
173
+ "reward_std": 0.036272107157856226,
174
+ "rewards/accuracy_reward": 0.3333333432674408,
175
+ "rewards/format_reward": 0.0,
176
+ "rewards/tag_count_reward": 0.019531250349245965,
177
+ "step": 12
178
+ },
179
+ {
180
+ "completion_length": 2048.0,
181
+ "epoch": 0.15568862275449102,
182
+ "grad_norm": 0.21733032166957855,
183
+ "kl": 0.0001404285430908203,
184
+ "learning_rate": 9.935271596564688e-07,
185
+ "loss": 0.0,
186
+ "reward": 0.6940104216337204,
187
+ "reward_std": 0.03566407039761543,
188
+ "rewards/accuracy_reward": 0.666666679084301,
189
+ "rewards/format_reward": 0.0,
190
+ "rewards/tag_count_reward": 0.027343751396983862,
191
+ "step": 13
192
+ },
193
+ {
194
+ "completion_length": 2046.890625,
195
+ "epoch": 0.16766467065868262,
196
+ "grad_norm": 0.32550087571144104,
197
+ "kl": 0.00013780593872070312,
198
+ "learning_rate": 9.898998575264588e-07,
199
+ "loss": 0.0,
200
+ "reward": 0.45572917722165585,
201
+ "reward_std": 0.07509249821305275,
202
+ "rewards/accuracy_reward": 0.416666679084301,
203
+ "rewards/format_reward": 0.0,
204
+ "rewards/tag_count_reward": 0.039062501629814506,
205
+ "step": 14
206
+ },
207
+ {
208
+ "completion_length": 2048.0,
209
+ "epoch": 0.17964071856287425,
210
+ "grad_norm": 0.34825780987739563,
211
+ "kl": 0.0001423358917236328,
212
+ "learning_rate": 9.854798261200746e-07,
213
+ "loss": 0.0,
214
+ "reward": 0.47526043467223644,
215
+ "reward_std": 0.08674583956599236,
216
+ "rewards/accuracy_reward": 0.416666679084301,
217
+ "rewards/format_reward": 0.0,
218
+ "rewards/tag_count_reward": 0.058593750931322575,
219
+ "step": 15
220
+ },
221
+ {
222
+ "completion_length": 2048.0,
223
+ "epoch": 0.19161676646706588,
224
+ "grad_norm": 0.29791343212127686,
225
+ "kl": 0.0001347064971923828,
226
+ "learning_rate": 9.80275030632663e-07,
227
+ "loss": 0.0,
228
+ "reward": 0.608072929084301,
229
+ "reward_std": 0.05152899120002985,
230
+ "rewards/accuracy_reward": 0.5833333507180214,
231
+ "rewards/format_reward": 0.0,
232
+ "rewards/tag_count_reward": 0.02473958395421505,
233
+ "step": 16
234
+ },
235
+ {
236
+ "completion_length": 2048.0,
237
+ "epoch": 0.20359281437125748,
238
+ "grad_norm": 0.24608102440834045,
239
+ "kl": 0.00013887882232666016,
240
+ "learning_rate": 9.742948504574879e-07,
241
+ "loss": 0.0,
242
+ "reward": 0.5247395932674408,
243
+ "reward_std": 0.04357585031539202,
244
+ "rewards/accuracy_reward": 0.5000000074505806,
245
+ "rewards/format_reward": 0.0,
246
+ "rewards/tag_count_reward": 0.024739584419876337,
247
+ "step": 17
248
+ },
249
+ {
250
+ "completion_length": 2048.0,
251
+ "epoch": 0.2155688622754491,
252
+ "grad_norm": 0.3837389647960663,
253
+ "kl": 0.00015234947204589844,
254
+ "learning_rate": 9.675500622834293e-07,
255
+ "loss": 0.0,
256
+ "reward": 0.36588541977107525,
257
+ "reward_std": 0.07624713983386755,
258
+ "rewards/accuracy_reward": 0.3333333432674408,
259
+ "rewards/format_reward": 0.0,
260
+ "rewards/tag_count_reward": 0.03255208441987634,
261
+ "step": 18
262
+ },
263
+ {
264
+ "completion_length": 2048.0,
265
+ "epoch": 0.2275449101796407,
266
+ "grad_norm": 0.3275865316390991,
267
+ "kl": 0.0001983642578125,
268
+ "learning_rate": 9.60052820674661e-07,
269
+ "loss": 0.0,
270
+ "reward": 0.6171875102445483,
271
+ "reward_std": 0.054537888150662184,
272
+ "rewards/accuracy_reward": 0.5833333432674408,
273
+ "rewards/format_reward": 0.0,
274
+ "rewards/tag_count_reward": 0.03385416732635349,
275
+ "step": 19
276
+ },
277
+ {
278
+ "completion_length": 2048.0,
279
+ "epoch": 0.23952095808383234,
280
+ "grad_norm": 0.3781125247478485,
281
+ "kl": 0.0002334117889404297,
282
+ "learning_rate": 9.518166361673058e-07,
283
+ "loss": 0.0,
284
+ "reward": 0.4023437649011612,
285
+ "reward_std": 0.08120491355657578,
286
+ "rewards/accuracy_reward": 0.3333333432674408,
287
+ "rewards/format_reward": 0.0,
288
+ "rewards/tag_count_reward": 0.06901041883975267,
289
+ "step": 20
290
+ },
291
+ {
292
+ "completion_length": 2048.0,
293
+ "epoch": 0.25149700598802394,
294
+ "grad_norm": 0.2996104955673218,
295
+ "kl": 0.0002651214599609375,
296
+ "learning_rate": 9.428563509225346e-07,
297
+ "loss": 0.0,
298
+ "reward": 0.4622395932674408,
299
+ "reward_std": 0.0660695880651474,
300
+ "rewards/accuracy_reward": 0.416666679084301,
301
+ "rewards/format_reward": 0.0,
302
+ "rewards/tag_count_reward": 0.04557291674427688,
303
+ "step": 21
304
+ },
305
+ {
306
+ "completion_length": 2048.0,
307
+ "epoch": 0.2634730538922156,
308
+ "grad_norm": 0.38923388719558716,
309
+ "kl": 0.0003142356872558594,
310
+ "learning_rate": 9.3318811197999e-07,
311
+ "loss": 0.0,
312
+ "reward": 0.6471354365348816,
313
+ "reward_std": 0.08566952683031559,
314
+ "rewards/accuracy_reward": 0.5833333432674408,
315
+ "rewards/format_reward": 0.0,
316
+ "rewards/tag_count_reward": 0.0638020858168602,
317
+ "step": 22
318
+ },
319
+ {
320
+ "completion_length": 2048.0,
321
+ "epoch": 0.2754491017964072,
322
+ "grad_norm": 0.3591609299182892,
323
+ "kl": 0.00036334991455078125,
324
+ "learning_rate": 9.228293421597289e-07,
325
+ "loss": 0.0,
326
+ "reward": 0.2200520895421505,
327
+ "reward_std": 0.09116558637470007,
328
+ "rewards/accuracy_reward": 0.1666666716337204,
329
+ "rewards/format_reward": 0.0,
330
+ "rewards/tag_count_reward": 0.0533854179084301,
331
+ "step": 23
332
+ },
333
+ {
334
+ "completion_length": 2048.0,
335
+ "epoch": 0.2874251497005988,
336
+ "grad_norm": 0.33370983600616455,
337
+ "kl": 0.0004825592041015625,
338
+ "learning_rate": 9.117987086651232e-07,
339
+ "loss": 0.0,
340
+ "reward": 0.49869792722165585,
341
+ "reward_std": 0.08588295057415962,
342
+ "rewards/accuracy_reward": 0.416666679084301,
343
+ "rewards/format_reward": 0.0,
344
+ "rewards/tag_count_reward": 0.08203125465661287,
345
+ "step": 24
346
+ },
347
+ {
348
+ "completion_length": 2048.0,
349
+ "epoch": 0.2994011976047904,
350
+ "grad_norm": 0.3089774250984192,
351
+ "kl": 0.0005397796630859375,
352
+ "learning_rate": 9.001160894432978e-07,
353
+ "loss": 0.0,
354
+ "reward": 0.5664062723517418,
355
+ "reward_std": 0.08240717835724354,
356
+ "rewards/accuracy_reward": 0.5000000149011612,
357
+ "rewards/format_reward": 0.0,
358
+ "rewards/tag_count_reward": 0.06640625093132257,
359
+ "step": 25
360
+ },
361
+ {
362
+ "completion_length": 2047.1198120117188,
363
+ "epoch": 0.31137724550898205,
364
+ "grad_norm": 0.39812523126602173,
365
+ "kl": 0.000537872314453125,
366
+ "learning_rate": 8.878025373637259e-07,
367
+ "loss": 0.0,
368
+ "reward": 0.5924479402601719,
369
+ "reward_std": 0.11753918416798115,
370
+ "rewards/accuracy_reward": 0.5000000074505806,
371
+ "rewards/format_reward": 0.0,
372
+ "rewards/tag_count_reward": 0.09244791977107525,
373
+ "step": 26
374
+ },
375
+ {
376
+ "completion_length": 2040.3073120117188,
377
+ "epoch": 0.32335329341317365,
378
+ "grad_norm": 0.3853413760662079,
379
+ "kl": 0.0007343292236328125,
380
+ "learning_rate": 8.748802422795359e-07,
381
+ "loss": 0.0,
382
+ "reward": 0.7500000149011612,
383
+ "reward_std": 0.08755372650921345,
384
+ "rewards/accuracy_reward": 0.666666679084301,
385
+ "rewards/format_reward": 0.0,
386
+ "rewards/tag_count_reward": 0.08333333395421505,
387
+ "step": 27
388
+ },
389
+ {
390
+ "completion_length": 2048.0,
391
+ "epoch": 0.33532934131736525,
392
+ "grad_norm": 0.4009532034397125,
393
+ "kl": 0.0008382797241210938,
394
+ "learning_rate": 8.613724910398959e-07,
395
+ "loss": 0.0,
396
+ "reward": 0.611979179084301,
397
+ "reward_std": 0.10698455851525068,
398
+ "rewards/accuracy_reward": 0.5000000149011612,
399
+ "rewards/format_reward": 0.0,
400
+ "rewards/tag_count_reward": 0.11197916977107525,
401
+ "step": 28
402
+ },
403
+ {
404
+ "completion_length": 2048.0,
405
+ "epoch": 0.3473053892215569,
406
+ "grad_norm": 0.36610516905784607,
407
+ "kl": 0.0007114410400390625,
408
+ "learning_rate": 8.473036255255366e-07,
409
+ "loss": 0.0,
410
+ "reward": 0.36197917349636555,
411
+ "reward_std": 0.09994817152619362,
412
+ "rewards/accuracy_reward": 0.2500000074505806,
413
+ "rewards/format_reward": 0.0,
414
+ "rewards/tag_count_reward": 0.1119791716337204,
415
+ "step": 29
416
+ },
417
+ {
418
+ "completion_length": 2048.0,
419
+ "epoch": 0.3592814371257485,
420
+ "grad_norm": 0.3826785087585449,
421
+ "kl": 0.0009069442749023438,
422
+ "learning_rate": 8.32698998783039e-07,
423
+ "loss": 0.0,
424
+ "reward": 0.3750000074505806,
425
+ "reward_std": 0.11406980641186237,
426
+ "rewards/accuracy_reward": 0.2500000074505806,
427
+ "rewards/format_reward": 0.0,
428
+ "rewards/tag_count_reward": 0.1250000037252903,
429
+ "step": 30
430
+ },
431
+ {
432
+ "completion_length": 2048.0,
433
+ "epoch": 0.3712574850299401,
434
+ "grad_norm": 0.47378554940223694,
435
+ "kl": 0.00110626220703125,
436
+ "learning_rate": 8.17584929336929e-07,
437
+ "loss": 0.0,
438
+ "reward": 0.6601562798023224,
439
+ "reward_std": 0.12305041775107384,
440
+ "rewards/accuracy_reward": 0.5052083432674408,
441
+ "rewards/format_reward": 0.0,
442
+ "rewards/tag_count_reward": 0.1549479216337204,
443
+ "step": 31
444
+ },
445
+ {
446
+ "completion_length": 2048.0,
447
+ "epoch": 0.38323353293413176,
448
+ "grad_norm": 0.40673568844795227,
449
+ "kl": 0.0009860992431640625,
450
+ "learning_rate": 8.019886537619179e-07,
451
+ "loss": 0.0,
452
+ "reward": 0.638020858168602,
453
+ "reward_std": 0.11014635302126408,
454
+ "rewards/accuracy_reward": 0.5000000149011612,
455
+ "rewards/format_reward": 0.0,
456
+ "rewards/tag_count_reward": 0.13802083395421505,
457
+ "step": 32
458
+ },
459
+ {
460
+ "completion_length": 2041.75,
461
+ "epoch": 0.39520958083832336,
462
+ "grad_norm": 0.40154603123664856,
463
+ "kl": 0.0013885498046875,
464
+ "learning_rate": 7.859382776007543e-07,
465
+ "loss": 0.0001,
466
+ "reward": 0.5377604216337204,
467
+ "reward_std": 0.12584633566439152,
468
+ "rewards/accuracy_reward": 0.4166666716337204,
469
+ "rewards/format_reward": 0.0,
470
+ "rewards/tag_count_reward": 0.1210937537252903,
471
+ "step": 33
472
+ },
473
+ {
474
+ "completion_length": 2048.0,
475
+ "epoch": 0.40718562874251496,
476
+ "grad_norm": 0.3680473864078522,
477
+ "kl": 0.0020599365234375,
478
+ "learning_rate": 7.694627247161356e-07,
479
+ "loss": 0.0001,
480
+ "reward": 0.5598958544433117,
481
+ "reward_std": 0.10915113240480423,
482
+ "rewards/accuracy_reward": 0.416666679084301,
483
+ "rewards/format_reward": 0.0,
484
+ "rewards/tag_count_reward": 0.1432291716337204,
485
+ "step": 34
486
+ },
487
+ {
488
+ "completion_length": 2047.0,
489
+ "epoch": 0.41916167664670656,
490
+ "grad_norm": 0.4365445375442505,
491
+ "kl": 0.0013685226440429688,
492
+ "learning_rate": 7.525916851679529e-07,
493
+ "loss": 0.0001,
494
+ "reward": 0.5781250260770321,
495
+ "reward_std": 0.1131261233240366,
496
+ "rewards/accuracy_reward": 0.416666679084301,
497
+ "rewards/format_reward": 0.0,
498
+ "rewards/tag_count_reward": 0.16145833767950535,
499
+ "step": 35
500
+ },
501
+ {
502
+ "completion_length": 2048.0,
503
+ "epoch": 0.4311377245508982,
504
+ "grad_norm": 0.40593045949935913,
505
+ "kl": 0.001617431640625,
506
+ "learning_rate": 7.353555617097967e-07,
507
+ "loss": 0.0001,
508
+ "reward": 0.7057291828095913,
509
+ "reward_std": 0.09066728875041008,
510
+ "rewards/accuracy_reward": 0.5000000149011612,
511
+ "rewards/format_reward": 0.0,
512
+ "rewards/tag_count_reward": 0.2057291716337204,
513
+ "step": 36
514
+ },
515
+ {
516
+ "completion_length": 2047.9323120117188,
517
+ "epoch": 0.4431137724550898,
518
+ "grad_norm": 0.35417431592941284,
519
+ "kl": 0.001926422119140625,
520
+ "learning_rate": 7.177854150011389e-07,
521
+ "loss": 0.0001,
522
+ "reward": 0.5872395932674408,
523
+ "reward_std": 0.09160411357879639,
524
+ "rewards/accuracy_reward": 0.416666679084301,
525
+ "rewards/format_reward": 0.0,
526
+ "rewards/tag_count_reward": 0.1705729216337204,
527
+ "step": 37
528
+ },
529
+ {
530
+ "completion_length": 2048.0,
531
+ "epoch": 0.4550898203592814,
532
+ "grad_norm": 0.36102283000946045,
533
+ "kl": 0.0015583038330078125,
534
+ "learning_rate": 6.999129076339259e-07,
535
+ "loss": 0.0001,
536
+ "reward": 0.5351562611758709,
537
+ "reward_std": 0.13301999680697918,
538
+ "rewards/accuracy_reward": 0.3437500149011612,
539
+ "rewards/format_reward": 0.0,
540
+ "rewards/tag_count_reward": 0.1914062537252903,
541
+ "step": 38
542
+ },
543
+ {
544
+ "completion_length": 2048.0,
545
+ "epoch": 0.46706586826347307,
546
+ "grad_norm": 0.3780408799648285,
547
+ "kl": 0.002948760986328125,
548
+ "learning_rate": 6.817702470744477e-07,
549
+ "loss": 0.0001,
550
+ "reward": 0.6627604477107525,
551
+ "reward_std": 0.09685690514743328,
552
+ "rewards/accuracy_reward": 0.5000000074505806,
553
+ "rewards/format_reward": 0.0,
554
+ "rewards/tag_count_reward": 0.1627604216337204,
555
+ "step": 39
556
+ },
557
+ {
558
+ "completion_length": 2048.0,
559
+ "epoch": 0.47904191616766467,
560
+ "grad_norm": 0.4100407361984253,
561
+ "kl": 0.0018138885498046875,
562
+ "learning_rate": 6.633901276233064e-07,
563
+ "loss": 0.0001,
564
+ "reward": 0.7291666716337204,
565
+ "reward_std": 0.10320629552006721,
566
+ "rewards/accuracy_reward": 0.5833333507180214,
567
+ "rewards/format_reward": 0.0,
568
+ "rewards/tag_count_reward": 0.1458333395421505,
569
+ "step": 40
570
+ },
571
+ {
572
+ "completion_length": 2048.0,
573
+ "epoch": 0.49101796407185627,
574
+ "grad_norm": 0.3577154278755188,
575
+ "kl": 0.0021648406982421875,
576
+ "learning_rate": 6.448056714980767e-07,
577
+ "loss": 0.0001,
578
+ "reward": 0.5182291828095913,
579
+ "reward_std": 0.10152745991945267,
580
+ "rewards/accuracy_reward": 0.3333333432674408,
581
+ "rewards/format_reward": 0.0,
582
+ "rewards/tag_count_reward": 0.1848958395421505,
583
+ "step": 41
584
+ },
585
+ {
586
+ "completion_length": 2046.1041870117188,
587
+ "epoch": 0.5029940119760479,
588
+ "grad_norm": 0.29059839248657227,
589
+ "kl": 0.0022754669189453125,
590
+ "learning_rate": 6.260503691448321e-07,
591
+ "loss": 0.0001,
592
+ "reward": 0.9713542014360428,
593
+ "reward_std": 0.05874503217637539,
594
+ "rewards/accuracy_reward": 0.7500000149011612,
595
+ "rewards/format_reward": 0.0,
596
+ "rewards/tag_count_reward": 0.2213541716337204,
597
+ "step": 42
598
+ },
599
+ {
600
+ "completion_length": 2048.0,
601
+ "epoch": 0.5149700598802395,
602
+ "grad_norm": 0.33205100893974304,
603
+ "kl": 0.002460479736328125,
604
+ "learning_rate": 6.071580188860954e-07,
605
+ "loss": 0.0001,
606
+ "reward": 0.7005208432674408,
607
+ "reward_std": 0.07212240621447563,
608
+ "rewards/accuracy_reward": 0.5000000149011612,
609
+ "rewards/format_reward": 0.0,
610
+ "rewards/tag_count_reward": 0.2005208358168602,
611
+ "step": 43
612
+ },
613
+ {
614
+ "completion_length": 2048.0,
615
+ "epoch": 0.5269461077844312,
616
+ "grad_norm": 0.4120350480079651,
617
+ "kl": 0.002471923828125,
618
+ "learning_rate": 5.881626660139791e-07,
619
+ "loss": 0.0001,
620
+ "reward": 0.699218787252903,
621
+ "reward_std": 0.09029853250831366,
622
+ "rewards/accuracy_reward": 0.5000000074505806,
623
+ "rewards/format_reward": 0.0,
624
+ "rewards/tag_count_reward": 0.1992187574505806,
625
+ "step": 44
626
+ },
627
+ {
628
+ "completion_length": 2048.0,
629
+ "epoch": 0.5389221556886228,
630
+ "grad_norm": 0.3548724949359894,
631
+ "kl": 0.0022029876708984375,
632
+ "learning_rate": 5.690985414382668e-07,
633
+ "loss": 0.0001,
634
+ "reward": 0.6106770932674408,
635
+ "reward_std": 0.09216992743313313,
636
+ "rewards/accuracy_reward": 0.416666679084301,
637
+ "rewards/format_reward": 0.0,
638
+ "rewards/tag_count_reward": 0.1940104253590107,
639
+ "step": 45
640
+ },
641
+ {
642
+ "completion_length": 2048.0,
643
+ "epoch": 0.5508982035928144,
644
+ "grad_norm": 0.3445223569869995,
645
+ "kl": 0.002902984619140625,
646
+ "learning_rate": 5.5e-07,
647
+ "loss": 0.0001,
648
+ "reward": 0.5507812574505806,
649
+ "reward_std": 0.09265115670859814,
650
+ "rewards/accuracy_reward": 0.3333333432674408,
651
+ "rewards/format_reward": 0.0,
652
+ "rewards/tag_count_reward": 0.2174479253590107,
653
+ "step": 46
654
+ },
655
+ {
656
+ "completion_length": 2048.0,
657
+ "epoch": 0.562874251497006,
658
+ "grad_norm": 0.3690991997718811,
659
+ "kl": 0.005420684814453125,
660
+ "learning_rate": 5.309014585617334e-07,
661
+ "loss": 0.0002,
662
+ "reward": 0.3554687649011612,
663
+ "reward_std": 0.10121702961623669,
664
+ "rewards/accuracy_reward": 0.17187500512227416,
665
+ "rewards/format_reward": 0.0,
666
+ "rewards/tag_count_reward": 0.1835937537252903,
667
+ "step": 47
668
+ },
669
+ {
670
+ "completion_length": 2048.0,
671
+ "epoch": 0.5748502994011976,
672
+ "grad_norm": 0.33994194865226746,
673
+ "kl": 0.003192901611328125,
674
+ "learning_rate": 5.11837333986021e-07,
675
+ "loss": 0.0001,
676
+ "reward": 1.0455729365348816,
677
+ "reward_std": 0.11090282909572124,
678
+ "rewards/accuracy_reward": 0.8385416716337204,
679
+ "rewards/format_reward": 0.0,
680
+ "rewards/tag_count_reward": 0.2070312537252903,
681
+ "step": 48
682
+ },
683
+ {
684
+ "completion_length": 2043.703125,
685
+ "epoch": 0.5868263473053892,
686
+ "grad_norm": 0.2940013110637665,
687
+ "kl": 0.00350189208984375,
688
+ "learning_rate": 4.928419811139045e-07,
689
+ "loss": 0.0001,
690
+ "reward": 0.569010429084301,
691
+ "reward_std": 0.06290155602619052,
692
+ "rewards/accuracy_reward": 0.3333333432674408,
693
+ "rewards/format_reward": 0.0,
694
+ "rewards/tag_count_reward": 0.2356770895421505,
695
+ "step": 49
696
+ },
697
+ {
698
+ "completion_length": 2042.140625,
699
+ "epoch": 0.5988023952095808,
700
+ "grad_norm": 0.3399907648563385,
701
+ "kl": 0.002948760986328125,
702
+ "learning_rate": 4.739496308551679e-07,
703
+ "loss": 0.0001,
704
+ "reward": 0.7213541865348816,
705
+ "reward_std": 0.10191570967435837,
706
+ "rewards/accuracy_reward": 0.510416679084301,
707
+ "rewards/format_reward": 0.0,
708
+ "rewards/tag_count_reward": 0.2109375,
709
+ "step": 50
710
+ },
711
+ {
712
+ "completion_length": 2048.0,
713
+ "epoch": 0.6107784431137725,
714
+ "grad_norm": 0.3221840262413025,
715
+ "kl": 0.003246307373046875,
716
+ "learning_rate": 4.551943285019233e-07,
717
+ "loss": 0.0001,
718
+ "reward": 0.635416679084301,
719
+ "reward_std": 0.06191476574167609,
720
+ "rewards/accuracy_reward": 0.416666679084301,
721
+ "rewards/format_reward": 0.0,
722
+ "rewards/tag_count_reward": 0.2187500074505806,
723
+ "step": 51
724
+ },
725
+ {
726
+ "completion_length": 2048.0,
727
+ "epoch": 0.6227544910179641,
728
+ "grad_norm": 0.1775510013103485,
729
+ "kl": 0.003017425537109375,
730
+ "learning_rate": 4.3660987237669377e-07,
731
+ "loss": 0.0001,
732
+ "reward": 0.575520858168602,
733
+ "reward_std": 0.01973361661657691,
734
+ "rewards/accuracy_reward": 0.3333333432674408,
735
+ "rewards/format_reward": 0.0,
736
+ "rewards/tag_count_reward": 0.2421875037252903,
737
+ "step": 52
738
+ },
739
+ {
740
+ "completion_length": 2048.0,
741
+ "epoch": 0.6347305389221557,
742
+ "grad_norm": 0.3201664686203003,
743
+ "kl": 0.003948211669921875,
744
+ "learning_rate": 4.182297529255524e-07,
745
+ "loss": 0.0002,
746
+ "reward": 0.6523437798023224,
747
+ "reward_std": 0.07445824518799782,
748
+ "rewards/accuracy_reward": 0.416666679084301,
749
+ "rewards/format_reward": 0.0,
750
+ "rewards/tag_count_reward": 0.2356770895421505,
751
+ "step": 53
752
+ },
753
+ {
754
+ "completion_length": 2048.0,
755
+ "epoch": 0.6467065868263473,
756
+ "grad_norm": 0.39619770646095276,
757
+ "kl": 0.003509521484375,
758
+ "learning_rate": 4.0008709236607405e-07,
759
+ "loss": 0.0001,
760
+ "reward": 0.645833358168602,
761
+ "reward_std": 0.05563760735094547,
762
+ "rewards/accuracy_reward": 0.416666679084301,
763
+ "rewards/format_reward": 0.0,
764
+ "rewards/tag_count_reward": 0.2291666679084301,
765
+ "step": 54
766
+ },
767
+ {
768
+ "completion_length": 2048.0,
769
+ "epoch": 0.6586826347305389,
770
+ "grad_norm": 0.29894304275512695,
771
+ "kl": 0.00421905517578125,
772
+ "learning_rate": 3.8221458499886115e-07,
773
+ "loss": 0.0002,
774
+ "reward": 0.731770858168602,
775
+ "reward_std": 0.05632513063028455,
776
+ "rewards/accuracy_reward": 0.5000000149011612,
777
+ "rewards/format_reward": 0.0,
778
+ "rewards/tag_count_reward": 0.2317708358168602,
779
+ "step": 55
780
+ },
781
+ {
782
+ "completion_length": 2048.0,
783
+ "epoch": 0.6706586826347305,
784
+ "grad_norm": 0.3364109992980957,
785
+ "kl": 0.005672454833984375,
786
+ "learning_rate": 3.646444382902033e-07,
787
+ "loss": 0.0002,
788
+ "reward": 0.4804687649011612,
789
+ "reward_std": 0.052780346013605595,
790
+ "rewards/accuracy_reward": 0.2500000074505806,
791
+ "rewards/format_reward": 0.0,
792
+ "rewards/tag_count_reward": 0.23046875,
793
+ "step": 56
794
+ },
795
+ {
796
+ "completion_length": 2048.0,
797
+ "epoch": 0.6826347305389222,
798
+ "grad_norm": 0.2491559237241745,
799
+ "kl": 0.003627777099609375,
800
+ "learning_rate": 3.474083148320469e-07,
801
+ "loss": 0.0001,
802
+ "reward": 0.7434895895421505,
803
+ "reward_std": 0.04665324650704861,
804
+ "rewards/accuracy_reward": 0.5000000074505806,
805
+ "rewards/format_reward": 0.0,
806
+ "rewards/tag_count_reward": 0.2434895895421505,
807
+ "step": 57
808
+ },
809
+ {
810
+ "completion_length": 2048.0,
811
+ "epoch": 0.6946107784431138,
812
+ "grad_norm": 0.29894545674324036,
813
+ "kl": 0.0038604736328125,
814
+ "learning_rate": 3.3053727528386457e-07,
815
+ "loss": 0.0002,
816
+ "reward": 0.7096354216337204,
817
+ "reward_std": 0.0567871811799705,
818
+ "rewards/accuracy_reward": 0.5000000149011612,
819
+ "rewards/format_reward": 0.0,
820
+ "rewards/tag_count_reward": 0.2096354253590107,
821
+ "step": 58
822
+ },
823
+ {
824
+ "completion_length": 2048.0,
825
+ "epoch": 0.7065868263473054,
826
+ "grad_norm": 0.26200374960899353,
827
+ "kl": 0.003528594970703125,
828
+ "learning_rate": 3.140617223992458e-07,
829
+ "loss": 0.0001,
830
+ "reward": 0.9895833432674408,
831
+ "reward_std": 0.05609210580587387,
832
+ "rewards/accuracy_reward": 0.7500000149011612,
833
+ "rewards/format_reward": 0.0,
834
+ "rewards/tag_count_reward": 0.2395833358168602,
835
+ "step": 59
836
+ },
837
+ {
838
+ "completion_length": 2043.1666870117188,
839
+ "epoch": 0.718562874251497,
840
+ "grad_norm": 0.32477909326553345,
841
+ "kl": 0.00547027587890625,
842
+ "learning_rate": 2.980113462380821e-07,
843
+ "loss": 0.0002,
844
+ "reward": 0.5572916865348816,
845
+ "reward_std": 0.060260336846113205,
846
+ "rewards/accuracy_reward": 0.3333333358168602,
847
+ "rewards/format_reward": 0.0,
848
+ "rewards/tag_count_reward": 0.2239583358168602,
849
+ "step": 60
850
+ },
851
+ {
852
+ "completion_length": 2048.0,
853
+ "epoch": 0.7305389221556886,
854
+ "grad_norm": 0.2725842595100403,
855
+ "kl": 0.003742218017578125,
856
+ "learning_rate": 2.82415070663071e-07,
857
+ "loss": 0.0001,
858
+ "reward": 0.7434895932674408,
859
+ "reward_std": 0.04541819915175438,
860
+ "rewards/accuracy_reward": 0.5000000149011612,
861
+ "rewards/format_reward": 0.0,
862
+ "rewards/tag_count_reward": 0.2434895895421505,
863
+ "step": 61
864
+ },
865
+ {
866
+ "completion_length": 2048.0,
867
+ "epoch": 0.7425149700598802,
868
+ "grad_norm": 0.3032469153404236,
869
+ "kl": 0.003986358642578125,
870
+ "learning_rate": 2.673010012169609e-07,
871
+ "loss": 0.0002,
872
+ "reward": 0.5559895895421505,
873
+ "reward_std": 0.0574858826585114,
874
+ "rewards/accuracy_reward": 0.3333333432674408,
875
+ "rewards/format_reward": 0.0,
876
+ "rewards/tag_count_reward": 0.2226562574505806,
877
+ "step": 62
878
+ },
879
+ {
880
+ "completion_length": 2048.0,
881
+ "epoch": 0.7544910179640718,
882
+ "grad_norm": 0.2664555311203003,
883
+ "kl": 0.00446319580078125,
884
+ "learning_rate": 2.5269637447446345e-07,
885
+ "loss": 0.0002,
886
+ "reward": 0.559895858168602,
887
+ "reward_std": 0.05363978538662195,
888
+ "rewards/accuracy_reward": 0.3333333432674408,
889
+ "rewards/format_reward": 0.0,
890
+ "rewards/tag_count_reward": 0.2265625037252903,
891
+ "step": 63
892
+ },
893
+ {
894
+ "completion_length": 2048.0,
895
+ "epoch": 0.7664670658682635,
896
+ "grad_norm": 0.2789634168148041,
897
+ "kl": 0.004024505615234375,
898
+ "learning_rate": 2.3862750896010425e-07,
899
+ "loss": 0.0002,
900
+ "reward": 0.5755208544433117,
901
+ "reward_std": 0.04511032486334443,
902
+ "rewards/accuracy_reward": 0.3333333432674408,
903
+ "rewards/format_reward": 0.0,
904
+ "rewards/tag_count_reward": 0.2421875074505806,
905
+ "step": 64
906
+ },
907
+ {
908
+ "completion_length": 2048.0,
909
+ "epoch": 0.7784431137724551,
910
+ "grad_norm": 0.3104996383190155,
911
+ "kl": 0.003612518310546875,
912
+ "learning_rate": 2.25119757720464e-07,
913
+ "loss": 0.0001,
914
+ "reward": 0.984375,
915
+ "reward_std": 0.06761277234181762,
916
+ "rewards/accuracy_reward": 0.7500000074505806,
917
+ "rewards/format_reward": 0.0,
918
+ "rewards/tag_count_reward": 0.2343750037252903,
919
+ "step": 65
920
+ },
921
+ {
922
+ "completion_length": 2048.0,
923
+ "epoch": 0.7904191616766467,
924
+ "grad_norm": 0.3051362931728363,
925
+ "kl": 0.0042266845703125,
926
+ "learning_rate": 2.12197462636274e-07,
927
+ "loss": 0.0002,
928
+ "reward": 0.6445312760770321,
929
+ "reward_std": 0.05897808913141489,
930
+ "rewards/accuracy_reward": 0.416666679084301,
931
+ "rewards/format_reward": 0.0,
932
+ "rewards/tag_count_reward": 0.2278645858168602,
933
+ "step": 66
934
+ },
935
+ {
936
+ "completion_length": 2048.0,
937
+ "epoch": 0.8023952095808383,
938
+ "grad_norm": 0.28398770093917847,
939
+ "kl": 0.004734039306640625,
940
+ "learning_rate": 1.998839105567023e-07,
941
+ "loss": 0.0002,
942
+ "reward": 0.5716145932674408,
943
+ "reward_std": 0.0617390270344913,
944
+ "rewards/accuracy_reward": 0.3333333432674408,
945
+ "rewards/format_reward": 0.0,
946
+ "rewards/tag_count_reward": 0.2382812537252903,
947
+ "step": 67
948
+ },
949
+ {
950
+ "completion_length": 2042.8333435058594,
951
+ "epoch": 0.8143712574850299,
952
+ "grad_norm": 0.3042202591896057,
953
+ "kl": 0.004070281982421875,
954
+ "learning_rate": 1.882012913348768e-07,
955
+ "loss": 0.0002,
956
+ "reward": 0.4023437611758709,
957
+ "reward_std": 0.04687500139698386,
958
+ "rewards/accuracy_reward": 0.1666666716337204,
959
+ "rewards/format_reward": 0.0,
960
+ "rewards/tag_count_reward": 0.2356770895421505,
961
+ "step": 68
962
+ },
963
+ {
964
+ "completion_length": 2048.0,
965
+ "epoch": 0.8263473053892215,
966
+ "grad_norm": 0.1977626532316208,
967
+ "kl": 0.00399017333984375,
968
+ "learning_rate": 1.7717065784027108e-07,
969
+ "loss": 0.0002,
970
+ "reward": 0.6575521007180214,
971
+ "reward_std": 0.024941950105130672,
972
+ "rewards/accuracy_reward": 0.416666679084301,
973
+ "rewards/format_reward": 0.0,
974
+ "rewards/tag_count_reward": 0.2408854216337204,
975
+ "step": 69
976
+ },
977
+ {
978
+ "completion_length": 2048.0,
979
+ "epoch": 0.8383233532934131,
980
+ "grad_norm": 0.2541021406650543,
981
+ "kl": 0.0036773681640625,
982
+ "learning_rate": 1.6681188802000992e-07,
983
+ "loss": 0.0001,
984
+ "reward": 0.7330729216337204,
985
+ "reward_std": 0.040123483166098595,
986
+ "rewards/accuracy_reward": 0.5000000149011612,
987
+ "rewards/format_reward": 0.0,
988
+ "rewards/tag_count_reward": 0.2330729216337204,
989
+ "step": 70
990
+ },
991
+ {
992
+ "completion_length": 2048.0,
993
+ "epoch": 0.8502994011976048,
994
+ "grad_norm": 0.3355337679386139,
995
+ "kl": 0.003818511962890625,
996
+ "learning_rate": 1.5714364907746534e-07,
997
+ "loss": 0.0002,
998
+ "reward": 0.557291679084301,
999
+ "reward_std": 0.0634542522020638,
1000
+ "rewards/accuracy_reward": 0.3333333358168602,
1001
+ "rewards/format_reward": 0.0,
1002
+ "rewards/tag_count_reward": 0.2239583358168602,
1003
+ "step": 71
1004
+ },
1005
+ {
1006
+ "completion_length": 2048.0,
1007
+ "epoch": 0.8622754491017964,
1008
+ "grad_norm": 0.2230527400970459,
1009
+ "kl": 0.004230499267578125,
1010
+ "learning_rate": 1.4818336383269423e-07,
1011
+ "loss": 0.0002,
1012
+ "reward": 0.9127604514360428,
1013
+ "reward_std": 0.02274093870073557,
1014
+ "rewards/accuracy_reward": 0.666666679084301,
1015
+ "rewards/format_reward": 0.0,
1016
+ "rewards/tag_count_reward": 0.2460937537252903,
1017
+ "step": 72
1018
+ },
1019
+ {
1020
+ "completion_length": 2048.0,
1021
+ "epoch": 0.874251497005988,
1022
+ "grad_norm": 0.1775672435760498,
1023
+ "kl": 0.004291534423828125,
1024
+ "learning_rate": 1.3994717932533889e-07,
1025
+ "loss": 0.0002,
1026
+ "reward": 0.8281250298023224,
1027
+ "reward_std": 0.02083333395421505,
1028
+ "rewards/accuracy_reward": 0.5833333507180214,
1029
+ "rewards/format_reward": 0.0,
1030
+ "rewards/tag_count_reward": 0.2447916716337204,
1031
+ "step": 73
1032
+ },
1033
+ {
1034
+ "completion_length": 2043.53125,
1035
+ "epoch": 0.8862275449101796,
1036
+ "grad_norm": 0.2539031505584717,
1037
+ "kl": 0.00445556640625,
1038
+ "learning_rate": 1.324499377165708e-07,
1039
+ "loss": 0.0002,
1040
+ "reward": 0.6497396044433117,
1041
+ "reward_std": 0.049258903600275517,
1042
+ "rewards/accuracy_reward": 0.416666679084301,
1043
+ "rewards/format_reward": 0.0,
1044
+ "rewards/tag_count_reward": 0.2330729216337204,
1045
+ "step": 74
1046
+ },
1047
+ {
1048
+ "completion_length": 2048.0,
1049
+ "epoch": 0.8982035928143712,
1050
+ "grad_norm": 0.3363349437713623,
1051
+ "kl": 0.0041046142578125,
1052
+ "learning_rate": 1.257051495425121e-07,
1053
+ "loss": 0.0002,
1054
+ "reward": 0.645833358168602,
1055
+ "reward_std": 0.06290360447019339,
1056
+ "rewards/accuracy_reward": 0.416666679084301,
1057
+ "rewards/format_reward": 0.0,
1058
+ "rewards/tag_count_reward": 0.2291666716337204,
1059
+ "step": 75
1060
+ },
1061
+ {
1062
+ "completion_length": 2048.0,
1063
+ "epoch": 0.9101796407185628,
1064
+ "grad_norm": 0.32424062490463257,
1065
+ "kl": 0.005748748779296875,
1066
+ "learning_rate": 1.197249693673371e-07,
1067
+ "loss": 0.0002,
1068
+ "reward": 0.895833358168602,
1069
+ "reward_std": 0.06451506866142154,
1070
+ "rewards/accuracy_reward": 0.6666666865348816,
1071
+ "rewards/format_reward": 0.0,
1072
+ "rewards/tag_count_reward": 0.2291666716337204,
1073
+ "step": 76
1074
+ },
1075
+ {
1076
+ "completion_length": 2037.34375,
1077
+ "epoch": 0.9221556886227545,
1078
+ "grad_norm": 7.499696254730225,
1079
+ "kl": 0.02497100830078125,
1080
+ "learning_rate": 1.145201738799255e-07,
1081
+ "loss": 0.001,
1082
+ "reward": 0.74609375,
1083
+ "reward_std": 0.026383287739008665,
1084
+ "rewards/accuracy_reward": 0.5000000074505806,
1085
+ "rewards/format_reward": 0.0,
1086
+ "rewards/tag_count_reward": 0.2460937537252903,
1087
+ "step": 77
1088
+ },
1089
+ {
1090
+ "completion_length": 2048.0,
1091
+ "epoch": 0.9341317365269461,
1092
+ "grad_norm": 0.22542916238307953,
1093
+ "kl": 0.00453948974609375,
1094
+ "learning_rate": 1.1010014247354125e-07,
1095
+ "loss": 0.0002,
1096
+ "reward": 0.6588541828095913,
1097
+ "reward_std": 0.02794927265495062,
1098
+ "rewards/accuracy_reward": 0.416666679084301,
1099
+ "rewards/format_reward": 0.0,
1100
+ "rewards/tag_count_reward": 0.2421875037252903,
1101
+ "step": 78
1102
+ },
1103
+ {
1104
+ "completion_length": 2048.0,
1105
+ "epoch": 0.9461077844311377,
1106
+ "grad_norm": 0.2916065454483032,
1107
+ "kl": 0.00518798828125,
1108
+ "learning_rate": 1.064728403435312e-07,
1109
+ "loss": 0.0002,
1110
+ "reward": 0.7369791865348816,
1111
+ "reward_std": 0.044873480685055256,
1112
+ "rewards/accuracy_reward": 0.5000000149011612,
1113
+ "rewards/format_reward": 0.0,
1114
+ "rewards/tag_count_reward": 0.2369791679084301,
1115
+ "step": 79
1116
+ },
1117
+ {
1118
+ "completion_length": 2048.0,
1119
+ "epoch": 0.9580838323353293,
1120
+ "grad_norm": 0.26393556594848633,
1121
+ "kl": 0.004276275634765625,
1122
+ "learning_rate": 1.0364480413350543e-07,
1123
+ "loss": 0.0002,
1124
+ "reward": 0.9765625149011612,
1125
+ "reward_std": 0.043472426012158394,
1126
+ "rewards/accuracy_reward": 0.7500000149011612,
1127
+ "rewards/format_reward": 0.0,
1128
+ "rewards/tag_count_reward": 0.2265625074505806,
1129
+ "step": 80
1130
+ },
1131
+ {
1132
+ "completion_length": 2048.0,
1133
+ "epoch": 0.9700598802395209,
1134
+ "grad_norm": 0.2773340046405792,
1135
+ "kl": 0.0044403076171875,
1136
+ "learning_rate": 1.0162113015586308e-07,
1137
+ "loss": 0.0002,
1138
+ "reward": 0.7356771044433117,
1139
+ "reward_std": 0.04566440684720874,
1140
+ "rewards/accuracy_reward": 0.5000000074505806,
1141
+ "rewards/format_reward": 0.0,
1142
+ "rewards/tag_count_reward": 0.2356770895421505,
1143
+ "step": 81
1144
+ },
1145
+ {
1146
+ "completion_length": 2046.1614685058594,
1147
+ "epoch": 0.9820359281437125,
1148
+ "grad_norm": 0.30801087617874146,
1149
+ "kl": 0.00536346435546875,
1150
+ "learning_rate": 1.0040546520789337e-07,
1151
+ "loss": 0.0002,
1152
+ "reward": 0.8151041865348816,
1153
+ "reward_std": 0.050114710349589586,
1154
+ "rewards/accuracy_reward": 0.5833333507180214,
1155
+ "rewards/format_reward": 0.0,
1156
+ "rewards/tag_count_reward": 0.2317708395421505,
1157
+ "step": 82
1158
+ },
1159
+ {
1160
+ "completion_length": 2048.0,
1161
+ "epoch": 0.9940119760479041,
1162
+ "grad_norm": 0.2908526360988617,
1163
+ "kl": 0.0043792724609375,
1164
+ "learning_rate": 1e-07,
1165
+ "loss": 0.0002,
1166
+ "reward": 0.8190104365348816,
1167
+ "reward_std": 0.046155727468430996,
1168
+ "rewards/accuracy_reward": 0.5833333432674408,
1169
+ "rewards/format_reward": 0.0,
1170
+ "rewards/tag_count_reward": 0.2356770895421505,
1171
+ "step": 83
1172
+ },
1173
+ {
1174
+ "epoch": 0.9940119760479041,
1175
+ "step": 83,
1176
+ "total_flos": 0.0,
1177
+ "train_loss": 0.00010365011172895271,
1178
+ "train_runtime": 4988.7462,
1179
+ "train_samples_per_second": 0.2,
1180
+ "train_steps_per_second": 0.017
1181
+ }
1182
+ ],
1183
+ "logging_steps": 1,
1184
+ "max_steps": 83,
1185
+ "num_input_tokens_seen": 0,
1186
+ "num_train_epochs": 1,
1187
+ "save_steps": 500,
1188
+ "stateful_callbacks": {
1189
+ "TrainerControl": {
1190
+ "args": {
1191
+ "should_epoch_stop": false,
1192
+ "should_evaluate": false,
1193
+ "should_log": false,
1194
+ "should_save": true,
1195
+ "should_training_stop": true
1196
+ },
1197
+ "attributes": {}
1198
+ }
1199
+ },
1200
+ "total_flos": 0.0,
1201
+ "train_batch_size": 16,
1202
+ "trial_name": null,
1203
+ "trial_params": null
1204
+ }