sunbv56 commited on
Commit
af9f175
·
verified ·
1 Parent(s): bbc601d

feat: Upload full training checkpoint for resume

Browse files
Files changed (5) hide show
  1. README.md +182 -47
  2. adapter_model.safetensors +1 -1
  3. optimizer.pt +1 -1
  4. scheduler.pt +0 -0
  5. trainer_state.json +308 -5
README.md CHANGED
@@ -1,72 +1,207 @@
1
  ---
2
- library_name: peft
3
  base_model: Qwen/Qwen2.5-VL-3B-Instruct
 
 
4
  tags:
5
  - base_model:adapter:Qwen/Qwen2.5-VL-3B-Instruct
6
  - lora
7
  - transformers
8
- pipeline_tag: text-generation
9
- model-index:
10
- - name: qwen2.5-vl-vqa-vibook-tmp
11
- results: []
12
  ---
13
 
14
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
15
- should probably proofread and complete it, then remove this comment. -->
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
 
17
- # qwen2.5-vl-vqa-vibook-tmp
18
 
19
- This model is a fine-tuned version of [Qwen/Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct) on an unknown dataset.
20
- It achieves the following results on the evaluation set:
21
- - Loss: 1.1527
22
 
23
- ## Model description
24
 
25
- More information needed
26
 
27
- ## Intended uses & limitations
28
 
29
- More information needed
30
 
31
- ## Training and evaluation data
32
 
33
- More information needed
34
 
35
- ## Training procedure
36
 
37
- ### Training hyperparameters
38
 
39
- The following hyperparameters were used during training:
40
- - learning_rate: 0.0001
41
- - train_batch_size: 4
42
- - eval_batch_size: 4
43
- - seed: 42
44
- - gradient_accumulation_steps: 2
45
- - total_train_batch_size: 8
46
- - optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
47
- - lr_scheduler_type: cosine
48
- - lr_scheduler_warmup_ratio: 0.1
49
- - training_steps: 1576
50
 
51
- ### Training results
52
 
53
- | Training Loss | Epoch | Step | Validation Loss |
54
- |:-------------:|:------:|:----:|:---------------:|
55
- | 0.9777 | 0.1111 | 50 | 1.0407 |
56
- | 0.8787 | 0.2222 | 100 | 0.8106 |
57
- | 0.9219 | 0.3333 | 150 | 0.7609 |
58
- | 0.6949 | 0.4444 | 200 | 0.7009 |
59
- | 0.7088 | 0.5556 | 250 | 0.6456 |
60
- | 0.6903 | 0.6667 | 300 | 0.5962 |
61
- | 0.5669 | 0.7778 | 350 | 0.5696 |
62
- | 0.6577 | 0.8889 | 400 | 0.5607 |
63
- | 0.4788 | 1.0 | 450 | 0.5549 |
64
 
 
65
 
 
66
  ### Framework versions
67
 
68
- - PEFT 0.16.0
69
- - Transformers 4.53.3
70
- - Pytorch 2.6.0+cu124
71
- - Datasets 4.4.1
72
- - Tokenizers 0.21.2
 
1
  ---
 
2
  base_model: Qwen/Qwen2.5-VL-3B-Instruct
3
+ library_name: peft
4
+ pipeline_tag: text-generation
5
  tags:
6
  - base_model:adapter:Qwen/Qwen2.5-VL-3B-Instruct
7
  - lora
8
  - transformers
 
 
 
 
9
  ---
10
 
11
+ # Model Card for Model ID
12
+
13
+ <!-- Provide a quick summary of what the model is/does. -->
14
+
15
+
16
+
17
+ ## Model Details
18
+
19
+ ### Model Description
20
+
21
+ <!-- Provide a longer summary of what this model is. -->
22
+
23
+
24
+
25
+ - **Developed by:** [More Information Needed]
26
+ - **Funded by [optional]:** [More Information Needed]
27
+ - **Shared by [optional]:** [More Information Needed]
28
+ - **Model type:** [More Information Needed]
29
+ - **Language(s) (NLP):** [More Information Needed]
30
+ - **License:** [More Information Needed]
31
+ - **Finetuned from model [optional]:** [More Information Needed]
32
+
33
+ ### Model Sources [optional]
34
+
35
+ <!-- Provide the basic links for the model. -->
36
+
37
+ - **Repository:** [More Information Needed]
38
+ - **Paper [optional]:** [More Information Needed]
39
+ - **Demo [optional]:** [More Information Needed]
40
+
41
+ ## Uses
42
+
43
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
44
+
45
+ ### Direct Use
46
+
47
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
48
+
49
+ [More Information Needed]
50
+
51
+ ### Downstream Use [optional]
52
+
53
+ <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
54
+
55
+ [More Information Needed]
56
+
57
+ ### Out-of-Scope Use
58
+
59
+ <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
60
+
61
+ [More Information Needed]
62
+
63
+ ## Bias, Risks, and Limitations
64
+
65
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
66
+
67
+ [More Information Needed]
68
+
69
+ ### Recommendations
70
+
71
+ <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
72
+
73
+ Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
74
+
75
+ ## How to Get Started with the Model
76
+
77
+ Use the code below to get started with the model.
78
+
79
+ [More Information Needed]
80
+
81
+ ## Training Details
82
+
83
+ ### Training Data
84
+
85
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
86
+
87
+ [More Information Needed]
88
+
89
+ ### Training Procedure
90
+
91
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
92
+
93
+ #### Preprocessing [optional]
94
+
95
+ [More Information Needed]
96
+
97
+
98
+ #### Training Hyperparameters
99
+
100
+ - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
101
+
102
+ #### Speeds, Sizes, Times [optional]
103
+
104
+ <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
105
+
106
+ [More Information Needed]
107
+
108
+ ## Evaluation
109
+
110
+ <!-- This section describes the evaluation protocols and provides the results. -->
111
+
112
+ ### Testing Data, Factors & Metrics
113
+
114
+ #### Testing Data
115
+
116
+ <!-- This should link to a Dataset Card if possible. -->
117
+
118
+ [More Information Needed]
119
+
120
+ #### Factors
121
+
122
+ <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
123
+
124
+ [More Information Needed]
125
+
126
+ #### Metrics
127
+
128
+ <!-- These are the evaluation metrics being used, ideally with a description of why. -->
129
+
130
+ [More Information Needed]
131
+
132
+ ### Results
133
+
134
+ [More Information Needed]
135
+
136
+ #### Summary
137
+
138
+
139
+
140
+ ## Model Examination [optional]
141
+
142
+ <!-- Relevant interpretability work for the model goes here -->
143
+
144
+ [More Information Needed]
145
+
146
+ ## Environmental Impact
147
+
148
+ <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
149
+
150
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
151
+
152
+ - **Hardware Type:** [More Information Needed]
153
+ - **Hours used:** [More Information Needed]
154
+ - **Cloud Provider:** [More Information Needed]
155
+ - **Compute Region:** [More Information Needed]
156
+ - **Carbon Emitted:** [More Information Needed]
157
+
158
+ ## Technical Specifications [optional]
159
+
160
+ ### Model Architecture and Objective
161
+
162
+ [More Information Needed]
163
+
164
+ ### Compute Infrastructure
165
+
166
+ [More Information Needed]
167
+
168
+ #### Hardware
169
+
170
+ [More Information Needed]
171
+
172
+ #### Software
173
+
174
+ [More Information Needed]
175
 
176
+ ## Citation [optional]
177
 
178
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
 
 
179
 
180
+ **BibTeX:**
181
 
182
+ [More Information Needed]
183
 
184
+ **APA:**
185
 
186
+ [More Information Needed]
187
 
188
+ ## Glossary [optional]
189
 
190
+ <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
191
 
192
+ [More Information Needed]
193
 
194
+ ## More Information [optional]
195
 
196
+ [More Information Needed]
 
 
 
 
 
 
 
 
 
 
197
 
198
+ ## Model Card Authors [optional]
199
 
200
+ [More Information Needed]
 
 
 
 
 
 
 
 
 
 
201
 
202
+ ## Model Card Contact
203
 
204
+ [More Information Needed]
205
  ### Framework versions
206
 
207
+ - PEFT 0.16.0
 
 
 
 
adapter_model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:07159a33ca2c1b778fa34d8d79942224bc0f31e7b0936740b7ffcb4734f9f89c
3
  size 148712776
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3750ccd57d3fdcb6b88d266ceb4058d9820139544a558a1849183cd4df3477ae
3
  size 148712776
optimizer.pt CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:1ca0feda33341f6afe8006871a17ce269886994285439866daa5248ee77a7d5e
3
  size 297808698
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4e9c16c75f244fe4373934880e5b893fdc5bc9b875528012f878df42fdd3be53
3
  size 297808698
scheduler.pt CHANGED
Binary files a/scheduler.pt and b/scheduler.pt differ
 
trainer_state.json CHANGED
@@ -2,9 +2,9 @@
2
  "best_global_step": 750,
3
  "best_metric": 0.48672306537628174,
4
  "best_model_checkpoint": "./qwen2.5-vl-finetune-checkpoints/checkpoint-750",
5
- "epoch": 2.7511111111111113,
6
  "eval_steps": 50,
7
- "global_step": 1238,
8
  "is_hyper_param_search": false,
9
  "is_local_process_zero": true,
10
  "is_world_process_zero": true,
@@ -1106,12 +1106,315 @@
1106
  "train_runtime": 34829.2458,
1107
  "train_samples_per_second": 0.284,
1108
  "train_steps_per_second": 0.036
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1109
  }
1110
  ],
1111
  "logging_steps": 10,
1112
- "max_steps": 1238,
1113
  "num_input_tokens_seen": 0,
1114
- "num_train_epochs": 3,
1115
  "save_steps": 50,
1116
  "stateful_callbacks": {
1117
  "TrainerControl": {
@@ -1125,7 +1428,7 @@
1125
  "attributes": {}
1126
  }
1127
  },
1128
- "total_flos": 6.388579291468186e+16,
1129
  "train_batch_size": 4,
1130
  "trial_name": null,
1131
  "trial_params": null
 
2
  "best_global_step": 750,
3
  "best_metric": 0.48672306537628174,
4
  "best_model_checkpoint": "./qwen2.5-vl-finetune-checkpoints/checkpoint-750",
5
+ "epoch": 4.666666666666667,
6
  "eval_steps": 50,
7
+ "global_step": 1576,
8
  "is_hyper_param_search": false,
9
  "is_local_process_zero": true,
10
  "is_world_process_zero": true,
 
1106
  "train_runtime": 34829.2458,
1107
  "train_samples_per_second": 0.284,
1108
  "train_steps_per_second": 0.036
1109
+ },
1110
+ {
1111
+ "epoch": 3.66962962962963,
1112
+ "grad_norm": 14.275522232055664,
1113
+ "learning_rate": 1.3300797847207797e-05,
1114
+ "loss": 3.5621,
1115
+ "step": 1240
1116
+ },
1117
+ {
1118
+ "epoch": 3.699259259259259,
1119
+ "grad_norm": 27.858943939208984,
1120
+ "learning_rate": 1.2557515699430094e-05,
1121
+ "loss": 4.3815,
1122
+ "step": 1250
1123
+ },
1124
+ {
1125
+ "epoch": 3.699259259259259,
1126
+ "eval_loss": 2.2658419609069824,
1127
+ "eval_runtime": 995.7526,
1128
+ "eval_samples_per_second": 0.301,
1129
+ "eval_steps_per_second": 0.075,
1130
+ "step": 1250
1131
+ },
1132
+ {
1133
+ "epoch": 3.728888888888889,
1134
+ "grad_norm": 30.557031631469727,
1135
+ "learning_rate": 1.1832611379355878e-05,
1136
+ "loss": 3.2056,
1137
+ "step": 1260
1138
+ },
1139
+ {
1140
+ "epoch": 3.7585185185185184,
1141
+ "grad_norm": 34.28306579589844,
1142
+ "learning_rate": 1.1126440690477996e-05,
1143
+ "loss": 2.8957,
1144
+ "step": 1270
1145
+ },
1146
+ {
1147
+ "epoch": 3.788148148148148,
1148
+ "grad_norm": 29.017297744750977,
1149
+ "learning_rate": 1.0439350241294566e-05,
1150
+ "loss": 2.5225,
1151
+ "step": 1280
1152
+ },
1153
+ {
1154
+ "epoch": 3.8177777777777777,
1155
+ "grad_norm": 23.32266616821289,
1156
+ "learning_rate": 9.771677275183744e-06,
1157
+ "loss": 2.6028,
1158
+ "step": 1290
1159
+ },
1160
+ {
1161
+ "epoch": 3.8474074074074074,
1162
+ "grad_norm": 32.830848693847656,
1163
+ "learning_rate": 9.123749504875135e-06,
1164
+ "loss": 2.7177,
1165
+ "step": 1300
1166
+ },
1167
+ {
1168
+ "epoch": 3.8474074074074074,
1169
+ "eval_loss": 1.3522464036941528,
1170
+ "eval_runtime": 985.7859,
1171
+ "eval_samples_per_second": 0.304,
1172
+ "eval_steps_per_second": 0.076,
1173
+ "step": 1300
1174
+ },
1175
+ {
1176
+ "epoch": 3.877037037037037,
1177
+ "grad_norm": 6.538234233856201,
1178
+ "learning_rate": 8.495884951599142e-06,
1179
+ "loss": 2.2624,
1180
+ "step": 1310
1181
+ },
1182
+ {
1183
+ "epoch": 3.9066666666666667,
1184
+ "grad_norm": 19.523771286010742,
1185
+ "learning_rate": 7.888391788993216e-06,
1186
+ "loss": 2.6275,
1187
+ "step": 1320
1188
+ },
1189
+ {
1190
+ "epoch": 3.9362962962962964,
1191
+ "grad_norm": 11.971488952636719,
1192
+ "learning_rate": 7.301568191841457e-06,
1193
+ "loss": 2.1496,
1194
+ "step": 1330
1195
+ },
1196
+ {
1197
+ "epoch": 3.965925925925926,
1198
+ "grad_norm": 34.24433898925781,
1199
+ "learning_rate": 6.735702189722115e-06,
1200
+ "loss": 2.0774,
1201
+ "step": 1340
1202
+ },
1203
+ {
1204
+ "epoch": 3.9955555555555557,
1205
+ "grad_norm": 12.619851112365723,
1206
+ "learning_rate": 6.191071525634456e-06,
1207
+ "loss": 2.0749,
1208
+ "step": 1350
1209
+ },
1210
+ {
1211
+ "epoch": 3.9955555555555557,
1212
+ "eval_loss": 1.2665727138519287,
1213
+ "eval_runtime": 972.1433,
1214
+ "eval_samples_per_second": 0.309,
1215
+ "eval_steps_per_second": 0.077,
1216
+ "step": 1350
1217
+ },
1218
+ {
1219
+ "epoch": 4.026666666666666,
1220
+ "grad_norm": 21.63642692565918,
1221
+ "learning_rate": 5.667943519674723e-06,
1222
+ "loss": 2.2795,
1223
+ "step": 1360
1224
+ },
1225
+ {
1226
+ "epoch": 4.0562962962962965,
1227
+ "grad_norm": 5.838581562042236,
1228
+ "learning_rate": 5.166574937827867e-06,
1229
+ "loss": 2.6146,
1230
+ "step": 1370
1231
+ },
1232
+ {
1233
+ "epoch": 4.085925925925926,
1234
+ "grad_norm": 11.008721351623535,
1235
+ "learning_rate": 4.687211865939539e-06,
1236
+ "loss": 2.3045,
1237
+ "step": 1380
1238
+ },
1239
+ {
1240
+ "epoch": 4.115555555555556,
1241
+ "grad_norm": 6.246650218963623,
1242
+ "learning_rate": 4.2300895889302805e-06,
1243
+ "loss": 1.823,
1244
+ "step": 1390
1245
+ },
1246
+ {
1247
+ "epoch": 4.145185185185185,
1248
+ "grad_norm": 13.782442092895508,
1249
+ "learning_rate": 3.7954324753109673e-06,
1250
+ "loss": 2.2982,
1251
+ "step": 1400
1252
+ },
1253
+ {
1254
+ "epoch": 4.145185185185185,
1255
+ "eval_loss": 1.2098972797393799,
1256
+ "eval_runtime": 998.8662,
1257
+ "eval_samples_per_second": 0.3,
1258
+ "eval_steps_per_second": 0.075,
1259
+ "step": 1400
1260
+ },
1261
+ {
1262
+ "epoch": 4.174814814814815,
1263
+ "grad_norm": 11.179134368896484,
1264
+ "learning_rate": 3.383453867056452e-06,
1265
+ "loss": 2.5618,
1266
+ "step": 1410
1267
+ },
1268
+ {
1269
+ "epoch": 4.204444444444444,
1270
+ "grad_norm": 73.97550201416016,
1271
+ "learning_rate": 2.9943559748912996e-06,
1272
+ "loss": 1.8831,
1273
+ "step": 1420
1274
+ },
1275
+ {
1276
+ "epoch": 4.234074074074074,
1277
+ "grad_norm": 17.907745361328125,
1278
+ "learning_rate": 2.628329779039057e-06,
1279
+ "loss": 2.2352,
1280
+ "step": 1430
1281
+ },
1282
+ {
1283
+ "epoch": 4.263703703703704,
1284
+ "grad_norm": 81.71790313720703,
1285
+ "learning_rate": 2.2855549354837912e-06,
1286
+ "loss": 2.1651,
1287
+ "step": 1440
1288
+ },
1289
+ {
1290
+ "epoch": 4.293333333333333,
1291
+ "grad_norm": 10.33467960357666,
1292
+ "learning_rate": 1.9661996877898105e-06,
1293
+ "loss": 1.7595,
1294
+ "step": 1450
1295
+ },
1296
+ {
1297
+ "epoch": 4.293333333333333,
1298
+ "eval_loss": 1.1622637510299683,
1299
+ "eval_runtime": 993.3397,
1300
+ "eval_samples_per_second": 0.302,
1301
+ "eval_steps_per_second": 0.076,
1302
+ "step": 1450
1303
+ },
1304
+ {
1305
+ "epoch": 4.322962962962963,
1306
+ "grad_norm": 40.43919372558594,
1307
+ "learning_rate": 1.6704207845230358e-06,
1308
+ "loss": 1.9304,
1309
+ "step": 1460
1310
+ },
1311
+ {
1312
+ "epoch": 4.352592592592592,
1313
+ "grad_norm": 10.497286796569824,
1314
+ "learning_rate": 1.3983634023143511e-06,
1315
+ "loss": 2.098,
1316
+ "step": 1470
1317
+ },
1318
+ {
1319
+ "epoch": 4.3822222222222225,
1320
+ "grad_norm": 9.101359367370605,
1321
+ "learning_rate": 1.1501610746028124e-06,
1322
+ "loss": 1.8441,
1323
+ "step": 1480
1324
+ },
1325
+ {
1326
+ "epoch": 4.411851851851852,
1327
+ "grad_norm": 20.517807006835938,
1328
+ "learning_rate": 9.25935626093688e-07,
1329
+ "loss": 2.3551,
1330
+ "step": 1490
1331
+ },
1332
+ {
1333
+ "epoch": 4.441481481481482,
1334
+ "grad_norm": 7.981099605560303,
1335
+ "learning_rate": 7.257971129634389e-07,
1336
+ "loss": 1.6124,
1337
+ "step": 1500
1338
+ },
1339
+ {
1340
+ "epoch": 4.441481481481482,
1341
+ "eval_loss": 1.1480356454849243,
1342
+ "eval_runtime": 970.9195,
1343
+ "eval_samples_per_second": 0.309,
1344
+ "eval_steps_per_second": 0.077,
1345
+ "step": 1500
1346
+ },
1347
+ {
1348
+ "epoch": 4.471111111111111,
1349
+ "grad_norm": 51.19599533081055,
1350
+ "learning_rate": 5.498437688410463e-07,
1351
+ "loss": 2.0946,
1352
+ "step": 1510
1353
+ },
1354
+ {
1355
+ "epoch": 4.50074074074074,
1356
+ "grad_norm": 7.847194671630859,
1357
+ "learning_rate": 3.981619565921968e-07,
1358
+ "loss": 1.8896,
1359
+ "step": 1520
1360
+ },
1361
+ {
1362
+ "epoch": 4.53037037037037,
1363
+ "grad_norm": 12.63452434539795,
1364
+ "learning_rate": 2.708261259299072e-07,
1365
+ "loss": 2.1132,
1366
+ "step": 1530
1367
+ },
1368
+ {
1369
+ "epoch": 4.5600000000000005,
1370
+ "grad_norm": 8.711173057556152,
1371
+ "learning_rate": 1.6789877687254928e-07,
1372
+ "loss": 1.9074,
1373
+ "step": 1540
1374
+ },
1375
+ {
1376
+ "epoch": 4.58962962962963,
1377
+ "grad_norm": 14.014768600463867,
1378
+ "learning_rate": 8.943042906705001e-08,
1379
+ "loss": 2.4591,
1380
+ "step": 1550
1381
+ },
1382
+ {
1383
+ "epoch": 4.58962962962963,
1384
+ "eval_loss": 1.1526756286621094,
1385
+ "eval_runtime": 1013.0536,
1386
+ "eval_samples_per_second": 0.296,
1387
+ "eval_steps_per_second": 0.074,
1388
+ "step": 1550
1389
+ },
1390
+ {
1391
+ "epoch": 4.619259259259259,
1392
+ "grad_norm": 241.5323486328125,
1393
+ "learning_rate": 3.545959699243207e-08,
1394
+ "loss": 1.9968,
1395
+ "step": 1560
1396
+ },
1397
+ {
1398
+ "epoch": 4.648888888888889,
1399
+ "grad_norm": 41.02328109741211,
1400
+ "learning_rate": 6.0127710558133265e-09,
1401
+ "loss": 1.9328,
1402
+ "step": 1570
1403
+ },
1404
+ {
1405
+ "epoch": 4.666666666666667,
1406
+ "step": 1576,
1407
+ "total_flos": 8.15036810717184e+16,
1408
+ "train_loss": 0.49436442077462445,
1409
+ "train_runtime": 26325.3193,
1410
+ "train_samples_per_second": 0.479,
1411
+ "train_steps_per_second": 0.06
1412
  }
1413
  ],
1414
  "logging_steps": 10,
1415
+ "max_steps": 1576,
1416
  "num_input_tokens_seen": 0,
1417
+ "num_train_epochs": 5,
1418
  "save_steps": 50,
1419
  "stateful_callbacks": {
1420
  "TrainerControl": {
 
1428
  "attributes": {}
1429
  }
1430
  },
1431
+ "total_flos": 8.15036810717184e+16,
1432
  "train_batch_size": 4,
1433
  "trial_name": null,
1434
  "trial_params": null