W1018 06:43:53.034832 2996178 site-packages/torch/distributed/run.py:792]
W1018 06:43:53.034832 2996178 site-packages/torch/distributed/run.py:792] *****************************************
W1018 06:43:53.034832 2996178 site-packages/torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W1018 06:43:53.034832 2996178 site-packages/torch/distributed/run.py:792] *****************************************
[2025-10-18 06:44:00,629] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-10-18 06:44:00,969] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-10-18 06:44:01,009] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-10-18 06:44:01,026] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-10-18 06:44:01,031] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-10-18 06:44:01,033] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-10-18 06:44:01,039] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-10-18 06:44:01,040] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-10-18 06:44:04,890] [INFO] [comm.py:669:init_distributed] cdb=None
[2025-10-18 06:44:05,492] [INFO] [comm.py:669:init_distributed] cdb=None
[2025-10-18 06:44:05,518] [INFO] [comm.py:669:init_distributed] cdb=None
[2025-10-18 06:44:05,602] [INFO] [comm.py:669:init_distributed] cdb=None
[2025-10-18 06:44:05,625] [INFO] [comm.py:669:init_distributed] cdb=None
[2025-10-18 06:44:05,666] [INFO] [comm.py:669:init_distributed] cdb=None
[2025-10-18 06:44:05,666] [INFO] [comm.py:700:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2025-10-18 06:44:05,673] [INFO] [comm.py:669:init_distributed] cdb=None
[2025-10-18 06:44:05,829] [INFO] [comm.py:669:init_distributed] cdb=None
[INFO|2025-10-18 06:44:06] llamafactory.hparams.parser:406 >> Process rank: 5, world size: 8, device: cuda:5, distributed training: True, compute dtype: torch.bfloat16
[INFO|2025-10-18 06:44:06] llamafactory.hparams.parser:406 >> Process rank: 0, world size: 8, device: cuda:0, distributed training: True, compute dtype: torch.bfloat16
[INFO|tokenization_utils_base.py:2058] 2025-10-18 06:44:06,954 >> loading file vocab.json
[INFO|tokenization_utils_base.py:2058] 2025-10-18 06:44:06,954 >> loading file merges.txt
[INFO|tokenization_utils_base.py:2058] 2025-10-18 06:44:06,954 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2058] 2025-10-18 06:44:06,954 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2058] 2025-10-18 06:44:06,954 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2058] 2025-10-18 06:44:06,954 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2058] 2025-10-18 06:44:06,954 >> loading file chat_template.jinja
[INFO|tokenization_utils_base.py:2323] 2025-10-18 06:44:07,355 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[INFO|configuration_utils.py:691] 2025-10-18 06:44:07,357 >> loading configuration file /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/base_model/method6_qwen2.5-7b_qwen3-4b_distill_qwen2.5-7b-it_difficulty-scale_method17/config.json
[INFO|configuration_utils.py:765] 2025-10-18 06:44:07,359 >> Model config Qwen2Config {
"architectures": [
"Qwen2ForCausalLM"
],
"attention_dropout": 0.0,
"bos_token_id": 151643,
"eos_token_id": 151645,
"hidden_act": "silu",
"hidden_size": 3584,
"initializer_range": 0.02,
"intermediate_size": 18944,
"max_position_embeddings": 32768,
"max_window_layers": 28,
"model_type": "qwen2",
"num_attention_heads": 28,
"num_hidden_layers": 28,
"num_key_value_heads": 4,
"rms_norm_eps": 1e-06,
"rope_scaling": null,
"rope_theta": 1000000.0,
"sliding_window": 131072,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.51.3",
"use_cache": false,
"use_sliding_window": false,
"vocab_size": 152064
}
[INFO|tokenization_utils_base.py:2058] 2025-10-18 06:44:07,360 >> loading file vocab.json
[INFO|tokenization_utils_base.py:2058] 2025-10-18 06:44:07,360 >> loading file merges.txt
[INFO|tokenization_utils_base.py:2058] 2025-10-18 06:44:07,360 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2058] 2025-10-18 06:44:07,360 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2058] 2025-10-18 06:44:07,360 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2058] 2025-10-18 06:44:07,360 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2058] 2025-10-18 06:44:07,360 >> loading file chat_template.jinja
[rank5]:[W1018 06:44:07.811198098 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 5] using GPU 5 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id.
[INFO|tokenization_utils_base.py:2323] 2025-10-18 06:44:07,759 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[INFO|2025-10-18 06:44:07] llamafactory.data.loader:143 >> Loading dataset /mmu_nlp_ssd/dongguanting/tool_light_data/method7-qwen2.5-7b-instruct-llama-factory-sft-edition17.json...
[INFO|2025-10-18 06:44:07] llamafactory.hparams.parser:406 >> Process rank: 4, world size: 8, device: cuda:4, distributed training: True, compute dtype: torch.bfloat16
[INFO|2025-10-18 06:44:07] llamafactory.hparams.parser:406 >> Process rank: 1, world size: 8, device: cuda:1, distributed training: True, compute dtype: torch.bfloat16
[INFO|2025-10-18 06:44:08] llamafactory.hparams.parser:406 >> Process rank: 7, world size: 8, device: cuda:7, distributed training: True, compute dtype: torch.bfloat16
[INFO|2025-10-18 06:44:08] llamafactory.hparams.parser:406 >> Process rank: 6, world size: 8, device: cuda:6, distributed training: True, compute dtype: torch.bfloat16
[INFO|2025-10-18 06:44:08] llamafactory.hparams.parser:406 >> Process rank: 2, world size: 8, device: cuda:2, distributed training: True, compute dtype: torch.bfloat16
[INFO|2025-10-18 06:44:08] llamafactory.hparams.parser:406 >> Process rank: 3, world size: 8, device: cuda:3, distributed training: True, compute dtype: torch.bfloat16
[rank3]:[W1018 06:44:08.022060437 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 3] using GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id.
[rank4]:[W1018 06:44:08.046568749 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 4] using GPU 4 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id.
[rank1]:[W1018 06:44:08.058006051 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 1] using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id.
[rank6]:[W1018 06:44:08.087474291 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 6] using GPU 6 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id.
[rank7]:[W1018 06:44:08.088871679 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 7] using GPU 7 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id.
[rank2]:[W1018 06:44:08.124219695 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 2] using GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id.
Setting num_proc from 16 back to 1 for the train split to disable multiprocessing as it only contains one shard.
Generating train split: 0 examples [00:00, ? examples/s]
Generating train split: 15077 examples [00:01, 11581.17 examples/s]
Generating train split: 15077 examples [00:01, 11559.35 examples/s]
Converting format of dataset (num_proc=16): 0%| | 0/15077 [00:00, ? examples/s]
Converting format of dataset (num_proc=16): 81%|████████ | 12142/15077 [00:00<00:00, 120710.70 examples/s]
Converting format of dataset (num_proc=16): 100%|██████████| 15077/15077 [00:00<00:00, 64696.89 examples/s]
[rank0]:[W1018 06:44:33.061247415 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 0] using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id.
Running tokenizer on dataset (num_proc=16): 0%| | 0/15077 [00:00, ? examples/s]
Running tokenizer on dataset (num_proc=16): 6%|▋ | 943/15077 [00:04<01:06, 213.18 examples/s]
Running tokenizer on dataset (num_proc=16): 19%|█▉ | 2829/15077 [00:04<00:16, 737.08 examples/s]
Running tokenizer on dataset (num_proc=16): 31%|███▏ | 4715/15077 [00:04<00:07, 1420.42 examples/s]
Running tokenizer on dataset (num_proc=16): 50%|█████ | 7541/15077 [00:05<00:03, 2290.17 examples/s]
Running tokenizer on dataset (num_proc=16): 69%|██████▉ | 10367/15077 [00:05<00:01, 3602.20 examples/s]
Running tokenizer on dataset (num_proc=16): 81%|████████▏ | 12251/15077 [00:05<00:00, 4645.95 examples/s]
Running tokenizer on dataset (num_proc=16): 94%|█████████▍| 14135/15077 [00:06<00:00, 4702.45 examples/s]
Running tokenizer on dataset (num_proc=16): 100%|██████████| 15077/15077 [00:06<00:00, 2277.50 examples/s]
training example:
input_ids:
[151644, 8948, 198, 2610, 525, 1207, 16948, 11, 3465, 553, 54364, 14817, 13, 1446, 525, 264, 10950, 17847, 13, 151645, 198, 151644, 872, 198, 2610, 525, 264, 10950, 17847, 429, 646, 11625, 279, 2661, 3405, 3019, 553, 3019, 448, 279, 1492, 315, 279, 58218, 2711, 5392, 323, 10135, 39299, 5392, 13, 16246, 264, 3405, 11, 498, 1184, 311, 1156, 1744, 911, 279, 32711, 1882, 304, 279, 3971, 323, 1221, 3410, 279, 4226, 13, 11954, 7274, 11, 498, 646, 19873, 279, 58218, 2711, 5392, 311, 2711, 323, 10135, 39299, 5392, 311, 11047, 279, 6888, 3491, 369, 2097, 1995, 911, 3151, 13347, 421, 4362, 13, 576, 32711, 1882, 323, 4226, 525, 43810, 2878, 366, 26865, 29, 690, 26865, 29, 323, 366, 9217, 29, 690, 9217, 29, 9492, 15576, 11, 323, 279, 2711, 3239, 323, 1102, 525, 43810, 2878, 366, 1836, 29, 690, 1836, 29, 323, 366, 1382, 29, 690, 1382, 29, 9492, 15576, 13, 1752, 3110, 11, 366, 26865, 29, 1096, 374, 279, 32711, 1882, 13, 690, 26865, 29, 366, 1836, 29, 2711, 3239, 1588, 690, 1836, 29, 366, 1382, 29, 2711, 1102, 1588, 690, 1382, 29, 366, 26865, 29, 1096, 374, 279, 32711, 1882, 13, 690, 26865, 29, 366, 12669, 29, 10135, 2038, 1588, 690, 12669, 29, 366, 1382, 29, 10135, 39299, 1102, 1588, 690, 1382, 29, 366, 26865, 29, 1096, 374, 279, 32711, 1882, 13, 690, 26865, 29, 366, 9217, 29, 576, 1590, 4226, 374, 1124, 58, 1124, 79075, 90, 9217, 1588, 92, 1124, 60, 690, 9217, 14276, 758, 279, 1537, 949, 315, 279, 4226, 11, 279, 1590, 4734, 4226, 374, 43810, 2878, 1124, 79075, 6257, 448, 44536, 3561, 624, 4498, 1521, 11038, 468, 13, 16390, 594, 6981, 2746, 30, 151645, 198, 151644, 77091, 198, 13708, 766, 93376, 8253, 979, 11038, 468, 13, 16390, 594, 6981, 8469, 11, 358, 686, 1401, 705, 6032, 31177, 1995, 389, 1435, 3918, 26865, 1784, 1836, 43960, 2253, 468, 13, 16390, 6981, 4545, 2400, 522, 1836, 1784, 1382, 29, 334, 19357, 8085, 56177, 28715, 389, 279, 3897, 9293, 11, 11038, 468, 13, 16390, 748, 6981, 11, 11038, 506, 13, 16390, 11, 5823, 3123, 389, 6652, 220, 17, 15, 11, 220, 16, 24, 22, 16, 11, 4152, 311, 264, 12654, 13, 1096, 1995, 4041, 5961, 504, 279, 13656, 3565, 911, 11038, 506, 13, 16390, 2878, 1493, 8173, 3918, 1382, 1784, 9217, 16357, 1590, 4226, 374, 1124, 79075, 90, 32146, 220, 17, 15, 11, 220, 16, 24, 22, 16, 92, 7110, 522, 9217, 29, 151645, 198]
inputs:
<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>
<|im_start|>user
You are a helpful assistant that can solve the given question step by step with the help of the wikipedia search tool and python interpreter tool. Given a question, you need to first think about the reasoning process in the mind and then provide the answer. During thinking, you can invoke the wikipedia search tool to search and python interpreter tool to calculate the math problem for fact information about specific topics if needed. The reasoning process and answer are enclosed within and tags respectively, and the search query and result are enclosed within and tags respectively. For example, This is the reasoning process. search query here search result here This is the reasoning process. python code here python interpreter result here This is the reasoning process. The final answer is \[ \boxed{answer here} \] . In the last part of the answer, the final exact answer is enclosed within \boxed{} with latex format.
When did Roy E. Disney's father die?<|im_end|>
<|im_start|>assistant
To determine when Roy E. Disney's father died, I will look up biographical information on him.Roy E. Disney father death date**Final Information**
Based on the provided documents, Roy E. Disney’s father, Roy O. Disney, passed away on December 20, 1971, due to a stroke. This information comes directly from the historical details about Roy O. Disney within these sources.The final answer is \boxed{December 20, 1971}.\<|im_end|>
label_ids:
[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 13708, 766, 93376, 8253, 979, 11038, 468, 13, 16390, 594, 6981, 8469, 11, 358, 686, 1401, 705, 6032, 31177, 1995, 389, 1435, 3918, 26865, 1784, 1836, 43960, 2253, 468, 13, 16390, 6981, 4545, 2400, 522, 1836, 1784, 1382, 29, 334, 19357, 8085, 56177, 28715, 389, 279, 3897, 9293, 11, 11038, 468, 13, 16390, 748, 6981, 11, 11038, 506, 13, 16390, 11, 5823, 3123, 389, 6652, 220, 17, 15, 11, 220, 16, 24, 22, 16, 11, 4152, 311, 264, 12654, 13, 1096, 1995, 4041, 5961, 504, 279, 13656, 3565, 911, 11038, 506, 13, 16390, 2878, 1493, 8173, 3918, 1382, 1784, 9217, 16357, 1590, 4226, 374, 1124, 79075, 90, 32146, 220, 17, 15, 11, 220, 16, 24, 22, 16, 92, 7110, 522, 9217, 29, 151645, 198]
labels:
To determine when Roy E. Disney's father died, I will look up biographical information on him.Roy E. Disney father death date**Final Information**
Based on the provided documents, Roy E. Disney’s father, Roy O. Disney, passed away on December 20, 1971, due to a stroke. This information comes directly from the historical details about Roy O. Disney within these sources.The final answer is \boxed{December 20, 1971}.\<|im_end|>
[INFO|configuration_utils.py:691] 2025-10-18 06:45:06,145 >> loading configuration file /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/base_model/method6_qwen2.5-7b_qwen3-4b_distill_qwen2.5-7b-it_difficulty-scale_method17/config.json
[INFO|configuration_utils.py:765] 2025-10-18 06:45:06,147 >> Model config Qwen2Config {
"architectures": [
"Qwen2ForCausalLM"
],
"attention_dropout": 0.0,
"bos_token_id": 151643,
"eos_token_id": 151645,
"hidden_act": "silu",
"hidden_size": 3584,
"initializer_range": 0.02,
"intermediate_size": 18944,
"max_position_embeddings": 32768,
"max_window_layers": 28,
"model_type": "qwen2",
"num_attention_heads": 28,
"num_hidden_layers": 28,
"num_key_value_heads": 4,
"rms_norm_eps": 1e-06,
"rope_scaling": null,
"rope_theta": 1000000.0,
"sliding_window": 131072,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.51.3",
"use_cache": false,
"use_sliding_window": false,
"vocab_size": 152064
}
[INFO|2025-10-18 06:45:06] llamafactory.model.model_utils.kv_cache:143 >> KV cache is disabled during training.
Applied Liger kernels to Qwen2
Applied Liger kernels to Qwen2
Applied Liger kernels to Qwen2
Applied Liger kernels to Qwen2Applied Liger kernels to Qwen2
Applied Liger kernels to Qwen2Applied Liger kernels to Qwen2Applied Liger kernels to Qwen2
[INFO|2025-10-18 06:45:06] llamafactory.model.model_utils.liger_kernel:143 >> Liger kernel has been applied to the model.
[INFO|modeling_utils.py:1121] 2025-10-18 06:45:07,293 >> loading weights file /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/base_model/method6_qwen2.5-7b_qwen3-4b_distill_qwen2.5-7b-it_difficulty-scale_method17/model.safetensors.index.json
[INFO|modeling_utils.py:3726] 2025-10-18 06:45:07,308 >> Detected DeepSpeed ZeRO-3: activating zero.init() for this model
[2025-10-18 06:45:07,308] [INFO] [config.py:735:__init__] Config mesh_device None world_size = 8
[2025-10-18 06:45:07,308] [INFO] [config.py:735:__init__] Config mesh_device None world_size = 8
[2025-10-18 06:45:07,308] [INFO] [config.py:735:__init__] Config mesh_device None world_size = 8
[2025-10-18 06:45:07,308] [INFO] [config.py:735:__init__] Config mesh_device None world_size = 8
[2025-10-18 06:45:07,308] [INFO] [config.py:735:__init__] Config mesh_device None world_size = 8
[2025-10-18 06:45:07,308] [INFO] [config.py:735:__init__] Config mesh_device None world_size = 8
[2025-10-18 06:45:07,308] [INFO] [config.py:735:__init__] Config mesh_device None world_size = 8
[2025-10-18 06:45:07,308] [INFO] [config.py:735:__init__] Config mesh_device None world_size = 8
[INFO|configuration_utils.py:1142] 2025-10-18 06:45:07,321 >> Generate config GenerationConfig {
"bos_token_id": 151643,
"eos_token_id": 151645,
"use_cache": false
}
Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.
Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.
Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.
[WARNING|logging.py:328] 2025-10-18 06:45:07,616 >> Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.
Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.
Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.
Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.
Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.
[2025-10-18 06:45:09,887] [INFO] [partition_parameters.py:348:__exit__] finished initializing model - num_params = 339, num_elems = 7.62B
Loading checkpoint shards: 0%| | 0/4 [00:00, ?it/s]
Loading checkpoint shards: 0%| | 0/4 [00:00, ?it/s]
Loading checkpoint shards: 0%| | 0/4 [00:00, ?it/s]
Loading checkpoint shards: 0%| | 0/4 [00:00, ?it/s]
Loading checkpoint shards: 0%| | 0/4 [00:00, ?it/s]
Loading checkpoint shards: 0%| | 0/4 [00:00, ?it/s]
Loading checkpoint shards: 0%| | 0/4 [00:00, ?it/s]
Loading checkpoint shards: 0%| | 0/4 [00:00, ?it/s]
Loading checkpoint shards: 25%|██▌ | 1/4 [00:02<00:07, 2.42s/it]
Loading checkpoint shards: 25%|██▌ | 1/4 [00:02<00:07, 2.42s/it]
Loading checkpoint shards: 25%|██▌ | 1/4 [00:02<00:07, 2.42s/it]
Loading checkpoint shards: 25%|██▌ | 1/4 [00:02<00:07, 2.42s/it]
Loading checkpoint shards: 25%|██▌ | 1/4 [00:02<00:07, 2.42s/it]
Loading checkpoint shards: 25%|██▌ | 1/4 [00:02<00:07, 2.41s/it]
Loading checkpoint shards: 25%|██▌ | 1/4 [00:02<00:06, 2.26s/it]
Loading checkpoint shards: 25%|██▌ | 1/4 [00:02<00:07, 2.57s/it]
Loading checkpoint shards: 50%|█████ | 2/4 [00:17<00:19, 9.81s/it]
Loading checkpoint shards: 50%|█████ | 2/4 [00:17<00:19, 9.81s/it]
Loading checkpoint shards: 50%|█████ | 2/4 [00:17<00:19, 9.74s/it]
Loading checkpoint shards: 50%|█████ | 2/4 [00:17<00:19, 9.81s/it]
Loading checkpoint shards: 50%|█████ | 2/4 [00:17<00:19, 9.81s/it]
Loading checkpoint shards: 50%|█████ | 2/4 [00:17<00:19, 9.81s/it]
Loading checkpoint shards: 50%|█████ | 2/4 [00:17<00:19, 9.81s/it]
Loading checkpoint shards: 50%|█████ | 2/4 [00:17<00:19, 9.84s/it]
Loading checkpoint shards: 75%|███████▌ | 3/4 [00:30<00:11, 11.14s/it]
Loading checkpoint shards: 75%|███████▌ | 3/4 [00:30<00:11, 11.14s/it]
Loading checkpoint shards: 75%|███████▌ | 3/4 [00:30<00:11, 11.14s/it]
Loading checkpoint shards: 75%|███████▌ | 3/4 [00:30<00:11, 11.14s/it]
Loading checkpoint shards: 75%|███████▌ | 3/4 [00:30<00:11, 11.14s/it]
Loading checkpoint shards: 75%|███████▌ | 3/4 [00:30<00:11, 11.14s/it]
Loading checkpoint shards: 75%|███████▌ | 3/4 [00:29<00:11, 11.11s/it]
Loading checkpoint shards: 75%|███████▌ | 3/4 [00:30<00:11, 11.15s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [00:32<00:00, 7.73s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [00:32<00:00, 8.13s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [00:32<00:00, 7.75s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [00:32<00:00, 8.17s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [00:32<00:00, 7.75s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [00:32<00:00, 7.75s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [00:32<00:00, 8.17s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [00:32<00:00, 7.75s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [00:32<00:00, 7.75s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [00:32<00:00, 8.17s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [00:32<00:00, 8.17s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [00:32<00:00, 8.17s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [00:32<00:00, 7.75s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [00:32<00:00, 8.17s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [00:32<00:00, 7.73s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [00:32<00:00, 8.17s/it]
[INFO|modeling_utils.py:4930] 2025-10-18 06:45:42,606 >> All model checkpoint weights were used when initializing Qwen2ForCausalLM.
[INFO|modeling_utils.py:4938] 2025-10-18 06:45:42,606 >> All the weights of Qwen2ForCausalLM were initialized from the model checkpoint at /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/base_model/method6_qwen2.5-7b_qwen3-4b_distill_qwen2.5-7b-it_difficulty-scale_method17.
If your task is similar to the task the model of the checkpoint was trained on, you can already use Qwen2ForCausalLM for predictions without further training.
/mmu_nlp_ssd/makai05/miniconda3/envs/verl/lib/python3.10/site-packages/llamafactory/model/model_utils/checkpointing.py:47: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
def forward(
/mmu_nlp_ssd/makai05/miniconda3/envs/verl/lib/python3.10/site-packages/llamafactory/model/model_utils/checkpointing.py:47: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
def forward(
/mmu_nlp_ssd/makai05/miniconda3/envs/verl/lib/python3.10/site-packages/llamafactory/model/model_utils/checkpointing.py:47: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
def forward(
/mmu_nlp_ssd/makai05/miniconda3/envs/verl/lib/python3.10/site-packages/llamafactory/model/model_utils/checkpointing.py:47: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
def forward(
/mmu_nlp_ssd/makai05/miniconda3/envs/verl/lib/python3.10/site-packages/llamafactory/model/model_utils/checkpointing.py:47: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
def forward(
/mmu_nlp_ssd/makai05/miniconda3/envs/verl/lib/python3.10/site-packages/llamafactory/model/model_utils/checkpointing.py:47: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
def forward(
/mmu_nlp_ssd/makai05/miniconda3/envs/verl/lib/python3.10/site-packages/llamafactory/model/model_utils/checkpointing.py:64: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
def backward(ctx: "torch.autograd.Function", grad_output: "torch.Tensor") -> "torch.Tensor":
/mmu_nlp_ssd/makai05/miniconda3/envs/verl/lib/python3.10/site-packages/llamafactory/model/model_utils/checkpointing.py:64: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
def backward(ctx: "torch.autograd.Function", grad_output: "torch.Tensor") -> "torch.Tensor":
/mmu_nlp_ssd/makai05/miniconda3/envs/verl/lib/python3.10/site-packages/llamafactory/model/model_utils/checkpointing.py:47: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
def forward(
/mmu_nlp_ssd/makai05/miniconda3/envs/verl/lib/python3.10/site-packages/llamafactory/model/model_utils/checkpointing.py:64: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
def backward(ctx: "torch.autograd.Function", grad_output: "torch.Tensor") -> "torch.Tensor":
/mmu_nlp_ssd/makai05/miniconda3/envs/verl/lib/python3.10/site-packages/llamafactory/model/model_utils/checkpointing.py:64: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
def backward(ctx: "torch.autograd.Function", grad_output: "torch.Tensor") -> "torch.Tensor":
/mmu_nlp_ssd/makai05/miniconda3/envs/verl/lib/python3.10/site-packages/llamafactory/model/model_utils/checkpointing.py:64: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
def backward(ctx: "torch.autograd.Function", grad_output: "torch.Tensor") -> "torch.Tensor":
/mmu_nlp_ssd/makai05/miniconda3/envs/verl/lib/python3.10/site-packages/llamafactory/model/model_utils/checkpointing.py:64: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
def backward(ctx: "torch.autograd.Function", grad_output: "torch.Tensor") -> "torch.Tensor":
/mmu_nlp_ssd/makai05/miniconda3/envs/verl/lib/python3.10/site-packages/llamafactory/model/model_utils/checkpointing.py:64: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
def backward(ctx: "torch.autograd.Function", grad_output: "torch.Tensor") -> "torch.Tensor":
[INFO|configuration_utils.py:1095] 2025-10-18 06:45:42,608 >> loading configuration file /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/base_model/method6_qwen2.5-7b_qwen3-4b_distill_qwen2.5-7b-it_difficulty-scale_method17/generation_config.json
[INFO|configuration_utils.py:1142] 2025-10-18 06:45:42,608 >> Generate config GenerationConfig {
"bos_token_id": 151643,
"do_sample": true,
"eos_token_id": [
151645,
151643
],
"pad_token_id": 151643,
"repetition_penalty": 1.05,
"temperature": 0.7,
"top_k": 20,
"top_p": 0.8
}
/mmu_nlp_ssd/makai05/miniconda3/envs/verl/lib/python3.10/site-packages/llamafactory/model/model_utils/checkpointing.py:47: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
def forward(
/mmu_nlp_ssd/makai05/miniconda3/envs/verl/lib/python3.10/site-packages/llamafactory/model/model_utils/checkpointing.py:64: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
def backward(ctx: "torch.autograd.Function", grad_output: "torch.Tensor") -> "torch.Tensor":
[INFO|2025-10-18 06:45:42] llamafactory.model.model_utils.checkpointing:143 >> Gradient checkpointing enabled.
[INFO|2025-10-18 06:45:42] llamafactory.model.model_utils.attention:143 >> Using torch SDPA for faster training and inference.
[INFO|2025-10-18 06:45:42] llamafactory.model.adapter:143 >> DeepSpeed ZeRO3 detected, remaining trainable params in float32.
[INFO|2025-10-18 06:45:42] llamafactory.model.adapter:143 >> Fine-tuning method: Full
[INFO|2025-10-18 06:45:42] llamafactory.model.loader:143 >> trainable params: 7,615,616,512 || all params: 7,615,616,512 || trainable%: 100.0000
[INFO|trainer.py:748] 2025-10-18 06:45:42,648 >> Using auto half precision backend
[INFO|deepspeed.py:380] 2025-10-18 06:45:43,067 >> Detected ZeRO Offload and non-DeepSpeed optimizers: This combination should work as long as the custom optimizer has both CPU and GPU implementation (except LAMB)
Installed CUDA version 12.3 does not match the version torch was compiled with 12.4 but since the APIs are compatible, accepting this combination
Using /root/.cache/torch_extensions/py310_cu124 as PyTorch extensions root...
Installed CUDA version 12.3 does not match the version torch was compiled with 12.4 but since the APIs are compatible, accepting this combination
Using /root/.cache/torch_extensions/py310_cu124 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py310_cu124/cpu_adam/build.ninja...
/mmu_nlp_ssd/makai05/miniconda3/envs/verl/lib/python3.10/site-packages/torch/utils/cpp_extension.py:2059: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.425875186920166 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.4944374561309814 seconds
Installed CUDA version 12.3 does not match the version torch was compiled with 12.4 but since the APIs are compatible, accepting this combination
Using /root/.cache/torch_extensions/py310_cu124 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py310_cu124/cpu_adam/build.ninja...
/mmu_nlp_ssd/makai05/miniconda3/envs/verl/lib/python3.10/site-packages/torch/utils/cpp_extension.py:2059: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.5600526332855225 seconds
Installed CUDA version 12.3 does not match the version torch was compiled with 12.4 but since the APIs are compatible, accepting this combination
Using /root/.cache/torch_extensions/py310_cu124 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py310_cu124/cpu_adam/build.ninja...
/mmu_nlp_ssd/makai05/miniconda3/envs/verl/lib/python3.10/site-packages/torch/utils/cpp_extension.py:2059: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.7998712062835693 seconds
Installed CUDA version 12.3 does not match the version torch was compiled with 12.4 but since the APIs are compatible, accepting this combination
Using /root/.cache/torch_extensions/py310_cu124 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py310_cu124/cpu_adam/build.ninja...
/mmu_nlp_ssd/makai05/miniconda3/envs/verl/lib/python3.10/site-packages/torch/utils/cpp_extension.py:2059: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.860788345336914 seconds
Adam Optimizer #0 is created with AVX512 arithmetic capability.
Config: alpha=0.000005, betas=(0.900000, 0.999000), weight_decay=0.010000, adam_w=1
[2025-10-18 06:45:47,795] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed info: version=0.16.7, git-hash=unknown, git-branch=unknown
[2025-10-18 06:45:47,795] [INFO] [config.py:735:__init__] Config mesh_device None world_size = 8
[2025-10-18 06:45:47,804] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2025-10-18 06:45:47,805] [INFO] [logging.py:107:log_dist] [Rank 0] Using client Optimizer as basic optimizer
[2025-10-18 06:45:47,805] [INFO] [logging.py:107:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
[2025-10-18 06:45:47,818] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam
[2025-10-18 06:45:47,818] [INFO] [utils.py:59:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=
[2025-10-18 06:45:47,818] [INFO] [logging.py:107:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer, MiCS is enabled False, Hierarchical params gather False
[2025-10-18 06:45:47,818] [INFO] [logging.py:107:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 3 optimizer
Installed CUDA version 12.3 does not match the version torch was compiled with 12.4 but since the APIs are compatible, accepting this combination
Using /root/.cache/torch_extensions/py310_cu124 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py310_cu124/cpu_adam/build.ninja...
/mmu_nlp_ssd/makai05/miniconda3/envs/verl/lib/python3.10/site-packages/torch/utils/cpp_extension.py:2059: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.975823402404785 seconds
Installed CUDA version 12.3 does not match the version torch was compiled with 12.4 but since the APIs are compatible, accepting this combination
Using /root/.cache/torch_extensions/py310_cu124 as PyTorch extensions root...
Installed CUDA version 12.3 does not match the version torch was compiled with 12.4 but since the APIs are compatible, accepting this combination
Using /root/.cache/torch_extensions/py310_cu124 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py310_cu124/cpu_adam/build.ninja...
/mmu_nlp_ssd/makai05/miniconda3/envs/verl/lib/python3.10/site-packages/torch/utils/cpp_extension.py:2059: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 3.004814624786377 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 3.098824977874756 seconds
[2025-10-18 06:45:48,100] [INFO] [utils.py:781:see_memory_usage] Stage 3 initialize beginning
[2025-10-18 06:45:48,101] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB Max_MA 3.05 GB CA 0.0 GB Max_CA 3 GB
[2025-10-18 06:45:48,101] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 80.18 GB, percent = 4.0%
[2025-10-18 06:45:48,103] [INFO] [stage3.py:170:__init__] Reduce bucket size 12845056
[2025-10-18 06:45:48,103] [INFO] [stage3.py:171:__init__] Prefetch bucket size 11560550
[2025-10-18 06:45:48,355] [INFO] [utils.py:781:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
[2025-10-18 06:45:48,356] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB
[2025-10-18 06:45:48,356] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 80.18 GB, percent = 4.0%
Parameter Offload: Total persistent parameters: 333312 in 141 params
[2025-10-18 06:45:48,621] [INFO] [utils.py:781:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
[2025-10-18 06:45:48,622] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB
[2025-10-18 06:45:48,622] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 80.18 GB, percent = 4.0%
[2025-10-18 06:45:48,836] [INFO] [utils.py:781:see_memory_usage] Before creating fp16 partitions
[2025-10-18 06:45:48,837] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB
[2025-10-18 06:45:48,837] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 80.18 GB, percent = 4.0%
[2025-10-18 06:45:51,184] [INFO] [utils.py:781:see_memory_usage] After creating fp16 partitions: 2
[2025-10-18 06:45:51,185] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB
[2025-10-18 06:45:51,186] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 102.07 GB, percent = 5.1%
[2025-10-18 06:45:51,455] [INFO] [utils.py:781:see_memory_usage] Before creating fp32 partitions
[2025-10-18 06:45:51,456] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB
[2025-10-18 06:45:51,456] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 105.93 GB, percent = 5.3%
[2025-10-18 06:45:54,718] [INFO] [utils.py:781:see_memory_usage] After creating fp32 partitions
[2025-10-18 06:45:54,719] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB
[2025-10-18 06:45:54,719] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 124.92 GB, percent = 6.2%
[2025-10-18 06:45:54,956] [INFO] [utils.py:781:see_memory_usage] Before initializing optimizer states
[2025-10-18 06:45:54,956] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB
[2025-10-18 06:45:54,957] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 128.81 GB, percent = 6.4%
[2025-10-18 06:46:01,399] [INFO] [utils.py:781:see_memory_usage] After initializing optimizer states
[2025-10-18 06:46:01,400] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB
[2025-10-18 06:46:01,400] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 157.56 GB, percent = 7.8%
[2025-10-18 06:46:01,401] [INFO] [stage3.py:534:_setup_for_real_optimizer] optimizer state initialized
[2025-10-18 06:46:04,410] [INFO] [utils.py:781:see_memory_usage] After initializing ZeRO optimizer
[2025-10-18 06:46:04,411] [INFO] [utils.py:782:see_memory_usage] MA 0.02 GB Max_MA 2.06 GB CA 2.06 GB Max_CA 2 GB
[2025-10-18 06:46:04,411] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 174.14 GB, percent = 8.6%
[2025-10-18 06:46:04,411] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedZeroOptimizer_Stage3
[2025-10-18 06:46:04,411] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed using configured LR scheduler = None
[2025-10-18 06:46:04,411] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed LR Scheduler = None
[2025-10-18 06:46:04,411] [INFO] [logging.py:107:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0, 0.0], mom=[(0.9, 0.999), (0.9, 0.999)]
[2025-10-18 06:46:04,412] [INFO] [config.py:1003:print] DeepSpeedEngine configuration:
[2025-10-18 06:46:04,413] [INFO] [config.py:1007:print] activation_checkpointing_config {
"partition_activations": false,
"contiguous_memory_optimization": false,
"cpu_checkpointing": false,
"number_checkpoints": null,
"synchronize_checkpoint_boundary": false,
"profile": false
}
[2025-10-18 06:46:04,413] [INFO] [config.py:1007:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'intra_op_parallelism': 1, 'single_submit': False, 'overlap_events': True, 'use_gds': False}
[2025-10-18 06:46:04,413] [INFO] [config.py:1007:print] amp_enabled .................. False
[2025-10-18 06:46:04,413] [INFO] [config.py:1007:print] amp_params ................... False
[2025-10-18 06:46:04,413] [INFO] [config.py:1007:print] autotuning_config ............ {
"enabled": false,
"start_step": null,
"end_step": null,
"metric_path": null,
"arg_mappings": null,
"metric": "throughput",
"model_info": null,
"results_dir": "autotuning_results",
"exps_dir": "autotuning_exps",
"overwrite": true,
"fast": true,
"start_profile_step": 3,
"end_profile_step": 5,
"tuner_type": "gridsearch",
"tuner_early_stopping": 5,
"tuner_num_trials": 50,
"model_info_path": null,
"mp_size": 1,
"max_train_batch_size": null,
"min_train_batch_size": 1,
"max_train_micro_batch_size_per_gpu": 1.024000e+03,
"min_train_micro_batch_size_per_gpu": 1,
"num_tuning_micro_batch_sizes": 3
}
[2025-10-18 06:46:04,413] [INFO] [config.py:1007:print] bfloat16_enabled ............. True
[2025-10-18 06:46:04,413] [INFO] [config.py:1007:print] bfloat16_immediate_grad_update True
[2025-10-18 06:46:04,413] [INFO] [config.py:1007:print] checkpoint_parallel_write_pipeline False
[2025-10-18 06:46:04,413] [INFO] [config.py:1007:print] checkpoint_tag_validation_enabled True
[2025-10-18 06:46:04,413] [INFO] [config.py:1007:print] checkpoint_tag_validation_fail False
[2025-10-18 06:46:04,413] [INFO] [config.py:1007:print] comms_config .................
[2025-10-18 06:46:04,413] [INFO] [config.py:1007:print] communication_data_type ...... None
[2025-10-18 06:46:04,413] [INFO] [config.py:1007:print] compile_config ............... deepcompile=False free_activation=False offload_activation=False offload_opt_states=False double_buffer=True symmetric_memory=False debug_log=False offload_parameters=False sync_before_reduce=False sync_after_reduce=False sync_before_allgather=False sync_after_allgather=False
[2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] curriculum_enabled_legacy .... False
[2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] curriculum_params_legacy ..... False
[2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'pin_memory': False, 'curriculum_learning': {'enabled': False}, 'dynamic_batching': {'enabled': False, 'lr_scaling_method': 'linear', 'min_batch_size': 1, 'max_batch_size': None, 'sequence_picking_order': 'dataloader', 'verbose': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] data_efficiency_enabled ...... False
[2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] dataloader_drop_last ......... False
[2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] disable_allgather ............ False
[2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] dump_state ................... False
[2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] dynamic_loss_scale_args ...... None
[2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] eigenvalue_enabled ........... False
[2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] eigenvalue_gas_boundary_resolution 1
[2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] eigenvalue_layer_name ........ bert.encoder.layer
[2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] eigenvalue_layer_num ......... 0
[2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] eigenvalue_max_iter .......... 100
[2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] eigenvalue_stability ......... 1e-06
[2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] eigenvalue_tol ............... 0.01
[2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] eigenvalue_verbose ........... False
[2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] elasticity_enabled ........... False
[2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] flops_profiler_config ........ {
"enabled": false,
"recompute_fwd_factor": 0.0,
"profile_step": 1,
"module_depth": -1,
"top_modules": 1,
"detailed": true,
"output_file": null
}
[2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] fp16_auto_cast ............... None
[2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] fp16_enabled ................. False
[2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] fp16_master_weights_and_gradients False
[2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] global_rank .................. 0
[2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] grad_accum_dtype ............. None
[2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] gradient_accumulation_steps .. 2
[2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] gradient_clipping ............ 1.0
[2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] gradient_predivide_factor .... 1.0
[2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] graph_harvesting ............. False
[2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] initial_dynamic_scale ........ 1
[2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] load_universal_checkpoint .... False
[2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] loss_scale ................... 1.0
[2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] memory_breakdown ............. False
[2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] mics_hierarchial_params_gather False
[2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] mics_shard_size .............. -1
[2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName')
[2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] nebula_config ................ {
"enabled": false,
"persistent_storage_path": null,
"persistent_time_interval": 100,
"num_of_version_in_retention": 2,
"enable_nebula_load": true,
"load_path": null
}
[2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] optimizer_legacy_fusion ...... False
[2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] optimizer_name ............... None
[2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] optimizer_params ............. None
[2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
[2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] pld_enabled .................. False
[2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] pld_params ................... False
[2025-10-18 06:46:04,415] [INFO] [config.py:1007:print] prescale_gradients ........... False
[2025-10-18 06:46:04,415] [INFO] [config.py:1007:print] scheduler_name ............... None
[2025-10-18 06:46:04,415] [INFO] [config.py:1007:print] scheduler_params ............. None
[2025-10-18 06:46:04,415] [INFO] [config.py:1007:print] seq_parallel_communication_data_type torch.float32
[2025-10-18 06:46:04,415] [INFO] [config.py:1007:print] sparse_attention ............. None
[2025-10-18 06:46:04,415] [INFO] [config.py:1007:print] sparse_gradients_enabled ..... False
[2025-10-18 06:46:04,415] [INFO] [config.py:1007:print] steps_per_print .............. inf
[2025-10-18 06:46:04,415] [INFO] [config.py:1007:print] tensor_parallel_config ....... dtype=torch.float16 autotp_size=0 tp_overlap_comm=False tensor_parallel=TPConfig(tp_size=1, tp_grain_size=1, mpu=None, tp_group=None) injection_policy_tuple=None keep_module_on_host=False replace_with_kernel_inject=False
[2025-10-18 06:46:04,415] [INFO] [config.py:1007:print] timers_config ................ enabled=True synchronized=True
[2025-10-18 06:46:04,415] [INFO] [config.py:1007:print] train_batch_size ............. 16
[2025-10-18 06:46:04,415] [INFO] [config.py:1007:print] train_micro_batch_size_per_gpu 1
[2025-10-18 06:46:04,415] [INFO] [config.py:1007:print] use_data_before_expert_parallel_ False
[2025-10-18 06:46:04,415] [INFO] [config.py:1007:print] use_node_local_storage ....... False
[2025-10-18 06:46:04,415] [INFO] [config.py:1007:print] wall_clock_breakdown ......... False
[2025-10-18 06:46:04,415] [INFO] [config.py:1007:print] weight_quantization_config ... None
[2025-10-18 06:46:04,415] [INFO] [config.py:1007:print] world_size ................... 8
[2025-10-18 06:46:04,415] [INFO] [config.py:1007:print] zero_allow_untested_optimizer True
[2025-10-18 06:46:04,415] [INFO] [config.py:1007:print] zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=12845056 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500000000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='cpu', nvme_path=None, buffer_count=5, buffer_size=100000000, max_in_cpu=1000000000, pin_memory=True) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='cpu', nvme_path=None, buffer_count=4, pin_memory=True, pipeline_read=False, pipeline_write=False, fast_init=False, ratio=1.0) sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=11560550 param_persistence_threshold=35840 model_persistence_threshold=9223372036854775807 max_live_parameters=1000000000 max_reuse_distance=1000000000 gather_16bit_weights_on_model_save=True module_granularity_threshold=0 use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False zeropp_loco_param=None mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True log_trace_cache_warnings=False
[2025-10-18 06:46:04,415] [INFO] [config.py:1007:print] zero_enabled ................. True
[2025-10-18 06:46:04,415] [INFO] [config.py:1007:print] zero_force_ds_cpu_optimizer .. True
[2025-10-18 06:46:04,415] [INFO] [config.py:1007:print] zero_optimization_stage ...... 3
[2025-10-18 06:46:04,415] [INFO] [config.py:993:print_user_config] json = {
"train_batch_size": 16,
"train_micro_batch_size_per_gpu": 1,
"gradient_accumulation_steps": 2,
"gradient_clipping": 1.0,
"zero_allow_untested_optimizer": true,
"fp16": {
"enabled": false,
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": true
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1.000000e+09,
"reduce_bucket_size": 1.284506e+07,
"stage3_prefetch_bucket_size": 1.156055e+07,
"stage3_param_persistence_threshold": 3.584000e+04,
"stage3_max_live_parameters": 1.000000e+09,
"stage3_max_reuse_distance": 1.000000e+09,
"stage3_gather_16bit_weights_on_model_save": true
},
"steps_per_print": inf
}
[INFO|trainer.py:2414] 2025-10-18 06:46:04,417 >> ***** Running training *****
[INFO|trainer.py:2415] 2025-10-18 06:46:04,417 >> Num examples = 15,077
[INFO|trainer.py:2416] 2025-10-18 06:46:04,417 >> Num Epochs = 3
[INFO|trainer.py:2417] 2025-10-18 06:46:04,417 >> Instantaneous batch size per device = 1
[INFO|trainer.py:2420] 2025-10-18 06:46:04,417 >> Total train batch size (w. parallel, distributed & accumulation) = 16
[INFO|trainer.py:2421] 2025-10-18 06:46:04,417 >> Gradient Accumulation steps = 2
[INFO|trainer.py:2422] 2025-10-18 06:46:04,417 >> Total optimization steps = 2,826
[INFO|trainer.py:2423] 2025-10-18 06:46:04,418 >> Number of trainable parameters = 7,615,616,512
0%| | 0/2826 [00:00, ?it/s]
0%| | 1/2826 [00:23<18:42:02, 23.83s/it]
0%| | 2/2826 [00:29<10:23:07, 13.24s/it]
0%| | 3/2826 [00:37<8:33:29, 10.91s/it]
0%| | 4/2826 [00:45<7:37:53, 9.74s/it]
0%| | 5/2826 [00:50<6:19:17, 8.07s/it]
0%| | 6/2826 [00:56<5:40:58, 7.25s/it]
0%| | 7/2826 [01:04<5:48:55, 7.43s/it]
0%| | 8/2826 [01:09<5:21:16, 6.84s/it]
0%| | 9/2826 [01:19<5:55:46, 7.58s/it]
0%| | 10/2826 [01:24<5:30:40, 7.05s/it]
{'loss': 0.741, 'grad_norm': 4.634474754333496, 'learning_rate': 1.5901060070671379e-07, 'epoch': 0.01}
0%| | 10/2826 [01:24<5:30:40, 7.05s/it]
0%| | 11/2826 [01:32<5:41:30, 7.28s/it]
0%| | 12/2826 [01:41<6:00:21, 7.68s/it]
0%| | 13/2826 [01:46<5:26:21, 6.96s/it]
0%| | 14/2826 [01:53<5:18:26, 6.79s/it]
1%| | 15/2826 [01:59<5:20:01, 6.83s/it]
1%| | 16/2826 [02:09<5:53:30, 7.55s/it]
1%| | 17/2826 [02:14<5:18:52, 6.81s/it]
1%| | 18/2826 [02:22<5:39:16, 7.25s/it]
1%| | 19/2826 [02:29<5:35:59, 7.18s/it]
1%| | 20/2826 [02:35<5:21:22, 6.87s/it]
{'loss': 0.5551, 'grad_norm': 2.9002726078033447, 'learning_rate': 3.356890459363958e-07, 'epoch': 0.02}
1%| | 20/2826 [02:35<5:21:22, 6.87s/it]
1%| | 21/2826 [02:41<5:11:04, 6.65s/it]
1%| | 22/2826 [02:46<4:49:05, 6.19s/it]
1%| | 23/2826 [02:52<4:37:04, 5.93s/it]
1%| | 24/2826 [02:58<4:43:15, 6.07s/it]
1%| | 25/2826 [03:03<4:28:32, 5.75s/it]
1%| | 26/2826 [03:09<4:28:21, 5.75s/it]
1%| | 27/2826 [03:17<5:01:43, 6.47s/it]
1%| | 28/2826 [03:22<4:42:11, 6.05s/it]
1%| | 29/2826 [03:29<4:48:31, 6.19s/it]
1%| | 30/2826 [03:35<4:48:33, 6.19s/it]
{'loss': 0.6185, 'grad_norm': 4.242003917694092, 'learning_rate': 5.123674911660778e-07, 'epoch': 0.03}
1%| | 30/2826 [03:35<4:48:33, 6.19s/it]
1%| | 31/2826 [03:40<4:40:17, 6.02s/it]
1%| | 32/2826 [03:48<5:03:57, 6.53s/it]
1%| | 33/2826 [03:54<4:52:24, 6.28s/it]
1%| | 34/2826 [03:59<4:34:31, 5.90s/it]
1%| | 35/2826 [04:05<4:29:57, 5.80s/it]
1%|▏ | 36/2826 [04:11<4:44:59, 6.13s/it]
1%|▏ | 37/2826 [04:18<4:45:05, 6.13s/it]
1%|▏ | 38/2826 [04:23<4:31:56, 5.85s/it]
1%|▏ | 39/2826 [04:30<4:48:11, 6.20s/it]
1%|▏ | 40/2826 [04:37<4:56:46, 6.39s/it]
{'loss': 0.6358, 'grad_norm': 3.8156638145446777, 'learning_rate': 6.890459363957598e-07, 'epoch': 0.04}
1%|▏ | 40/2826 [04:37<4:56:46, 6.39s/it]
1%|▏ | 41/2826 [04:42<4:41:51, 6.07s/it]
1%|▏ | 42/2826 [04:47<4:26:53, 5.75s/it]
2%|▏ | 43/2826 [04:52<4:19:12, 5.59s/it]
2%|▏ | 44/2826 [05:04<5:44:37, 7.43s/it]
2%|▏ | 45/2826 [05:10<5:24:05, 6.99s/it]
2%|▏ | 46/2826 [05:17<5:27:17, 7.06s/it]
2%|▏ | 47/2826 [05:23<5:09:38, 6.69s/it]
2%|▏ | 48/2826 [05:29<5:06:01, 6.61s/it]
2%|▏ | 49/2826 [05:36<5:09:26, 6.69s/it]
2%|▏ | 50/2826 [05:42<4:51:43, 6.31s/it]
{'loss': 0.5922, 'grad_norm': 3.047624349594116, 'learning_rate': 8.657243816254418e-07, 'epoch': 0.05}
2%|▏ | 50/2826 [05:42<4:51:43, 6.31s/it]
2%|▏ | 51/2826 [05:47<4:42:50, 6.12s/it]
2%|▏ | 52/2826 [05:53<4:35:09, 5.95s/it]
2%|▏ | 53/2826 [06:00<4:46:37, 6.20s/it]
2%|▏ | 54/2826 [06:05<4:31:14, 5.87s/it]
2%|▏ | 55/2826 [06:10<4:20:00, 5.63s/it]
2%|▏ | 56/2826 [06:15<4:12:35, 5.47s/it]
2%|▏ | 57/2826 [06:20<4:13:32, 5.49s/it]
2%|▏ | 58/2826 [06:26<4:07:59, 5.38s/it]
2%|▏ | 59/2826 [06:32<4:19:03, 5.62s/it]
2%|▏ | 60/2826 [06:37<4:16:47, 5.57s/it]
{'loss': 0.6282, 'grad_norm': 2.2943954467773438, 'learning_rate': 1.0424028268551239e-06, 'epoch': 0.06}
2%|▏ | 60/2826 [06:37<4:16:47, 5.57s/it]
2%|▏ | 61/2826 [06:42<4:09:52, 5.42s/it]
2%|▏ | 62/2826 [06:50<4:48:21, 6.26s/it]
2%|▏ | 63/2826 [06:56<4:32:02, 5.91s/it]
2%|▏ | 64/2826 [07:01<4:19:34, 5.64s/it]
2%|▏ | 65/2826 [07:06<4:11:27, 5.46s/it]
2%|▏ | 66/2826 [07:12<4:26:42, 5.80s/it]
2%|▏ | 67/2826 [07:17<4:17:27, 5.60s/it]
2%|▏ | 68/2826 [07:23<4:15:42, 5.56s/it]
2%|▏ | 69/2826 [07:28<4:08:49, 5.42s/it]
2%|▏ | 70/2826 [07:33<4:10:42, 5.46s/it]
{'loss': 0.5836, 'grad_norm': 2.831937551498413, 'learning_rate': 1.2190812720848057e-06, 'epoch': 0.07}
2%|▏ | 70/2826 [07:33<4:10:42, 5.46s/it]
3%|▎ | 71/2826 [07:40<4:21:11, 5.69s/it]
3%|▎ | 72/2826 [07:45<4:12:07, 5.49s/it]
3%|▎ | 73/2826 [07:51<4:20:25, 5.68s/it]
3%|▎ | 74/2826 [07:57<4:30:14, 5.89s/it]
3%|▎ | 75/2826 [08:02<4:18:55, 5.65s/it]
3%|▎ | 76/2826 [08:09<4:30:07, 5.89s/it]
3%|▎ | 77/2826 [08:14<4:21:27, 5.71s/it]
3%|▎ | 78/2826 [08:19<4:15:34, 5.58s/it]
3%|▎ | 79/2826 [08:25<4:21:19, 5.71s/it]
3%|▎ | 80/2826 [08:31<4:25:37, 5.80s/it]
{'loss': 0.5836, 'grad_norm': 3.941297769546509, 'learning_rate': 1.3957597173144876e-06, 'epoch': 0.08}
3%|▎ | 80/2826 [08:31<4:25:37, 5.80s/it]
3%|▎ | 81/2826 [08:37<4:24:11, 5.77s/it]
3%|▎ | 82/2826 [08:44<4:36:17, 6.04s/it]
3%|▎ | 83/2826 [08:49<4:28:40, 5.88s/it]
3%|▎ | 84/2826 [08:55<4:24:56, 5.80s/it]
3%|▎ | 85/2826 [09:00<4:15:18, 5.59s/it]
3%|▎ | 86/2826 [09:06<4:17:26, 5.64s/it]
3%|▎ | 87/2826 [09:12<4:27:09, 5.85s/it]
3%|▎ | 88/2826 [09:18<4:30:40, 5.93s/it]
3%|▎ | 89/2826 [09:24<4:35:16, 6.03s/it]
3%|▎ | 90/2826 [09:31<4:42:51, 6.20s/it]
{'loss': 0.4983, 'grad_norm': 2.4598379135131836, 'learning_rate': 1.5724381625441699e-06, 'epoch': 0.1}
3%|▎ | 90/2826 [09:31<4:42:51, 6.20s/it]
3%|▎ | 91/2826 [09:38<4:53:12, 6.43s/it]
3%|▎ | 92/2826 [09:43<4:35:32, 6.05s/it]
3%|▎ | 93/2826 [09:48<4:25:22, 5.83s/it]
3%|▎ | 94/2826 [09:53<4:13:26, 5.57s/it]
3%|▎ | 95/2826 [09:58<4:06:08, 5.41s/it]
3%|▎ | 96/2826 [10:05<4:23:20, 5.79s/it]
3%|▎ | 97/2826 [10:10<4:14:19, 5.59s/it]
3%|▎ | 98/2826 [10:15<4:07:18, 5.44s/it]
4%|▎ | 99/2826 [10:21<4:14:18, 5.60s/it]
4%|▎ | 100/2826 [10:27<4:12:08, 5.55s/it]
{'loss': 0.6057, 'grad_norm': 2.533829927444458, 'learning_rate': 1.7491166077738517e-06, 'epoch': 0.11}
4%|▎ | 100/2826 [10:27<4:12:08, 5.55s/it]
4%|▎ | 101/2826 [10:33<4:27:27, 5.89s/it]
4%|▎ | 102/2826 [10:39<4:20:10, 5.73s/it]
4%|▎ | 103/2826 [10:44<4:18:40, 5.70s/it]
4%|▎ | 104/2826 [10:49<4:09:49, 5.51s/it]
4%|▎ | 105/2826 [10:56<4:26:30, 5.88s/it]
4%|▍ | 106/2826 [11:01<4:15:18, 5.63s/it]
4%|▍ | 107/2826 [11:07<4:11:56, 5.56s/it]
4%|▍ | 108/2826 [11:13<4:21:10, 5.77s/it]
4%|▍ | 109/2826 [11:19<4:21:41, 5.78s/it]
4%|▍ | 110/2826 [11:26<4:45:02, 6.30s/it]
{'loss': 0.5135, 'grad_norm': 2.412334442138672, 'learning_rate': 1.925795053003534e-06, 'epoch': 0.12}
4%|▍ | 110/2826 [11:26<4:45:02, 6.30s/it]
4%|▍ | 111/2826 [11:32<4:43:03, 6.26s/it]
4%|▍ | 112/2826 [11:39<4:41:34, 6.22s/it]
4%|▍ | 113/2826 [11:44<4:30:27, 5.98s/it]
4%|▍ | 114/2826 [11:50<4:32:46, 6.03s/it]
4%|▍ | 115/2826 [11:57<4:48:37, 6.39s/it]
4%|▍ | 116/2826 [12:03<4:33:35, 6.06s/it]
4%|▍ | 117/2826 [12:09<4:38:15, 6.16s/it]
4%|▍ | 118/2826 [12:14<4:24:36, 5.86s/it]
4%|▍ | 119/2826 [12:20<4:30:32, 6.00s/it]
4%|▍ | 120/2826 [12:26<4:24:18, 5.86s/it]
{'loss': 0.4844, 'grad_norm': 2.7505877017974854, 'learning_rate': 2.1024734982332157e-06, 'epoch': 0.13}
4%|▍ | 120/2826 [12:26<4:24:18, 5.86s/it]
4%|▍ | 121/2826 [12:32<4:31:46, 6.03s/it]
4%|▍ | 122/2826 [12:39<4:33:42, 6.07s/it]
4%|▍ | 123/2826 [12:47<5:01:24, 6.69s/it]
4%|▍ | 124/2826 [12:52<4:38:39, 6.19s/it]
4%|▍ | 125/2826 [12:57<4:22:03, 5.82s/it]
4%|▍ | 126/2826 [13:02<4:16:06, 5.69s/it]
4%|▍ | 127/2826 [13:07<4:09:54, 5.56s/it]
5%|▍ | 128/2826 [13:13<4:08:13, 5.52s/it]
5%|▍ | 129/2826 [13:18<4:09:08, 5.54s/it]
5%|▍ | 130/2826 [13:25<4:24:04, 5.88s/it]
{'loss': 0.5386, 'grad_norm': 2.701307535171509, 'learning_rate': 2.279151943462898e-06, 'epoch': 0.14}
5%|▍ | 130/2826 [13:25<4:24:04, 5.88s/it]
5%|▍ | 131/2826 [13:33<4:48:13, 6.42s/it]
5%|▍ | 132/2826 [13:39<4:47:26, 6.40s/it]
5%|▍ | 133/2826 [13:45<4:45:01, 6.35s/it]
5%|▍ | 134/2826 [13:51<4:39:17, 6.23s/it]
5%|▍ | 135/2826 [13:56<4:23:01, 5.86s/it]
5%|▍ | 136/2826 [14:01<4:12:40, 5.64s/it]
5%|▍ | 137/2826 [14:06<4:05:00, 5.47s/it]
5%|▍ | 138/2826 [14:12<4:09:47, 5.58s/it]
5%|▍ | 139/2826 [14:18<4:08:09, 5.54s/it]
5%|▍ | 140/2826 [14:23<4:02:49, 5.42s/it]
{'loss': 0.4774, 'grad_norm': 2.8261961936950684, 'learning_rate': 2.45583038869258e-06, 'epoch': 0.15}
5%|▍ | 140/2826 [14:23<4:02:49, 5.42s/it]
5%|▍ | 141/2826 [14:30<4:26:39, 5.96s/it]
5%|▌ | 142/2826 [14:36<4:27:10, 5.97s/it]
5%|▌ | 143/2826 [14:42<4:25:40, 5.94s/it]
5%|▌ | 144/2826 [14:47<4:12:36, 5.65s/it]
5%|▌ | 145/2826 [14:56<4:54:58, 6.60s/it]
5%|▌ | 146/2826 [15:02<4:48:52, 6.47s/it]
5%|▌ | 147/2826 [15:08<4:49:25, 6.48s/it]
5%|▌ | 148/2826 [15:15<4:48:40, 6.47s/it]
5%|▌ | 149/2826 [15:20<4:35:42, 6.18s/it]
5%|▌ | 150/2826 [15:26<4:28:33, 6.02s/it]
{'loss': 0.5035, 'grad_norm': 2.4490256309509277, 'learning_rate': 2.6325088339222617e-06, 'epoch': 0.16}
5%|▌ | 150/2826 [15:26<4:28:33, 6.02s/it]
5%|▌ | 151/2826 [15:32<4:26:40, 5.98s/it]
5%|▌ | 152/2826 [15:38<4:34:25, 6.16s/it]
5%|▌ | 153/2826 [15:44<4:22:26, 5.89s/it]
5%|▌ | 154/2826 [15:49<4:14:32, 5.72s/it]
5%|▌ | 155/2826 [15:54<4:06:51, 5.55s/it]
6%|▌ | 156/2826 [16:00<4:03:41, 5.48s/it]
6%|▌ | 157/2826 [16:05<3:58:43, 5.37s/it]
6%|▌ | 158/2826 [16:11<4:06:59, 5.55s/it]
6%|▌ | 159/2826 [16:16<4:02:55, 5.47s/it]
6%|▌ | 160/2826 [16:22<4:14:53, 5.74s/it]
{'loss': 0.4897, 'grad_norm': 2.418158769607544, 'learning_rate': 2.8091872791519436e-06, 'epoch': 0.17}
6%|▌ | 160/2826 [16:22<4:14:53, 5.74s/it]
6%|▌ | 161/2826 [16:28<4:17:35, 5.80s/it]
6%|▌ | 162/2826 [16:33<4:06:56, 5.56s/it]
6%|▌ | 163/2826 [16:39<4:11:43, 5.67s/it]
6%|▌ | 164/2826 [16:45<4:12:17, 5.69s/it]
6%|▌ | 165/2826 [16:50<4:04:39, 5.52s/it]
6%|▌ | 166/2826 [16:55<4:02:03, 5.46s/it]
6%|▌ | 167/2826 [17:01<4:03:45, 5.50s/it]
6%|▌ | 168/2826 [17:07<4:17:27, 5.81s/it]
6%|▌ | 169/2826 [17:13<4:20:06, 5.87s/it]
6%|▌ | 170/2826 [17:19<4:09:41, 5.64s/it]
{'loss': 0.5196, 'grad_norm': 3.5972161293029785, 'learning_rate': 2.985865724381626e-06, 'epoch': 0.18}
6%|▌ | 170/2826 [17:19<4:09:41, 5.64s/it]
6%|▌ | 171/2826 [17:24<4:02:06, 5.47s/it]
6%|▌ | 172/2826 [17:30<4:13:47, 5.74s/it]
6%|▌ | 173/2826 [17:36<4:19:58, 5.88s/it]
6%|▌ | 174/2826 [17:42<4:17:08, 5.82s/it]
6%|▌ | 175/2826 [17:48<4:24:57, 6.00s/it]
6%|▌ | 176/2826 [17:54<4:19:20, 5.87s/it]
6%|▋ | 177/2826 [18:00<4:28:56, 6.09s/it]
6%|▋ | 178/2826 [18:07<4:33:44, 6.20s/it]
6%|▋ | 179/2826 [18:12<4:21:13, 5.92s/it]
6%|▋ | 180/2826 [18:18<4:13:45, 5.75s/it]
{'loss': 0.4791, 'grad_norm': 2.814927577972412, 'learning_rate': 3.162544169611308e-06, 'epoch': 0.19}
6%|▋ | 180/2826 [18:18<4:13:45, 5.75s/it]
6%|▋ | 181/2826 [18:23<4:14:05, 5.76s/it]
6%|▋ | 182/2826 [18:29<4:07:06, 5.61s/it]
6%|▋ | 183/2826 [18:34<4:02:13, 5.50s/it]
7%|▋ | 184/2826 [18:40<4:05:21, 5.57s/it]
7%|▋ | 185/2826 [18:45<4:04:42, 5.56s/it]
7%|▋ | 186/2826 [18:50<4:01:25, 5.49s/it]
7%|▋ | 187/2826 [18:56<4:08:28, 5.65s/it]
7%|▋ | 188/2826 [19:02<4:06:17, 5.60s/it]
7%|▋ | 189/2826 [19:08<4:06:21, 5.61s/it]
7%|▋ | 190/2826 [19:13<4:00:02, 5.46s/it]
{'loss': 0.5024, 'grad_norm': 2.6151270866394043, 'learning_rate': 3.3392226148409896e-06, 'epoch': 0.2}
7%|▋ | 190/2826 [19:13<4:00:02, 5.46s/it]
7%|▋ | 191/2826 [19:18<3:54:38, 5.34s/it]
7%|▋ | 192/2826 [19:23<3:58:13, 5.43s/it]
7%|▋ | 193/2826 [19:29<4:00:43, 5.49s/it]
7%|▋ | 194/2826 [19:34<3:56:46, 5.40s/it]
7%|▋ | 195/2826 [19:40<4:07:26, 5.64s/it]
7%|▋ | 196/2826 [19:46<4:08:45, 5.68s/it]
7%|▋ | 197/2826 [19:53<4:17:45, 5.88s/it]
7%|▋ | 198/2826 [19:58<4:10:06, 5.71s/it]
7%|▋ | 199/2826 [20:04<4:20:17, 5.95s/it]
7%|▋ | 200/2826 [20:10<4:23:09, 6.01s/it]
{'loss': 0.5781, 'grad_norm': 2.8331387042999268, 'learning_rate': 3.5159010600706715e-06, 'epoch': 0.21}
7%|▋ | 200/2826 [20:10<4:23:09, 6.01s/it]
7%|▋ | 201/2826 [20:17<4:31:53, 6.21s/it]
7%|▋ | 202/2826 [20:22<4:16:04, 5.86s/it]
7%|▋ | 203/2826 [20:28<4:20:29, 5.96s/it]
7%|▋ | 204/2826 [20:34<4:13:59, 5.81s/it]
7%|▋ | 205/2826 [20:39<4:09:38, 5.71s/it]
7%|▋ | 206/2826 [20:45<4:13:49, 5.81s/it]
7%|▋ | 207/2826 [20:52<4:17:59, 5.91s/it]
7%|▋ | 208/2826 [20:59<4:33:55, 6.28s/it]
7%|▋ | 209/2826 [21:04<4:20:06, 5.96s/it]
7%|▋ | 210/2826 [21:09<4:09:27, 5.72s/it]
{'loss': 0.4186, 'grad_norm': 2.433027744293213, 'learning_rate': 3.6925795053003538e-06, 'epoch': 0.22}
7%|▋ | 210/2826 [21:09<4:09:27, 5.72s/it]
7%|▋ | 211/2826 [21:15<4:09:03, 5.71s/it]
8%|▊ | 212/2826 [21:21<4:21:42, 6.01s/it]
8%|▊ | 213/2826 [21:27<4:14:17, 5.84s/it]
8%|▊ | 214/2826 [21:32<4:06:21, 5.66s/it]
8%|▊ | 215/2826 [21:37<3:58:03, 5.47s/it]
8%|▊ | 216/2826 [21:43<3:59:05, 5.50s/it]
8%|▊ | 217/2826 [21:48<4:00:39, 5.53s/it]
8%|▊ | 218/2826 [21:54<3:59:22, 5.51s/it]
8%|▊ | 219/2826 [21:59<3:54:55, 5.41s/it]
8%|▊ | 220/2826 [22:04<3:55:50, 5.43s/it]
{'loss': 0.4819, 'grad_norm': 2.671696186065674, 'learning_rate': 3.869257950530036e-06, 'epoch': 0.23}
8%|▊ | 220/2826 [22:04<3:55:50, 5.43s/it]
8%|▊ | 221/2826 [22:10<4:03:25, 5.61s/it]
8%|▊ | 222/2826 [22:16<4:02:32, 5.59s/it]
8%|▊ | 223/2826 [22:24<4:37:17, 6.39s/it]
8%|▊ | 224/2826 [22:30<4:29:46, 6.22s/it]
8%|▊ | 225/2826 [22:35<4:16:51, 5.93s/it]
8%|▊ | 226/2826 [22:41<4:17:50, 5.95s/it]
8%|▊ | 227/2826 [22:46<4:06:36, 5.69s/it]
8%|▊ | 228/2826 [22:52<3:59:26, 5.53s/it]
8%|▊ | 229/2826 [22:57<3:55:52, 5.45s/it]
8%|▊ | 230/2826 [23:02<3:58:37, 5.52s/it]
{'loss': 0.547, 'grad_norm': 2.5337982177734375, 'learning_rate': 4.045936395759718e-06, 'epoch': 0.24}
8%|▊ | 230/2826 [23:02<3:58:37, 5.52s/it]
8%|▊ | 231/2826 [23:10<4:23:10, 6.09s/it]
8%|▊ | 232/2826 [23:16<4:17:01, 5.94s/it]
8%|▊ | 233/2826 [23:21<4:09:29, 5.77s/it]
8%|▊ | 234/2826 [23:28<4:22:36, 6.08s/it]
8%|▊ | 235/2826 [23:33<4:11:19, 5.82s/it]
8%|▊ | 236/2826 [23:39<4:17:55, 5.98s/it]
8%|▊ | 237/2826 [23:45<4:11:50, 5.84s/it]
8%|▊ | 238/2826 [23:50<4:08:01, 5.75s/it]
8%|▊ | 239/2826 [23:56<4:13:11, 5.87s/it]
8%|▊ | 240/2826 [24:02<4:14:32, 5.91s/it]
{'loss': 0.5603, 'grad_norm': 2.2034990787506104, 'learning_rate': 4.222614840989399e-06, 'epoch': 0.25}
8%|▊ | 240/2826 [24:02<4:14:32, 5.91s/it]
9%|▊ | 241/2826 [24:08<4:04:06, 5.67s/it]
9%|▊ | 242/2826 [24:14<4:16:47, 5.96s/it]
9%|▊ | 243/2826 [24:20<4:17:00, 5.97s/it]
9%|▊ | 244/2826 [24:27<4:21:32, 6.08s/it]
9%|▊ | 245/2826 [24:32<4:13:30, 5.89s/it]
9%|▊ | 246/2826 [24:37<4:01:58, 5.63s/it]
9%|▊ | 247/2826 [24:42<3:55:58, 5.49s/it]
9%|▉ | 248/2826 [24:47<3:50:30, 5.36s/it]
9%|▉ | 249/2826 [24:54<4:14:51, 5.93s/it]
9%|▉ | 250/2826 [25:00<4:07:52, 5.77s/it]
{'loss': 0.4483, 'grad_norm': 2.2893121242523193, 'learning_rate': 4.399293286219082e-06, 'epoch': 0.27}
9%|▉ | 250/2826 [25:00<4:07:52, 5.77s/it]
9%|▉ | 251/2826 [25:05<3:56:30, 5.51s/it]
9%|▉ | 252/2826 [25:10<3:49:51, 5.36s/it]
9%|▉ | 253/2826 [25:15<3:46:14, 5.28s/it]
9%|▉ | 254/2826 [25:21<3:56:34, 5.52s/it]
9%|▉ | 255/2826 [25:27<4:04:17, 5.70s/it]
9%|▉ | 256/2826 [25:33<4:06:25, 5.75s/it]
9%|▉ | 257/2826 [25:39<4:08:51, 5.81s/it]
9%|▉ | 258/2826 [25:44<3:59:43, 5.60s/it]
9%|▉ | 259/2826 [25:49<3:55:48, 5.51s/it]
9%|▉ | 260/2826 [25:55<3:53:48, 5.47s/it]
{'loss': 0.5178, 'grad_norm': 1.8757219314575195, 'learning_rate': 4.575971731448763e-06, 'epoch': 0.28}
9%|▉ | 260/2826 [25:55<3:53:48, 5.47s/it]
9%|▉ | 261/2826 [26:00<3:55:24, 5.51s/it]
9%|▉ | 262/2826 [26:05<3:49:49, 5.38s/it]
9%|▉ | 263/2826 [26:11<3:56:35, 5.54s/it]
9%|▉ | 264/2826 [26:17<3:55:06, 5.51s/it]
9%|▉ | 265/2826 [26:22<3:50:02, 5.39s/it]
9%|▉ | 266/2826 [26:27<3:49:46, 5.39s/it]
9%|▉ | 267/2826 [26:33<4:00:44, 5.64s/it]
9%|▉ | 268/2826 [26:39<3:54:51, 5.51s/it]
10%|▉ | 269/2826 [26:44<3:52:03, 5.45s/it]
10%|▉ | 270/2826 [26:51<4:12:07, 5.92s/it]
{'loss': 0.5264, 'grad_norm': 2.3748602867126465, 'learning_rate': 4.752650176678445e-06, 'epoch': 0.29}
10%|▉ | 270/2826 [26:51<4:12:07, 5.92s/it]
10%|▉ | 271/2826 [26:56<4:06:02, 5.78s/it]
10%|▉ | 272/2826 [27:03<4:11:26, 5.91s/it]
10%|▉ | 273/2826 [27:09<4:15:58, 6.02s/it]
10%|▉ | 274/2826 [27:15<4:13:23, 5.96s/it]
10%|▉ | 275/2826 [27:20<4:02:08, 5.70s/it]
10%|▉ | 276/2826 [27:27<4:16:38, 6.04s/it]
10%|▉ | 277/2826 [27:32<4:05:47, 5.79s/it]
10%|▉ | 278/2826 [27:38<4:04:57, 5.77s/it]
10%|▉ | 279/2826 [27:44<4:13:22, 5.97s/it]
10%|▉ | 280/2826 [27:50<4:10:21, 5.90s/it]
{'loss': 0.5124, 'grad_norm': 3.0481033325195312, 'learning_rate': 4.929328621908128e-06, 'epoch': 0.3}
10%|▉ | 280/2826 [27:50<4:10:21, 5.90s/it]
10%|▉ | 281/2826 [27:56<4:16:44, 6.05s/it]
10%|▉ | 282/2826 [28:02<4:15:20, 6.02s/it]
10%|█ | 283/2826 [28:09<4:24:00, 6.23s/it]
10%|█ | 284/2826 [28:14<4:14:50, 6.02s/it]
10%|█ | 285/2826 [28:20<4:06:38, 5.82s/it]
10%|█ | 286/2826 [28:25<4:00:40, 5.69s/it]
10%|█ | 287/2826 [28:30<3:56:00, 5.58s/it]
10%|█ | 288/2826 [28:37<4:08:47, 5.88s/it]
10%|█ | 289/2826 [28:42<4:00:37, 5.69s/it]
10%|█ | 290/2826 [28:47<3:51:46, 5.48s/it]
{'loss': 0.4977, 'grad_norm': 2.682847023010254, 'learning_rate': 4.99993132201408e-06, 'epoch': 0.31}
10%|█ | 290/2826 [28:47<3:51:46, 5.48s/it]
10%|█ | 291/2826 [28:53<3:51:06, 5.47s/it]
10%|█ | 292/2826 [28:58<3:51:50, 5.49s/it]
10%|█ | 293/2826 [29:05<4:07:39, 5.87s/it]
10%|█ | 294/2826 [29:11<4:07:28, 5.86s/it]
10%|█ | 295/2826 [29:16<4:02:30, 5.75s/it]
10%|█ | 296/2826 [29:21<3:51:53, 5.50s/it]
11%|█ | 297/2826 [29:28<4:03:10, 5.77s/it]
11%|█ | 298/2826 [29:33<3:55:54, 5.60s/it]
11%|█ | 299/2826 [29:38<3:51:48, 5.50s/it]
11%|█ | 300/2826 [29:45<4:03:29, 5.78s/it]
{'loss': 0.5005, 'grad_norm': 2.472842216491699, 'learning_rate': 4.9995116368759e-06, 'epoch': 0.32}
11%|█ | 300/2826 [29:45<4:03:29, 5.78s/it]
11%|█ | 301/2826 [29:51<4:09:56, 5.94s/it]
11%|█ | 302/2826 [29:57<4:16:26, 6.10s/it]
11%|█ | 303/2826 [30:04<4:20:03, 6.18s/it]
11%|█ | 304/2826 [30:10<4:19:04, 6.16s/it]
11%|█ | 305/2826 [30:16<4:16:51, 6.11s/it]
11%|█ | 306/2826 [30:22<4:12:04, 6.00s/it]
11%|█ | 307/2826 [30:31<4:49:36, 6.90s/it]
11%|█ | 308/2826 [30:36<4:37:29, 6.61s/it]
11%|█ | 309/2826 [30:42<4:18:02, 6.15s/it]
11%|█ | 310/2826 [30:47<4:15:14, 6.09s/it]
{'loss': 0.4857, 'grad_norm': 2.582815647125244, 'learning_rate': 4.998710485009401e-06, 'epoch': 0.33}
11%|█ | 310/2826 [30:47<4:15:14, 6.09s/it]
11%|█ | 311/2826 [30:53<4:12:00, 6.01s/it]
11%|█ | 312/2826 [30:59<4:05:28, 5.86s/it]
11%|█ | 313/2826 [31:04<3:56:02, 5.64s/it]
11%|█ | 314/2826 [31:09<3:50:40, 5.51s/it]
11%|█ | 315/2826 [31:15<3:55:37, 5.63s/it]
11%|█ | 316/2826 [31:20<3:49:21, 5.48s/it]
11%|█ | 317/2826 [31:27<4:05:01, 5.86s/it]
11%|█▏ | 318/2826 [31:33<4:05:35, 5.88s/it]
11%|█▏ | 319/2826 [31:38<3:58:38, 5.71s/it]
11%|█▏ | 320/2826 [31:44<4:02:17, 5.80s/it]
{'loss': 0.4637, 'grad_norm': 2.3572824001312256, 'learning_rate': 4.99752798868358e-06, 'epoch': 0.34}
11%|█▏ | 320/2826 [31:44<4:02:17, 5.80s/it]
11%|█▏ | 321/2826 [31:50<4:01:26, 5.78s/it]
11%|█▏ | 322/2826 [31:56<4:05:42, 5.89s/it]
11%|█▏ | 323/2826 [32:03<4:19:27, 6.22s/it]
11%|█▏ | 324/2826 [32:09<4:20:20, 6.24s/it]
12%|█▏ | 325/2826 [32:15<4:07:38, 5.94s/it]
12%|█▏ | 326/2826 [32:21<4:14:54, 6.12s/it]
12%|█▏ | 327/2826 [32:26<4:02:33, 5.82s/it]
12%|█▏ | 328/2826 [32:31<3:54:28, 5.63s/it]
12%|█▏ | 329/2826 [32:37<3:59:40, 5.76s/it]
12%|█▏ | 330/2826 [32:43<3:54:50, 5.65s/it]
{'loss': 0.4775, 'grad_norm': 2.3432295322418213, 'learning_rate': 4.99596432836689e-06, 'epoch': 0.35}
12%|█▏ | 330/2826 [32:43<3:54:50, 5.65s/it]
12%|█▏ | 331/2826 [32:48<3:48:56, 5.51s/it]
12%|█▏ | 332/2826 [32:53<3:43:28, 5.38s/it]
12%|█▏ | 333/2826 [33:00<3:59:21, 5.76s/it]
12%|█▏ | 334/2826 [33:05<3:50:32, 5.55s/it]
12%|█▏ | 335/2826 [33:10<3:50:33, 5.55s/it]
12%|█▏ | 336/2826 [33:16<3:56:10, 5.69s/it]
12%|█▏ | 337/2826 [33:21<3:48:12, 5.50s/it]
12%|█▏ | 338/2826 [33:29<4:08:49, 6.00s/it]
12%|█▏ | 339/2826 [33:34<4:04:46, 5.91s/it]
12%|█▏ | 340/2826 [33:39<3:53:35, 5.64s/it]
{'loss': 0.5779, 'grad_norm': 2.7486777305603027, 'learning_rate': 4.994019742699705e-06, 'epoch': 0.36}
12%|█▏ | 340/2826 [33:39<3:53:35, 5.64s/it]
12%|█▏ | 341/2826 [33:45<3:53:09, 5.63s/it]
12%|█▏ | 342/2826 [33:52<4:12:17, 6.09s/it]
12%|█▏ | 343/2826 [33:58<4:08:33, 6.01s/it]
12%|█▏ | 344/2826 [34:03<3:55:55, 5.70s/it]
12%|█▏ | 345/2826 [34:10<4:11:33, 6.08s/it]
12%|█▏ | 346/2826 [34:15<4:02:03, 5.86s/it]
12%|█▏ | 347/2826 [34:21<3:57:32, 5.75s/it]
12%|█▏ | 348/2826 [34:26<3:51:23, 5.60s/it]
12%|█▏ | 349/2826 [34:32<3:59:55, 5.81s/it]
12%|█▏ | 350/2826 [34:37<3:52:14, 5.63s/it]
{'loss': 0.5057, 'grad_norm': 2.3831562995910645, 'learning_rate': 4.991694528457891e-06, 'epoch': 0.37}
12%|█▏ | 350/2826 [34:37<3:52:14, 5.63s/it]
12%|█▏ | 351/2826 [34:43<3:48:23, 5.54s/it]
12%|█▏ | 352/2826 [34:49<4:01:40, 5.86s/it]
12%|█▏ | 353/2826 [34:55<3:53:33, 5.67s/it]
13%|█▎ | 354/2826 [35:01<3:57:47, 5.77s/it]
13%|█▎ | 355/2826 [35:08<4:21:59, 6.36s/it]
13%|█▎ | 356/2826 [35:14<4:18:34, 6.28s/it]
13%|█▎ | 357/2826 [35:20<4:03:13, 5.91s/it]
13%|█▎ | 358/2826 [35:25<3:53:06, 5.67s/it]
13%|█▎ | 359/2826 [35:30<3:50:19, 5.60s/it]
13%|█▎ | 360/2826 [35:37<4:01:45, 5.88s/it]
{'loss': 0.5313, 'grad_norm': 2.5414721965789795, 'learning_rate': 4.988989040507518e-06, 'epoch': 0.38}
13%|█▎ | 360/2826 [35:37<4:01:45, 5.88s/it]
13%|█▎ | 361/2826 [35:42<3:54:48, 5.72s/it]
13%|█▎ | 362/2826 [35:49<4:13:39, 6.18s/it]
13%|█▎ | 363/2826 [35:55<4:05:20, 5.98s/it]
13%|█▎ | 364/2826 [36:01<4:03:14, 5.93s/it]
13%|█▎ | 365/2826 [36:06<3:56:21, 5.76s/it]
13%|█▎ | 366/2826 [36:12<4:02:19, 5.91s/it]
13%|█▎ | 367/2826 [36:18<4:03:11, 5.93s/it]
13%|█▎ | 368/2826 [36:23<3:52:32, 5.68s/it]
13%|█▎ | 369/2826 [36:29<3:56:29, 5.78s/it]
13%|█▎ | 370/2826 [36:35<3:54:38, 5.73s/it]
{'loss': 0.4441, 'grad_norm': 2.4140472412109375, 'learning_rate': 4.985903691750697e-06, 'epoch': 0.39}
13%|█▎ | 370/2826 [36:35<3:54:38, 5.73s/it]
13%|█▎ | 371/2826 [36:41<3:58:25, 5.83s/it]
13%|█▎ | 372/2826 [36:46<3:49:17, 5.61s/it]
13%|█▎ | 373/2826 [36:52<3:58:00, 5.82s/it]
13%|█▎ | 374/2826 [36:58<3:55:29, 5.76s/it]
13%|█▎ | 375/2826 [37:06<4:26:54, 6.53s/it]
13%|█▎ | 376/2826 [37:11<4:09:42, 6.12s/it]
13%|█▎ | 377/2826 [37:18<4:10:37, 6.14s/it]
13%|█▎ | 378/2826 [37:23<4:01:09, 5.91s/it]
13%|█▎ | 379/2826 [37:28<3:51:10, 5.67s/it]
13%|█▎ | 380/2826 [37:33<3:43:35, 5.48s/it]
{'loss': 0.4778, 'grad_norm': 2.4907593727111816, 'learning_rate': 4.982438953062572e-06, 'epoch': 0.4}
13%|█▎ | 380/2826 [37:33<3:43:35, 5.48s/it]
13%|█▎ | 381/2826 [37:39<3:50:14, 5.65s/it]
14%|█▎ | 382/2826 [37:44<3:42:51, 5.47s/it]
14%|█▎ | 383/2826 [37:52<4:05:57, 6.04s/it]
14%|█▎ | 384/2826 [37:59<4:27:26, 6.57s/it]
14%|█▎ | 385/2826 [38:05<4:14:52, 6.26s/it]
14%|█▎ | 386/2826 [38:11<4:09:40, 6.14s/it]
14%|█▎ | 387/2826 [38:17<4:14:10, 6.25s/it]
14%|█▎ | 388/2826 [38:23<4:05:39, 6.05s/it]
14%|█▍ | 389/2826 [38:29<4:11:46, 6.20s/it]
14%|█▍ | 390/2826 [38:35<4:00:58, 5.94s/it]
{'loss': 0.4848, 'grad_norm': 2.579932928085327, 'learning_rate': 4.978595353219449e-06, 'epoch': 0.41}
14%|█▍ | 390/2826 [38:35<4:00:58, 5.94s/it]
14%|█▍ | 391/2826 [38:40<3:54:40, 5.78s/it]
14%|█▍ | 392/2826 [38:46<3:49:40, 5.66s/it]
14%|█▍ | 393/2826 [38:52<4:03:09, 6.00s/it]
14%|█▍ | 394/2826 [38:57<3:50:05, 5.68s/it]
14%|█▍ | 395/2826 [39:04<4:01:48, 5.97s/it]
14%|█▍ | 396/2826 [39:11<4:12:46, 6.24s/it]
14%|█▍ | 397/2826 [39:16<4:03:52, 6.02s/it]
14%|█▍ | 398/2826 [39:23<4:07:49, 6.12s/it]
14%|█▍ | 399/2826 [39:28<3:55:41, 5.83s/it]
14%|█▍ | 400/2826 [39:33<3:51:53, 5.74s/it]
{'loss': 0.4891, 'grad_norm': 2.5512266159057617, 'learning_rate': 4.974373478818098e-06, 'epoch': 0.42}
14%|█▍ | 400/2826 [39:33<3:51:53, 5.74s/it]
14%|█▍ | 401/2826 [39:39<3:46:08, 5.60s/it]
14%|█▍ | 402/2826 [39:44<3:40:59, 5.47s/it]
14%|█▍ | 403/2826 [39:49<3:42:37, 5.51s/it]
14%|█▍ | 404/2826 [39:54<3:37:15, 5.38s/it]
14%|█▍ | 405/2826 [40:00<3:42:19, 5.51s/it]
14%|█▍ | 406/2826 [40:06<3:50:04, 5.70s/it]
14%|█▍ | 407/2826 [40:12<3:45:21, 5.59s/it]
14%|█▍ | 408/2826 [40:20<4:14:48, 6.32s/it]
14%|█▍ | 409/2826 [40:26<4:18:53, 6.43s/it]
15%|█▍ | 410/2826 [40:32<4:12:26, 6.27s/it]
{'loss': 0.4954, 'grad_norm': 2.3293063640594482, 'learning_rate': 4.969773974186235e-06, 'epoch': 0.44}
15%|█▍ | 410/2826 [40:32<4:12:26, 6.27s/it]
15%|█▍ | 411/2826 [40:37<3:57:24, 5.90s/it]
15%|█▍ | 412/2826 [40:42<3:46:52, 5.64s/it]
15%|█▍ | 413/2826 [40:48<3:44:12, 5.58s/it]
15%|█▍ | 414/2826 [40:53<3:44:44, 5.59s/it]
15%|█▍ | 415/2826 [41:00<3:53:09, 5.80s/it]
15%|█▍ | 416/2826 [41:06<3:56:39, 5.89s/it]
15%|█▍ | 417/2826 [41:11<3:47:26, 5.66s/it]
15%|█▍ | 418/2826 [41:17<3:54:30, 5.84s/it]
15%|█▍ | 419/2826 [41:22<3:45:20, 5.62s/it]
15%|█▍ | 420/2826 [41:28<3:45:17, 5.62s/it]
{'loss': 0.5353, 'grad_norm': 2.6347479820251465, 'learning_rate': 4.964797541284175e-06, 'epoch': 0.45}
15%|█▍ | 420/2826 [41:28<3:45:17, 5.62s/it]
15%|█▍ | 421/2826 [41:35<4:03:45, 6.08s/it]
15%|█▍ | 422/2826 [41:40<3:50:48, 5.76s/it]
15%|█▍ | 423/2826 [41:47<4:04:43, 6.11s/it]
15%|█▌ | 424/2826 [41:52<3:53:13, 5.83s/it]
15%|█▌ | 425/2826 [41:57<3:45:27, 5.63s/it]
15%|█▌ | 426/2826 [42:02<3:38:24, 5.46s/it]
15%|█▌ | 427/2826 [42:08<3:43:24, 5.59s/it]
15%|█▌ | 428/2826 [42:14<3:39:38, 5.50s/it]
15%|█▌ | 429/2826 [42:19<3:35:08, 5.39s/it]
15%|█▌ | 430/2826 [42:25<3:44:05, 5.61s/it]
{'loss': 0.5726, 'grad_norm': 2.7719151973724365, 'learning_rate': 4.959444939597712e-06, 'epoch': 0.46}
15%|█▌ | 430/2826 [42:25<3:44:05, 5.61s/it]
15%|█▌ | 431/2826 [42:31<3:48:34, 5.73s/it]
15%|█▌ | 432/2826 [42:37<3:51:21, 5.80s/it]
15%|█▌ | 433/2826 [42:42<3:47:31, 5.70s/it]
15%|█▌ | 434/2826 [42:49<3:53:20, 5.85s/it]
15%|█▌ | 435/2826 [42:56<4:07:34, 6.21s/it]
15%|█▌ | 436/2826 [43:01<3:54:23, 5.88s/it]
15%|█▌ | 437/2826 [43:08<4:05:38, 6.17s/it]
15%|█▌ | 438/2826 [43:13<3:52:01, 5.83s/it]
16%|█▌ | 439/2826 [43:18<3:47:20, 5.71s/it]
16%|█▌ | 440/2826 [43:24<3:48:23, 5.74s/it]
{'loss': 0.5642, 'grad_norm': 2.1757211685180664, 'learning_rate': 4.953716986022204e-06, 'epoch': 0.47}
16%|█▌ | 440/2826 [43:24<3:48:23, 5.74s/it]
16%|█▌ | 441/2826 [43:31<4:04:01, 6.14s/it]
16%|█▌ | 442/2826 [43:37<3:58:23, 6.00s/it]
16%|█▌ | 443/2826 [43:42<3:53:30, 5.88s/it]
16%|█▌ | 444/2826 [43:48<3:52:52, 5.87s/it]
16%|█▌ | 445/2826 [43:54<3:52:04, 5.85s/it]
16%|█▌ | 446/2826 [44:01<4:04:00, 6.15s/it]
16%|█▌ | 447/2826 [44:07<4:03:58, 6.15s/it]
16%|█▌ | 448/2826 [44:12<3:56:22, 5.96s/it]
16%|█▌ | 449/2826 [44:18<3:48:29, 5.77s/it]
16%|█▌ | 450/2826 [44:23<3:43:41, 5.65s/it]
{'loss': 0.4429, 'grad_norm': 2.432244300842285, 'learning_rate': 4.947614554737904e-06, 'epoch': 0.48}
16%|█▌ | 450/2826 [44:23<3:43:41, 5.65s/it]
16%|█▌ | 451/2826 [44:29<3:41:26, 5.59s/it]
16%|█▌ | 452/2826 [44:35<3:55:20, 5.95s/it]
16%|█▌ | 453/2826 [44:41<3:54:21, 5.93s/it]
16%|█▌ | 454/2826 [44:47<3:55:45, 5.96s/it]
16%|█▌ | 455/2826 [44:52<3:44:48, 5.69s/it]
16%|█▌ | 456/2826 [44:57<3:38:37, 5.53s/it]
16%|█▌ | 457/2826 [45:05<4:00:57, 6.10s/it]
16%|█▌ | 458/2826 [45:11<3:55:45, 5.97s/it]
16%|█▌ | 459/2826 [45:16<3:45:13, 5.71s/it]
16%|█▋ | 460/2826 [45:22<3:48:47, 5.80s/it]
{'loss': 0.4683, 'grad_norm': 1.972844123840332, 'learning_rate': 4.941138577076538e-06, 'epoch': 0.49}
16%|█▋ | 460/2826 [45:22<3:48:47, 5.80s/it]
16%|█▋ | 461/2826 [45:29<4:08:32, 6.31s/it]
16%|█▋ | 462/2826 [45:35<4:00:35, 6.11s/it]
16%|█▋ | 463/2826 [45:42<4:18:41, 6.57s/it]
16%|█▋ | 464/2826 [45:49<4:20:29, 6.62s/it]
16%|█▋ | 465/2826 [45:54<4:03:24, 6.19s/it]
16%|█▋ | 466/2826 [46:01<4:05:41, 6.25s/it]
17%|█▋ | 467/2826 [46:07<4:01:02, 6.13s/it]
17%|█▋ | 468/2826 [46:13<4:09:24, 6.35s/it]
17%|█▋ | 469/2826 [46:19<3:54:58, 5.98s/it]
17%|█▋ | 470/2826 [46:24<3:46:25, 5.77s/it]
{'loss': 0.4385, 'grad_norm': 2.484992742538452, 'learning_rate': 4.934290041379182e-06, 'epoch': 0.5}
17%|█▋ | 470/2826 [46:24<3:46:25, 5.77s/it]
17%|█▋ | 471/2826 [46:31<4:00:41, 6.13s/it]
17%|█▋ | 472/2826 [46:37<3:57:47, 6.06s/it]
17%|█▋ | 473/2826 [46:42<3:50:59, 5.89s/it]
17%|█▋ | 474/2826 [46:48<3:44:01, 5.72s/it]
17%|█▋ | 475/2826 [46:53<3:42:44, 5.68s/it]
17%|█▋ | 476/2826 [46:59<3:48:07, 5.82s/it]
17%|█▋ | 477/2826 [47:04<3:40:30, 5.63s/it]
17%|█▋ | 478/2826 [47:10<3:36:48, 5.54s/it]
17%|█▋ | 479/2826 [47:15<3:31:28, 5.41s/it]
17%|█▋ | 480/2826 [47:22<3:50:51, 5.90s/it]
{'loss': 0.4935, 'grad_norm': 2.0424418449401855, 'learning_rate': 4.92706999284541e-06, 'epoch': 0.51}
17%|█▋ | 480/2826 [47:22<3:50:51, 5.90s/it]
17%|█▋ | 481/2826 [47:27<3:44:52, 5.75s/it]
17%|█▋ | 482/2826 [47:34<3:58:19, 6.10s/it]
17%|█▋ | 483/2826 [47:40<3:58:41, 6.11s/it]
17%|█▋ | 484/2826 [47:46<3:51:25, 5.93s/it]
17%|█▋ | 485/2826 [47:51<3:41:43, 5.68s/it]
17%|█▋ | 486/2826 [47:57<3:44:23, 5.75s/it]
17%|█▋ | 487/2826 [48:03<3:52:34, 5.97s/it]
17%|█▋ | 488/2826 [48:10<4:00:52, 6.18s/it]
17%|█▋ | 489/2826 [48:15<3:51:08, 5.93s/it]
17%|█▋ | 490/2826 [48:21<3:46:56, 5.83s/it]
{'loss': 0.4548, 'grad_norm': 2.3754308223724365, 'learning_rate': 4.9194795333737925e-06, 'epoch': 0.52}
17%|█▋ | 490/2826 [48:21<3:46:56, 5.83s/it]
17%|█▋ | 491/2826 [48:30<4:17:57, 6.63s/it]
17%|█▋ | 492/2826 [48:36<4:13:59, 6.53s/it]
17%|█▋ | 493/2826 [48:42<4:10:37, 6.45s/it]
17%|█▋ | 494/2826 [48:48<4:01:09, 6.20s/it]
18%|█▊ | 495/2826 [48:53<3:49:27, 5.91s/it]
18%|█▊ | 496/2826 [48:58<3:40:05, 5.67s/it]
18%|█▊ | 497/2826 [49:05<3:50:25, 5.94s/it]
18%|█▊ | 498/2826 [49:10<3:41:06, 5.70s/it]
18%|█▊ | 499/2826 [49:16<3:45:16, 5.81s/it]
18%|█▊ | 500/2826 [49:21<3:36:56, 5.60s/it]
{'loss': 0.5486, 'grad_norm': 3.0801432132720947, 'learning_rate': 4.911519821393718e-06, 'epoch': 0.53}
18%|█▊ | 500/2826 [49:21<3:36:56, 5.60s/it]
18%|█▊ | 501/2826 [49:26<3:32:25, 5.48s/it]
18%|█▊ | 502/2826 [49:32<3:32:22, 5.48s/it]
18%|█▊ | 503/2826 [49:38<3:45:29, 5.82s/it]
18%|█▊ | 504/2826 [49:45<3:54:36, 6.06s/it]
18%|█▊ | 505/2826 [49:51<3:51:05, 5.97s/it]
18%|█▊ | 506/2826 [49:56<3:41:08, 5.72s/it]
18%|█▊ | 507/2826 [50:02<3:43:18, 5.78s/it]
18%|█▊ | 508/2826 [50:07<3:37:37, 5.63s/it]
18%|█▊ | 509/2826 [50:12<3:31:29, 5.48s/it]
18%|█▊ | 510/2826 [50:17<3:27:07, 5.37s/it]
{'loss': 0.5121, 'grad_norm': 2.2712507247924805, 'learning_rate': 4.9031920716886035e-06, 'epoch': 0.54}
18%|█▊ | 510/2826 [50:17<3:27:07, 5.37s/it]
18%|█▊ | 511/2826 [50:22<3:26:13, 5.35s/it]
18%|█▊ | 512/2826 [50:28<3:30:05, 5.45s/it]
18%|█▊ | 513/2826 [50:33<3:25:15, 5.32s/it]
18%|█▊ | 514/2826 [50:38<3:23:00, 5.27s/it]
18%|█▊ | 515/2826 [50:45<3:41:37, 5.75s/it]
18%|█▊ | 516/2826 [50:51<3:46:24, 5.88s/it]
18%|█▊ | 517/2826 [50:57<3:38:45, 5.68s/it]
18%|█▊ | 518/2826 [51:04<3:57:15, 6.17s/it]
18%|█▊ | 519/2826 [51:10<4:00:33, 6.26s/it]
18%|█▊ | 520/2826 [51:16<3:56:14, 6.15s/it]
{'loss': 0.4495, 'grad_norm': 2.0000548362731934, 'learning_rate': 4.894497555210499e-06, 'epoch': 0.55}
18%|█▊ | 520/2826 [51:16<3:56:14, 6.15s/it]
18%|█▊ | 521/2826 [51:21<3:43:53, 5.83s/it]
18%|█▊ | 522/2826 [51:27<3:39:53, 5.73s/it]
19%|█▊ | 523/2826 [51:32<3:35:51, 5.62s/it]
19%|█▊ | 524/2826 [51:39<3:44:31, 5.85s/it]
19%|█▊ | 525/2826 [51:44<3:35:09, 5.61s/it]
19%|█▊ | 526/2826 [51:50<3:48:00, 5.95s/it]
19%|█▊ | 527/2826 [51:55<3:37:42, 5.68s/it]
19%|█▊ | 528/2826 [52:01<3:30:43, 5.50s/it]
19%|█▊ | 529/2826 [52:07<3:38:48, 5.72s/it]
19%|█▉ | 530/2826 [52:13<3:43:21, 5.84s/it]
{'loss': 0.5028, 'grad_norm': 2.590303897857666, 'learning_rate': 4.8854375988861134e-06, 'epoch': 0.56}
19%|█▉ | 530/2826 [52:13<3:43:21, 5.84s/it]
19%|█▉ | 531/2826 [52:18<3:35:36, 5.64s/it]
19%|█▉ | 532/2826 [52:23<3:29:42, 5.49s/it]
19%|█▉ | 533/2826 [52:29<3:37:36, 5.69s/it]
19%|█▉ | 534/2826 [52:35<3:36:27, 5.67s/it]
19%|█▉ | 535/2826 [52:40<3:29:50, 5.50s/it]
19%|█▉ | 536/2826 [52:47<3:48:28, 5.99s/it]
19%|█▉ | 537/2826 [52:52<3:38:04, 5.72s/it]
19%|█▉ | 538/2826 [52:57<3:30:39, 5.52s/it]
19%|█▉ | 539/2826 [53:04<3:39:14, 5.75s/it]
19%|█▉ | 540/2826 [53:09<3:32:30, 5.58s/it]
{'loss': 0.5193, 'grad_norm': 2.377298355102539, 'learning_rate': 4.87601358541431e-06, 'epoch': 0.57}
19%|█▉ | 540/2826 [53:09<3:32:30, 5.58s/it]
19%|█▉ | 541/2826 [53:15<3:37:09, 5.70s/it]
19%|█▉ | 542/2826 [53:21<3:38:14, 5.73s/it]
19%|█▉ | 543/2826 [53:26<3:38:50, 5.75s/it]
19%|█▉ | 544/2826 [53:32<3:40:01, 5.79s/it]
19%|█▉ | 545/2826 [53:38<3:35:35, 5.67s/it]
19%|█▉ | 546/2826 [53:44<3:38:27, 5.75s/it]
19%|█▉ | 547/2826 [53:49<3:35:02, 5.66s/it]
19%|█▉ | 548/2826 [53:56<3:49:37, 6.05s/it]
19%|█▉ | 549/2826 [54:02<3:46:33, 5.97s/it]
19%|█▉ | 550/2826 [54:07<3:35:42, 5.69s/it]
{'loss': 0.545, 'grad_norm': 2.966008186340332, 'learning_rate': 4.8662269530550825e-06, 'epoch': 0.58}
19%|█▉ | 550/2826 [54:07<3:35:42, 5.69s/it]
19%|█▉ | 551/2826 [54:13<3:44:14, 5.91s/it]
20%|█▉ | 552/2826 [54:19<3:47:00, 5.99s/it]
20%|█▉ | 553/2826 [54:26<3:54:01, 6.18s/it]
20%|█▉ | 554/2826 [54:31<3:45:16, 5.95s/it]
20%|█▉ | 555/2826 [54:36<3:34:54, 5.68s/it]
20%|█▉ | 556/2826 [54:42<3:28:14, 5.50s/it]
20%|█▉ | 557/2826 [54:47<3:22:34, 5.36s/it]
20%|█▉ | 558/2826 [54:52<3:26:50, 5.47s/it]
20%|█▉ | 559/2826 [54:57<3:22:48, 5.37s/it]
20%|█▉ | 560/2826 [55:03<3:27:45, 5.50s/it]
{'loss': 0.5219, 'grad_norm': 2.250293254852295, 'learning_rate': 4.856079195410046e-06, 'epoch': 0.59}
20%|█▉ | 560/2826 [55:03<3:27:45, 5.50s/it]
20%|█▉ | 561/2826 [55:09<3:25:15, 5.44s/it]
20%|█▉ | 562/2826 [55:14<3:21:54, 5.35s/it]
20%|█▉ | 563/2826 [55:21<3:38:55, 5.80s/it]
20%|█▉ | 564/2826 [55:27<3:45:49, 5.99s/it]
20%|█▉ | 565/2826 [55:32<3:38:34, 5.80s/it]
20%|██ | 566/2826 [55:39<3:49:49, 6.10s/it]
20%|██ | 567/2826 [55:46<3:57:56, 6.32s/it]
20%|██ | 568/2826 [55:53<4:01:09, 6.41s/it]
20%|██ | 569/2826 [55:59<4:02:37, 6.45s/it]
20%|██ | 570/2826 [56:04<3:48:57, 6.09s/it]
{'loss': 0.4725, 'grad_norm': 2.437361240386963, 'learning_rate': 4.845571861194501e-06, 'epoch': 0.6}
20%|██ | 570/2826 [56:04<3:48:57, 6.09s/it]
20%|██ | 571/2826 [56:10<3:43:02, 5.93s/it]
20%|██ | 572/2826 [56:15<3:32:34, 5.66s/it]
20%|██ | 573/2826 [56:21<3:32:07, 5.65s/it]
20%|██ | 574/2826 [56:27<3:41:44, 5.91s/it]
20%|██ | 575/2826 [56:34<3:49:25, 6.12s/it]
20%|██ | 576/2826 [56:39<3:44:49, 6.00s/it]
20%|██ | 577/2826 [56:45<3:40:14, 5.88s/it]
20%|██ | 578/2826 [56:52<3:51:27, 6.18s/it]
20%|██ | 579/2826 [56:59<4:02:06, 6.46s/it]
21%|██ | 580/2826 [57:04<3:47:09, 6.07s/it]
{'loss': 0.4232, 'grad_norm': 2.435994863510132, 'learning_rate': 4.834706554001065e-06, 'epoch': 0.62}
21%|██ | 580/2826 [57:04<3:47:09, 6.07s/it]
21%|██ | 581/2826 [57:10<3:46:22, 6.05s/it]
21%|██ | 582/2826 [57:16<3:44:56, 6.01s/it]
21%|██ | 583/2826 [57:21<3:36:49, 5.80s/it]
21%|██ | 584/2826 [57:27<3:28:57, 5.59s/it]
21%|██ | 585/2826 [57:32<3:30:43, 5.64s/it]
21%|██ | 586/2826 [57:40<3:55:05, 6.30s/it]
21%|██ | 587/2826 [57:45<3:43:04, 5.98s/it]
21%|██ | 588/2826 [57:50<3:33:39, 5.73s/it]
21%|██ | 589/2826 [57:56<3:31:59, 5.69s/it]
21%|██ | 590/2826 [58:02<3:35:32, 5.78s/it]
{'loss': 0.4834, 'grad_norm': 2.705902099609375, 'learning_rate': 4.823484932054937e-06, 'epoch': 0.63}
21%|██ | 590/2826 [58:02<3:35:32, 5.78s/it]
21%|██ | 591/2826 [58:09<3:44:11, 6.02s/it]
21%|██ | 592/2826 [58:14<3:35:03, 5.78s/it]
21%|██ | 593/2826 [58:20<3:39:36, 5.90s/it]
21%|██ | 594/2826 [58:26<3:36:20, 5.82s/it]
21%|██ | 595/2826 [58:31<3:28:41, 5.61s/it]
21%|██ | 596/2826 [58:36<3:25:41, 5.53s/it]
21%|██ | 597/2826 [58:41<3:20:33, 5.40s/it]
21%|██ | 598/2826 [58:47<3:18:54, 5.36s/it]
21%|██ | 599/2826 [58:52<3:25:20, 5.53s/it]
21%|██ | 600/2826 [58:58<3:20:36, 5.41s/it]
{'loss': 0.5302, 'grad_norm': 2.1471517086029053, 'learning_rate': 4.811908707960832e-06, 'epoch': 0.64}
21%|██ | 600/2826 [58:58<3:20:36, 5.41s/it]
21%|██▏ | 601/2826 [59:03<3:18:26, 5.35s/it]
21%|██▏ | 602/2826 [59:08<3:14:59, 5.26s/it]
21%|██▏ | 603/2826 [59:15<3:32:37, 5.74s/it]
21%|██▏ | 604/2826 [59:20<3:29:18, 5.65s/it]
21%|██▏ | 605/2826 [59:26<3:29:32, 5.66s/it]
21%|██▏ | 606/2826 [59:31<3:22:07, 5.46s/it]
21%|██▏ | 607/2826 [59:36<3:18:09, 5.36s/it]
22%|██▏ | 608/2826 [59:41<3:14:28, 5.26s/it]
22%|██▏ | 609/2826 [59:47<3:22:29, 5.48s/it]
22%|██▏ | 610/2826 [59:55<3:45:57, 6.12s/it]
{'loss': 0.494, 'grad_norm': 2.0760443210601807, 'learning_rate': 4.799979648441602e-06, 'epoch': 0.65}
22%|██▏ | 610/2826 [59:55<3:45:57, 6.12s/it]
22%|██▏ | 611/2826 [1:00:00<3:40:40, 5.98s/it]
22%|██▏ | 612/2826 [1:00:07<3:45:01, 6.10s/it]
22%|██▏ | 613/2826 [1:00:12<3:34:41, 5.82s/it]
22%|██▏ | 614/2826 [1:00:18<3:37:59, 5.91s/it]
22%|██▏ | 615/2826 [1:00:23<3:34:09, 5.81s/it]
22%|██▏ | 616/2826 [1:00:30<3:38:30, 5.93s/it]
22%|██▏ | 617/2826 [1:00:36<3:44:02, 6.09s/it]
22%|██▏ | 618/2826 [1:00:42<3:36:24, 5.88s/it]
22%|██▏ | 619/2826 [1:00:47<3:33:58, 5.82s/it]
22%|██▏ | 620/2826 [1:00:52<3:25:26, 5.59s/it]
{'loss': 0.487, 'grad_norm': 2.334944009780884, 'learning_rate': 4.787699574068611e-06, 'epoch': 0.66}
22%|██▏ | 620/2826 [1:00:52<3:25:26, 5.59s/it]
22%|██▏ | 621/2826 [1:00:57<3:20:49, 5.46s/it]
22%|██▏ | 622/2826 [1:01:03<3:16:54, 5.36s/it]
22%|██▏ | 623/2826 [1:01:08<3:23:09, 5.53s/it]
22%|██▏ | 624/2826 [1:01:14<3:27:42, 5.66s/it]
22%|██▏ | 625/2826 [1:01:22<3:51:48, 6.32s/it]
22%|██▏ | 626/2826 [1:01:28<3:47:35, 6.21s/it]
22%|██▏ | 627/2826 [1:01:34<3:40:19, 6.01s/it]
22%|██▏ | 628/2826 [1:01:39<3:28:45, 5.70s/it]
22%|██▏ | 629/2826 [1:01:47<3:53:46, 6.38s/it]
22%|██▏ | 630/2826 [1:01:52<3:45:36, 6.16s/it]
{'loss': 0.4911, 'grad_norm': 2.3444855213165283, 'learning_rate': 4.775070358983881e-06, 'epoch': 0.67}
22%|██▏ | 630/2826 [1:01:52<3:45:36, 6.16s/it]
22%|██▏ | 631/2826 [1:01:59<3:46:16, 6.19s/it]
22%|██▏ | 632/2826 [1:02:04<3:38:43, 5.98s/it]
22%|██▏ | 633/2826 [1:02:11<3:42:53, 6.10s/it]
22%|██▏ | 634/2826 [1:02:16<3:39:35, 6.01s/it]
22%|██▏ | 635/2826 [1:02:21<3:29:22, 5.73s/it]
23%|██▎ | 636/2826 [1:02:27<3:23:13, 5.57s/it]
23%|██▎ | 637/2826 [1:02:32<3:23:05, 5.57s/it]
23%|██▎ | 638/2826 [1:02:39<3:34:31, 5.88s/it]
23%|██▎ | 639/2826 [1:02:44<3:30:56, 5.79s/it]
23%|██▎ | 640/2826 [1:02:49<3:23:19, 5.58s/it]
{'loss': 0.4744, 'grad_norm': 2.127737045288086, 'learning_rate': 4.7620939306140696e-06, 'epoch': 0.68}
23%|██▎ | 640/2826 [1:02:49<3:23:19, 5.58s/it]
23%|██▎ | 641/2826 [1:02:55<3:27:12, 5.69s/it]
23%|██▎ | 642/2826 [1:03:00<3:20:26, 5.51s/it]
23%|██▎ | 643/2826 [1:03:08<3:47:34, 6.25s/it]
23%|██▎ | 644/2826 [1:03:14<3:34:50, 5.91s/it]
23%|██▎ | 645/2826 [1:03:19<3:27:00, 5.69s/it]
23%|██▎ | 646/2826 [1:03:26<3:41:49, 6.11s/it]
23%|██▎ | 647/2826 [1:03:32<3:43:35, 6.16s/it]
23%|██▎ | 648/2826 [1:03:37<3:32:05, 5.84s/it]
23%|██▎ | 649/2826 [1:03:42<3:23:47, 5.62s/it]
23%|██▎ | 650/2826 [1:03:47<3:17:11, 5.44s/it]
{'loss': 0.4789, 'grad_norm': 2.2132568359375, 'learning_rate': 4.748772269376312e-06, 'epoch': 0.69}
23%|██▎ | 650/2826 [1:03:47<3:17:11, 5.44s/it]
23%|██▎ | 651/2826 [1:03:53<3:16:57, 5.43s/it]
23%|██▎ | 652/2826 [1:03:59<3:23:10, 5.61s/it]
23%|██▎ | 653/2826 [1:04:05<3:34:07, 5.91s/it]
23%|██▎ | 654/2826 [1:04:12<3:36:23, 5.98s/it]
23%|██▎ | 655/2826 [1:04:17<3:29:18, 5.78s/it]
23%|██▎ | 656/2826 [1:04:23<3:31:03, 5.84s/it]
23%|██▎ | 657/2826 [1:04:28<3:23:09, 5.62s/it]
23%|██▎ | 658/2826 [1:04:34<3:27:43, 5.75s/it]
23%|██▎ | 659/2826 [1:04:39<3:21:46, 5.59s/it]
23%|██▎ | 660/2826 [1:04:46<3:33:07, 5.90s/it]
{'loss': 0.488, 'grad_norm': 1.9452372789382935, 'learning_rate': 4.735107408375977e-06, 'epoch': 0.7}
23%|██▎ | 660/2826 [1:04:46<3:33:07, 5.90s/it]
23%|██▎ | 661/2826 [1:04:53<3:43:28, 6.19s/it]
23%|██▎ | 662/2826 [1:04:58<3:31:37, 5.87s/it]
23%|██▎ | 663/2826 [1:05:03<3:22:27, 5.62s/it]
23%|██▎ | 664/2826 [1:05:08<3:16:46, 5.46s/it]
24%|██▎ | 665/2826 [1:05:14<3:18:12, 5.50s/it]
24%|██▎ | 666/2826 [1:05:19<3:13:54, 5.39s/it]
24%|██▎ | 667/2826 [1:05:25<3:23:58, 5.67s/it]
24%|██▎ | 668/2826 [1:05:30<3:17:42, 5.50s/it]
24%|██▎ | 669/2826 [1:05:35<3:13:39, 5.39s/it]
24%|██▎ | 670/2826 [1:05:40<3:12:34, 5.36s/it]
{'loss': 0.4462, 'grad_norm': 2.7268893718719482, 'learning_rate': 4.721101433096381e-06, 'epoch': 0.71}
24%|██▎ | 670/2826 [1:05:40<3:12:34, 5.36s/it]
24%|██▎ | 671/2826 [1:05:47<3:20:19, 5.58s/it]
24%|██▍ | 672/2826 [1:05:52<3:15:03, 5.43s/it]
24%|██▍ | 673/2826 [1:05:58<3:24:07, 5.69s/it]
24%|██▍ | 674/2826 [1:06:04<3:31:42, 5.90s/it]
24%|██▍ | 675/2826 [1:06:09<3:23:25, 5.67s/it]
24%|██▍ | 676/2826 [1:06:15<3:26:38, 5.77s/it]
24%|██▍ | 677/2826 [1:06:22<3:30:54, 5.89s/it]
24%|██▍ | 678/2826 [1:06:28<3:36:49, 6.06s/it]
24%|██▍ | 679/2826 [1:06:34<3:33:14, 5.96s/it]
24%|██▍ | 680/2826 [1:06:41<3:43:49, 6.26s/it]
{'loss': 0.5087, 'grad_norm': 2.1095452308654785, 'learning_rate': 4.706756481080511e-06, 'epoch': 0.72}
24%|██▍ | 680/2826 [1:06:41<3:43:49, 6.26s/it]
24%|██▍ | 681/2826 [1:06:46<3:35:57, 6.04s/it]
24%|██▍ | 682/2826 [1:06:52<3:28:07, 5.82s/it]
24%|██▍ | 683/2826 [1:06:57<3:27:49, 5.82s/it]
24%|██▍ | 684/2826 [1:07:05<3:41:54, 6.22s/it]
24%|██▍ | 685/2826 [1:07:10<3:28:59, 5.86s/it]
24%|██▍ | 686/2826 [1:07:15<3:25:57, 5.77s/it]
24%|██▍ | 687/2826 [1:07:21<3:23:20, 5.70s/it]
24%|██▍ | 688/2826 [1:07:26<3:16:02, 5.50s/it]
24%|██▍ | 689/2826 [1:07:31<3:09:25, 5.32s/it]
24%|██▍ | 690/2826 [1:07:36<3:12:46, 5.41s/it]
{'loss': 0.5304, 'grad_norm': 2.278555154800415, 'learning_rate': 4.692074741604795e-06, 'epoch': 0.73}
24%|██▍ | 690/2826 [1:07:36<3:12:46, 5.41s/it]
24%|██▍ | 691/2826 [1:07:42<3:18:40, 5.58s/it]
24%|██▍ | 692/2826 [1:07:48<3:15:39, 5.50s/it]
25%|██▍ | 693/2826 [1:07:54<3:24:44, 5.76s/it]
25%|██▍ | 694/2826 [1:08:00<3:29:48, 5.90s/it]
25%|██▍ | 695/2826 [1:08:06<3:23:32, 5.73s/it]
25%|██▍ | 696/2826 [1:08:11<3:17:25, 5.56s/it]
25%|██▍ | 697/2826 [1:08:17<3:30:17, 5.93s/it]
25%|██▍ | 698/2826 [1:08:23<3:23:51, 5.75s/it]
25%|██▍ | 699/2826 [1:08:28<3:16:07, 5.53s/it]
25%|██▍ | 700/2826 [1:08:33<3:11:26, 5.40s/it]
{'loss': 0.5177, 'grad_norm': 2.455960512161255, 'learning_rate': 4.677058455344989e-06, 'epoch': 0.74}
25%|██▍ | 700/2826 [1:08:33<3:11:26, 5.40s/it]
25%|██▍ | 701/2826 [1:08:38<3:11:11, 5.40s/it]
25%|██▍ | 702/2826 [1:08:46<3:34:46, 6.07s/it]
25%|██▍ | 703/2826 [1:08:52<3:31:48, 5.99s/it]
25%|██▍ | 704/2826 [1:08:58<3:38:31, 6.18s/it]
25%|██▍ | 705/2826 [1:09:04<3:29:38, 5.93s/it]
25%|██▍ | 706/2826 [1:09:11<3:39:23, 6.21s/it]
25%|██▌ | 707/2826 [1:09:17<3:41:51, 6.28s/it]
25%|██▌ | 708/2826 [1:09:23<3:39:06, 6.21s/it]
25%|██▌ | 709/2826 [1:09:30<3:44:42, 6.37s/it]
25%|██▌ | 710/2826 [1:09:37<3:52:46, 6.60s/it]
{'loss': 0.4841, 'grad_norm': 2.1136856079101562, 'learning_rate': 4.661709914034209e-06, 'epoch': 0.75}
25%|██▌ | 710/2826 [1:09:37<3:52:46, 6.60s/it]
25%|██▌ | 711/2826 [1:09:44<3:57:13, 6.73s/it]
25%|██▌ | 712/2826 [1:09:51<3:56:08, 6.70s/it]
25%|██▌ | 713/2826 [1:09:57<3:53:06, 6.62s/it]
25%|██▌ | 714/2826 [1:10:02<3:38:14, 6.20s/it]
25%|██▌ | 715/2826 [1:10:07<3:27:22, 5.89s/it]
25%|██▌ | 716/2826 [1:10:13<3:22:05, 5.75s/it]
25%|██▌ | 717/2826 [1:10:19<3:29:24, 5.96s/it]
25%|██▌ | 718/2826 [1:10:25<3:25:07, 5.84s/it]
25%|██▌ | 719/2826 [1:10:30<3:21:22, 5.73s/it]
25%|██▌ | 720/2826 [1:10:38<3:37:30, 6.20s/it]
{'loss': 0.4544, 'grad_norm': 2.296614646911621, 'learning_rate': 4.646031460113175e-06, 'epoch': 0.76}
25%|██▌ | 720/2826 [1:10:38<3:37:30, 6.20s/it]
26%|██▌ | 721/2826 [1:10:43<3:27:51, 5.92s/it]
26%|██▌ | 722/2826 [1:10:49<3:29:29, 5.97s/it]
26%|██▌ | 723/2826 [1:10:54<3:20:48, 5.73s/it]
26%|██▌ | 724/2826 [1:10:59<3:13:28, 5.52s/it]
26%|██▌ | 725/2826 [1:11:05<3:19:32, 5.70s/it]
26%|██▌ | 726/2826 [1:11:10<3:14:08, 5.55s/it]
26%|██▌ | 727/2826 [1:11:17<3:20:28, 5.73s/it]
26%|██▌ | 728/2826 [1:11:22<3:13:59, 5.55s/it]
26%|██▌ | 729/2826 [1:11:27<3:08:52, 5.40s/it]
26%|██▌ | 730/2826 [1:11:32<3:08:15, 5.39s/it]
{'loss': 0.4715, 'grad_norm': 1.8733782768249512, 'learning_rate': 4.630025486372715e-06, 'epoch': 0.77}
26%|██▌ | 730/2826 [1:11:32<3:08:15, 5.39s/it]
26%|██▌ | 731/2826 [1:11:38<3:16:06, 5.62s/it]
26%|██▌ | 732/2826 [1:11:44<3:16:38, 5.63s/it]
26%|██▌ | 733/2826 [1:11:49<3:10:19, 5.46s/it]
26%|██▌ | 734/2826 [1:11:54<3:06:23, 5.35s/it]
26%|██▌ | 735/2826 [1:11:59<3:03:47, 5.27s/it]
26%|██▌ | 736/2826 [1:12:05<3:06:16, 5.35s/it]
26%|██▌ | 737/2826 [1:12:11<3:18:30, 5.70s/it]
26%|██▌ | 738/2826 [1:12:17<3:14:16, 5.58s/it]
26%|██▌ | 739/2826 [1:12:22<3:10:19, 5.47s/it]
26%|██▌ | 740/2826 [1:12:28<3:13:57, 5.58s/it]
{'loss': 0.4824, 'grad_norm': 2.526837110519409, 'learning_rate': 4.613694435588589e-06, 'epoch': 0.79}
26%|██▌ | 740/2826 [1:12:28<3:13:57, 5.58s/it]
26%|██▌ | 741/2826 [1:12:33<3:10:35, 5.48s/it]
26%|██▋ | 742/2826 [1:12:39<3:11:40, 5.52s/it]
26%|██▋ | 743/2826 [1:12:45<3:19:04, 5.73s/it]
26%|██▋ | 744/2826 [1:12:52<3:33:39, 6.16s/it]
26%|██▋ | 745/2826 [1:12:57<3:26:52, 5.96s/it]
26%|██▋ | 746/2826 [1:13:03<3:20:29, 5.78s/it]
26%|██▋ | 747/2826 [1:13:08<3:18:28, 5.73s/it]
26%|██▋ | 748/2826 [1:13:14<3:12:42, 5.56s/it]
27%|██▋ | 749/2826 [1:13:19<3:12:00, 5.55s/it]
27%|██▋ | 750/2826 [1:13:25<3:15:38, 5.65s/it]
{'loss': 0.4852, 'grad_norm': 2.2026150226593018, 'learning_rate': 4.597040800148679e-06, 'epoch': 0.8}
27%|██▋ | 750/2826 [1:13:25<3:15:38, 5.65s/it]
27%|██▋ | 751/2826 [1:13:30<3:09:26, 5.48s/it]
27%|██▋ | 752/2826 [1:13:36<3:14:51, 5.64s/it]
27%|██▋ | 753/2826 [1:13:41<3:08:40, 5.46s/it]
27%|██▋ | 754/2826 [1:13:46<3:04:28, 5.34s/it]
27%|██▋ | 755/2826 [1:13:51<3:03:20, 5.31s/it]
27%|██▋ | 756/2826 [1:13:58<3:15:18, 5.66s/it]
27%|██▋ | 757/2826 [1:14:03<3:11:08, 5.54s/it]
27%|██▋ | 758/2826 [1:14:10<3:19:44, 5.80s/it]
27%|██▋ | 759/2826 [1:14:16<3:30:47, 6.12s/it]
27%|██▋ | 760/2826 [1:14:21<3:19:45, 5.80s/it]
{'loss': 0.4134, 'grad_norm': 2.214277744293213, 'learning_rate': 4.580067121672607e-06, 'epoch': 0.81}
27%|██▋ | 760/2826 [1:14:21<3:19:45, 5.80s/it]
27%|██▋ | 761/2826 [1:14:27<3:16:01, 5.70s/it]
27%|██▋ | 762/2826 [1:14:32<3:12:29, 5.60s/it]
27%|██▋ | 763/2826 [1:14:37<3:08:07, 5.47s/it]
27%|██▋ | 764/2826 [1:14:43<3:08:46, 5.49s/it]
27%|██▋ | 765/2826 [1:14:50<3:27:24, 6.04s/it]
27%|██▋ | 766/2826 [1:14:55<3:17:52, 5.76s/it]
27%|██▋ | 767/2826 [1:15:02<3:25:25, 5.99s/it]
27%|██▋ | 768/2826 [1:15:08<3:26:57, 6.03s/it]
27%|██▋ | 769/2826 [1:15:14<3:22:23, 5.90s/it]
27%|██▋ | 770/2826 [1:15:20<3:29:34, 6.12s/it]
{'loss': 0.4493, 'grad_norm': 2.623305559158325, 'learning_rate': 4.562775990623847e-06, 'epoch': 0.82}
27%|██▋ | 770/2826 [1:15:20<3:29:34, 6.12s/it]
27%|██▋ | 771/2826 [1:15:26<3:23:43, 5.95s/it]
27%|██▋ | 772/2826 [1:15:32<3:22:26, 5.91s/it]
27%|██▋ | 773/2826 [1:15:37<3:17:11, 5.76s/it]
27%|██▋ | 774/2826 [1:15:44<3:25:45, 6.02s/it]
27%|██▋ | 775/2826 [1:15:49<3:22:25, 5.92s/it]
27%|██▋ | 776/2826 [1:15:55<3:15:14, 5.71s/it]
27%|██▋ | 777/2826 [1:16:01<3:17:52, 5.79s/it]
28%|██▊ | 778/2826 [1:16:06<3:14:38, 5.70s/it]
28%|██▊ | 779/2826 [1:16:12<3:19:28, 5.85s/it]
28%|██▊ | 780/2826 [1:16:18<3:20:27, 5.88s/it]
{'loss': 0.5255, 'grad_norm': 2.9433794021606445, 'learning_rate': 4.5451700459143735e-06, 'epoch': 0.83}
28%|██▊ | 780/2826 [1:16:18<3:20:27, 5.88s/it]
28%|██▊ | 781/2826 [1:16:25<3:25:19, 6.02s/it]
28%|██▊ | 782/2826 [1:16:31<3:33:16, 6.26s/it]
28%|██▊ | 783/2826 [1:16:37<3:30:04, 6.17s/it]
28%|██▊ | 784/2826 [1:16:43<3:20:48, 5.90s/it]
28%|██▊ | 785/2826 [1:16:48<3:13:36, 5.69s/it]
28%|██▊ | 786/2826 [1:16:54<3:16:01, 5.77s/it]
28%|██▊ | 787/2826 [1:17:00<3:17:12, 5.80s/it]
28%|██▊ | 788/2826 [1:17:05<3:11:27, 5.64s/it]
28%|██▊ | 789/2826 [1:17:10<3:09:15, 5.57s/it]
28%|██▊ | 790/2826 [1:17:17<3:21:24, 5.94s/it]
{'loss': 0.4503, 'grad_norm': 2.143739938735962, 'learning_rate': 4.527251974501923e-06, 'epoch': 0.84}
28%|██▊ | 790/2826 [1:17:17<3:21:24, 5.94s/it]
28%|██▊ | 791/2826 [1:17:23<3:25:32, 6.06s/it]
28%|██▊ | 792/2826 [1:17:32<3:49:08, 6.76s/it]
28%|██▊ | 793/2826 [1:17:37<3:32:09, 6.26s/it]
28%|██▊ | 794/2826 [1:17:42<3:20:51, 5.93s/it]
28%|██▊ | 795/2826 [1:17:48<3:16:48, 5.81s/it]
28%|██▊ | 796/2826 [1:17:53<3:10:20, 5.63s/it]
28%|██▊ | 797/2826 [1:17:59<3:15:07, 5.77s/it]
28%|██▊ | 798/2826 [1:18:05<3:19:41, 5.91s/it]
28%|██▊ | 799/2826 [1:18:11<3:13:41, 5.73s/it]
28%|██▊ | 800/2826 [1:18:17<3:26:14, 6.11s/it]
{'loss': 0.4636, 'grad_norm': 2.1592986583709717, 'learning_rate': 4.509024510979917e-06, 'epoch': 0.85}
28%|██▊ | 800/2826 [1:18:17<3:26:14, 6.11s/it]
28%|██▊ | 801/2826 [1:18:24<3:29:49, 6.22s/it]
28%|██▊ | 802/2826 [1:18:30<3:26:00, 6.11s/it]
28%|██▊ | 803/2826 [1:18:35<3:19:53, 5.93s/it]
28%|██▊ | 804/2826 [1:18:42<3:25:27, 6.10s/it]
28%|██▊ | 805/2826 [1:18:47<3:20:20, 5.95s/it]
29%|██▊ | 806/2826 [1:18:53<3:14:27, 5.78s/it]
29%|██▊ | 807/2826 [1:19:00<3:26:15, 6.13s/it]
29%|██▊ | 808/2826 [1:19:05<3:20:04, 5.95s/it]
29%|██▊ | 809/2826 [1:19:13<3:36:29, 6.44s/it]
29%|██▊ | 810/2826 [1:19:18<3:24:30, 6.09s/it]
{'loss': 0.4685, 'grad_norm': 2.2622759342193604, 'learning_rate': 4.4904904371601176e-06, 'epoch': 0.86}
29%|██▊ | 810/2826 [1:19:18<3:24:30, 6.09s/it]
29%|██▊ | 811/2826 [1:19:24<3:24:46, 6.10s/it]
29%|██▊ | 812/2826 [1:19:29<3:14:40, 5.80s/it]
29%|██▉ | 813/2826 [1:19:36<3:25:28, 6.12s/it]
29%|██▉ | 814/2826 [1:19:43<3:28:16, 6.21s/it]
29%|██▉ | 815/2826 [1:19:48<3:17:00, 5.88s/it]
29%|██▉ | 816/2826 [1:19:55<3:27:35, 6.20s/it]
29%|██▉ | 817/2826 [1:20:02<3:40:45, 6.59s/it]
29%|██▉ | 818/2826 [1:20:10<3:50:48, 6.90s/it]
29%|██▉ | 819/2826 [1:20:15<3:32:37, 6.36s/it]
29%|██▉ | 820/2826 [1:20:20<3:22:27, 6.06s/it]
{'loss': 0.5248, 'grad_norm': 2.3408522605895996, 'learning_rate': 4.4716525816480816e-06, 'epoch': 0.87}
29%|██▉ | 820/2826 [1:20:20<3:22:27, 6.06s/it]
29%|██▉ | 821/2826 [1:20:26<3:16:31, 5.88s/it]
29%|██▉ | 822/2826 [1:20:33<3:30:28, 6.30s/it]
29%|██▉ | 823/2826 [1:20:39<3:29:36, 6.28s/it]
29%|██▉ | 824/2826 [1:20:44<3:16:30, 5.89s/it]
29%|██▉ | 825/2826 [1:20:50<3:10:33, 5.71s/it]
29%|██▉ | 826/2826 [1:20:56<3:15:50, 5.88s/it]
29%|██▉ | 827/2826 [1:21:01<3:11:04, 5.74s/it]
29%|██▉ | 828/2826 [1:21:08<3:24:38, 6.15s/it]
29%|██▉ | 829/2826 [1:21:14<3:21:40, 6.06s/it]
29%|██▉ | 830/2826 [1:21:20<3:23:18, 6.11s/it]
{'loss': 0.4747, 'grad_norm': 2.5351459980010986, 'learning_rate': 4.4525138194114644e-06, 'epoch': 0.88}
29%|██▉ | 830/2826 [1:21:20<3:23:18, 6.11s/it]
29%|██▉ | 831/2826 [1:21:28<3:33:50, 6.43s/it]
29%|██▉ | 832/2826 [1:21:35<3:48:02, 6.86s/it]
29%|██▉ | 833/2826 [1:21:42<3:45:13, 6.78s/it]
30%|██▉ | 834/2826 [1:21:47<3:27:54, 6.26s/it]
30%|██▉ | 835/2826 [1:21:54<3:36:10, 6.51s/it]
30%|██▉ | 836/2826 [1:22:01<3:35:27, 6.50s/it]
30%|██▉ | 837/2826 [1:22:09<3:56:47, 7.14s/it]
30%|██▉ | 838/2826 [1:22:15<3:42:13, 6.71s/it]
30%|██▉ | 839/2826 [1:22:21<3:39:52, 6.64s/it]
30%|██▉ | 840/2826 [1:22:27<3:25:15, 6.20s/it]
{'loss': 0.4198, 'grad_norm': 2.4038591384887695, 'learning_rate': 4.4330770713412555e-06, 'epoch': 0.89}
30%|██▉ | 840/2826 [1:22:27<3:25:15, 6.20s/it]
30%|██▉ | 841/2826 [1:22:33<3:23:29, 6.15s/it]
30%|██▉ | 842/2826 [1:22:41<3:41:23, 6.70s/it]
30%|██▉ | 843/2826 [1:22:47<3:38:46, 6.62s/it]
30%|██▉ | 844/2826 [1:22:54<3:36:54, 6.57s/it]
30%|██▉ | 845/2826 [1:22:59<3:22:22, 6.13s/it]
30%|██▉ | 846/2826 [1:23:04<3:15:28, 5.92s/it]
30%|██▉ | 847/2826 [1:23:09<3:09:06, 5.73s/it]
30%|███ | 848/2826 [1:23:15<3:12:47, 5.85s/it]
30%|███ | 849/2826 [1:23:22<3:14:44, 5.91s/it]
30%|███ | 850/2826 [1:23:27<3:15:11, 5.93s/it]
{'loss': 0.4545, 'grad_norm': 2.2719292640686035, 'learning_rate': 4.413345303805996e-06, 'epoch': 0.9}
30%|███ | 850/2826 [1:23:27<3:15:11, 5.93s/it]
30%|███ | 851/2826 [1:23:33<3:09:49, 5.77s/it]
30%|███ | 852/2826 [1:23:40<3:18:21, 6.03s/it]
30%|███ | 853/2826 [1:23:45<3:16:26, 5.97s/it]
30%|███ | 854/2826 [1:23:51<3:08:49, 5.75s/it]
30%|███ | 855/2826 [1:23:57<3:12:56, 5.87s/it]
30%|███ | 856/2826 [1:24:02<3:04:42, 5.63s/it]
30%|███ | 857/2826 [1:24:08<3:10:29, 5.80s/it]
30%|███ | 858/2826 [1:24:13<3:03:55, 5.61s/it]
30%|███ | 859/2826 [1:24:18<3:00:37, 5.51s/it]
30%|███ | 860/2826 [1:24:24<2:57:40, 5.42s/it]
{'loss': 0.5003, 'grad_norm': 3.1209301948547363, 'learning_rate': 4.393321528199072e-06, 'epoch': 0.91}
30%|███ | 860/2826 [1:24:24<2:57:40, 5.42s/it]
30%|███ | 861/2826 [1:24:29<2:59:49, 5.49s/it]
31%|███ | 862/2826 [1:24:36<3:06:48, 5.71s/it]
31%|███ | 863/2826 [1:24:42<3:09:17, 5.79s/it]
31%|███ | 864/2826 [1:24:47<3:05:16, 5.67s/it]
31%|███ | 865/2826 [1:24:53<3:06:12, 5.70s/it]
31%|███ | 866/2826 [1:24:59<3:07:58, 5.75s/it]
31%|███ | 867/2826 [1:25:04<3:08:17, 5.77s/it]
31%|███ | 868/2826 [1:25:09<3:02:02, 5.58s/it]
31%|███ | 869/2826 [1:25:15<2:57:41, 5.45s/it]
31%|███ | 870/2826 [1:25:20<2:53:58, 5.34s/it]
{'loss': 0.472, 'grad_norm': 2.414945125579834, 'learning_rate': 4.373008800479118e-06, 'epoch': 0.92}
31%|███ | 870/2826 [1:25:20<2:53:58, 5.34s/it]
31%|███ | 871/2826 [1:25:25<2:57:13, 5.44s/it]
31%|███ | 872/2826 [1:25:31<2:54:06, 5.35s/it]
31%|███ | 873/2826 [1:25:36<2:51:52, 5.28s/it]
31%|███ | 874/2826 [1:25:42<3:02:38, 5.61s/it]
31%|███ | 875/2826 [1:25:47<2:57:43, 5.47s/it]
31%|███ | 876/2826 [1:25:52<2:54:15, 5.36s/it]
31%|███ | 877/2826 [1:25:59<3:08:52, 5.81s/it]
31%|███ | 878/2826 [1:26:05<3:04:33, 5.68s/it]
31%|███ | 879/2826 [1:26:11<3:12:47, 5.94s/it]
31%|███ | 880/2826 [1:26:17<3:16:48, 6.07s/it]
{'loss': 0.4661, 'grad_norm': 2.21144437789917, 'learning_rate': 4.352410220703629e-06, 'epoch': 0.93}
31%|███ | 880/2826 [1:26:17<3:16:48, 6.07s/it]
31%|███ | 881/2826 [1:26:25<3:26:56, 6.38s/it]
31%|███ | 882/2826 [1:26:30<3:20:38, 6.19s/it]
31%|███ | 883/2826 [1:26:36<3:17:58, 6.11s/it]
31%|███▏ | 884/2826 [1:26:41<3:08:38, 5.83s/it]
31%|███▏ | 885/2826 [1:26:48<3:13:22, 5.98s/it]
31%|███▏ | 886/2826 [1:26:53<3:04:43, 5.71s/it]
31%|███▏ | 887/2826 [1:27:00<3:16:46, 6.09s/it]
31%|███▏ | 888/2826 [1:27:05<3:06:51, 5.79s/it]
31%|███▏ | 889/2826 [1:27:12<3:15:55, 6.07s/it]
31%|███▏ | 890/2826 [1:27:17<3:11:36, 5.94s/it]
{'loss': 0.4614, 'grad_norm': 2.210827589035034, 'learning_rate': 4.331528932555844e-06, 'epoch': 0.94}
31%|███▏ | 890/2826 [1:27:17<3:11:36, 5.94s/it]
32%|███▏ | 891/2826 [1:27:22<3:04:34, 5.72s/it]
32%|███▏ | 892/2826 [1:27:28<2:58:48, 5.55s/it]
32%|███▏ | 893/2826 [1:27:35<3:15:39, 6.07s/it]
32%|███▏ | 894/2826 [1:27:40<3:05:23, 5.76s/it]
32%|███▏ | 895/2826 [1:27:46<3:06:43, 5.80s/it]
32%|███▏ | 896/2826 [1:27:52<3:11:42, 5.96s/it]
32%|███▏ | 897/2826 [1:27:58<3:07:25, 5.83s/it]
32%|███▏ | 898/2826 [1:28:04<3:07:33, 5.84s/it]
32%|███▏ | 899/2826 [1:28:10<3:12:55, 6.01s/it]
32%|███▏ | 900/2826 [1:28:15<3:04:44, 5.76s/it]
{'loss': 0.4623, 'grad_norm': 2.403038740158081, 'learning_rate': 4.3103681228649626e-06, 'epoch': 0.95}
32%|███▏ | 900/2826 [1:28:15<3:04:44, 5.76s/it]
32%|███▏ | 901/2826 [1:28:20<2:59:30, 5.59s/it]
32%|███▏ | 902/2826 [1:28:25<2:55:26, 5.47s/it]
32%|███▏ | 903/2826 [1:28:31<2:58:45, 5.58s/it]
32%|███▏ | 904/2826 [1:28:38<3:10:55, 5.96s/it]
32%|███▏ | 905/2826 [1:28:44<3:10:35, 5.95s/it]
32%|███▏ | 906/2826 [1:28:50<3:06:40, 5.83s/it]
32%|███▏ | 907/2826 [1:28:56<3:15:02, 6.10s/it]
32%|███▏ | 908/2826 [1:29:02<3:14:04, 6.07s/it]
32%|███▏ | 909/2826 [1:29:08<3:10:21, 5.96s/it]
32%|███▏ | 910/2826 [1:29:13<3:02:19, 5.71s/it]
{'loss': 0.4902, 'grad_norm': 2.588114023208618, 'learning_rate': 4.288931021119788e-06, 'epoch': 0.97}
32%|███▏ | 910/2826 [1:29:13<3:02:19, 5.71s/it]
32%|███▏ | 911/2826 [1:29:18<2:58:08, 5.58s/it]
32%|███▏ | 912/2826 [1:29:24<2:57:23, 5.56s/it]
32%|███▏ | 913/2826 [1:29:29<2:55:27, 5.50s/it]
32%|███▏ | 914/2826 [1:29:34<2:51:24, 5.38s/it]
32%|███▏ | 915/2826 [1:29:40<2:52:58, 5.43s/it]
32%|███▏ | 916/2826 [1:29:45<2:49:43, 5.33s/it]
32%|███▏ | 917/2826 [1:29:51<2:53:38, 5.46s/it]
32%|███▏ | 918/2826 [1:29:59<3:15:46, 6.16s/it]
33%|███▎ | 919/2826 [1:30:04<3:05:44, 5.84s/it]
33%|███▎ | 920/2826 [1:30:10<3:05:48, 5.85s/it]
{'loss': 0.5047, 'grad_norm': 2.288691997528076, 'learning_rate': 4.267220898975848e-06, 'epoch': 0.98}
33%|███▎ | 920/2826 [1:30:10<3:05:48, 5.85s/it]
33%|███▎ | 921/2826 [1:30:16<3:12:02, 6.05s/it]
33%|███▎ | 922/2826 [1:30:22<3:10:46, 6.01s/it]
33%|███▎ | 923/2826 [1:30:28<3:08:25, 5.94s/it]
33%|███▎ | 924/2826 [1:30:35<3:16:32, 6.20s/it]
33%|███▎ | 925/2826 [1:30:40<3:08:33, 5.95s/it]
33%|███▎ | 926/2826 [1:30:47<3:14:26, 6.14s/it]
33%|███▎ | 927/2826 [1:30:54<3:21:57, 6.38s/it]
33%|███▎ | 928/2826 [1:30:59<3:12:01, 6.07s/it]
33%|███▎ | 929/2826 [1:31:04<3:02:48, 5.78s/it]
33%|███▎ | 930/2826 [1:31:10<3:07:58, 5.95s/it]
{'loss': 0.5358, 'grad_norm': 2.2487804889678955, 'learning_rate': 4.245241069756092e-06, 'epoch': 0.99}
33%|███▎ | 930/2826 [1:31:10<3:07:58, 5.95s/it]
33%|███▎ | 931/2826 [1:31:19<3:29:39, 6.64s/it]
33%|███▎ | 932/2826 [1:31:24<3:22:17, 6.41s/it]
33%|███▎ | 933/2826 [1:31:30<3:15:02, 6.18s/it]
33%|███▎ | 934/2826 [1:31:35<3:04:54, 5.86s/it]
33%|███▎ | 935/2826 [1:31:41<3:03:18, 5.82s/it]
33%|███▎ | 936/2826 [1:31:46<3:00:09, 5.72s/it]
33%|███▎ | 937/2826 [1:31:52<3:02:43, 5.80s/it]
33%|███▎ | 938/2826 [1:31:58<3:03:28, 5.83s/it]
33%|███▎ | 939/2826 [1:32:05<3:14:38, 6.19s/it]
33%|███▎ | 940/2826 [1:32:11<3:10:32, 6.06s/it]
{'loss': 0.4928, 'grad_norm': 2.5266008377075195, 'learning_rate': 4.222994887945219e-06, 'epoch': 1.0}
33%|███▎ | 940/2826 [1:32:11<3:10:32, 6.06s/it]
33%|███▎ | 941/2826 [1:32:17<3:07:09, 5.96s/it]
33%|███▎ | 942/2826 [1:32:22<3:02:07, 5.80s/it]
33%|███▎ | 943/2826 [1:32:25<2:31:12, 4.82s/it][INFO|trainer.py:3984] 2025-10-18 08:18:38,867 >> Saving model checkpoint to /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-943
[INFO|configuration_utils.py:419] 2025-10-18 08:18:38,877 >> Configuration saved in /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-943/config.json
[INFO|configuration_utils.py:911] 2025-10-18 08:18:38,879 >> Configuration saved in /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-943/generation_config.json
[INFO|modeling_utils.py:3580] 2025-10-18 08:18:54,649 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 4 checkpoint shards. You can find where each parameters has been saved in the index located at /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-943/model.safetensors.index.json.
[INFO|tokenization_utils_base.py:2510] 2025-10-18 08:18:54,651 >> tokenizer config file saved in /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-943/tokenizer_config.json
[INFO|tokenization_utils_base.py:2519] 2025-10-18 08:18:54,652 >> Special tokens file saved in /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-943/special_tokens_map.json
[2025-10-18 08:18:55,344] [INFO] [logging.py:107:log_dist] [Rank 0] [Torch] Checkpoint global_step942 is about to be saved!
[2025-10-18 08:18:55,355] [INFO] [logging.py:107:log_dist] [Rank 0] Saving model checkpoint: /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-943/global_step942/zero_pp_rank_0_mp_rank_00_model_states.pt
[2025-10-18 08:18:55,355] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-943/global_step942/zero_pp_rank_0_mp_rank_00_model_states.pt...
[2025-10-18 08:18:55,372] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-943/global_step942/zero_pp_rank_0_mp_rank_00_model_states.pt.
[2025-10-18 08:18:55,384] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-943/global_step942/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt...
[2025-10-18 08:19:06,711] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-943/global_step942/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt.
[2025-10-18 08:19:06,716] [INFO] [engine.py:3701:_save_zero_checkpoint] zero checkpoint saved /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-943/global_step942/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt
[2025-10-18 08:19:07,451] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step942 is ready now!
33%|███▎ | 944/2826 [1:33:09<8:41:27, 16.62s/it]
33%|███▎ | 945/2826 [1:33:15<7:01:57, 13.46s/it]
33%|███▎ | 946/2826 [1:33:20<5:46:28, 11.06s/it]
34%|███▎ | 947/2826 [1:33:27<5:04:11, 9.71s/it]
34%|███▎ | 948/2826 [1:33:32<4:23:27, 8.42s/it]
34%|███▎ | 949/2826 [1:33:39<4:01:52, 7.73s/it]
34%|███▎ | 950/2826 [1:33:45<3:49:22, 7.34s/it]
{'loss': 0.3963, 'grad_norm': 2.5962352752685547, 'learning_rate': 4.20048574867773e-06, 'epoch': 1.01}
34%|███▎ | 950/2826 [1:33:45<3:49:22, 7.34s/it]
34%|███▎ | 951/2826 [1:33:52<3:43:24, 7.15s/it]
34%|███▎ | 952/2826 [1:33:57<3:24:25, 6.55s/it]
34%|███▎ | 953/2826 [1:34:04<3:28:22, 6.68s/it]
34%|███▍ | 954/2826 [1:34:10<3:23:13, 6.51s/it]
34%|███▍ | 955/2826 [1:34:15<3:11:55, 6.15s/it]
34%|███▍ | 956/2826 [1:34:21<3:03:10, 5.88s/it]
34%|███▍ | 957/2826 [1:34:26<3:02:49, 5.87s/it]
34%|███▍ | 958/2826 [1:34:32<3:00:15, 5.79s/it]
34%|███▍ | 959/2826 [1:34:38<3:01:04, 5.82s/it]
34%|███▍ | 960/2826 [1:34:43<2:53:58, 5.59s/it]
{'loss': 0.3125, 'grad_norm': 2.707613229751587, 'learning_rate': 4.1777170872197725e-06, 'epoch': 1.02}
34%|███▍ | 960/2826 [1:34:43<2:53:58, 5.59s/it]
34%|███▍ | 961/2826 [1:34:48<2:49:24, 5.45s/it]
34%|███▍ | 962/2826 [1:34:55<2:59:06, 5.77s/it]
34%|███▍ | 963/2826 [1:35:00<2:52:13, 5.55s/it]
34%|███▍ | 964/2826 [1:35:06<2:57:05, 5.71s/it]
34%|███▍ | 965/2826 [1:35:12<3:00:15, 5.81s/it]
34%|███▍ | 966/2826 [1:35:17<2:54:05, 5.62s/it]
34%|███▍ | 967/2826 [1:35:22<2:50:43, 5.51s/it]
34%|███▍ | 968/2826 [1:35:28<2:51:12, 5.53s/it]
34%|███▍ | 969/2826 [1:35:35<3:06:10, 6.02s/it]
34%|███▍ | 970/2826 [1:35:40<3:00:38, 5.84s/it]
{'loss': 0.3457, 'grad_norm': 2.4237964153289795, 'learning_rate': 4.1546923784448646e-06, 'epoch': 1.03}
34%|███▍ | 970/2826 [1:35:40<3:00:38, 5.84s/it]
34%|███▍ | 971/2826 [1:35:46<2:56:18, 5.70s/it]
34%|███▍ | 972/2826 [1:35:52<3:02:30, 5.91s/it]
34%|███▍ | 973/2826 [1:35:57<2:56:13, 5.71s/it]
34%|███▍ | 974/2826 [1:36:02<2:51:26, 5.55s/it]
35%|███▍ | 975/2826 [1:36:09<2:58:15, 5.78s/it]
35%|███▍ | 976/2826 [1:36:15<3:01:07, 5.87s/it]
35%|███▍ | 977/2826 [1:36:21<2:59:05, 5.81s/it]
35%|███▍ | 978/2826 [1:36:27<3:05:22, 6.02s/it]
35%|███▍ | 979/2826 [1:36:33<3:02:09, 5.92s/it]
35%|███▍ | 980/2826 [1:36:38<2:55:14, 5.70s/it]
{'loss': 0.3029, 'grad_norm': 1.6531928777694702, 'learning_rate': 4.1314151363035705e-06, 'epoch': 1.04}
35%|███▍ | 980/2826 [1:36:38<2:55:14, 5.70s/it]
35%|███▍ | 981/2826 [1:36:43<2:49:17, 5.51s/it]
35%|███▍ | 982/2826 [1:36:48<2:46:14, 5.41s/it]
35%|███▍ | 983/2826 [1:36:54<2:50:11, 5.54s/it]
35%|███▍ | 984/2826 [1:37:00<2:56:04, 5.74s/it]
35%|███▍ | 985/2826 [1:37:06<2:56:10, 5.74s/it]
35%|███▍ | 986/2826 [1:37:12<3:02:53, 5.96s/it]
35%|███▍ | 987/2826 [1:37:19<3:06:18, 6.08s/it]
35%|███▍ | 988/2826 [1:37:25<3:05:34, 6.06s/it]
35%|███▍ | 989/2826 [1:37:30<2:58:32, 5.83s/it]
35%|███▌ | 990/2826 [1:37:36<3:02:40, 5.97s/it]
{'loss': 0.3289, 'grad_norm': 2.1669981479644775, 'learning_rate': 4.1078889132872145e-06, 'epoch': 1.05}
35%|███▌ | 990/2826 [1:37:36<3:02:40, 5.97s/it]
35%|███▌ | 991/2826 [1:37:42<2:57:27, 5.80s/it]
35%|███▌ | 992/2826 [1:37:48<2:57:00, 5.79s/it]
35%|███▌ | 993/2826 [1:37:54<2:58:57, 5.86s/it]
35%|███▌ | 994/2826 [1:37:59<2:54:16, 5.71s/it]
35%|███▌ | 995/2826 [1:38:05<3:02:00, 5.96s/it]
35%|███▌ | 996/2826 [1:38:12<3:09:19, 6.21s/it]
35%|███▌ | 997/2826 [1:38:17<2:58:04, 5.84s/it]
35%|███▌ | 998/2826 [1:38:23<2:52:42, 5.67s/it]
35%|███▌ | 999/2826 [1:38:30<3:08:35, 6.19s/it]
35%|███▌ | 1000/2826 [1:38:36<3:09:47, 6.24s/it]
{'loss': 0.3234, 'grad_norm': 2.445012092590332, 'learning_rate': 4.084117299885712e-06, 'epoch': 1.06}
35%|███▌ | 1000/2826 [1:38:36<3:09:47, 6.24s/it]
35%|███▌ | 1001/2826 [1:38:42<3:04:16, 6.06s/it]
35%|███▌ | 1002/2826 [1:38:48<3:01:52, 5.98s/it]
35%|███▌ | 1003/2826 [1:38:54<3:00:59, 5.96s/it]
36%|███▌ | 1004/2826 [1:39:00<3:09:10, 6.23s/it]
36%|███▌ | 1005/2826 [1:39:07<3:07:42, 6.18s/it]
36%|███▌ | 1006/2826 [1:39:13<3:11:31, 6.31s/it]
36%|███▌ | 1007/2826 [1:39:18<2:59:50, 5.93s/it]
36%|███▌ | 1008/2826 [1:39:23<2:51:32, 5.66s/it]
36%|███▌ | 1009/2826 [1:39:29<2:49:19, 5.59s/it]
36%|███▌ | 1010/2826 [1:39:35<2:59:22, 5.93s/it]
{'loss': 0.3139, 'grad_norm': 2.0615527629852295, 'learning_rate': 4.060103924039599e-06, 'epoch': 1.07}
36%|███▌ | 1010/2826 [1:39:35<2:59:22, 5.93s/it]
36%|███▌ | 1011/2826 [1:39:41<2:59:26, 5.93s/it]
36%|███▌ | 1012/2826 [1:39:47<2:56:02, 5.82s/it]
36%|███▌ | 1013/2826 [1:39:52<2:50:05, 5.63s/it]
36%|███▌ | 1014/2826 [1:39:57<2:47:28, 5.55s/it]
36%|███▌ | 1015/2826 [1:40:03<2:48:04, 5.57s/it]
36%|███▌ | 1016/2826 [1:40:09<2:47:44, 5.56s/it]
36%|███▌ | 1017/2826 [1:40:14<2:46:35, 5.53s/it]
36%|███▌ | 1018/2826 [1:40:20<2:48:22, 5.59s/it]
36%|███▌ | 1019/2826 [1:40:25<2:49:06, 5.62s/it]
36%|███▌ | 1020/2826 [1:40:32<2:55:54, 5.84s/it]
{'loss': 0.3144, 'grad_norm': 1.990400791168213, 'learning_rate': 4.035852450586352e-06, 'epoch': 1.08}
36%|███▌ | 1020/2826 [1:40:32<2:55:54, 5.84s/it]
36%|███▌ | 1021/2826 [1:40:37<2:52:24, 5.73s/it]
36%|███▌ | 1022/2826 [1:40:43<2:52:07, 5.73s/it]
36%|███▌ | 1023/2826 [1:40:48<2:49:56, 5.66s/it]
36%|███▌ | 1024/2826 [1:40:54<2:49:48, 5.65s/it]
36%|███▋ | 1025/2826 [1:40:59<2:44:53, 5.49s/it]
36%|███▋ | 1026/2826 [1:41:05<2:46:54, 5.56s/it]
36%|███▋ | 1027/2826 [1:41:10<2:42:46, 5.43s/it]
36%|███▋ | 1028/2826 [1:41:17<2:58:46, 5.97s/it]
36%|███▋ | 1029/2826 [1:41:23<2:59:35, 6.00s/it]
36%|███▋ | 1030/2826 [1:41:29<2:58:41, 5.97s/it]
{'loss': 0.323, 'grad_norm': 2.5510122776031494, 'learning_rate': 4.011366580701073e-06, 'epoch': 1.09}
36%|███▋ | 1030/2826 [1:41:29<2:58:41, 5.97s/it]
36%|███▋ | 1031/2826 [1:41:36<3:04:08, 6.16s/it]
37%|███▋ | 1032/2826 [1:41:42<3:07:15, 6.26s/it]
37%|███▋ | 1033/2826 [1:41:48<3:00:04, 6.03s/it]
37%|███▋ | 1034/2826 [1:41:54<3:03:10, 6.13s/it]
37%|███▋ | 1035/2826 [1:42:01<3:10:27, 6.38s/it]
37%|███▋ | 1036/2826 [1:42:07<3:05:23, 6.21s/it]
37%|███▋ | 1037/2826 [1:42:12<2:58:14, 5.98s/it]
37%|███▋ | 1038/2826 [1:42:18<2:50:23, 5.72s/it]
37%|███▋ | 1039/2826 [1:42:23<2:47:49, 5.64s/it]
37%|███▋ | 1040/2826 [1:42:28<2:42:38, 5.46s/it]
{'loss': 0.3694, 'grad_norm': 2.462083101272583, 'learning_rate': 3.9866500513316274e-06, 'epoch': 1.1}
37%|███▋ | 1040/2826 [1:42:28<2:42:38, 5.46s/it]
37%|███▋ | 1041/2826 [1:42:34<2:50:41, 5.74s/it]
37%|███▋ | 1042/2826 [1:42:40<2:45:18, 5.56s/it]
37%|███▋ | 1043/2826 [1:42:45<2:45:43, 5.58s/it]
37%|███▋ | 1044/2826 [1:42:53<3:00:56, 6.09s/it]
37%|███▋ | 1045/2826 [1:42:58<2:59:06, 6.03s/it]
37%|███▋ | 1046/2826 [1:43:04<2:58:46, 6.03s/it]
37%|███▋ | 1047/2826 [1:43:10<2:51:13, 5.78s/it]
37%|███▋ | 1048/2826 [1:43:15<2:46:01, 5.60s/it]
37%|███▋ | 1049/2826 [1:43:20<2:45:51, 5.60s/it]
37%|███▋ | 1050/2826 [1:43:26<2:41:47, 5.47s/it]
{'loss': 0.3351, 'grad_norm': 2.4385085105895996, 'learning_rate': 3.961706634628323e-06, 'epoch': 1.11}
37%|███▋ | 1050/2826 [1:43:26<2:41:47, 5.47s/it]
37%|███▋ | 1051/2826 [1:43:32<2:46:15, 5.62s/it]
37%|███▋ | 1052/2826 [1:43:37<2:41:15, 5.45s/it]
37%|███▋ | 1053/2826 [1:43:42<2:42:28, 5.50s/it]
37%|███▋ | 1054/2826 [1:43:49<2:51:56, 5.82s/it]
37%|███▋ | 1055/2826 [1:43:55<2:55:01, 5.93s/it]
37%|███▋ | 1056/2826 [1:44:01<2:52:50, 5.86s/it]
37%|███▋ | 1057/2826 [1:44:07<2:58:38, 6.06s/it]
37%|███▋ | 1058/2826 [1:44:14<3:03:47, 6.24s/it]
37%|███▋ | 1059/2826 [1:44:21<3:11:08, 6.49s/it]
38%|███▊ | 1060/2826 [1:44:26<3:00:03, 6.12s/it]
{'loss': 0.3459, 'grad_norm': 1.7553578615188599, 'learning_rate': 3.936540137368222e-06, 'epoch': 1.12}
38%|███▊ | 1060/2826 [1:44:26<3:00:03, 6.12s/it]
38%|███▊ | 1061/2826 [1:44:32<3:01:32, 6.17s/it]
38%|███▊ | 1062/2826 [1:44:38<2:54:03, 5.92s/it]
38%|███▊ | 1063/2826 [1:44:45<3:05:06, 6.30s/it]
38%|███▊ | 1064/2826 [1:44:51<2:59:14, 6.10s/it]
38%|███▊ | 1065/2826 [1:44:56<2:50:04, 5.79s/it]
38%|███▊ | 1066/2826 [1:45:01<2:44:29, 5.61s/it]
38%|███▊ | 1067/2826 [1:45:07<2:44:30, 5.61s/it]
38%|███▊ | 1068/2826 [1:45:12<2:40:10, 5.47s/it]
38%|███▊ | 1069/2826 [1:45:18<2:50:12, 5.81s/it]
38%|███▊ | 1070/2826 [1:45:24<2:45:11, 5.64s/it]
{'loss': 0.3186, 'grad_norm': 2.513950824737549, 'learning_rate': 3.911154400374159e-06, 'epoch': 1.13}
38%|███▊ | 1070/2826 [1:45:24<2:45:11, 5.64s/it]
38%|███▊ | 1071/2826 [1:45:29<2:44:36, 5.63s/it]
38%|███▊ | 1072/2826 [1:45:35<2:43:08, 5.58s/it]
38%|███▊ | 1073/2826 [1:45:40<2:38:13, 5.42s/it]
38%|███▊ | 1074/2826 [1:45:45<2:39:13, 5.45s/it]
38%|███▊ | 1075/2826 [1:45:50<2:37:30, 5.40s/it]
38%|███▊ | 1076/2826 [1:45:57<2:51:32, 5.88s/it]
38%|███▊ | 1077/2826 [1:46:04<2:56:12, 6.04s/it]
38%|███▊ | 1078/2826 [1:46:09<2:52:41, 5.93s/it]
38%|███▊ | 1079/2826 [1:46:14<2:44:03, 5.63s/it]
38%|███▊ | 1080/2826 [1:46:21<2:48:28, 5.79s/it]
{'loss': 0.3333, 'grad_norm': 2.6273515224456787, 'learning_rate': 3.885553297928573e-06, 'epoch': 1.15}
38%|███▊ | 1080/2826 [1:46:21<2:48:28, 5.79s/it]
38%|███▊ | 1081/2826 [1:46:27<2:53:52, 5.98s/it]
38%|███▊ | 1082/2826 [1:46:34<3:02:20, 6.27s/it]
38%|███▊ | 1083/2826 [1:46:40<2:57:07, 6.10s/it]
38%|███▊ | 1084/2826 [1:46:45<2:53:59, 5.99s/it]
38%|███▊ | 1085/2826 [1:46:51<2:46:29, 5.74s/it]
38%|███▊ | 1086/2826 [1:46:56<2:43:57, 5.65s/it]
38%|███▊ | 1087/2826 [1:47:01<2:39:38, 5.51s/it]
38%|███▊ | 1088/2826 [1:47:08<2:50:11, 5.88s/it]
39%|███▊ | 1089/2826 [1:47:13<2:45:55, 5.73s/it]
39%|███▊ | 1090/2826 [1:47:19<2:42:49, 5.63s/it]
{'loss': 0.3137, 'grad_norm': 2.4155592918395996, 'learning_rate': 3.859740737182222e-06, 'epoch': 1.16}
39%|███▊ | 1090/2826 [1:47:19<2:42:49, 5.63s/it]
39%|███▊ | 1091/2826 [1:47:26<2:54:06, 6.02s/it]
39%|███▊ | 1092/2826 [1:47:31<2:52:35, 5.97s/it]
39%|███▊ | 1093/2826 [1:47:37<2:45:28, 5.73s/it]
39%|███▊ | 1094/2826 [1:47:42<2:39:51, 5.54s/it]
39%|███▊ | 1095/2826 [1:47:49<2:50:25, 5.91s/it]
39%|███▉ | 1096/2826 [1:47:54<2:47:13, 5.80s/it]
39%|███▉ | 1097/2826 [1:48:00<2:47:03, 5.80s/it]
39%|███▉ | 1098/2826 [1:48:06<2:50:12, 5.91s/it]
39%|███▉ | 1099/2826 [1:48:12<2:52:30, 5.99s/it]
39%|███▉ | 1100/2826 [1:48:18<2:49:45, 5.90s/it]
{'loss': 0.3426, 'grad_norm': 2.719611644744873, 'learning_rate': 3.833720657557894e-06, 'epoch': 1.17}
39%|███▉ | 1100/2826 [1:48:18<2:49:45, 5.90s/it]
39%|███▉ | 1101/2826 [1:48:23<2:42:16, 5.64s/it]
39%|███▉ | 1102/2826 [1:48:28<2:38:20, 5.51s/it]
39%|███▉ | 1103/2826 [1:48:33<2:36:09, 5.44s/it]
39%|███▉ | 1104/2826 [1:48:39<2:39:40, 5.56s/it]
39%|███▉ | 1105/2826 [1:48:44<2:36:35, 5.46s/it]
39%|███▉ | 1106/2826 [1:48:50<2:35:30, 5.42s/it]
39%|███▉ | 1107/2826 [1:48:56<2:39:28, 5.57s/it]
39%|███▉ | 1108/2826 [1:49:01<2:36:21, 5.46s/it]
39%|███▉ | 1109/2826 [1:49:07<2:44:47, 5.76s/it]
39%|███▉ | 1110/2826 [1:49:13<2:43:13, 5.71s/it]
{'loss': 0.3709, 'grad_norm': 2.5729358196258545, 'learning_rate': 3.807497030149181e-06, 'epoch': 1.18}
39%|███▉ | 1110/2826 [1:49:13<2:43:13, 5.71s/it]
39%|███▉ | 1111/2826 [1:49:18<2:37:52, 5.52s/it]
39%|███▉ | 1112/2826 [1:49:25<2:51:38, 6.01s/it]
39%|███▉ | 1113/2826 [1:49:31<2:48:24, 5.90s/it]
39%|███▉ | 1114/2826 [1:49:36<2:44:09, 5.75s/it]
39%|███▉ | 1115/2826 [1:49:42<2:41:00, 5.65s/it]
39%|███▉ | 1116/2826 [1:49:47<2:36:20, 5.49s/it]
40%|███▉ | 1117/2826 [1:49:52<2:33:20, 5.38s/it]
40%|███▉ | 1118/2826 [1:49:59<2:43:55, 5.76s/it]
40%|███▉ | 1119/2826 [1:50:05<2:50:31, 5.99s/it]
40%|███▉ | 1120/2826 [1:50:12<2:59:47, 6.32s/it]
{'loss': 0.329, 'grad_norm': 1.9626141786575317, 'learning_rate': 3.7810738571144257e-06, 'epoch': 1.19}
40%|███▉ | 1120/2826 [1:50:12<2:59:47, 6.32s/it]
40%|███▉ | 1121/2826 [1:50:19<3:00:26, 6.35s/it]
40%|███▉ | 1122/2826 [1:50:25<2:58:29, 6.28s/it]
40%|███▉ | 1123/2826 [1:50:31<3:00:30, 6.36s/it]
40%|███▉ | 1124/2826 [1:50:37<2:54:26, 6.15s/it]
40%|███▉ | 1125/2826 [1:50:43<2:49:42, 5.99s/it]
40%|███▉ | 1126/2826 [1:50:48<2:47:34, 5.91s/it]
40%|███▉ | 1127/2826 [1:50:53<2:40:54, 5.68s/it]
40%|███▉ | 1128/2826 [1:51:00<2:49:57, 6.01s/it]
40%|███▉ | 1129/2826 [1:51:06<2:44:59, 5.83s/it]
40%|███▉ | 1130/2826 [1:51:11<2:43:19, 5.78s/it]
{'loss': 0.305, 'grad_norm': 2.601951837539673, 'learning_rate': 3.7544551710659296e-06, 'epoch': 1.2}
40%|███▉ | 1130/2826 [1:51:11<2:43:19, 5.78s/it]
40%|████ | 1131/2826 [1:51:17<2:38:51, 5.62s/it]
40%|████ | 1132/2826 [1:51:22<2:38:00, 5.60s/it]
40%|████ | 1133/2826 [1:51:27<2:33:40, 5.45s/it]
40%|████ | 1134/2826 [1:51:32<2:31:17, 5.37s/it]
40%|████ | 1135/2826 [1:51:38<2:31:27, 5.37s/it]
40%|████ | 1136/2826 [1:51:43<2:32:33, 5.42s/it]
40%|████ | 1137/2826 [1:51:49<2:34:00, 5.47s/it]
40%|████ | 1138/2826 [1:51:55<2:38:03, 5.62s/it]
40%|████ | 1139/2826 [1:52:02<2:54:30, 6.21s/it]
40%|████ | 1140/2826 [1:52:08<2:48:10, 5.98s/it]
{'loss': 0.3449, 'grad_norm': 2.4118540287017822, 'learning_rate': 3.7276450344545024e-06, 'epoch': 1.21}
40%|████ | 1140/2826 [1:52:08<2:48:10, 5.98s/it]
40%|████ | 1141/2826 [1:52:14<2:49:24, 6.03s/it]
40%|████ | 1142/2826 [1:52:19<2:44:28, 5.86s/it]
40%|████ | 1143/2826 [1:52:26<2:46:48, 5.95s/it]
40%|████ | 1144/2826 [1:52:32<2:49:26, 6.04s/it]
41%|████ | 1145/2826 [1:52:39<2:54:45, 6.24s/it]
41%|████ | 1146/2826 [1:52:44<2:49:05, 6.04s/it]
41%|████ | 1147/2826 [1:52:50<2:44:31, 5.88s/it]
41%|████ | 1148/2826 [1:52:56<2:44:26, 5.88s/it]
41%|████ | 1149/2826 [1:53:01<2:38:38, 5.68s/it]
41%|████ | 1150/2826 [1:53:07<2:41:21, 5.78s/it]
{'loss': 0.3403, 'grad_norm': 2.5080604553222656, 'learning_rate': 3.7006475389494723e-06, 'epoch': 1.22}
41%|████ | 1150/2826 [1:53:07<2:41:21, 5.78s/it]
41%|████ | 1151/2826 [1:53:13<2:44:09, 5.88s/it]
41%|████ | 1152/2826 [1:53:18<2:40:38, 5.76s/it]
41%|████ | 1153/2826 [1:53:24<2:36:03, 5.60s/it]
41%|████ | 1154/2826 [1:53:29<2:34:55, 5.56s/it]
41%|████ | 1155/2826 [1:53:35<2:34:28, 5.55s/it]
41%|████ | 1156/2826 [1:53:41<2:41:39, 5.81s/it]
41%|████ | 1157/2826 [1:53:48<2:50:40, 6.14s/it]
41%|████ | 1158/2826 [1:53:54<2:49:25, 6.09s/it]
41%|████ | 1159/2826 [1:54:00<2:46:10, 5.98s/it]
41%|████ | 1160/2826 [1:54:06<2:46:48, 6.01s/it]
{'loss': 0.3342, 'grad_norm': 2.6882951259613037, 'learning_rate': 3.6734668048142273e-06, 'epoch': 1.23}
41%|████ | 1160/2826 [1:54:06<2:46:48, 6.01s/it]
41%|████ | 1161/2826 [1:54:12<2:51:09, 6.17s/it]
41%|████ | 1162/2826 [1:54:18<2:47:06, 6.03s/it]
41%|████ | 1163/2826 [1:54:23<2:39:00, 5.74s/it]
41%|████ | 1164/2826 [1:54:28<2:33:37, 5.55s/it]
41%|████ | 1165/2826 [1:54:33<2:32:15, 5.50s/it]
41%|████▏ | 1166/2826 [1:54:40<2:38:52, 5.74s/it]
41%|████▏ | 1167/2826 [1:54:46<2:41:32, 5.84s/it]
41%|████▏ | 1168/2826 [1:54:52<2:42:24, 5.88s/it]
41%|████▏ | 1169/2826 [1:54:59<2:53:32, 6.28s/it]
41%|████▏ | 1170/2826 [1:55:04<2:46:16, 6.02s/it]
{'loss': 0.3589, 'grad_norm': 2.3755247592926025, 'learning_rate': 3.646106980277394e-06, 'epoch': 1.24}
41%|████▏ | 1170/2826 [1:55:04<2:46:16, 6.02s/it]
41%|████▏ | 1171/2826 [1:55:12<3:00:48, 6.56s/it]
41%|████▏ | 1172/2826 [1:55:18<2:53:22, 6.29s/it]
42%|████▏ | 1173/2826 [1:55:23<2:46:19, 6.04s/it]
42%|████▏ | 1174/2826 [1:55:29<2:44:25, 5.97s/it]
42%|████▏ | 1175/2826 [1:55:34<2:37:32, 5.73s/it]
42%|████▏ | 1176/2826 [1:55:39<2:32:24, 5.54s/it]
42%|████▏ | 1177/2826 [1:55:45<2:28:56, 5.42s/it]
42%|████▏ | 1178/2826 [1:55:51<2:37:45, 5.74s/it]
42%|████▏ | 1179/2826 [1:55:57<2:39:37, 5.82s/it]
42%|████▏ | 1180/2826 [1:56:02<2:35:59, 5.69s/it]
{'loss': 0.3447, 'grad_norm': 2.4138166904449463, 'learning_rate': 3.618572240899748e-06, 'epoch': 1.25}
42%|████▏ | 1180/2826 [1:56:02<2:35:59, 5.69s/it]
42%|████▏ | 1181/2826 [1:56:08<2:33:22, 5.59s/it]
42%|████▏ | 1182/2826 [1:56:14<2:36:00, 5.69s/it]
42%|████▏ | 1183/2826 [1:56:19<2:33:35, 5.61s/it]
42%|████▏ | 1184/2826 [1:56:25<2:34:16, 5.64s/it]
42%|████▏ | 1185/2826 [1:56:30<2:32:58, 5.59s/it]
42%|████▏ | 1186/2826 [1:56:35<2:28:12, 5.42s/it]
42%|████▏ | 1187/2826 [1:56:41<2:29:00, 5.45s/it]
42%|████▏ | 1188/2826 [1:56:46<2:25:36, 5.33s/it]
42%|████▏ | 1189/2826 [1:56:51<2:25:56, 5.35s/it]
42%|████▏ | 1190/2826 [1:56:58<2:33:41, 5.64s/it]
{'loss': 0.3787, 'grad_norm': 2.6930105686187744, 'learning_rate': 3.5908667889369603e-06, 'epoch': 1.26}
42%|████▏ | 1190/2826 [1:56:58<2:33:41, 5.64s/it]
42%|████▏ | 1191/2826 [1:57:03<2:33:44, 5.64s/it]
42%|████▏ | 1192/2826 [1:57:08<2:29:11, 5.48s/it]
42%|████▏ | 1193/2826 [1:57:15<2:35:09, 5.70s/it]
42%|████▏ | 1194/2826 [1:57:20<2:35:46, 5.73s/it]
42%|████▏ | 1195/2826 [1:57:28<2:51:33, 6.31s/it]
42%|████▏ | 1196/2826 [1:57:35<2:55:24, 6.46s/it]
42%|████▏ | 1197/2826 [1:57:41<2:50:09, 6.27s/it]
42%|████▏ | 1198/2826 [1:57:46<2:43:27, 6.02s/it]
42%|████▏ | 1199/2826 [1:57:52<2:39:53, 5.90s/it]
42%|████▏ | 1200/2826 [1:57:58<2:39:54, 5.90s/it]
{'loss': 0.3376, 'grad_norm': 2.732795476913452, 'learning_rate': 3.5629948526982563e-06, 'epoch': 1.27}
42%|████▏ | 1200/2826 [1:57:58<2:39:54, 5.90s/it]
42%|████▏ | 1201/2826 [1:58:03<2:39:13, 5.88s/it]
43%|████▎ | 1202/2826 [1:58:09<2:39:19, 5.89s/it]
43%|████▎ | 1203/2826 [1:58:15<2:34:22, 5.71s/it]
43%|████▎ | 1204/2826 [1:58:20<2:29:42, 5.54s/it]
43%|████▎ | 1205/2826 [1:58:25<2:29:21, 5.53s/it]
43%|████▎ | 1206/2826 [1:58:32<2:39:28, 5.91s/it]
43%|████▎ | 1207/2826 [1:58:38<2:40:30, 5.95s/it]
43%|████▎ | 1208/2826 [1:58:44<2:39:40, 5.92s/it]
43%|████▎ | 1209/2826 [1:58:50<2:40:20, 5.95s/it]
43%|████▎ | 1210/2826 [1:58:55<2:33:47, 5.71s/it]
{'loss': 0.3461, 'grad_norm': 1.8468087911605835, 'learning_rate': 3.534960685901111e-06, 'epoch': 1.28}
43%|████▎ | 1210/2826 [1:58:55<2:33:47, 5.71s/it]
43%|████▎ | 1211/2826 [1:59:00<2:30:04, 5.58s/it]
43%|████▎ | 1212/2826 [1:59:06<2:29:24, 5.55s/it]
43%|████▎ | 1213/2826 [1:59:12<2:31:21, 5.63s/it]
43%|████▎ | 1214/2826 [1:59:17<2:26:46, 5.46s/it]
43%|████▎ | 1215/2826 [1:59:22<2:24:28, 5.38s/it]
43%|████▎ | 1216/2826 [1:59:27<2:22:41, 5.32s/it]
43%|████▎ | 1217/2826 [1:59:33<2:25:32, 5.43s/it]
43%|████▎ | 1218/2826 [1:59:39<2:29:00, 5.56s/it]
43%|████▎ | 1219/2826 [1:59:44<2:26:55, 5.49s/it]
43%|████▎ | 1220/2826 [1:59:49<2:24:10, 5.39s/it]
{'loss': 0.3396, 'grad_norm': 2.3408284187316895, 'learning_rate': 3.506768567022062e-06, 'epoch': 1.29}
43%|████▎ | 1220/2826 [1:59:49<2:24:10, 5.39s/it]
43%|████▎ | 1221/2826 [1:59:57<2:42:47, 6.09s/it]
43%|████▎ | 1222/2826 [2:00:02<2:35:46, 5.83s/it]
43%|████▎ | 1223/2826 [2:00:07<2:29:33, 5.60s/it]
43%|████▎ | 1224/2826 [2:00:13<2:34:39, 5.79s/it]
43%|████▎ | 1225/2826 [2:00:19<2:29:34, 5.61s/it]
43%|████▎ | 1226/2826 [2:00:25<2:34:01, 5.78s/it]
43%|████▎ | 1227/2826 [2:00:31<2:38:56, 5.96s/it]
43%|████▎ | 1228/2826 [2:00:38<2:47:10, 6.28s/it]
43%|████▎ | 1229/2826 [2:00:43<2:37:14, 5.91s/it]
44%|████▎ | 1230/2826 [2:00:49<2:38:19, 5.95s/it]
{'loss': 0.3364, 'grad_norm': 2.7420434951782227, 'learning_rate': 3.478422798643737e-06, 'epoch': 1.3}
44%|████▎ | 1230/2826 [2:00:49<2:38:19, 5.95s/it]
44%|████▎ | 1231/2826 [2:00:54<2:31:24, 5.70s/it]
44%|████▎ | 1232/2826 [2:01:00<2:27:28, 5.55s/it]
44%|████▎ | 1233/2826 [2:01:05<2:24:12, 5.43s/it]
44%|████▎ | 1234/2826 [2:01:11<2:26:55, 5.54s/it]
44%|████▎ | 1235/2826 [2:01:16<2:22:12, 5.36s/it]
44%|████▎ | 1236/2826 [2:01:22<2:33:06, 5.78s/it]
44%|████▍ | 1237/2826 [2:01:30<2:45:42, 6.26s/it]
44%|████▍ | 1238/2826 [2:01:35<2:42:04, 6.12s/it]
44%|████▍ | 1239/2826 [2:01:41<2:36:57, 5.93s/it]
44%|████▍ | 1240/2826 [2:01:46<2:29:55, 5.67s/it]
{'loss': 0.3126, 'grad_norm': 2.634403705596924, 'learning_rate': 3.4499277067982177e-06, 'epoch': 1.32}
44%|████▍ | 1240/2826 [2:01:46<2:29:55, 5.67s/it]
44%|████▍ | 1241/2826 [2:01:52<2:35:38, 5.89s/it]
44%|████▍ | 1242/2826 [2:01:58<2:30:23, 5.70s/it]
44%|████▍ | 1243/2826 [2:02:04<2:37:16, 5.96s/it]
44%|████▍ | 1244/2826 [2:02:09<2:30:31, 5.71s/it]
44%|████▍ | 1245/2826 [2:02:16<2:33:58, 5.84s/it]
44%|████▍ | 1246/2826 [2:02:23<2:47:57, 6.38s/it]
44%|████▍ | 1247/2826 [2:02:29<2:40:25, 6.10s/it]
44%|████▍ | 1248/2826 [2:02:34<2:33:38, 5.84s/it]
44%|████▍ | 1249/2826 [2:02:39<2:27:36, 5.62s/it]
44%|████▍ | 1250/2826 [2:02:44<2:22:53, 5.44s/it]
{'loss': 0.3092, 'grad_norm': 2.4217336177825928, 'learning_rate': 3.421287640306809e-06, 'epoch': 1.33}
44%|████▍ | 1250/2826 [2:02:44<2:22:53, 5.44s/it]
44%|████▍ | 1251/2826 [2:02:49<2:22:20, 5.42s/it]
44%|████▍ | 1252/2826 [2:02:55<2:20:51, 5.37s/it]
44%|████▍ | 1253/2826 [2:03:00<2:20:22, 5.35s/it]
44%|████▍ | 1254/2826 [2:03:07<2:30:44, 5.75s/it]
44%|████▍ | 1255/2826 [2:03:12<2:31:47, 5.80s/it]
44%|████▍ | 1256/2826 [2:03:20<2:43:03, 6.23s/it]
44%|████▍ | 1257/2826 [2:03:25<2:38:33, 6.06s/it]
45%|████▍ | 1258/2826 [2:03:33<2:51:53, 6.58s/it]
45%|████▍ | 1259/2826 [2:03:41<3:04:50, 7.08s/it]
45%|████▍ | 1260/2826 [2:03:48<2:57:03, 6.78s/it]
{'loss': 0.3374, 'grad_norm': 1.7107937335968018, 'learning_rate': 3.3925069701163406e-06, 'epoch': 1.34}
45%|████▍ | 1260/2826 [2:03:48<2:57:03, 6.78s/it]
45%|████▍ | 1261/2826 [2:03:55<2:59:04, 6.87s/it]
45%|████▍ | 1262/2826 [2:04:02<3:02:05, 6.99s/it]
45%|████▍ | 1263/2826 [2:04:08<2:54:45, 6.71s/it]
45%|████▍ | 1264/2826 [2:04:13<2:42:59, 6.26s/it]
45%|████▍ | 1265/2826 [2:04:20<2:47:00, 6.42s/it]
45%|████▍ | 1266/2826 [2:04:26<2:48:05, 6.47s/it]
45%|████▍ | 1267/2826 [2:04:32<2:43:49, 6.31s/it]
45%|████▍ | 1268/2826 [2:04:38<2:40:28, 6.18s/it]
45%|████▍ | 1269/2826 [2:04:43<2:32:35, 5.88s/it]
45%|████▍ | 1270/2826 [2:04:49<2:26:33, 5.65s/it]
{'loss': 0.3436, 'grad_norm': 2.1515822410583496, 'learning_rate': 3.363590088632085e-06, 'epoch': 1.35}
45%|████▍ | 1270/2826 [2:04:49<2:26:33, 5.65s/it]
45%|████▍ | 1271/2826 [2:04:54<2:21:12, 5.45s/it]
45%|████▌ | 1272/2826 [2:05:01<2:33:43, 5.94s/it]
45%|████▌ | 1273/2826 [2:05:07<2:34:11, 5.96s/it]
45%|████▌ | 1274/2826 [2:05:13<2:40:23, 6.20s/it]
45%|████▌ | 1275/2826 [2:05:20<2:44:42, 6.37s/it]
45%|████▌ | 1276/2826 [2:05:26<2:38:25, 6.13s/it]
45%|████▌ | 1277/2826 [2:05:32<2:36:57, 6.08s/it]
45%|████▌ | 1278/2826 [2:05:38<2:37:00, 6.09s/it]
45%|████▌ | 1279/2826 [2:05:44<2:34:16, 5.98s/it]
45%|████▌ | 1280/2826 [2:05:49<2:28:39, 5.77s/it]
{'loss': 0.3283, 'grad_norm': 2.0105717182159424, 'learning_rate': 3.334541409047408e-06, 'epoch': 1.36}
45%|████▌ | 1280/2826 [2:05:49<2:28:39, 5.77s/it]
45%|████▌ | 1281/2826 [2:05:54<2:25:09, 5.64s/it]
45%|████▌ | 1282/2826 [2:06:00<2:29:19, 5.80s/it]
45%|████▌ | 1283/2826 [2:06:07<2:35:08, 6.03s/it]
45%|████▌ | 1284/2826 [2:06:12<2:28:22, 5.77s/it]
45%|████▌ | 1285/2826 [2:06:17<2:23:35, 5.59s/it]
46%|████▌ | 1286/2826 [2:06:22<2:19:50, 5.45s/it]
46%|████▌ | 1287/2826 [2:06:28<2:22:23, 5.55s/it]
46%|████▌ | 1288/2826 [2:06:33<2:18:55, 5.42s/it]
46%|████▌ | 1289/2826 [2:06:40<2:27:16, 5.75s/it]
46%|████▌ | 1290/2826 [2:06:47<2:39:01, 6.21s/it]
{'loss': 0.358, 'grad_norm': 1.8952791690826416, 'learning_rate': 3.3053653646702422e-06, 'epoch': 1.37}
46%|████▌ | 1290/2826 [2:06:47<2:39:01, 6.21s/it]
46%|████▌ | 1291/2826 [2:06:53<2:34:47, 6.05s/it]
46%|████▌ | 1292/2826 [2:06:58<2:28:17, 5.80s/it]
46%|████▌ | 1293/2826 [2:07:05<2:37:43, 6.17s/it]
46%|████▌ | 1294/2826 [2:07:10<2:30:34, 5.90s/it]
46%|████▌ | 1295/2826 [2:07:16<2:32:57, 5.99s/it]
46%|████▌ | 1296/2826 [2:07:23<2:35:31, 6.10s/it]
46%|████▌ | 1297/2826 [2:07:28<2:31:19, 5.94s/it]
46%|████▌ | 1298/2826 [2:07:34<2:25:42, 5.72s/it]
46%|████▌ | 1299/2826 [2:07:39<2:26:39, 5.76s/it]
46%|████▌ | 1300/2826 [2:07:45<2:21:48, 5.58s/it]
{'loss': 0.3084, 'grad_norm': 1.8639928102493286, 'learning_rate': 3.276066408246487e-06, 'epoch': 1.38}
46%|████▌ | 1300/2826 [2:07:45<2:21:48, 5.58s/it]
46%|████▌ | 1301/2826 [2:07:50<2:19:52, 5.50s/it]
46%|████▌ | 1302/2826 [2:07:55<2:18:49, 5.47s/it]
46%|████▌ | 1303/2826 [2:08:01<2:22:56, 5.63s/it]
46%|████▌ | 1304/2826 [2:08:06<2:17:04, 5.40s/it]
46%|████▌ | 1305/2826 [2:08:12<2:19:13, 5.49s/it]
46%|████▌ | 1306/2826 [2:08:19<2:33:02, 6.04s/it]
46%|████▌ | 1307/2826 [2:08:25<2:30:09, 5.93s/it]
46%|████▋ | 1308/2826 [2:08:31<2:32:11, 6.02s/it]
46%|████▋ | 1309/2826 [2:08:38<2:38:16, 6.26s/it]
46%|████▋ | 1310/2826 [2:08:45<2:43:16, 6.46s/it]
{'loss': 0.3508, 'grad_norm': 2.563251256942749, 'learning_rate': 3.2466490112804484e-06, 'epoch': 1.39}
46%|████▋ | 1310/2826 [2:08:45<2:43:16, 6.46s/it]
46%|████▋ | 1311/2826 [2:08:50<2:33:04, 6.06s/it]
46%|████▋ | 1312/2826 [2:08:55<2:24:59, 5.75s/it]
46%|████▋ | 1313/2826 [2:09:02<2:31:01, 5.99s/it]
46%|████▋ | 1314/2826 [2:09:07<2:26:05, 5.80s/it]
47%|████▋ | 1315/2826 [2:09:13<2:30:47, 5.99s/it]
47%|████▋ | 1316/2826 [2:09:19<2:31:49, 6.03s/it]
47%|████▋ | 1317/2826 [2:09:25<2:24:31, 5.75s/it]
47%|████▋ | 1318/2826 [2:09:30<2:19:42, 5.56s/it]
47%|████▋ | 1319/2826 [2:09:35<2:17:57, 5.49s/it]
47%|████▋ | 1320/2826 [2:09:41<2:17:45, 5.49s/it]
{'loss': 0.3215, 'grad_norm': 2.214616060256958, 'learning_rate': 3.217117663352417e-06, 'epoch': 1.4}
47%|████▋ | 1320/2826 [2:09:41<2:17:45, 5.49s/it]
47%|████▋ | 1321/2826 [2:09:46<2:16:04, 5.42s/it]
47%|████▋ | 1322/2826 [2:09:51<2:14:27, 5.36s/it]
47%|████▋ | 1323/2826 [2:09:57<2:20:51, 5.62s/it]
47%|████▋ | 1324/2826 [2:10:04<2:25:54, 5.83s/it]
47%|████▋ | 1325/2826 [2:10:10<2:32:08, 6.08s/it]
47%|████▋ | 1326/2826 [2:10:15<2:24:22, 5.78s/it]
47%|████▋ | 1327/2826 [2:10:21<2:24:36, 5.79s/it]
47%|████▋ | 1328/2826 [2:10:27<2:28:57, 5.97s/it]
47%|████▋ | 1329/2826 [2:10:34<2:32:56, 6.13s/it]
47%|████▋ | 1330/2826 [2:10:40<2:29:08, 5.98s/it]
{'loss': 0.3193, 'grad_norm': 1.793468952178955, 'learning_rate': 3.187476871433478e-06, 'epoch': 1.41}
47%|████▋ | 1330/2826 [2:10:40<2:29:08, 5.98s/it]
47%|████▋ | 1331/2826 [2:10:45<2:23:33, 5.76s/it]
47%|████▋ | 1332/2826 [2:10:50<2:18:35, 5.57s/it]
47%|████▋ | 1333/2826 [2:10:57<2:26:11, 5.88s/it]
47%|████▋ | 1334/2826 [2:11:02<2:21:18, 5.68s/it]
47%|████▋ | 1335/2826 [2:11:08<2:27:13, 5.92s/it]
47%|████▋ | 1336/2826 [2:11:14<2:24:04, 5.80s/it]
47%|████▋ | 1337/2826 [2:11:21<2:35:42, 6.27s/it]
47%|████▋ | 1338/2826 [2:11:27<2:30:18, 6.06s/it]
47%|████▋ | 1339/2826 [2:11:32<2:23:26, 5.79s/it]
47%|████▋ | 1340/2826 [2:11:37<2:18:12, 5.58s/it]
{'loss': 0.3019, 'grad_norm': 2.204789638519287, 'learning_rate': 3.1577311591976766e-06, 'epoch': 1.42}
47%|████▋ | 1340/2826 [2:11:37<2:18:12, 5.58s/it]
47%|████▋ | 1341/2826 [2:11:42<2:14:46, 5.45s/it]
47%|████▋ | 1342/2826 [2:11:49<2:21:40, 5.73s/it]
48%|████▊ | 1343/2826 [2:11:54<2:19:57, 5.66s/it]
48%|████▊ | 1344/2826 [2:11:59<2:18:13, 5.60s/it]
48%|████▊ | 1345/2826 [2:12:05<2:14:23, 5.44s/it]
48%|████▊ | 1346/2826 [2:12:10<2:12:03, 5.35s/it]
48%|████▊ | 1347/2826 [2:12:16<2:17:11, 5.57s/it]
48%|████▊ | 1348/2826 [2:12:23<2:28:48, 6.04s/it]
48%|████▊ | 1349/2826 [2:12:29<2:28:56, 6.05s/it]
48%|████▊ | 1350/2826 [2:12:36<2:37:53, 6.42s/it]
{'loss': 0.3099, 'grad_norm': 2.307568311691284, 'learning_rate': 3.1278850663316307e-06, 'epoch': 1.43}
48%|████▊ | 1350/2826 [2:12:36<2:37:53, 6.42s/it]
48%|████▊ | 1351/2826 [2:12:42<2:34:33, 6.29s/it]
48%|████▊ | 1352/2826 [2:12:49<2:36:56, 6.39s/it]
48%|████▊ | 1353/2826 [2:12:55<2:32:30, 6.21s/it]
48%|████▊ | 1354/2826 [2:13:00<2:24:05, 5.87s/it]
48%|████▊ | 1355/2826 [2:13:05<2:22:01, 5.79s/it]
48%|████▊ | 1356/2826 [2:13:11<2:17:15, 5.60s/it]
48%|████▊ | 1357/2826 [2:13:17<2:20:35, 5.74s/it]
48%|████▊ | 1358/2826 [2:13:23<2:24:48, 5.92s/it]
48%|████▊ | 1359/2826 [2:13:29<2:25:00, 5.93s/it]
48%|████▊ | 1360/2826 [2:13:35<2:27:52, 6.05s/it]
{'loss': 0.3085, 'grad_norm': 2.485848903656006, 'learning_rate': 3.0979431478416987e-06, 'epoch': 1.44}
48%|████▊ | 1360/2826 [2:13:35<2:27:52, 6.05s/it]
48%|████▊ | 1361/2826 [2:13:42<2:33:04, 6.27s/it]
48%|████▊ | 1362/2826 [2:13:47<2:25:26, 5.96s/it]
48%|████▊ | 1363/2826 [2:13:53<2:21:55, 5.82s/it]
48%|████▊ | 1364/2826 [2:13:59<2:25:53, 5.99s/it]
48%|████▊ | 1365/2826 [2:14:04<2:20:01, 5.75s/it]
48%|████▊ | 1366/2826 [2:14:10<2:19:49, 5.75s/it]
48%|████▊ | 1367/2826 [2:14:15<2:17:32, 5.66s/it]
48%|████▊ | 1368/2826 [2:14:21<2:15:01, 5.56s/it]
48%|████▊ | 1369/2826 [2:14:26<2:11:38, 5.42s/it]
48%|████▊ | 1370/2826 [2:14:32<2:14:16, 5.53s/it]
{'loss': 0.3211, 'grad_norm': 1.953053593635559, 'learning_rate': 3.067909973358811e-06, 'epoch': 1.45}
48%|████▊ | 1370/2826 [2:14:32<2:14:16, 5.53s/it]
49%|████▊ | 1371/2826 [2:14:38<2:23:06, 5.90s/it]
49%|████▊ | 1372/2826 [2:14:45<2:24:23, 5.96s/it]
49%|████▊ | 1373/2826 [2:14:50<2:22:01, 5.86s/it]
49%|████▊ | 1374/2826 [2:14:58<2:37:49, 6.52s/it]
49%|████▊ | 1375/2826 [2:15:05<2:40:58, 6.66s/it]
49%|████▊ | 1376/2826 [2:15:11<2:30:58, 6.25s/it]
49%|████▊ | 1377/2826 [2:15:17<2:30:50, 6.25s/it]
49%|████▉ | 1378/2826 [2:15:25<2:46:58, 6.92s/it]
49%|████▉ | 1379/2826 [2:15:32<2:42:59, 6.76s/it]
49%|████▉ | 1380/2826 [2:15:37<2:31:28, 6.29s/it]
{'loss': 0.3329, 'grad_norm': 2.2350101470947266, 'learning_rate': 3.0377901264410673e-06, 'epoch': 1.46}
49%|████▉ | 1380/2826 [2:15:37<2:31:28, 6.29s/it]
49%|████▉ | 1381/2826 [2:15:43<2:29:53, 6.22s/it]
49%|████▉ | 1382/2826 [2:15:48<2:22:06, 5.90s/it]
49%|████▉ | 1383/2826 [2:15:54<2:23:53, 5.98s/it]
49%|████▉ | 1384/2826 [2:16:00<2:19:47, 5.82s/it]
49%|████▉ | 1385/2826 [2:16:07<2:34:03, 6.41s/it]
49%|████▉ | 1386/2826 [2:16:13<2:25:59, 6.08s/it]
49%|████▉ | 1387/2826 [2:16:18<2:21:43, 5.91s/it]
49%|████▉ | 1388/2826 [2:16:23<2:16:02, 5.68s/it]
49%|████▉ | 1389/2826 [2:16:29<2:17:27, 5.74s/it]
49%|████▉ | 1390/2826 [2:16:35<2:14:07, 5.60s/it]
{'loss': 0.3376, 'grad_norm': 2.542452335357666, 'learning_rate': 3.0075882038742133e-06, 'epoch': 1.47}
49%|████▉ | 1390/2826 [2:16:35<2:14:07, 5.60s/it]
49%|████▉ | 1391/2826 [2:16:41<2:18:52, 5.81s/it]
49%|████▉ | 1392/2826 [2:16:47<2:18:47, 5.81s/it]
49%|████▉ | 1393/2826 [2:16:53<2:21:25, 5.92s/it]
49%|████▉ | 1394/2826 [2:16:59<2:19:25, 5.84s/it]
49%|████▉ | 1395/2826 [2:17:05<2:25:19, 6.09s/it]
49%|████▉ | 1396/2826 [2:17:11<2:23:11, 6.01s/it]
49%|████▉ | 1397/2826 [2:17:17<2:25:45, 6.12s/it]
49%|████▉ | 1398/2826 [2:17:24<2:28:19, 6.23s/it]
50%|████▉ | 1399/2826 [2:17:30<2:24:35, 6.08s/it]
50%|████▉ | 1400/2826 [2:17:35<2:17:12, 5.77s/it]
{'loss': 0.2896, 'grad_norm': 2.3203530311584473, 'learning_rate': 2.9773088149700923e-06, 'epoch': 1.48}
50%|████▉ | 1400/2826 [2:17:35<2:17:12, 5.77s/it]
50%|████▉ | 1401/2826 [2:17:40<2:15:09, 5.69s/it]
50%|████▉ | 1402/2826 [2:17:46<2:19:12, 5.87s/it]
50%|████▉ | 1403/2826 [2:17:52<2:18:14, 5.83s/it]
50%|████▉ | 1404/2826 [2:17:59<2:25:04, 6.12s/it]
50%|████▉ | 1405/2826 [2:18:04<2:20:20, 5.93s/it]
50%|████▉ | 1406/2826 [2:18:10<2:14:53, 5.70s/it]
50%|████▉ | 1407/2826 [2:18:15<2:15:57, 5.75s/it]
50%|████▉ | 1408/2826 [2:18:21<2:12:40, 5.61s/it]
50%|████▉ | 1409/2826 [2:18:27<2:15:12, 5.73s/it]
50%|████▉ | 1410/2826 [2:18:32<2:10:58, 5.55s/it]
{'loss': 0.299, 'grad_norm': 1.9708584547042847, 'learning_rate': 2.9469565808631888e-06, 'epoch': 1.5}
50%|████▉ | 1410/2826 [2:18:32<2:10:58, 5.55s/it]
50%|████▉ | 1411/2826 [2:18:39<2:24:05, 6.11s/it]
50%|████▉ | 1412/2826 [2:18:45<2:22:02, 6.03s/it]
50%|█████ | 1413/2826 [2:18:50<2:15:37, 5.76s/it]
50%|█████ | 1414/2826 [2:18:56<2:18:27, 5.88s/it]
50%|█████ | 1415/2826 [2:19:02<2:13:23, 5.67s/it]
50%|█████ | 1416/2826 [2:19:07<2:11:13, 5.58s/it]
50%|█████ | 1417/2826 [2:19:12<2:07:42, 5.44s/it]
50%|█████ | 1418/2826 [2:19:17<2:06:28, 5.39s/it]
50%|█████ | 1419/2826 [2:19:23<2:07:08, 5.42s/it]
50%|█████ | 1420/2826 [2:19:29<2:08:46, 5.50s/it]
{'loss': 0.3484, 'grad_norm': 2.63698148727417, 'learning_rate': 2.9165361338053683e-06, 'epoch': 1.51}
50%|█████ | 1420/2826 [2:19:29<2:08:46, 5.50s/it]
50%|█████ | 1421/2826 [2:19:34<2:06:34, 5.41s/it]
50%|█████ | 1422/2826 [2:19:40<2:15:20, 5.78s/it]
50%|█████ | 1423/2826 [2:19:46<2:11:04, 5.61s/it]
50%|█████ | 1424/2826 [2:19:51<2:06:47, 5.43s/it]
50%|█████ | 1425/2826 [2:19:56<2:05:00, 5.35s/it]
50%|█████ | 1426/2826 [2:20:02<2:09:19, 5.54s/it]
50%|█████ | 1427/2826 [2:20:08<2:11:39, 5.65s/it]
51%|█████ | 1428/2826 [2:20:15<2:26:46, 6.30s/it]
51%|█████ | 1429/2826 [2:20:22<2:25:56, 6.27s/it]
51%|█████ | 1430/2826 [2:20:28<2:28:01, 6.36s/it]
{'loss': 0.3316, 'grad_norm': 2.091648578643799, 'learning_rate': 2.886052116458918e-06, 'epoch': 1.52}
51%|█████ | 1430/2826 [2:20:28<2:28:01, 6.36s/it]
51%|█████ | 1431/2826 [2:20:34<2:21:56, 6.11s/it]
51%|█████ | 1432/2826 [2:20:39<2:18:51, 5.98s/it]
51%|█████ | 1433/2826 [2:20:45<2:13:38, 5.76s/it]
51%|█████ | 1434/2826 [2:20:50<2:09:06, 5.56s/it]
51%|█████ | 1435/2826 [2:20:57<2:16:52, 5.90s/it]
51%|█████ | 1436/2826 [2:21:03<2:20:39, 6.07s/it]
51%|█████ | 1437/2826 [2:21:08<2:15:37, 5.86s/it]
51%|█████ | 1438/2826 [2:21:13<2:10:24, 5.64s/it]
51%|█████ | 1439/2826 [2:21:20<2:18:44, 6.00s/it]
51%|█████ | 1440/2826 [2:21:25<2:12:38, 5.74s/it]
{'loss': 0.328, 'grad_norm': 1.955355167388916, 'learning_rate': 2.8555091811880004e-06, 'epoch': 1.53}
51%|█████ | 1440/2826 [2:21:25<2:12:38, 5.74s/it]
51%|█████ | 1441/2826 [2:21:33<2:22:46, 6.18s/it]
51%|█████ | 1442/2826 [2:21:38<2:16:39, 5.92s/it]
51%|█████ | 1443/2826 [2:21:44<2:18:11, 6.00s/it]
51%|█████ | 1444/2826 [2:21:50<2:16:05, 5.91s/it]
51%|█████ | 1445/2826 [2:21:57<2:24:43, 6.29s/it]
51%|█████ | 1446/2826 [2:22:04<2:28:06, 6.44s/it]
51%|█████ | 1447/2826 [2:22:09<2:21:07, 6.14s/it]
51%|█████ | 1448/2826 [2:22:16<2:22:48, 6.22s/it]
51%|█████▏ | 1449/2826 [2:22:22<2:22:37, 6.21s/it]
51%|█████▏ | 1450/2826 [2:22:28<2:24:52, 6.32s/it]
{'loss': 0.3215, 'grad_norm': 1.6724951267242432, 'learning_rate': 2.8249119893486252e-06, 'epoch': 1.54}
51%|█████▏ | 1450/2826 [2:22:28<2:24:52, 6.32s/it]
51%|█████▏ | 1451/2826 [2:22:34<2:22:48, 6.23s/it]
51%|█████▏ | 1452/2826 [2:22:41<2:24:20, 6.30s/it]
51%|█████▏ | 1453/2826 [2:22:47<2:20:23, 6.14s/it]
51%|█████▏ | 1454/2826 [2:22:52<2:14:36, 5.89s/it]
51%|█████▏ | 1455/2826 [2:22:57<2:10:07, 5.70s/it]
52%|█████▏ | 1456/2826 [2:23:04<2:20:31, 6.15s/it]
52%|█████▏ | 1457/2826 [2:23:11<2:23:45, 6.30s/it]
52%|█████▏ | 1458/2826 [2:23:18<2:25:44, 6.39s/it]
52%|█████▏ | 1459/2826 [2:23:23<2:18:02, 6.06s/it]
52%|█████▏ | 1460/2826 [2:23:28<2:13:18, 5.86s/it]
{'loss': 0.3118, 'grad_norm': 2.1872570514678955, 'learning_rate': 2.7942652105772516e-06, 'epoch': 1.55}
52%|█████▏ | 1460/2826 [2:23:28<2:13:18, 5.86s/it]
52%|█████▏ | 1461/2826 [2:23:35<2:18:35, 6.09s/it]
52%|█████▏ | 1462/2826 [2:23:41<2:18:42, 6.10s/it]
52%|█████▏ | 1463/2826 [2:23:47<2:14:19, 5.91s/it]
52%|█████▏ | 1464/2826 [2:23:52<2:10:18, 5.74s/it]
52%|█████▏ | 1465/2826 [2:23:59<2:21:58, 6.26s/it]
52%|█████▏ | 1466/2826 [2:24:07<2:31:09, 6.67s/it]
52%|█████▏ | 1467/2826 [2:24:14<2:30:24, 6.64s/it]
52%|█████▏ | 1468/2826 [2:24:19<2:21:01, 6.23s/it]
52%|█████▏ | 1469/2826 [2:24:25<2:18:52, 6.14s/it]
52%|█████▏ | 1470/2826 [2:24:31<2:19:27, 6.17s/it]
{'loss': 0.2973, 'grad_norm': 3.0710208415985107, 'learning_rate': 2.7635735220781214e-06, 'epoch': 1.56}
52%|█████▏ | 1470/2826 [2:24:31<2:19:27, 6.17s/it]
52%|█████▏ | 1471/2826 [2:24:37<2:15:11, 5.99s/it]
52%|█████▏ | 1472/2826 [2:24:42<2:08:45, 5.71s/it]
52%|█████▏ | 1473/2826 [2:24:47<2:04:36, 5.53s/it]
52%|█████▏ | 1474/2826 [2:24:53<2:09:37, 5.75s/it]
52%|█████▏ | 1475/2826 [2:24:59<2:14:02, 5.95s/it]
52%|█████▏ | 1476/2826 [2:25:05<2:12:32, 5.89s/it]
52%|█████▏ | 1477/2826 [2:25:13<2:23:37, 6.39s/it]
52%|█████▏ | 1478/2826 [2:25:19<2:24:09, 6.42s/it]
52%|█████▏ | 1479/2826 [2:25:24<2:15:55, 6.05s/it]
52%|█████▏ | 1480/2826 [2:25:31<2:16:18, 6.08s/it]
{'loss': 0.3423, 'grad_norm': 2.357663631439209, 'learning_rate': 2.7328416079094412e-06, 'epoch': 1.57}
52%|█████▏ | 1480/2826 [2:25:31<2:16:18, 6.08s/it]
52%|█████▏ | 1481/2826 [2:25:36<2:12:05, 5.89s/it]
52%|█████▏ | 1482/2826 [2:25:42<2:09:59, 5.80s/it]
52%|█████▏ | 1483/2826 [2:25:48<2:12:18, 5.91s/it]
53%|█████▎ | 1484/2826 [2:25:53<2:09:47, 5.80s/it]
53%|█████▎ | 1485/2826 [2:25:59<2:05:43, 5.63s/it]
53%|█████▎ | 1486/2826 [2:26:04<2:03:09, 5.51s/it]
53%|█████▎ | 1487/2826 [2:26:09<2:02:44, 5.50s/it]
53%|█████▎ | 1488/2826 [2:26:15<2:04:11, 5.57s/it]
53%|█████▎ | 1489/2826 [2:26:21<2:06:54, 5.70s/it]
53%|█████▎ | 1490/2826 [2:26:27<2:11:41, 5.91s/it]
{'loss': 0.3211, 'grad_norm': 2.2559144496917725, 'learning_rate': 2.7020741582685217e-06, 'epoch': 1.58}
53%|█████▎ | 1490/2826 [2:26:27<2:11:41, 5.91s/it]
53%|█████▎ | 1491/2826 [2:26:33<2:08:57, 5.80s/it]
53%|█████▎ | 1492/2826 [2:26:38<2:05:45, 5.66s/it]
53%|█████▎ | 1493/2826 [2:26:45<2:10:31, 5.88s/it]
53%|█████▎ | 1494/2826 [2:26:50<2:07:31, 5.74s/it]
53%|█████▎ | 1495/2826 [2:26:56<2:05:43, 5.67s/it]
53%|█████▎ | 1496/2826 [2:27:04<2:21:24, 6.38s/it]
53%|█████▎ | 1497/2826 [2:27:09<2:14:05, 6.05s/it]
53%|█████▎ | 1498/2826 [2:27:14<2:08:05, 5.79s/it]
53%|█████▎ | 1499/2826 [2:27:21<2:15:04, 6.11s/it]
53%|█████▎ | 1500/2826 [2:27:26<2:08:59, 5.84s/it]
{'loss': 0.2733, 'grad_norm': 2.0730817317962646, 'learning_rate': 2.6712758687759706e-06, 'epoch': 1.59}
53%|█████▎ | 1500/2826 [2:27:26<2:08:59, 5.84s/it]
53%|█████▎ | 1501/2826 [2:27:31<2:05:15, 5.67s/it]
53%|█████▎ | 1502/2826 [2:27:37<2:05:15, 5.68s/it]
53%|█████▎ | 1503/2826 [2:27:42<2:02:02, 5.53s/it]
53%|█████▎ | 1504/2826 [2:27:48<2:04:18, 5.64s/it]
53%|█████▎ | 1505/2826 [2:27:53<2:00:14, 5.46s/it]
53%|█████▎ | 1506/2826 [2:28:00<2:09:37, 5.89s/it]
53%|█████▎ | 1507/2826 [2:28:06<2:11:55, 6.00s/it]
53%|█████▎ | 1508/2826 [2:28:13<2:16:18, 6.21s/it]
53%|█████▎ | 1509/2826 [2:28:19<2:15:06, 6.16s/it]
53%|█████▎ | 1510/2826 [2:28:24<2:08:26, 5.86s/it]
{'loss': 0.338, 'grad_norm': 2.6119141578674316, 'learning_rate': 2.6404514397590657e-06, 'epoch': 1.6}
53%|█████▎ | 1510/2826 [2:28:24<2:08:26, 5.86s/it]
53%|█████▎ | 1511/2826 [2:28:31<2:15:14, 6.17s/it]
54%|█████▎ | 1512/2826 [2:28:37<2:15:48, 6.20s/it]
54%|█████▎ | 1513/2826 [2:28:44<2:20:11, 6.41s/it]
54%|█████▎ | 1514/2826 [2:28:50<2:14:05, 6.13s/it]
54%|█████▎ | 1515/2826 [2:28:55<2:09:32, 5.93s/it]
54%|█████▎ | 1516/2826 [2:29:01<2:09:27, 5.93s/it]
54%|█████▎ | 1517/2826 [2:29:07<2:10:28, 5.98s/it]
54%|█████▎ | 1518/2826 [2:29:14<2:12:44, 6.09s/it]
54%|█████▍ | 1519/2826 [2:29:19<2:07:23, 5.85s/it]
54%|█████▍ | 1520/2826 [2:29:24<2:02:12, 5.61s/it]
{'loss': 0.3124, 'grad_norm': 2.315875768661499, 'learning_rate': 2.6096055755344113e-06, 'epoch': 1.61}
54%|█████▍ | 1520/2826 [2:29:24<2:02:12, 5.61s/it]
54%|█████▍ | 1521/2826 [2:29:29<2:00:01, 5.52s/it]
54%|█████▍ | 1522/2826 [2:29:35<1:59:29, 5.50s/it]
54%|█████▍ | 1523/2826 [2:29:40<1:56:40, 5.37s/it]
54%|█████▍ | 1524/2826 [2:29:45<1:54:55, 5.30s/it]
54%|█████▍ | 1525/2826 [2:29:50<1:53:48, 5.25s/it]
54%|█████▍ | 1526/2826 [2:29:55<1:54:11, 5.27s/it]
54%|█████▍ | 1527/2826 [2:30:02<1:59:37, 5.53s/it]
54%|█████▍ | 1528/2826 [2:30:07<1:56:35, 5.39s/it]
54%|█████▍ | 1529/2826 [2:30:12<1:57:28, 5.43s/it]
54%|█████▍ | 1530/2826 [2:30:18<1:59:13, 5.52s/it]
{'loss': 0.3538, 'grad_norm': 2.2880892753601074, 'learning_rate': 2.578742983689973e-06, 'epoch': 1.62}
54%|█████▍ | 1530/2826 [2:30:18<1:59:13, 5.52s/it]
54%|█████▍ | 1531/2826 [2:30:24<1:59:53, 5.55s/it]
54%|█████▍ | 1532/2826 [2:30:30<2:04:08, 5.76s/it]
54%|█████▍ | 1533/2826 [2:30:35<2:00:30, 5.59s/it]
54%|█████▍ | 1534/2826 [2:30:42<2:10:51, 6.08s/it]
54%|█████▍ | 1535/2826 [2:30:49<2:12:32, 6.16s/it]
54%|█████▍ | 1536/2826 [2:30:55<2:14:24, 6.25s/it]
54%|█████▍ | 1537/2826 [2:31:01<2:15:44, 6.32s/it]
54%|█████▍ | 1538/2826 [2:31:07<2:09:14, 6.02s/it]
54%|█████▍ | 1539/2826 [2:31:13<2:09:01, 6.02s/it]
54%|█████▍ | 1540/2826 [2:31:20<2:15:35, 6.33s/it]
{'loss': 0.3353, 'grad_norm': 2.2615041732788086, 'learning_rate': 2.547868374366631e-06, 'epoch': 1.63}
54%|█████▍ | 1540/2826 [2:31:20<2:15:35, 6.33s/it]
55%|█████▍ | 1541/2826 [2:31:26<2:13:57, 6.25s/it]
55%|█████▍ | 1542/2826 [2:31:32<2:13:36, 6.24s/it]
55%|█████▍ | 1543/2826 [2:31:38<2:09:49, 6.07s/it]
55%|█████▍ | 1544/2826 [2:31:43<2:06:18, 5.91s/it]
55%|█████▍ | 1545/2826 [2:31:49<2:06:29, 5.92s/it]
55%|█████▍ | 1546/2826 [2:31:55<2:04:51, 5.85s/it]
55%|█████▍ | 1547/2826 [2:32:01<2:06:52, 5.95s/it]
55%|█████▍ | 1548/2826 [2:32:06<2:01:18, 5.69s/it]
55%|█████▍ | 1549/2826 [2:32:12<1:59:28, 5.61s/it]
55%|█████▍ | 1550/2826 [2:32:18<2:02:13, 5.75s/it]
{'loss': 0.302, 'grad_norm': 1.9062315225601196, 'learning_rate': 2.5169864595393295e-06, 'epoch': 1.64}
55%|█████▍ | 1550/2826 [2:32:18<2:02:13, 5.75s/it]
55%|█████▍ | 1551/2826 [2:32:24<2:03:14, 5.80s/it]
55%|█████▍ | 1552/2826 [2:32:29<1:59:34, 5.63s/it]
55%|█████▍ | 1553/2826 [2:32:35<2:03:47, 5.83s/it]
55%|█████▍ | 1554/2826 [2:32:41<2:03:06, 5.81s/it]
55%|█████▌ | 1555/2826 [2:32:48<2:12:37, 6.26s/it]
55%|█████▌ | 1556/2826 [2:32:54<2:09:23, 6.11s/it]
55%|█████▌ | 1557/2826 [2:32:59<2:03:22, 5.83s/it]
55%|█████▌ | 1558/2826 [2:33:05<2:01:59, 5.77s/it]
55%|█████▌ | 1559/2826 [2:33:10<1:58:14, 5.60s/it]
55%|█████▌ | 1560/2826 [2:33:16<1:57:39, 5.58s/it]
{'loss': 0.3124, 'grad_norm': 2.7016942501068115, 'learning_rate': 2.4861019522979537e-06, 'epoch': 1.65}
55%|█████▌ | 1560/2826 [2:33:16<1:57:39, 5.58s/it]
55%|█████▌ | 1561/2826 [2:33:22<2:02:07, 5.79s/it]
55%|█████▌ | 1562/2826 [2:33:28<2:01:58, 5.79s/it]
55%|█████▌ | 1563/2826 [2:33:34<2:07:10, 6.04s/it]
55%|█████▌ | 1564/2826 [2:33:40<2:05:25, 5.96s/it]
55%|█████▌ | 1565/2826 [2:33:46<2:06:55, 6.04s/it]
55%|█████▌ | 1566/2826 [2:33:52<2:02:21, 5.83s/it]
55%|█████▌ | 1567/2826 [2:33:57<1:58:05, 5.63s/it]
55%|█████▌ | 1568/2826 [2:34:02<1:54:14, 5.45s/it]
56%|█████▌ | 1569/2826 [2:34:07<1:53:19, 5.41s/it]
56%|█████▌ | 1570/2826 [2:34:13<1:53:32, 5.42s/it]
{'loss': 0.3497, 'grad_norm': 2.4618184566497803, 'learning_rate': 2.455219566128034e-06, 'epoch': 1.67}
56%|█████▌ | 1570/2826 [2:34:13<1:53:32, 5.42s/it]
56%|█████▌ | 1571/2826 [2:34:18<1:53:24, 5.42s/it]
56%|█████▌ | 1572/2826 [2:34:24<1:54:07, 5.46s/it]
56%|█████▌ | 1573/2826 [2:34:29<1:52:16, 5.38s/it]
56%|█████▌ | 1574/2826 [2:34:35<1:56:00, 5.56s/it]
56%|█████▌ | 1575/2826 [2:34:40<1:56:14, 5.58s/it]
56%|█████▌ | 1576/2826 [2:34:46<1:57:30, 5.64s/it]
56%|█████▌ | 1577/2826 [2:34:53<2:05:30, 6.03s/it]
56%|█████▌ | 1578/2826 [2:34:59<2:07:31, 6.13s/it]
56%|█████▌ | 1579/2826 [2:35:05<2:05:00, 6.01s/it]
56%|█████▌ | 1580/2826 [2:35:12<2:07:54, 6.16s/it]
{'loss': 0.3233, 'grad_norm': 2.8924951553344727, 'learning_rate': 2.4243440141913905e-06, 'epoch': 1.68}
56%|█████▌ | 1580/2826 [2:35:12<2:07:54, 6.16s/it]
56%|█████▌ | 1581/2826 [2:35:18<2:07:28, 6.14s/it]
56%|█████▌ | 1582/2826 [2:35:25<2:11:43, 6.35s/it]
56%|█████▌ | 1583/2826 [2:35:30<2:08:40, 6.21s/it]
56%|█████▌ | 1584/2826 [2:35:36<2:05:36, 6.07s/it]
56%|█████▌ | 1585/2826 [2:35:41<2:00:26, 5.82s/it]
56%|█████▌ | 1586/2826 [2:35:48<2:04:25, 6.02s/it]
56%|█████▌ | 1587/2826 [2:35:53<1:59:19, 5.78s/it]
56%|█████▌ | 1588/2826 [2:35:59<1:58:20, 5.74s/it]
56%|█████▌ | 1589/2826 [2:36:04<1:54:37, 5.56s/it]
56%|█████▋ | 1590/2826 [2:36:10<1:55:40, 5.62s/it]
{'loss': 0.3067, 'grad_norm': 2.32255482673645, 'learning_rate': 2.393480008606825e-06, 'epoch': 1.69}
56%|█████▋ | 1590/2826 [2:36:10<1:55:40, 5.62s/it]
56%|█████▋ | 1591/2826 [2:36:15<1:53:13, 5.50s/it]
56%|█████▋ | 1592/2826 [2:36:20<1:50:24, 5.37s/it]
56%|█████▋ | 1593/2826 [2:36:26<1:55:33, 5.62s/it]
56%|█████▋ | 1594/2826 [2:36:32<1:57:34, 5.73s/it]
56%|█████▋ | 1595/2826 [2:36:38<1:55:04, 5.61s/it]
56%|█████▋ | 1596/2826 [2:36:44<1:59:10, 5.81s/it]
57%|█████▋ | 1597/2826 [2:36:49<1:55:38, 5.65s/it]
57%|█████▋ | 1598/2826 [2:36:55<1:58:18, 5.78s/it]
57%|█████▋ | 1599/2826 [2:37:02<2:04:41, 6.10s/it]
57%|█████▋ | 1600/2826 [2:37:09<2:11:51, 6.45s/it]
{'loss': 0.2893, 'grad_norm': 1.8984359502792358, 'learning_rate': 2.3626322597309774e-06, 'epoch': 1.7}
57%|█████▋ | 1600/2826 [2:37:09<2:11:51, 6.45s/it]
57%|█████▋ | 1601/2826 [2:37:15<2:06:46, 6.21s/it]
57%|█████▋ | 1602/2826 [2:37:20<2:02:48, 6.02s/it]
57%|█████▋ | 1603/2826 [2:37:26<1:57:59, 5.79s/it]
57%|█████▋ | 1604/2826 [2:37:31<1:56:41, 5.73s/it]
57%|█████▋ | 1605/2826 [2:37:37<1:58:05, 5.80s/it]
57%|█████▋ | 1606/2826 [2:37:43<1:58:54, 5.85s/it]
57%|█████▋ | 1607/2826 [2:37:50<2:05:37, 6.18s/it]
57%|█████▋ | 1608/2826 [2:37:56<2:01:53, 6.00s/it]
57%|█████▋ | 1609/2826 [2:38:02<2:03:18, 6.08s/it]
57%|█████▋ | 1610/2826 [2:38:07<1:59:12, 5.88s/it]
{'loss': 0.2825, 'grad_norm': 1.8360289335250854, 'learning_rate': 2.331805475439445e-06, 'epoch': 1.71}
57%|█████▋ | 1610/2826 [2:38:07<1:59:12, 5.88s/it]
57%|█████▋ | 1611/2826 [2:38:13<1:54:49, 5.67s/it]
57%|█████▋ | 1612/2826 [2:38:18<1:51:13, 5.50s/it]
57%|█████▋ | 1613/2826 [2:38:23<1:49:05, 5.40s/it]
57%|█████▋ | 1614/2826 [2:38:28<1:49:12, 5.41s/it]
57%|█████▋ | 1615/2826 [2:38:35<1:55:45, 5.74s/it]
57%|█████▋ | 1616/2826 [2:38:40<1:52:21, 5.57s/it]
57%|█████▋ | 1617/2826 [2:38:45<1:49:27, 5.43s/it]
57%|█████▋ | 1618/2826 [2:38:50<1:48:33, 5.39s/it]
57%|█████▋ | 1619/2826 [2:38:55<1:46:10, 5.28s/it]
57%|█████▋ | 1620/2826 [2:39:01<1:44:54, 5.22s/it]
{'loss': 0.3379, 'grad_norm': 2.331998109817505, 'learning_rate': 2.3010043604082824e-06, 'epoch': 1.72}
57%|█████▋ | 1620/2826 [2:39:01<1:44:54, 5.22s/it]
57%|█████▋ | 1621/2826 [2:39:06<1:44:19, 5.19s/it]
57%|█████▋ | 1622/2826 [2:39:13<1:54:32, 5.71s/it]
57%|█████▋ | 1623/2826 [2:39:18<1:51:12, 5.55s/it]
57%|█████▋ | 1624/2826 [2:39:25<1:58:44, 5.93s/it]
58%|█████▊ | 1625/2826 [2:39:30<1:55:06, 5.75s/it]
58%|█████▊ | 1626/2826 [2:39:37<2:01:37, 6.08s/it]
58%|█████▊ | 1627/2826 [2:39:44<2:08:46, 6.44s/it]
58%|█████▊ | 1628/2826 [2:39:49<2:02:13, 6.12s/it]
58%|█████▊ | 1629/2826 [2:39:55<1:58:04, 5.92s/it]
58%|█████▊ | 1630/2826 [2:40:01<1:59:03, 5.97s/it]
{'loss': 0.301, 'grad_norm': 2.3304574489593506, 'learning_rate': 2.2702336153959925e-06, 'epoch': 1.73}
58%|█████▊ | 1630/2826 [2:40:01<1:59:03, 5.97s/it]
58%|█████▊ | 1631/2826 [2:40:07<2:00:10, 6.03s/it]
58%|█████▊ | 1632/2826 [2:40:13<1:56:44, 5.87s/it]
58%|█████▊ | 1633/2826 [2:40:18<1:51:58, 5.63s/it]
58%|█████▊ | 1634/2826 [2:40:23<1:48:38, 5.47s/it]
58%|█████▊ | 1635/2826 [2:40:28<1:49:03, 5.49s/it]
58%|█████▊ | 1636/2826 [2:40:33<1:46:44, 5.38s/it]
58%|█████▊ | 1637/2826 [2:40:39<1:47:59, 5.45s/it]
58%|█████▊ | 1638/2826 [2:40:45<1:48:22, 5.47s/it]
58%|█████▊ | 1639/2826 [2:40:50<1:45:44, 5.34s/it]
58%|█████▊ | 1640/2826 [2:40:58<2:02:47, 6.21s/it]
{'loss': 0.404, 'grad_norm': 2.534090518951416, 'learning_rate': 2.2394979365261134e-06, 'epoch': 1.74}
58%|█████▊ | 1640/2826 [2:40:58<2:02:47, 6.21s/it]
58%|█████▊ | 1641/2826 [2:41:04<2:01:16, 6.14s/it]
58%|█████▊ | 1642/2826 [2:41:10<2:02:52, 6.23s/it]
58%|█████▊ | 1643/2826 [2:41:16<1:57:24, 5.95s/it]
58%|█████▊ | 1644/2826 [2:41:22<1:57:17, 5.95s/it]
58%|█████▊ | 1645/2826 [2:41:27<1:55:10, 5.85s/it]
58%|█████▊ | 1646/2826 [2:41:33<1:54:44, 5.83s/it]
58%|█████▊ | 1647/2826 [2:41:39<1:53:42, 5.79s/it]
58%|█████▊ | 1648/2826 [2:41:44<1:49:27, 5.58s/it]
58%|█████▊ | 1649/2826 [2:41:51<1:56:58, 5.96s/it]
58%|█████▊ | 1650/2826 [2:41:56<1:52:02, 5.72s/it]
{'loss': 0.3242, 'grad_norm': 2.273122549057007, 'learning_rate': 2.208802014570507e-06, 'epoch': 1.75}
58%|█████▊ | 1650/2826 [2:41:56<1:52:02, 5.72s/it]
58%|█████▊ | 1651/2826 [2:42:02<1:52:53, 5.76s/it]
58%|█████▊ | 1652/2826 [2:42:07<1:49:24, 5.59s/it]
58%|█████▊ | 1653/2826 [2:42:13<1:50:33, 5.66s/it]
59%|█████▊ | 1654/2826 [2:42:19<1:57:09, 6.00s/it]
59%|█████▊ | 1655/2826 [2:42:25<1:54:50, 5.88s/it]
59%|█████▊ | 1656/2826 [2:42:32<1:58:24, 6.07s/it]
59%|█████▊ | 1657/2826 [2:42:37<1:53:51, 5.84s/it]
59%|█████▊ | 1658/2826 [2:42:42<1:49:11, 5.61s/it]
59%|█████▊ | 1659/2826 [2:42:49<1:56:21, 5.98s/it]
59%|█████▊ | 1660/2826 [2:42:55<2:00:40, 6.21s/it]
{'loss': 0.3152, 'grad_norm': 1.8859643936157227, 'learning_rate': 2.1781505342334775e-06, 'epoch': 1.76}
59%|█████▊ | 1660/2826 [2:42:55<2:00:40, 6.21s/it]
59%|█████▉ | 1661/2826 [2:43:03<2:06:10, 6.50s/it]
59%|█████▉ | 1662/2826 [2:43:08<2:02:01, 6.29s/it]
59%|█████▉ | 1663/2826 [2:43:16<2:06:22, 6.52s/it]
59%|█████▉ | 1664/2826 [2:43:22<2:05:53, 6.50s/it]
59%|█████▉ | 1665/2826 [2:43:27<1:59:42, 6.19s/it]
59%|█████▉ | 1666/2826 [2:43:34<1:59:38, 6.19s/it]
59%|█████▉ | 1667/2826 [2:43:39<1:54:03, 5.90s/it]
59%|█████▉ | 1668/2826 [2:43:44<1:49:47, 5.69s/it]
59%|█████▉ | 1669/2826 [2:43:51<1:57:06, 6.07s/it]
59%|█████▉ | 1670/2826 [2:43:56<1:51:35, 5.79s/it]
{'loss': 0.3302, 'grad_norm': 2.567715644836426, 'learning_rate': 2.147548173436805e-06, 'epoch': 1.77}
59%|█████▉ | 1670/2826 [2:43:56<1:51:35, 5.79s/it]
59%|█████▉ | 1671/2826 [2:44:01<1:48:16, 5.62s/it]
59%|█████▉ | 1672/2826 [2:44:07<1:47:12, 5.57s/it]
59%|█████▉ | 1673/2826 [2:44:13<1:50:40, 5.76s/it]
59%|█████▉ | 1674/2826 [2:44:21<2:03:28, 6.43s/it]
59%|█████▉ | 1675/2826 [2:44:27<2:00:11, 6.27s/it]
59%|█████▉ | 1676/2826 [2:44:32<1:54:02, 5.95s/it]
59%|█████▉ | 1677/2826 [2:44:38<1:51:21, 5.81s/it]
59%|█████▉ | 1678/2826 [2:44:44<1:53:40, 5.94s/it]
59%|█████▉ | 1679/2826 [2:44:49<1:47:55, 5.65s/it]
59%|█████▉ | 1680/2826 [2:44:55<1:49:36, 5.74s/it]
{'loss': 0.293, 'grad_norm': 2.7930519580841064, 'learning_rate': 2.116999602605814e-06, 'epoch': 1.78}
59%|█████▉ | 1680/2826 [2:44:55<1:49:36, 5.74s/it]
59%|█████▉ | 1681/2826 [2:45:00<1:48:18, 5.68s/it]
60%|█████▉ | 1682/2826 [2:45:07<1:54:42, 6.02s/it]
60%|█████▉ | 1683/2826 [2:45:13<1:51:58, 5.88s/it]
60%|█████▉ | 1684/2826 [2:45:18<1:48:19, 5.69s/it]
60%|█████▉ | 1685/2826 [2:45:25<1:57:26, 6.18s/it]
60%|█████▉ | 1686/2826 [2:45:32<1:58:25, 6.23s/it]
60%|█████▉ | 1687/2826 [2:45:38<1:56:40, 6.15s/it]
60%|█████▉ | 1688/2826 [2:45:44<1:57:54, 6.22s/it]
60%|█████▉ | 1689/2826 [2:45:49<1:51:06, 5.86s/it]
60%|█████▉ | 1690/2826 [2:45:54<1:46:55, 5.65s/it]
{'loss': 0.2683, 'grad_norm': 2.646296262741089, 'learning_rate': 2.086509483956594e-06, 'epoch': 1.79}
60%|█████▉ | 1690/2826 [2:45:54<1:46:55, 5.65s/it]
60%|█████▉ | 1691/2826 [2:46:00<1:47:50, 5.70s/it]
60%|█████▉ | 1692/2826 [2:46:05<1:45:04, 5.56s/it]
60%|█████▉ | 1693/2826 [2:46:11<1:45:16, 5.58s/it]
60%|█████▉ | 1694/2826 [2:46:16<1:42:07, 5.41s/it]
60%|█████▉ | 1695/2826 [2:46:21<1:43:29, 5.49s/it]
60%|██████ | 1696/2826 [2:46:27<1:42:21, 5.44s/it]
60%|██████ | 1697/2826 [2:46:33<1:44:08, 5.53s/it]
60%|██████ | 1698/2826 [2:46:39<1:47:26, 5.71s/it]
60%|██████ | 1699/2826 [2:46:44<1:44:49, 5.58s/it]
60%|██████ | 1700/2826 [2:46:50<1:48:47, 5.80s/it]
{'loss': 0.313, 'grad_norm': 2.3010053634643555, 'learning_rate': 2.056082470784469e-06, 'epoch': 1.8}
60%|██████ | 1700/2826 [2:46:50<1:48:47, 5.80s/it]
60%|██████ | 1701/2826 [2:46:55<1:44:42, 5.58s/it]
60%|██████ | 1702/2826 [2:47:02<1:49:02, 5.82s/it]
60%|██████ | 1703/2826 [2:47:08<1:52:33, 6.01s/it]
60%|██████ | 1704/2826 [2:47:14<1:48:40, 5.81s/it]
60%|██████ | 1705/2826 [2:47:19<1:48:14, 5.79s/it]
60%|██████ | 1706/2826 [2:47:25<1:49:15, 5.85s/it]
60%|██████ | 1707/2826 [2:47:30<1:45:38, 5.66s/it]
60%|██████ | 1708/2826 [2:47:36<1:43:37, 5.56s/it]
60%|██████ | 1709/2826 [2:47:42<1:45:30, 5.67s/it]
61%|██████ | 1710/2826 [2:47:47<1:42:47, 5.53s/it]
{'loss': 0.262, 'grad_norm': 2.3864669799804688, 'learning_rate': 2.0257232067538213e-06, 'epoch': 1.81}
61%|██████ | 1710/2826 [2:47:47<1:42:47, 5.53s/it]
61%|██████ | 1711/2826 [2:47:53<1:43:52, 5.59s/it]
61%|██████ | 1712/2826 [2:47:59<1:45:58, 5.71s/it]
61%|██████ | 1713/2826 [2:48:04<1:42:33, 5.53s/it]
61%|██████ | 1714/2826 [2:48:10<1:46:20, 5.74s/it]
61%|██████ | 1715/2826 [2:48:16<1:46:34, 5.76s/it]
61%|██████ | 1716/2826 [2:48:21<1:43:35, 5.60s/it]
61%|██████ | 1717/2826 [2:48:27<1:43:00, 5.57s/it]
61%|██████ | 1718/2826 [2:48:33<1:45:42, 5.72s/it]
61%|██████ | 1719/2826 [2:48:38<1:44:57, 5.69s/it]
61%|██████ | 1720/2826 [2:48:43<1:42:34, 5.56s/it]
{'loss': 0.3457, 'grad_norm': 2.63028883934021, 'learning_rate': 1.9954363251894007e-06, 'epoch': 1.82}
61%|██████ | 1720/2826 [2:48:43<1:42:34, 5.56s/it]
61%|██████ | 1721/2826 [2:48:50<1:46:39, 5.79s/it]
61%|██████ | 1722/2826 [2:48:56<1:51:04, 6.04s/it]
61%|██████ | 1723/2826 [2:49:02<1:46:15, 5.78s/it]
61%|██████ | 1724/2826 [2:49:07<1:46:08, 5.78s/it]
61%|██████ | 1725/2826 [2:49:13<1:47:50, 5.88s/it]
61%|██████ | 1726/2826 [2:49:19<1:46:57, 5.83s/it]
61%|██████ | 1727/2826 [2:49:24<1:43:12, 5.64s/it]
61%|██████ | 1728/2826 [2:49:31<1:47:26, 5.87s/it]
61%|██████ | 1729/2826 [2:49:38<1:52:35, 6.16s/it]
61%|██████ | 1730/2826 [2:49:43<1:50:19, 6.04s/it]
{'loss': 0.2739, 'grad_norm': 2.0011484622955322, 'learning_rate': 1.9652264483691933e-06, 'epoch': 1.84}
61%|██████ | 1730/2826 [2:49:43<1:50:19, 6.04s/it]
61%|██████▏ | 1731/2826 [2:49:49<1:46:41, 5.85s/it]
61%|██████▏ | 1732/2826 [2:49:54<1:42:24, 5.62s/it]
61%|██████▏ | 1733/2826 [2:49:59<1:40:58, 5.54s/it]
61%|██████▏ | 1734/2826 [2:50:06<1:48:38, 5.97s/it]
61%|██████▏ | 1735/2826 [2:50:12<1:46:45, 5.87s/it]
61%|██████▏ | 1736/2826 [2:50:19<1:52:13, 6.18s/it]
61%|██████▏ | 1737/2826 [2:50:24<1:47:58, 5.95s/it]
62%|██████▏ | 1738/2826 [2:50:30<1:46:00, 5.85s/it]
62%|██████▏ | 1739/2826 [2:50:35<1:41:12, 5.59s/it]
62%|██████▏ | 1740/2826 [2:50:40<1:38:18, 5.43s/it]
{'loss': 0.3109, 'grad_norm': 2.6818690299987793, 'learning_rate': 1.9350981868189944e-06, 'epoch': 1.85}
62%|██████▏ | 1740/2826 [2:50:40<1:38:18, 5.43s/it]
62%|██████▏ | 1741/2826 [2:50:45<1:35:50, 5.30s/it]
62%|██████▏ | 1742/2826 [2:50:50<1:34:29, 5.23s/it]
62%|██████▏ | 1743/2826 [2:50:55<1:34:51, 5.26s/it]
62%|██████▏ | 1744/2826 [2:51:01<1:37:15, 5.39s/it]
62%|██████▏ | 1745/2826 [2:51:06<1:36:09, 5.34s/it]
62%|██████▏ | 1746/2826 [2:51:13<1:42:39, 5.70s/it]
62%|██████▏ | 1747/2826 [2:51:18<1:39:15, 5.52s/it]
62%|██████▏ | 1748/2826 [2:51:23<1:40:05, 5.57s/it]
62%|██████▏ | 1749/2826 [2:51:30<1:46:27, 5.93s/it]
62%|██████▏ | 1750/2826 [2:51:36<1:47:42, 6.01s/it]
{'loss': 0.3269, 'grad_norm': 2.6978225708007812, 'learning_rate': 1.9050561386087618e-06, 'epoch': 1.86}
62%|██████▏ | 1750/2826 [2:51:36<1:47:42, 6.01s/it]
62%|██████▏ | 1751/2826 [2:51:41<1:42:26, 5.72s/it]
62%|██████▏ | 1752/2826 [2:51:48<1:45:14, 5.88s/it]
62%|██████▏ | 1753/2826 [2:51:53<1:41:24, 5.67s/it]
62%|██████▏ | 1754/2826 [2:51:58<1:38:46, 5.53s/it]
62%|██████▏ | 1755/2826 [2:52:03<1:36:29, 5.41s/it]
62%|██████▏ | 1756/2826 [2:52:10<1:46:05, 5.95s/it]
62%|██████▏ | 1757/2826 [2:52:17<1:49:18, 6.14s/it]
62%|██████▏ | 1758/2826 [2:52:22<1:44:35, 5.88s/it]
62%|██████▏ | 1759/2826 [2:52:29<1:49:05, 6.13s/it]
62%|██████▏ | 1760/2826 [2:52:34<1:44:20, 5.87s/it]
{'loss': 0.3617, 'grad_norm': 2.578031301498413, 'learning_rate': 1.8751048886508711e-06, 'epoch': 1.87}
62%|██████▏ | 1760/2826 [2:52:34<1:44:20, 5.87s/it]
62%|██████▏ | 1761/2826 [2:52:41<1:49:08, 6.15s/it]
62%|██████▏ | 1762/2826 [2:52:46<1:45:00, 5.92s/it]
62%|██████▏ | 1763/2826 [2:52:53<1:45:45, 5.97s/it]
62%|██████▏ | 1764/2826 [2:52:58<1:43:38, 5.86s/it]
62%|██████▏ | 1765/2826 [2:53:03<1:39:37, 5.63s/it]
62%|██████▏ | 1766/2826 [2:53:10<1:46:18, 6.02s/it]
63%|██████▎ | 1767/2826 [2:53:15<1:40:50, 5.71s/it]
63%|██████▎ | 1768/2826 [2:53:20<1:38:28, 5.58s/it]
63%|██████▎ | 1769/2826 [2:53:26<1:37:01, 5.51s/it]
63%|██████▎ | 1770/2826 [2:53:33<1:44:39, 5.95s/it]
{'loss': 0.3228, 'grad_norm': 2.5525052547454834, 'learning_rate': 1.8452490080003888e-06, 'epoch': 1.88}
63%|██████▎ | 1770/2826 [2:53:33<1:44:39, 5.95s/it]
63%|██████▎ | 1771/2826 [2:53:39<1:48:00, 6.14s/it]
63%|██████▎ | 1772/2826 [2:53:45<1:47:00, 6.09s/it]
63%|██████▎ | 1773/2826 [2:53:51<1:42:38, 5.85s/it]
63%|██████▎ | 1774/2826 [2:53:57<1:44:30, 5.96s/it]
63%|██████▎ | 1775/2826 [2:54:02<1:41:07, 5.77s/it]
63%|██████▎ | 1776/2826 [2:54:08<1:41:44, 5.81s/it]
63%|██████▎ | 1777/2826 [2:54:13<1:38:22, 5.63s/it]
63%|██████▎ | 1778/2826 [2:54:21<1:47:25, 6.15s/it]
63%|██████▎ | 1779/2826 [2:54:26<1:43:40, 5.94s/it]
63%|██████▎ | 1780/2826 [2:54:32<1:42:29, 5.88s/it]
{'loss': 0.2857, 'grad_norm': 2.1095635890960693, 'learning_rate': 1.8154930531574521e-06, 'epoch': 1.89}
63%|██████▎ | 1780/2826 [2:54:32<1:42:29, 5.88s/it]
63%|██████▎ | 1781/2826 [2:54:37<1:40:35, 5.78s/it]
63%|██████▎ | 1782/2826 [2:54:44<1:44:03, 5.98s/it]
63%|██████▎ | 1783/2826 [2:54:49<1:41:01, 5.81s/it]
63%|██████▎ | 1784/2826 [2:54:54<1:36:51, 5.58s/it]
63%|██████▎ | 1785/2826 [2:55:01<1:41:36, 5.86s/it]
63%|██████▎ | 1786/2826 [2:55:06<1:39:35, 5.75s/it]
63%|██████▎ | 1787/2826 [2:55:12<1:37:48, 5.65s/it]
63%|██████▎ | 1788/2826 [2:55:17<1:34:36, 5.47s/it]
63%|██████▎ | 1789/2826 [2:55:22<1:35:07, 5.50s/it]
63%|██████▎ | 1790/2826 [2:55:29<1:42:31, 5.94s/it]
{'loss': 0.3622, 'grad_norm': 2.3965845108032227, 'learning_rate': 1.785841565371868e-06, 'epoch': 1.9}
63%|██████▎ | 1790/2826 [2:55:29<1:42:31, 5.94s/it]
63%|██████▎ | 1791/2826 [2:55:35<1:40:40, 5.84s/it]
63%|██████▎ | 1792/2826 [2:55:41<1:40:29, 5.83s/it]
63%|██████▎ | 1793/2826 [2:55:46<1:38:35, 5.73s/it]
63%|██████▎ | 1794/2826 [2:55:52<1:37:03, 5.64s/it]
64%|██████▎ | 1795/2826 [2:55:59<1:45:40, 6.15s/it]
64%|██████▎ | 1796/2826 [2:56:04<1:42:18, 5.96s/it]
64%|██████▎ | 1797/2826 [2:56:10<1:37:45, 5.70s/it]
64%|██████▎ | 1798/2826 [2:56:15<1:38:16, 5.74s/it]
64%|██████▎ | 1799/2826 [2:56:20<1:34:52, 5.54s/it]
64%|██████▎ | 1800/2826 [2:56:27<1:38:07, 5.74s/it]
{'loss': 0.3031, 'grad_norm': 2.293715238571167, 'learning_rate': 1.7562990699500482e-06, 'epoch': 1.91}
64%|██████▎ | 1800/2826 [2:56:27<1:38:07, 5.74s/it]
64%|██████▎ | 1801/2826 [2:56:33<1:42:57, 6.03s/it]
64%|██████▍ | 1802/2826 [2:56:38<1:38:10, 5.75s/it]
64%|██████▍ | 1803/2826 [2:56:44<1:36:22, 5.65s/it]
64%|██████▍ | 1804/2826 [2:56:50<1:40:07, 5.88s/it]
64%|██████▍ | 1805/2826 [2:56:56<1:37:25, 5.73s/it]
64%|██████▍ | 1806/2826 [2:57:01<1:36:31, 5.68s/it]
64%|██████▍ | 1807/2826 [2:57:08<1:43:01, 6.07s/it]
64%|██████▍ | 1808/2826 [2:57:14<1:41:20, 5.97s/it]
64%|██████▍ | 1809/2826 [2:57:20<1:40:21, 5.92s/it]
64%|██████▍ | 1810/2826 [2:57:25<1:36:22, 5.69s/it]
{'loss': 0.3019, 'grad_norm': 2.026015281677246, 'learning_rate': 1.7268700755643708e-06, 'epoch': 1.92}
64%|██████▍ | 1810/2826 [2:57:25<1:36:22, 5.69s/it]
64%|██████▍ | 1811/2826 [2:57:31<1:36:47, 5.72s/it]
64%|██████▍ | 1812/2826 [2:57:38<1:47:09, 6.34s/it]
64%|██████▍ | 1813/2826 [2:57:45<1:46:12, 6.29s/it]
64%|██████▍ | 1814/2826 [2:57:51<1:44:59, 6.22s/it]
64%|██████▍ | 1815/2826 [2:57:56<1:38:55, 5.87s/it]
64%|██████▍ | 1816/2826 [2:58:01<1:36:12, 5.72s/it]
64%|██████▍ | 1817/2826 [2:58:06<1:32:43, 5.51s/it]
64%|██████▍ | 1818/2826 [2:58:13<1:36:57, 5.77s/it]
64%|██████▍ | 1819/2826 [2:58:19<1:38:58, 5.90s/it]
64%|██████▍ | 1820/2826 [2:58:25<1:39:09, 5.91s/it]
{'loss': 0.3047, 'grad_norm': 1.7175791263580322, 'learning_rate': 1.6975590735650812e-06, 'epoch': 1.93}
64%|██████▍ | 1820/2826 [2:58:25<1:39:09, 5.91s/it]
64%|██████▍ | 1821/2826 [2:58:30<1:35:36, 5.71s/it]
64%|██████▍ | 1822/2826 [2:58:36<1:36:03, 5.74s/it]
65%|██████▍ | 1823/2826 [2:58:41<1:34:19, 5.64s/it]
65%|██████▍ | 1824/2826 [2:58:46<1:31:50, 5.50s/it]
65%|██████▍ | 1825/2826 [2:58:52<1:31:48, 5.50s/it]
65%|██████▍ | 1826/2826 [2:58:58<1:34:17, 5.66s/it]
65%|██████▍ | 1827/2826 [2:59:04<1:35:07, 5.71s/it]
65%|██████▍ | 1828/2826 [2:59:10<1:39:48, 6.00s/it]
65%|██████▍ | 1829/2826 [2:59:16<1:38:58, 5.96s/it]
65%|██████▍ | 1830/2826 [2:59:22<1:36:23, 5.81s/it]
{'loss': 0.3048, 'grad_norm': 2.0024490356445312, 'learning_rate': 1.668370537294841e-06, 'epoch': 1.94}
65%|██████▍ | 1830/2826 [2:59:22<1:36:23, 5.81s/it]
65%|██████▍ | 1831/2826 [2:59:27<1:35:03, 5.73s/it]
65%|██████▍ | 1832/2826 [2:59:33<1:34:32, 5.71s/it]
65%|██████▍ | 1833/2826 [2:59:38<1:31:26, 5.52s/it]
65%|██████▍ | 1834/2826 [2:59:44<1:32:25, 5.59s/it]
65%|██████▍ | 1835/2826 [2:59:49<1:29:51, 5.44s/it]
65%|██████▍ | 1836/2826 [2:59:55<1:35:45, 5.80s/it]
65%|██████▌ | 1837/2826 [3:00:02<1:40:11, 6.08s/it]
65%|██████▌ | 1838/2826 [3:00:08<1:39:40, 6.05s/it]
65%|██████▌ | 1839/2826 [3:00:14<1:37:52, 5.95s/it]
65%|██████▌ | 1840/2826 [3:00:20<1:38:25, 5.99s/it]
{'loss': 0.3205, 'grad_norm': 2.8226239681243896, 'learning_rate': 1.6393089214060204e-06, 'epoch': 1.95}
65%|██████▌ | 1840/2826 [3:00:20<1:38:25, 5.99s/it]
65%|██████▌ | 1841/2826 [3:00:25<1:34:10, 5.74s/it]
65%|██████▌ | 1842/2826 [3:00:30<1:31:41, 5.59s/it]
65%|██████▌ | 1843/2826 [3:00:36<1:32:12, 5.63s/it]
65%|██████▌ | 1844/2826 [3:00:43<1:39:03, 6.05s/it]
65%|██████▌ | 1845/2826 [3:00:48<1:34:42, 5.79s/it]
65%|██████▌ | 1846/2826 [3:00:54<1:33:47, 5.74s/it]
65%|██████▌ | 1847/2826 [3:01:01<1:39:31, 6.10s/it]
65%|██████▌ | 1848/2826 [3:01:08<1:44:20, 6.40s/it]
65%|██████▌ | 1849/2826 [3:01:16<1:51:30, 6.85s/it]
65%|██████▌ | 1850/2826 [3:01:23<1:52:02, 6.89s/it]
{'loss': 0.321, 'grad_norm': 1.9452221393585205, 'learning_rate': 1.6103786611808414e-06, 'epoch': 1.96}
65%|██████▌ | 1850/2826 [3:01:23<1:52:02, 6.89s/it]
65%|██████▌ | 1851/2826 [3:01:30<1:52:23, 6.92s/it]
66%|██████▌ | 1852/2826 [3:01:35<1:43:37, 6.38s/it]
66%|██████▌ | 1853/2826 [3:01:40<1:36:58, 5.98s/it]
66%|██████▌ | 1854/2826 [3:01:45<1:33:39, 5.78s/it]
66%|██████▌ | 1855/2826 [3:01:51<1:31:43, 5.67s/it]
66%|██████▌ | 1856/2826 [3:01:58<1:38:38, 6.10s/it]
66%|██████▌ | 1857/2826 [3:02:04<1:37:27, 6.03s/it]
66%|██████▌ | 1858/2826 [3:02:09<1:32:53, 5.76s/it]
66%|██████▌ | 1859/2826 [3:02:14<1:29:45, 5.57s/it]
66%|██████▌ | 1860/2826 [3:02:19<1:27:40, 5.45s/it]
{'loss': 0.2954, 'grad_norm': 2.304274320602417, 'learning_rate': 1.5815841718544884e-06, 'epoch': 1.97}
66%|██████▌ | 1860/2826 [3:02:19<1:27:40, 5.45s/it]
66%|██████▌ | 1861/2826 [3:02:26<1:34:15, 5.86s/it]
66%|██████▌ | 1862/2826 [3:02:32<1:34:22, 5.87s/it]
66%|██████▌ | 1863/2826 [3:02:37<1:29:58, 5.61s/it]
66%|██████▌ | 1864/2826 [3:02:43<1:30:24, 5.64s/it]
66%|██████▌ | 1865/2826 [3:02:48<1:28:03, 5.50s/it]
66%|██████▌ | 1866/2826 [3:02:53<1:28:22, 5.52s/it]
66%|██████▌ | 1867/2826 [3:02:58<1:25:49, 5.37s/it]
66%|██████▌ | 1868/2826 [3:03:03<1:24:40, 5.30s/it]
66%|██████▌ | 1869/2826 [3:03:10<1:28:38, 5.56s/it]
66%|██████▌ | 1870/2826 [3:03:16<1:32:21, 5.80s/it]
{'loss': 0.2945, 'grad_norm': 2.502206802368164, 'learning_rate': 1.5529298479412636e-06, 'epoch': 1.98}
66%|██████▌ | 1870/2826 [3:03:16<1:32:21, 5.80s/it]
66%|██████▌ | 1871/2826 [3:03:22<1:34:55, 5.96s/it]
66%|██████▌ | 1872/2826 [3:03:29<1:37:43, 6.15s/it]
66%|██████▋ | 1873/2826 [3:03:34<1:34:39, 5.96s/it]
66%|██████▋ | 1874/2826 [3:03:40<1:32:15, 5.81s/it]
66%|██████▋ | 1875/2826 [3:03:45<1:29:33, 5.65s/it]
66%|██████▋ | 1876/2826 [3:03:52<1:35:24, 6.03s/it]
66%|██████▋ | 1877/2826 [3:03:57<1:31:38, 5.79s/it]
66%|██████▋ | 1878/2826 [3:04:03<1:29:25, 5.66s/it]
66%|██████▋ | 1879/2826 [3:04:08<1:26:37, 5.49s/it]
67%|██████▋ | 1880/2826 [3:04:13<1:24:42, 5.37s/it]
{'loss': 0.3291, 'grad_norm': 2.5796189308166504, 'learning_rate': 1.524420062563912e-06, 'epoch': 1.99}
67%|██████▋ | 1880/2826 [3:04:13<1:24:42, 5.37s/it]
67%|██████▋ | 1881/2826 [3:04:19<1:26:16, 5.48s/it]
67%|██████▋ | 1882/2826 [3:04:25<1:33:05, 5.92s/it]
67%|██████▋ | 1883/2826 [3:04:32<1:35:55, 6.10s/it]
67%|██████▋ | 1884/2826 [3:04:38<1:34:43, 6.03s/it]
67%|██████▋ | 1885/2826 [3:04:43<1:30:52, 5.79s/it]
67%|██████▋ | 1886/2826 [3:04:48<1:26:53, 5.55s/it][INFO|trainer.py:3984] 2025-10-18 09:51:01,296 >> Saving model checkpoint to /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-1886
[INFO|configuration_utils.py:419] 2025-10-18 09:51:01,303 >> Configuration saved in /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-1886/config.json
[INFO|configuration_utils.py:911] 2025-10-18 09:51:01,304 >> Configuration saved in /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-1886/generation_config.json
[INFO|modeling_utils.py:3580] 2025-10-18 09:51:16,354 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 4 checkpoint shards. You can find where each parameters has been saved in the index located at /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-1886/model.safetensors.index.json.
[INFO|tokenization_utils_base.py:2510] 2025-10-18 09:51:16,359 >> tokenizer config file saved in /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-1886/tokenizer_config.json
[INFO|tokenization_utils_base.py:2519] 2025-10-18 09:51:16,360 >> Special tokens file saved in /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-1886/special_tokens_map.json
[2025-10-18 09:51:16,892] [INFO] [logging.py:107:log_dist] [Rank 0] [Torch] Checkpoint global_step1885 is about to be saved!
[2025-10-18 09:51:16,903] [INFO] [logging.py:107:log_dist] [Rank 0] Saving model checkpoint: /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-1886/global_step1885/zero_pp_rank_0_mp_rank_00_model_states.pt
[2025-10-18 09:51:16,903] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-1886/global_step1885/zero_pp_rank_0_mp_rank_00_model_states.pt...
[2025-10-18 09:51:16,923] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-1886/global_step1885/zero_pp_rank_0_mp_rank_00_model_states.pt.
[2025-10-18 09:51:16,936] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-1886/global_step1885/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt...
[2025-10-18 09:51:34,927] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-1886/global_step1885/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt.
[2025-10-18 09:51:34,929] [INFO] [engine.py:3701:_save_zero_checkpoint] zero checkpoint saved /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-1886/global_step1885/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt
[2025-10-18 09:51:35,161] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step1885 is ready now!
67%|██████▋ | 1887/2826 [3:05:36<4:45:09, 18.22s/it]
67%|██████▋ | 1888/2826 [3:05:42<3:48:33, 14.62s/it]
67%|██████▋ | 1889/2826 [3:05:47<3:04:18, 11.80s/it]
67%|██████▋ | 1890/2826 [3:05:53<2:37:08, 10.07s/it]
{'loss': 0.234, 'grad_norm': 1.9198871850967407, 'learning_rate': 1.4960591667862163e-06, 'epoch': 2.0}
67%|██████▋ | 1890/2826 [3:05:53<2:37:08, 10.07s/it]
67%|██████▋ | 1891/2826 [3:05:59<2:15:35, 8.70s/it]
67%|██████▋ | 1892/2826 [3:06:05<2:03:47, 7.95s/it]
67%|██████▋ | 1893/2826 [3:06:11<1:54:09, 7.34s/it]
67%|██████▋ | 1894/2826 [3:06:16<1:45:05, 6.77s/it]
67%|██████▋ | 1895/2826 [3:06:22<1:38:19, 6.34s/it]
67%|██████▋ | 1896/2826 [3:06:28<1:35:50, 6.18s/it]
67%|██████▋ | 1897/2826 [3:06:35<1:39:16, 6.41s/it]
67%|██████▋ | 1898/2826 [3:06:41<1:40:07, 6.47s/it]
67%|██████▋ | 1899/2826 [3:06:47<1:36:03, 6.22s/it]
67%|██████▋ | 1900/2826 [3:06:54<1:40:35, 6.52s/it]
{'loss': 0.1943, 'grad_norm': 1.7082706689834595, 'learning_rate': 1.4678514889489464e-06, 'epoch': 2.01}
67%|██████▋ | 1900/2826 [3:06:54<1:40:35, 6.52s/it]
67%|██████▋ | 1901/2826 [3:07:03<1:51:41, 7.24s/it]
67%|██████▋ | 1902/2826 [3:07:09<1:45:07, 6.83s/it]
67%|██████▋ | 1903/2826 [3:07:14<1:39:02, 6.44s/it]
67%|██████▋ | 1904/2826 [3:07:19<1:32:32, 6.02s/it]
67%|██████▋ | 1905/2826 [3:07:26<1:34:50, 6.18s/it]
67%|██████▋ | 1906/2826 [3:07:31<1:30:04, 5.87s/it]
67%|██████▋ | 1907/2826 [3:07:36<1:27:27, 5.71s/it]
68%|██████▊ | 1908/2826 [3:07:43<1:33:41, 6.12s/it]
68%|██████▊ | 1909/2826 [3:07:49<1:30:38, 5.93s/it]
68%|██████▊ | 1910/2826 [3:07:56<1:35:49, 6.28s/it]
{'loss': 0.1911, 'grad_norm': 1.8571817874908447, 'learning_rate': 1.4398013340092864e-06, 'epoch': 2.03}
68%|██████▊ | 1910/2826 [3:07:56<1:35:49, 6.28s/it]
68%|██████▊ | 1911/2826 [3:08:02<1:32:27, 6.06s/it]
68%|██████▊ | 1912/2826 [3:08:07<1:29:36, 5.88s/it]
68%|██████▊ | 1913/2826 [3:08:13<1:28:53, 5.84s/it]
68%|██████▊ | 1914/2826 [3:08:19<1:31:11, 6.00s/it]
68%|██████▊ | 1915/2826 [3:08:26<1:33:33, 6.16s/it]
68%|██████▊ | 1916/2826 [3:08:32<1:34:31, 6.23s/it]
68%|██████▊ | 1917/2826 [3:08:37<1:29:10, 5.89s/it]
68%|██████▊ | 1918/2826 [3:08:42<1:25:34, 5.65s/it]
68%|██████▊ | 1919/2826 [3:08:49<1:30:19, 5.98s/it]
68%|██████▊ | 1920/2826 [3:08:56<1:33:49, 6.21s/it]
{'loss': 0.1895, 'grad_norm': 2.454561233520508, 'learning_rate': 1.4119129828838275e-06, 'epoch': 2.04}
68%|██████▊ | 1920/2826 [3:08:56<1:33:49, 6.21s/it]
68%|██████▊ | 1921/2826 [3:09:02<1:33:13, 6.18s/it]
68%|██████▊ | 1922/2826 [3:09:09<1:37:57, 6.50s/it]
68%|██████▊ | 1923/2826 [3:09:17<1:43:01, 6.84s/it]
68%|██████▊ | 1924/2826 [3:09:22<1:34:50, 6.31s/it]
68%|██████▊ | 1925/2826 [3:09:27<1:30:38, 6.04s/it]
68%|██████▊ | 1926/2826 [3:09:33<1:27:29, 5.83s/it]
68%|██████▊ | 1927/2826 [3:09:39<1:29:13, 5.96s/it]
68%|██████▊ | 1928/2826 [3:09:45<1:28:23, 5.91s/it]
68%|██████▊ | 1929/2826 [3:09:51<1:28:49, 5.94s/it]
68%|██████▊ | 1930/2826 [3:09:56<1:24:37, 5.67s/it]
{'loss': 0.2177, 'grad_norm': 2.3714683055877686, 'learning_rate': 1.384190691795226e-06, 'epoch': 2.05}
68%|██████▊ | 1930/2826 [3:09:56<1:24:37, 5.67s/it]
68%|██████▊ | 1931/2826 [3:10:01<1:24:10, 5.64s/it]
68%|██████▊ | 1932/2826 [3:10:06<1:21:41, 5.48s/it]
68%|██████▊ | 1933/2826 [3:10:14<1:29:20, 6.00s/it]
68%|██████▊ | 1934/2826 [3:10:19<1:25:39, 5.76s/it]
68%|██████▊ | 1935/2826 [3:10:24<1:22:28, 5.55s/it]
69%|██████▊ | 1936/2826 [3:10:29<1:20:14, 5.41s/it]
69%|██████▊ | 1937/2826 [3:10:36<1:26:10, 5.82s/it]
69%|██████▊ | 1938/2826 [3:10:42<1:29:57, 6.08s/it]
69%|██████▊ | 1939/2826 [3:10:48<1:25:36, 5.79s/it]
69%|██████▊ | 1940/2826 [3:10:53<1:25:20, 5.78s/it]
{'loss': 0.2252, 'grad_norm': 2.1356313228607178, 'learning_rate': 1.3566386916226373e-06, 'epoch': 2.06}
69%|██████▊ | 1940/2826 [3:10:53<1:25:20, 5.78s/it]
69%|██████▊ | 1941/2826 [3:11:01<1:33:30, 6.34s/it]
69%|██████▊ | 1942/2826 [3:11:06<1:29:11, 6.05s/it]
69%|██████▉ | 1943/2826 [3:11:12<1:25:14, 5.79s/it]
69%|██████▉ | 1944/2826 [3:11:18<1:28:11, 6.00s/it]
69%|██████▉ | 1945/2826 [3:11:25<1:33:26, 6.36s/it]
69%|██████▉ | 1946/2826 [3:11:32<1:37:07, 6.62s/it]
69%|██████▉ | 1947/2826 [3:11:40<1:39:07, 6.77s/it]
69%|██████▉ | 1948/2826 [3:11:45<1:33:28, 6.39s/it]
69%|██████▉ | 1949/2826 [3:11:51<1:29:38, 6.13s/it]
69%|██████▉ | 1950/2826 [3:11:57<1:28:57, 6.09s/it]
{'loss': 0.1982, 'grad_norm': 2.446906089782715, 'learning_rate': 1.3292611872560134e-06, 'epoch': 2.07}
69%|██████▉ | 1950/2826 [3:11:57<1:28:57, 6.09s/it]
69%|██████▉ | 1951/2826 [3:12:02<1:26:12, 5.91s/it]
69%|██████▉ | 1952/2826 [3:12:07<1:23:51, 5.76s/it]
69%|██████▉ | 1953/2826 [3:12:13<1:24:18, 5.79s/it]
69%|██████▉ | 1954/2826 [3:12:19<1:21:45, 5.63s/it]
69%|██████▉ | 1955/2826 [3:12:26<1:28:01, 6.06s/it]
69%|██████▉ | 1956/2826 [3:12:33<1:32:10, 6.36s/it]
69%|██████▉ | 1957/2826 [3:12:38<1:29:15, 6.16s/it]
69%|██████▉ | 1958/2826 [3:12:44<1:26:07, 5.95s/it]
69%|██████▉ | 1959/2826 [3:12:50<1:28:51, 6.15s/it]
69%|██████▉ | 1960/2826 [3:12:58<1:32:42, 6.42s/it]
{'loss': 0.1696, 'grad_norm': 2.1040875911712646, 'learning_rate': 1.302062356954365e-06, 'epoch': 2.08}
69%|██████▉ | 1960/2826 [3:12:58<1:32:42, 6.42s/it]
69%|██████▉ | 1961/2826 [3:13:06<1:40:15, 6.95s/it]
69%|██████▉ | 1962/2826 [3:13:11<1:33:04, 6.46s/it]
69%|██████▉ | 1963/2826 [3:13:16<1:27:18, 6.07s/it]
69%|██████▉ | 1964/2826 [3:13:22<1:24:00, 5.85s/it]
70%|██████▉ | 1965/2826 [3:13:29<1:28:56, 6.20s/it]
70%|██████▉ | 1966/2826 [3:13:34<1:26:09, 6.01s/it]
70%|██████▉ | 1967/2826 [3:13:41<1:31:34, 6.40s/it]
70%|██████▉ | 1968/2826 [3:13:47<1:27:26, 6.12s/it]
70%|██████▉ | 1969/2826 [3:13:53<1:26:53, 6.08s/it]
70%|██████▉ | 1970/2826 [3:14:00<1:29:35, 6.28s/it]
{'loss': 0.1936, 'grad_norm': 2.220742702484131, 'learning_rate': 1.2750463517080922e-06, 'epoch': 2.09}
70%|██████▉ | 1970/2826 [3:14:00<1:29:35, 6.28s/it]
70%|██████▉ | 1971/2826 [3:14:05<1:24:17, 5.91s/it]
70%|██████▉ | 1972/2826 [3:14:10<1:22:25, 5.79s/it]
70%|██████▉ | 1973/2826 [3:14:17<1:26:58, 6.12s/it]
70%|██████▉ | 1974/2826 [3:14:23<1:26:07, 6.07s/it]
70%|██████▉ | 1975/2826 [3:14:28<1:22:19, 5.80s/it]
70%|██████▉ | 1976/2826 [3:14:33<1:19:45, 5.63s/it]
70%|██████▉ | 1977/2826 [3:14:39<1:17:36, 5.48s/it]
70%|██████▉ | 1978/2826 [3:14:45<1:22:04, 5.81s/it]
70%|███████ | 1979/2826 [3:14:50<1:20:02, 5.67s/it]
70%|███████ | 1980/2826 [3:14:56<1:20:16, 5.69s/it]
{'loss': 0.1604, 'grad_norm': 2.7784054279327393, 'learning_rate': 1.2482172946054753e-06, 'epoch': 2.1}
70%|███████ | 1980/2826 [3:14:56<1:20:16, 5.69s/it]
70%|███████ | 1981/2826 [3:15:02<1:18:45, 5.59s/it]
70%|███████ | 1982/2826 [3:15:08<1:20:40, 5.73s/it]
70%|███████ | 1983/2826 [3:15:14<1:24:24, 6.01s/it]
70%|███████ | 1984/2826 [3:15:19<1:20:36, 5.74s/it]
70%|███████ | 1985/2826 [3:15:25<1:20:48, 5.76s/it]
70%|███████ | 1986/2826 [3:15:31<1:20:35, 5.76s/it]
70%|███████ | 1987/2826 [3:15:37<1:19:54, 5.72s/it]
70%|███████ | 1988/2826 [3:15:42<1:18:59, 5.66s/it]
70%|███████ | 1989/2826 [3:15:48<1:18:14, 5.61s/it]
70%|███████ | 1990/2826 [3:15:54<1:22:43, 5.94s/it]
{'loss': 0.2069, 'grad_norm': 2.0539498329162598, 'learning_rate': 1.2215792802034187e-06, 'epoch': 2.11}
70%|███████ | 1990/2826 [3:15:54<1:22:43, 5.94s/it]
70%|███████ | 1991/2826 [3:15:59<1:19:14, 5.69s/it]
70%|███████ | 1992/2826 [3:16:06<1:21:47, 5.88s/it]
71%|███████ | 1993/2826 [3:16:13<1:26:41, 6.24s/it]
71%|███████ | 1994/2826 [3:16:19<1:26:02, 6.20s/it]
71%|███████ | 1995/2826 [3:16:25<1:25:45, 6.19s/it]
71%|███████ | 1996/2826 [3:16:30<1:21:23, 5.88s/it]
71%|███████ | 1997/2826 [3:16:38<1:27:54, 6.36s/it]
71%|███████ | 1998/2826 [3:16:44<1:25:16, 6.18s/it]
71%|███████ | 1999/2826 [3:16:49<1:21:13, 5.89s/it]
71%|███████ | 2000/2826 [3:16:54<1:18:18, 5.69s/it]
{'loss': 0.1964, 'grad_norm': 1.8337138891220093, 'learning_rate': 1.1951363739025618e-06, 'epoch': 2.12}
71%|███████ | 2000/2826 [3:16:54<1:18:18, 5.69s/it]
71%|███████ | 2001/2826 [3:17:01<1:24:31, 6.15s/it]
71%|███████ | 2002/2826 [3:17:08<1:28:55, 6.47s/it]
71%|███████ | 2003/2826 [3:17:14<1:24:46, 6.18s/it]
71%|███████ | 2004/2826 [3:17:20<1:25:04, 6.21s/it]
71%|███████ | 2005/2826 [3:17:26<1:21:26, 5.95s/it]
71%|███████ | 2006/2826 [3:17:31<1:18:48, 5.77s/it]
71%|███████ | 2007/2826 [3:17:36<1:16:02, 5.57s/it]
71%|███████ | 2008/2826 [3:17:42<1:17:17, 5.67s/it]
71%|███████ | 2009/2826 [3:17:48<1:17:53, 5.72s/it]
71%|███████ | 2010/2826 [3:17:53<1:14:54, 5.51s/it]
{'loss': 0.1871, 'grad_norm': 1.7631642818450928, 'learning_rate': 1.168892611326827e-06, 'epoch': 2.13}
71%|███████ | 2010/2826 [3:17:53<1:14:54, 5.51s/it]
71%|███████ | 2011/2826 [3:17:58<1:14:29, 5.48s/it]
71%|███████ | 2012/2826 [3:18:06<1:23:32, 6.16s/it]
71%|███████ | 2013/2826 [3:18:11<1:20:27, 5.94s/it]
71%|███████▏ | 2014/2826 [3:18:17<1:17:52, 5.75s/it]
71%|███████▏ | 2015/2826 [3:18:22<1:15:07, 5.56s/it]
71%|███████▏ | 2016/2826 [3:18:27<1:13:24, 5.44s/it]
71%|███████▏ | 2017/2826 [3:18:33<1:16:56, 5.71s/it]
71%|███████▏ | 2018/2826 [3:18:39<1:18:59, 5.87s/it]
71%|███████▏ | 2019/2826 [3:18:46<1:21:10, 6.04s/it]
71%|███████▏ | 2020/2826 [3:18:51<1:16:58, 5.73s/it]
{'loss': 0.2595, 'grad_norm': 2.386589527130127, 'learning_rate': 1.1428519977075136e-06, 'epoch': 2.14}
71%|███████▏ | 2020/2826 [3:18:51<1:16:58, 5.73s/it]
72%|███████▏ | 2021/2826 [3:18:57<1:18:28, 5.85s/it]
72%|███████▏ | 2022/2826 [3:19:02<1:15:55, 5.67s/it]
72%|███████▏ | 2023/2826 [3:19:07<1:13:31, 5.49s/it]
72%|███████▏ | 2024/2826 [3:19:14<1:16:15, 5.71s/it]
72%|███████▏ | 2025/2826 [3:19:20<1:19:35, 5.96s/it]
72%|███████▏ | 2026/2826 [3:19:27<1:22:04, 6.16s/it]
72%|███████▏ | 2027/2826 [3:19:32<1:17:51, 5.85s/it]
72%|███████▏ | 2028/2826 [3:19:38<1:19:51, 6.00s/it]
72%|███████▏ | 2029/2826 [3:19:43<1:16:21, 5.75s/it]
72%|███████▏ | 2030/2826 [3:19:49<1:13:51, 5.57s/it]
{'loss': 0.185, 'grad_norm': 2.553382635116577, 'learning_rate': 1.1170185072720434e-06, 'epoch': 2.15}
72%|███████▏ | 2030/2826 [3:19:49<1:13:51, 5.57s/it]
72%|███████▏ | 2031/2826 [3:19:54<1:12:13, 5.45s/it]
72%|███████▏ | 2032/2826 [3:19:59<1:10:46, 5.35s/it]
72%|███████▏ | 2033/2826 [3:20:04<1:11:20, 5.40s/it]
72%|███████▏ | 2034/2826 [3:20:10<1:13:55, 5.60s/it]
72%|███████▏ | 2035/2826 [3:20:17<1:17:21, 5.87s/it]
72%|███████▏ | 2036/2826 [3:20:23<1:16:08, 5.78s/it]
72%|███████▏ | 2037/2826 [3:20:28<1:16:03, 5.78s/it]
72%|███████▏ | 2038/2826 [3:20:35<1:18:50, 6.00s/it]
72%|███████▏ | 2039/2826 [3:20:41<1:19:04, 6.03s/it]
72%|███████▏ | 2040/2826 [3:20:47<1:20:00, 6.11s/it]
{'loss': 0.228, 'grad_norm': 2.870973825454712, 'learning_rate': 1.091396082637419e-06, 'epoch': 2.16}
72%|███████▏ | 2040/2826 [3:20:47<1:20:00, 6.11s/it]
72%|███████▏ | 2041/2826 [3:20:53<1:17:39, 5.94s/it]
72%|███████▏ | 2042/2826 [3:20:58<1:15:02, 5.74s/it]
72%|███████▏ | 2043/2826 [3:21:03<1:13:49, 5.66s/it]
72%|███████▏ | 2044/2826 [3:21:10<1:15:49, 5.82s/it]
72%|███████▏ | 2045/2826 [3:21:16<1:18:18, 6.02s/it]
72%|███████▏ | 2046/2826 [3:21:22<1:18:41, 6.05s/it]
72%|███████▏ | 2047/2826 [3:21:28<1:18:47, 6.07s/it]
72%|███████▏ | 2048/2826 [3:21:34<1:18:41, 6.07s/it]
73%|███████▎ | 2049/2826 [3:21:42<1:22:51, 6.40s/it]
73%|███████▎ | 2050/2826 [3:21:48<1:21:13, 6.28s/it]
{'loss': 0.2098, 'grad_norm': 2.643745183944702, 'learning_rate': 1.065988634208516e-06, 'epoch': 2.17}
73%|███████▎ | 2050/2826 [3:21:48<1:21:13, 6.28s/it]
73%|███████▎ | 2051/2826 [3:21:53<1:18:57, 6.11s/it]
73%|███████▎ | 2052/2826 [3:22:00<1:21:20, 6.31s/it]
73%|███████▎ | 2053/2826 [3:22:07<1:23:59, 6.52s/it]
73%|███████▎ | 2054/2826 [3:22:13<1:20:35, 6.26s/it]
73%|███████▎ | 2055/2826 [3:22:19<1:18:48, 6.13s/it]
73%|███████▎ | 2056/2826 [3:22:24<1:14:42, 5.82s/it]
73%|███████▎ | 2057/2826 [3:22:29<1:11:45, 5.60s/it]
73%|███████▎ | 2058/2826 [3:22:36<1:18:24, 6.13s/it]
73%|███████▎ | 2059/2826 [3:22:42<1:15:54, 5.94s/it]
73%|███████▎ | 2060/2826 [3:22:48<1:15:33, 5.92s/it]
{'loss': 0.1982, 'grad_norm': 2.369596481323242, 'learning_rate': 1.0408000395812961e-06, 'epoch': 2.18}
73%|███████▎ | 2060/2826 [3:22:48<1:15:33, 5.92s/it]
73%|███████▎ | 2061/2826 [3:22:54<1:16:32, 6.00s/it]
73%|███████▎ | 2062/2826 [3:22:59<1:13:27, 5.77s/it]
73%|███████▎ | 2063/2826 [3:23:04<1:10:57, 5.58s/it]
73%|███████▎ | 2064/2826 [3:23:10<1:11:39, 5.64s/it]
73%|███████▎ | 2065/2826 [3:23:16<1:12:13, 5.69s/it]
73%|███████▎ | 2066/2826 [3:23:22<1:15:30, 5.96s/it]
73%|███████▎ | 2067/2826 [3:23:28<1:14:26, 5.88s/it]
73%|███████▎ | 2068/2826 [3:23:34<1:14:49, 5.92s/it]
73%|███████▎ | 2069/2826 [3:23:39<1:12:33, 5.75s/it]
73%|███████▎ | 2070/2826 [3:23:45<1:11:27, 5.67s/it]
{'loss': 0.1844, 'grad_norm': 2.1093883514404297, 'learning_rate': 1.0158341429510194e-06, 'epoch': 2.2}
73%|███████▎ | 2070/2826 [3:23:45<1:11:27, 5.67s/it]
73%|███████▎ | 2071/2826 [3:23:52<1:18:35, 6.25s/it]
73%|███████▎ | 2072/2826 [3:23:58<1:14:18, 5.91s/it]
73%|███████▎ | 2073/2826 [3:24:03<1:11:32, 5.70s/it]
73%|███████▎ | 2074/2826 [3:24:10<1:16:00, 6.06s/it]
73%|███████▎ | 2075/2826 [3:24:15<1:12:30, 5.79s/it]
73%|███████▎ | 2076/2826 [3:24:22<1:18:19, 6.27s/it]
73%|███████▎ | 2077/2826 [3:24:27<1:14:25, 5.96s/it]
74%|███████▎ | 2078/2826 [3:24:34<1:16:09, 6.11s/it]
74%|███████▎ | 2079/2826 [3:24:39<1:13:08, 5.87s/it]
74%|███████▎ | 2080/2826 [3:24:45<1:12:01, 5.79s/it]
{'loss': 0.1654, 'grad_norm': 1.951935052871704, 'learning_rate': 9.910947545255523e-07, 'epoch': 2.21}
74%|███████▎ | 2080/2826 [3:24:45<1:12:01, 5.79s/it]
74%|███████▎ | 2081/2826 [3:24:51<1:12:04, 5.80s/it]
74%|███████▎ | 2082/2826 [3:24:57<1:12:16, 5.83s/it]
74%|███████▎ | 2083/2826 [3:25:02<1:11:05, 5.74s/it]
74%|███████▎ | 2084/2826 [3:25:07<1:09:16, 5.60s/it]
74%|███████▍ | 2085/2826 [3:25:15<1:15:48, 6.14s/it]
74%|███████▍ | 2086/2826 [3:25:20<1:12:21, 5.87s/it]
74%|███████▍ | 2087/2826 [3:25:26<1:12:01, 5.85s/it]
74%|███████▍ | 2088/2826 [3:25:31<1:09:49, 5.68s/it]
74%|███████▍ | 2089/2826 [3:25:37<1:09:15, 5.64s/it]
74%|███████▍ | 2090/2826 [3:25:43<1:11:40, 5.84s/it]
{'loss': 0.2037, 'grad_norm': 2.230781078338623, 'learning_rate': 9.665856499438744e-07, 'epoch': 2.22}
74%|███████▍ | 2090/2826 [3:25:43<1:11:40, 5.84s/it]
74%|███████▍ | 2091/2826 [3:25:48<1:09:46, 5.70s/it]
74%|███████▍ | 2092/2826 [3:25:55<1:11:46, 5.87s/it]
74%|███████▍ | 2093/2826 [3:26:01<1:12:30, 5.94s/it]
74%|███████▍ | 2094/2826 [3:26:08<1:16:57, 6.31s/it]
74%|███████▍ | 2095/2826 [3:26:13<1:13:42, 6.05s/it]
74%|███████▍ | 2096/2826 [3:26:18<1:10:09, 5.77s/it]
74%|███████▍ | 2097/2826 [3:26:24<1:10:40, 5.82s/it]
74%|███████▍ | 2098/2826 [3:26:29<1:08:00, 5.61s/it]
74%|███████▍ | 2099/2826 [3:26:34<1:05:53, 5.44s/it]
74%|███████▍ | 2100/2826 [3:26:40<1:05:13, 5.39s/it]
{'loss': 0.2087, 'grad_norm': 2.6240904331207275, 'learning_rate': 9.423105696998491e-07, 'epoch': 2.23}
74%|███████▍ | 2100/2826 [3:26:40<1:05:13, 5.39s/it]
74%|███████▍ | 2101/2826 [3:26:46<1:09:12, 5.73s/it]
74%|███████▍ | 2102/2826 [3:26:52<1:08:27, 5.67s/it]
74%|███████▍ | 2103/2826 [3:26:57<1:08:10, 5.66s/it]
74%|███████▍ | 2104/2826 [3:27:04<1:09:40, 5.79s/it]
74%|███████▍ | 2105/2826 [3:27:09<1:08:16, 5.68s/it]
75%|███████▍ | 2106/2826 [3:27:15<1:10:07, 5.84s/it]
75%|███████▍ | 2107/2826 [3:27:20<1:07:24, 5.62s/it]
75%|███████▍ | 2108/2826 [3:27:25<1:05:28, 5.47s/it]
75%|███████▍ | 2109/2826 [3:27:32<1:08:24, 5.72s/it]
75%|███████▍ | 2110/2826 [3:27:38<1:10:22, 5.90s/it]
{'loss': 0.2105, 'grad_norm': 1.712857723236084, 'learning_rate': 9.182732185713633e-07, 'epoch': 2.24}
75%|███████▍ | 2110/2826 [3:27:38<1:10:22, 5.90s/it]
75%|███████▍ | 2111/2826 [3:27:44<1:09:00, 5.79s/it]
75%|███████▍ | 2112/2826 [3:27:49<1:06:52, 5.62s/it]
75%|███████▍ | 2113/2826 [3:27:55<1:08:21, 5.75s/it]
75%|███████▍ | 2114/2826 [3:28:01<1:10:52, 5.97s/it]
75%|███████▍ | 2115/2826 [3:28:08<1:14:43, 6.31s/it]
75%|███████▍ | 2116/2826 [3:28:14<1:11:51, 6.07s/it]
75%|███████▍ | 2117/2826 [3:28:20<1:12:30, 6.14s/it]
75%|███████▍ | 2118/2826 [3:28:25<1:09:08, 5.86s/it]
75%|███████▍ | 2119/2826 [3:28:31<1:06:20, 5.63s/it]
75%|███████▌ | 2120/2826 [3:28:36<1:04:27, 5.48s/it]
{'loss': 0.2186, 'grad_norm': 2.036086082458496, 'learning_rate': 8.94477265054918e-07, 'epoch': 2.25}
75%|███████▌ | 2120/2826 [3:28:36<1:04:27, 5.48s/it]
75%|███████▌ | 2121/2826 [3:28:41<1:05:06, 5.54s/it]
75%|███████▌ | 2122/2826 [3:28:47<1:05:10, 5.55s/it]
75%|███████▌ | 2123/2826 [3:28:52<1:04:29, 5.50s/it]
75%|███████▌ | 2124/2826 [3:28:59<1:08:49, 5.88s/it]
75%|███████▌ | 2125/2826 [3:29:06<1:12:23, 6.20s/it]
75%|███████▌ | 2126/2826 [3:29:11<1:08:44, 5.89s/it]
75%|███████▌ | 2127/2826 [3:29:16<1:06:02, 5.67s/it]
75%|███████▌ | 2128/2826 [3:29:23<1:07:45, 5.82s/it]
75%|███████▌ | 2129/2826 [3:29:28<1:06:46, 5.75s/it]
75%|███████▌ | 2130/2826 [3:29:34<1:05:57, 5.69s/it]
{'loss': 0.1879, 'grad_norm': 2.3545398712158203, 'learning_rate': 8.709263408057522e-07, 'epoch': 2.26}
75%|███████▌ | 2130/2826 [3:29:34<1:05:57, 5.69s/it]
75%|███████▌ | 2131/2826 [3:29:40<1:09:22, 5.99s/it]
75%|███████▌ | 2132/2826 [3:29:45<1:06:14, 5.73s/it]
75%|███████▌ | 2133/2826 [3:29:50<1:03:38, 5.51s/it]
76%|███████▌ | 2134/2826 [3:29:56<1:03:07, 5.47s/it]
76%|███████▌ | 2135/2826 [3:30:01<1:02:23, 5.42s/it]
76%|███████▌ | 2136/2826 [3:30:06<1:01:41, 5.36s/it]
76%|███████▌ | 2137/2826 [3:30:13<1:05:37, 5.71s/it]
76%|███████▌ | 2138/2826 [3:30:18<1:04:12, 5.60s/it]
76%|███████▌ | 2139/2826 [3:30:23<1:02:18, 5.44s/it]
76%|███████▌ | 2140/2826 [3:30:30<1:07:36, 5.91s/it]
{'loss': 0.2177, 'grad_norm': 1.9098992347717285, 'learning_rate': 8.476240400835972e-07, 'epoch': 2.27}
76%|███████▌ | 2140/2826 [3:30:30<1:07:36, 5.91s/it]
76%|███████▌ | 2141/2826 [3:30:36<1:06:05, 5.79s/it]
76%|███████▌ | 2142/2826 [3:30:41<1:03:44, 5.59s/it]
76%|███████▌ | 2143/2826 [3:30:47<1:03:51, 5.61s/it]
76%|███████▌ | 2144/2826 [3:30:52<1:04:15, 5.65s/it]
76%|███████▌ | 2145/2826 [3:30:58<1:05:14, 5.75s/it]
76%|███████▌ | 2146/2826 [3:31:04<1:05:48, 5.81s/it]
76%|███████▌ | 2147/2826 [3:31:12<1:11:58, 6.36s/it]
76%|███████▌ | 2148/2826 [3:31:19<1:13:50, 6.53s/it]
76%|███████▌ | 2149/2826 [3:31:25<1:10:42, 6.27s/it]
76%|███████▌ | 2150/2826 [3:31:30<1:07:01, 5.95s/it]
{'loss': 0.165, 'grad_norm': 2.107959270477295, 'learning_rate': 8.245739192041311e-07, 'epoch': 2.28}
76%|███████▌ | 2150/2826 [3:31:30<1:07:01, 5.95s/it]
76%|███████▌ | 2151/2826 [3:31:36<1:07:15, 5.98s/it]
76%|███████▌ | 2152/2826 [3:31:41<1:05:21, 5.82s/it]
76%|███████▌ | 2153/2826 [3:31:48<1:06:53, 5.96s/it]
76%|███████▌ | 2154/2826 [3:31:53<1:04:08, 5.73s/it]
76%|███████▋ | 2155/2826 [3:31:59<1:05:29, 5.86s/it]
76%|███████▋ | 2156/2826 [3:32:04<1:02:42, 5.62s/it]
76%|███████▋ | 2157/2826 [3:32:09<1:00:49, 5.45s/it]
76%|███████▋ | 2158/2826 [3:32:15<1:01:31, 5.53s/it]
76%|███████▋ | 2159/2826 [3:32:20<1:00:05, 5.41s/it]
76%|███████▋ | 2160/2826 [3:32:27<1:04:54, 5.85s/it]
{'loss': 0.2018, 'grad_norm': 2.550719976425171, 'learning_rate': 8.017794959962225e-07, 'epoch': 2.29}
76%|███████▋ | 2160/2826 [3:32:27<1:04:54, 5.85s/it]
76%|███████▋ | 2161/2826 [3:32:33<1:06:06, 5.96s/it]
77%|███████▋ | 2162/2826 [3:32:40<1:09:13, 6.26s/it]
77%|███████▋ | 2163/2826 [3:32:47<1:12:06, 6.53s/it]
77%|███████▋ | 2164/2826 [3:32:54<1:13:20, 6.65s/it]
77%|███████▋ | 2165/2826 [3:32:59<1:09:29, 6.31s/it]
77%|███████▋ | 2166/2826 [3:33:06<1:11:42, 6.52s/it]
77%|███████▋ | 2167/2826 [3:33:12<1:08:24, 6.23s/it]
77%|███████▋ | 2168/2826 [3:33:18<1:06:30, 6.06s/it]
77%|███████▋ | 2169/2826 [3:33:23<1:03:08, 5.77s/it]
77%|███████▋ | 2170/2826 [3:33:28<1:00:57, 5.58s/it]
{'loss': 0.1955, 'grad_norm': 2.354701280593872, 'learning_rate': 7.792442492650587e-07, 'epoch': 2.3}
77%|███████▋ | 2170/2826 [3:33:28<1:00:57, 5.58s/it]
77%|███████▋ | 2171/2826 [3:33:35<1:04:28, 5.91s/it]
77%|███████▋ | 2172/2826 [3:33:41<1:06:40, 6.12s/it]
77%|███████▋ | 2173/2826 [3:33:47<1:05:52, 6.05s/it]
77%|███████▋ | 2174/2826 [3:33:53<1:03:50, 5.88s/it]
77%|███████▋ | 2175/2826 [3:33:58<1:01:10, 5.64s/it]
77%|███████▋ | 2176/2826 [3:34:03<59:25, 5.48s/it]
77%|███████▋ | 2177/2826 [3:34:08<1:00:00, 5.55s/it]
77%|███████▋ | 2178/2826 [3:34:14<59:35, 5.52s/it]
77%|███████▋ | 2179/2826 [3:34:19<59:16, 5.50s/it]
77%|███████▋ | 2180/2826 [3:34:25<58:22, 5.42s/it]
{'loss': 0.1976, 'grad_norm': 2.3547091484069824, 'learning_rate': 7.569716182612177e-07, 'epoch': 2.31}
77%|███████▋ | 2180/2826 [3:34:25<58:22, 5.42s/it]
77%|███████▋ | 2181/2826 [3:34:30<58:29, 5.44s/it]
77%|███████▋ | 2182/2826 [3:34:37<1:02:44, 5.85s/it]
77%|███████▋ | 2183/2826 [3:34:43<1:03:58, 5.97s/it]
77%|███████▋ | 2184/2826 [3:34:50<1:06:22, 6.20s/it]
77%|███████▋ | 2185/2826 [3:34:56<1:04:53, 6.07s/it]
77%|███████▋ | 2186/2826 [3:35:01<1:01:41, 5.78s/it]
77%|███████▋ | 2187/2826 [3:35:06<1:00:31, 5.68s/it]
77%|███████▋ | 2188/2826 [3:35:12<1:01:55, 5.82s/it]
77%|███████▋ | 2189/2826 [3:35:18<1:01:15, 5.77s/it]
77%|███████▋ | 2190/2826 [3:35:25<1:04:03, 6.04s/it]
{'loss': 0.1685, 'grad_norm': 1.4048022031784058, 'learning_rate': 7.349650021557839e-07, 'epoch': 2.32}
77%|███████▋ | 2190/2826 [3:35:25<1:04:03, 6.04s/it]
78%|███████▊ | 2191/2826 [3:35:30<1:01:02, 5.77s/it]
78%|███████▊ | 2192/2826 [3:35:36<1:01:22, 5.81s/it]
78%|███████▊ | 2193/2826 [3:35:41<1:00:40, 5.75s/it]
78%|███████▊ | 2194/2826 [3:35:47<1:00:55, 5.78s/it]
78%|███████▊ | 2195/2826 [3:35:52<59:07, 5.62s/it]
78%|███████▊ | 2196/2826 [3:35:58<57:46, 5.50s/it]
78%|███████▊ | 2197/2826 [3:36:03<56:47, 5.42s/it]
78%|███████▊ | 2198/2826 [3:36:08<56:36, 5.41s/it]
78%|███████▊ | 2199/2826 [3:36:13<55:36, 5.32s/it]
78%|███████▊ | 2200/2826 [3:36:19<54:53, 5.26s/it]
{'loss': 0.1519, 'grad_norm': 2.568500280380249, 'learning_rate': 7.132277595215773e-07, 'epoch': 2.33}
78%|███████▊ | 2200/2826 [3:36:19<54:53, 5.26s/it]
78%|███████▊ | 2201/2826 [3:36:25<59:00, 5.66s/it]
78%|███████▊ | 2202/2826 [3:36:31<1:00:43, 5.84s/it]
78%|███████▊ | 2203/2826 [3:36:37<1:00:25, 5.82s/it]
78%|███████▊ | 2204/2826 [3:36:44<1:02:22, 6.02s/it]
78%|███████▊ | 2205/2826 [3:36:49<59:23, 5.74s/it]
78%|███████▊ | 2206/2826 [3:36:54<59:15, 5.74s/it]
78%|███████▊ | 2207/2826 [3:37:00<57:33, 5.58s/it]
78%|███████▊ | 2208/2826 [3:37:06<1:00:19, 5.86s/it]
78%|███████▊ | 2209/2826 [3:37:12<1:00:50, 5.92s/it]
78%|███████▊ | 2210/2826 [3:37:18<1:01:06, 5.95s/it]
{'loss': 0.1573, 'grad_norm': 2.205993413925171, 'learning_rate': 6.917632078205805e-07, 'epoch': 2.34}
78%|███████▊ | 2210/2826 [3:37:18<1:01:06, 5.95s/it]
78%|███████▊ | 2211/2826 [3:37:24<1:01:57, 6.04s/it]
78%|███████▊ | 2212/2826 [3:37:30<1:01:16, 5.99s/it]
78%|███████▊ | 2213/2826 [3:37:37<1:02:16, 6.10s/it]
78%|███████▊ | 2214/2826 [3:37:42<1:00:56, 5.98s/it]
78%|███████▊ | 2215/2826 [3:37:47<58:08, 5.71s/it]
78%|███████▊ | 2216/2826 [3:37:53<56:31, 5.56s/it]
78%|███████▊ | 2217/2826 [3:37:59<57:27, 5.66s/it]
78%|███████▊ | 2218/2826 [3:38:04<55:42, 5.50s/it]
79%|███████▊ | 2219/2826 [3:38:09<54:57, 5.43s/it]
79%|███████▊ | 2220/2826 [3:38:14<54:23, 5.39s/it]
{'loss': 0.184, 'grad_norm': 2.067505121231079, 'learning_rate': 6.705746228976387e-07, 'epoch': 2.35}
79%|███████▊ | 2220/2826 [3:38:14<54:23, 5.39s/it]
79%|███████▊ | 2221/2826 [3:38:21<56:59, 5.65s/it]
79%|███████▊ | 2222/2826 [3:38:26<55:25, 5.51s/it]
79%|███████▊ | 2223/2826 [3:38:31<55:25, 5.51s/it]
79%|███████▊ | 2224/2826 [3:38:37<55:44, 5.55s/it]
79%|███████▊ | 2225/2826 [3:38:42<55:28, 5.54s/it]
79%|███████▉ | 2226/2826 [3:38:49<58:35, 5.86s/it]
79%|███████▉ | 2227/2826 [3:38:54<56:41, 5.68s/it]
79%|███████▉ | 2228/2826 [3:39:00<56:33, 5.67s/it]
79%|███████▉ | 2229/2826 [3:39:05<56:04, 5.64s/it]
79%|███████▉ | 2230/2826 [3:39:11<54:31, 5.49s/it]
{'loss': 0.1968, 'grad_norm': 2.4360201358795166, 'learning_rate': 6.496652384805125e-07, 'epoch': 2.36}
79%|███████▉ | 2230/2826 [3:39:11<54:31, 5.49s/it]
79%|███████▉ | 2231/2826 [3:39:16<53:53, 5.44s/it]
79%|███████▉ | 2232/2826 [3:39:22<54:48, 5.54s/it]
79%|███████▉ | 2233/2826 [3:39:28<57:11, 5.79s/it]
79%|███████▉ | 2234/2826 [3:39:34<58:45, 5.95s/it]
79%|███████▉ | 2235/2826 [3:39:41<59:48, 6.07s/it]
79%|███████▉ | 2236/2826 [3:39:47<59:45, 6.08s/it]
79%|███████▉ | 2237/2826 [3:39:53<1:00:38, 6.18s/it]
79%|███████▉ | 2238/2826 [3:39:59<1:00:35, 6.18s/it]
79%|███████▉ | 2239/2826 [3:40:05<58:35, 5.99s/it]
79%|███████▉ | 2240/2826 [3:40:12<1:01:36, 6.31s/it]
{'loss': 0.1846, 'grad_norm': 2.042179584503174, 'learning_rate': 6.290382456863584e-07, 'epoch': 2.38}
79%|███████▉ | 2240/2826 [3:40:12<1:01:36, 6.31s/it]
79%|███████▉ | 2241/2826 [3:40:17<59:00, 6.05s/it]
79%|███████▉ | 2242/2826 [3:40:24<58:59, 6.06s/it]
79%|███████▉ | 2243/2826 [3:40:31<1:03:39, 6.55s/it]
79%|███████▉ | 2244/2826 [3:40:38<1:03:01, 6.50s/it]
79%|███████▉ | 2245/2826 [3:40:43<59:06, 6.10s/it]
79%|███████▉ | 2246/2826 [3:40:50<1:02:49, 6.50s/it]
80%|███████▉ | 2247/2826 [3:40:55<58:38, 6.08s/it]
80%|███████▉ | 2248/2826 [3:41:02<1:00:33, 6.29s/it]
80%|███████▉ | 2249/2826 [3:41:08<58:50, 6.12s/it]
80%|███████▉ | 2250/2826 [3:41:13<55:52, 5.82s/it]
{'loss': 0.1858, 'grad_norm': 2.849271535873413, 'learning_rate': 6.086967925347075e-07, 'epoch': 2.39}
80%|███████▉ | 2250/2826 [3:41:13<55:52, 5.82s/it]
80%|███████▉ | 2251/2826 [3:41:19<56:26, 5.89s/it]
80%|███████▉ | 2252/2826 [3:41:24<55:01, 5.75s/it]
80%|███████▉ | 2253/2826 [3:41:30<53:34, 5.61s/it]
80%|███████▉ | 2254/2826 [3:41:36<56:02, 5.88s/it]
80%|███████▉ | 2255/2826 [3:41:42<56:14, 5.91s/it]
80%|███████▉ | 2256/2826 [3:41:49<57:35, 6.06s/it]
80%|███████▉ | 2257/2826 [3:41:54<55:16, 5.83s/it]
80%|███████▉ | 2258/2826 [3:42:01<58:27, 6.17s/it]
80%|███████▉ | 2259/2826 [3:42:06<55:15, 5.85s/it]
80%|███████▉ | 2260/2826 [3:42:12<55:51, 5.92s/it]
{'loss': 0.1837, 'grad_norm': 2.0765082836151123, 'learning_rate': 5.88643983467033e-07, 'epoch': 2.4}
80%|███████▉ | 2260/2826 [3:42:12<55:51, 5.92s/it]
80%|████████ | 2261/2826 [3:42:18<55:54, 5.94s/it]
80%|████████ | 2262/2826 [3:42:23<53:39, 5.71s/it]
80%|████████ | 2263/2826 [3:42:30<56:01, 5.97s/it]
80%|████████ | 2264/2826 [3:42:35<53:51, 5.75s/it]
80%|████████ | 2265/2826 [3:42:41<53:00, 5.67s/it]
80%|████████ | 2266/2826 [3:42:47<54:28, 5.84s/it]
80%|████████ | 2267/2826 [3:42:53<55:09, 5.92s/it]
80%|████████ | 2268/2826 [3:42:58<54:00, 5.81s/it]
80%|████████ | 2269/2826 [3:43:04<53:00, 5.71s/it]
80%|████████ | 2270/2826 [3:43:09<51:56, 5.60s/it]
{'loss': 0.1659, 'grad_norm': 1.9958840608596802, 'learning_rate': 5.688828788729547e-07, 'epoch': 2.41}
80%|████████ | 2270/2826 [3:43:09<51:56, 5.60s/it]
80%|████████ | 2271/2826 [3:43:16<54:23, 5.88s/it]
80%|████████ | 2272/2826 [3:43:21<53:29, 5.79s/it]
80%|████████ | 2273/2826 [3:43:27<51:39, 5.61s/it]
80%|████████ | 2274/2826 [3:43:32<49:48, 5.41s/it]
81%|████████ | 2275/2826 [3:43:37<48:50, 5.32s/it]
81%|████████ | 2276/2826 [3:43:42<48:05, 5.25s/it]
81%|████████ | 2277/2826 [3:43:47<48:08, 5.26s/it]
81%|████████ | 2278/2826 [3:43:53<50:46, 5.56s/it]
81%|████████ | 2279/2826 [3:43:59<50:21, 5.52s/it]
81%|████████ | 2280/2826 [3:44:04<50:05, 5.51s/it]
{'loss': 0.2095, 'grad_norm': 2.253602981567383, 'learning_rate': 5.494164946231747e-07, 'epoch': 2.42}
81%|████████ | 2280/2826 [3:44:04<50:05, 5.51s/it]
81%|████████ | 2281/2826 [3:44:09<49:02, 5.40s/it]
81%|████████ | 2282/2826 [3:44:15<48:52, 5.39s/it]
81%|████████ | 2283/2826 [3:44:20<49:40, 5.49s/it]
81%|████████ | 2284/2826 [3:44:26<48:51, 5.41s/it]
81%|████████ | 2285/2826 [3:44:32<50:44, 5.63s/it]
81%|████████ | 2286/2826 [3:44:37<49:33, 5.51s/it]
81%|████████ | 2287/2826 [3:44:45<56:11, 6.26s/it]
81%|████████ | 2288/2826 [3:44:52<57:50, 6.45s/it]
81%|████████ | 2289/2826 [3:44:57<54:16, 6.07s/it]
81%|████████ | 2290/2826 [3:45:03<53:50, 6.03s/it]
{'loss': 0.1862, 'grad_norm': 1.5552992820739746, 'learning_rate': 5.302478016092075e-07, 'epoch': 2.43}
81%|████████ | 2290/2826 [3:45:03<53:50, 6.03s/it]
81%|████████ | 2291/2826 [3:45:08<51:04, 5.73s/it]
81%|████████ | 2292/2826 [3:45:13<49:29, 5.56s/it]
81%|████████ | 2293/2826 [3:45:18<48:33, 5.47s/it]
81%|████████ | 2294/2826 [3:45:24<47:41, 5.38s/it]
81%|████████ | 2295/2826 [3:45:29<47:22, 5.35s/it]
81%|████████ | 2296/2826 [3:45:35<49:27, 5.60s/it]
81%|████████▏ | 2297/2826 [3:45:42<51:51, 5.88s/it]
81%|████████▏ | 2298/2826 [3:45:47<50:21, 5.72s/it]
81%|████████▏ | 2299/2826 [3:45:56<57:54, 6.59s/it]
81%|████████▏ | 2300/2826 [3:46:01<53:44, 6.13s/it]
{'loss': 0.2085, 'grad_norm': 2.721445322036743, 'learning_rate': 5.113797252899728e-07, 'epoch': 2.44}
81%|████████▏ | 2300/2826 [3:46:01<53:44, 6.13s/it]
81%|████████▏ | 2301/2826 [3:46:06<50:48, 5.81s/it]
81%|████████▏ | 2302/2826 [3:46:12<52:45, 6.04s/it]
81%|████████▏ | 2303/2826 [3:46:17<50:26, 5.79s/it]
82%|████████▏ | 2304/2826 [3:46:23<48:47, 5.61s/it]
82%|████████▏ | 2305/2826 [3:46:28<48:03, 5.53s/it]
82%|████████▏ | 2306/2826 [3:46:33<46:58, 5.42s/it]
82%|████████▏ | 2307/2826 [3:46:41<53:02, 6.13s/it]
82%|████████▏ | 2308/2826 [3:46:46<50:44, 5.88s/it]
82%|████████▏ | 2309/2826 [3:46:51<48:41, 5.65s/it]
82%|████████▏ | 2310/2826 [3:46:58<51:06, 5.94s/it]
{'loss': 0.1914, 'grad_norm': 2.3488707542419434, 'learning_rate': 4.928151452453184e-07, 'epoch': 2.45}
82%|████████▏ | 2310/2826 [3:46:58<51:06, 5.94s/it]
82%|████████▏ | 2311/2826 [3:47:03<48:58, 5.71s/it]
82%|████████▏ | 2312/2826 [3:47:09<48:15, 5.63s/it]
82%|████████▏ | 2313/2826 [3:47:14<46:56, 5.49s/it]
82%|████████▏ | 2314/2826 [3:47:20<49:18, 5.78s/it]
82%|████████▏ | 2315/2826 [3:47:26<48:12, 5.66s/it]
82%|████████▏ | 2316/2826 [3:47:31<47:14, 5.56s/it]
82%|████████▏ | 2317/2826 [3:47:38<49:55, 5.89s/it]
82%|████████▏ | 2318/2826 [3:47:45<53:20, 6.30s/it]
82%|████████▏ | 2319/2826 [3:47:52<56:04, 6.64s/it]
82%|████████▏ | 2320/2826 [3:47:59<55:20, 6.56s/it]
{'loss': 0.1718, 'grad_norm': 2.49068021774292, 'learning_rate': 4.745568947365542e-07, 'epoch': 2.46}
82%|████████▏ | 2320/2826 [3:47:59<55:20, 6.56s/it]
82%|████████▏ | 2321/2826 [3:48:04<51:33, 6.13s/it]
82%|████████▏ | 2322/2826 [3:48:09<49:00, 5.83s/it]
82%|████████▏ | 2323/2826 [3:48:15<50:23, 6.01s/it]
82%|████████▏ | 2324/2826 [3:48:21<49:10, 5.88s/it]
82%|████████▏ | 2325/2826 [3:48:26<48:18, 5.79s/it]
82%|████████▏ | 2326/2826 [3:48:33<49:17, 5.91s/it]
82%|████████▏ | 2327/2826 [3:48:40<52:32, 6.32s/it]
82%|████████▏ | 2328/2826 [3:48:46<50:49, 6.12s/it]
82%|████████▏ | 2329/2826 [3:48:52<50:21, 6.08s/it]
82%|████████▏ | 2330/2826 [3:48:58<50:41, 6.13s/it]
{'loss': 0.1669, 'grad_norm': 1.4638549089431763, 'learning_rate': 4.5660776027404654e-07, 'epoch': 2.47}
82%|████████▏ | 2330/2826 [3:48:58<50:41, 6.13s/it]
82%|████████▏ | 2331/2826 [3:49:04<51:05, 6.19s/it]
83%|████████▎ | 2332/2826 [3:49:11<52:12, 6.34s/it]
83%|████████▎ | 2333/2826 [3:49:19<55:36, 6.77s/it]
83%|████████▎ | 2334/2826 [3:49:24<52:32, 6.41s/it]
83%|████████▎ | 2335/2826 [3:49:29<49:42, 6.07s/it]
83%|████████▎ | 2336/2826 [3:49:36<51:12, 6.27s/it]
83%|████████▎ | 2337/2826 [3:49:43<51:27, 6.31s/it]
83%|████████▎ | 2338/2826 [3:49:48<48:33, 5.97s/it]
83%|████████▎ | 2339/2826 [3:49:53<47:47, 5.89s/it]
83%|████████▎ | 2340/2826 [3:50:00<48:21, 5.97s/it]
{'loss': 0.1731, 'grad_norm': 2.288776159286499, 'learning_rate': 4.389704811919507e-07, 'epoch': 2.48}
83%|████████▎ | 2340/2826 [3:50:00<48:21, 5.97s/it]
83%|████████▎ | 2341/2826 [3:50:05<47:06, 5.83s/it]
83%|████████▎ | 2342/2826 [3:50:10<45:28, 5.64s/it]
83%|████████▎ | 2343/2826 [3:50:16<46:05, 5.73s/it]
83%|████████▎ | 2344/2826 [3:50:23<48:01, 5.98s/it]
83%|████████▎ | 2345/2826 [3:50:28<46:00, 5.74s/it]
83%|████████▎ | 2346/2826 [3:50:33<44:32, 5.57s/it]
83%|████████▎ | 2347/2826 [3:50:39<45:26, 5.69s/it]
83%|████████▎ | 2348/2826 [3:50:45<44:54, 5.64s/it]
83%|████████▎ | 2349/2826 [3:50:51<47:32, 5.98s/it]
83%|████████▎ | 2350/2826 [3:50:57<45:41, 5.76s/it]
{'loss': 0.1802, 'grad_norm': 2.385162115097046, 'learning_rate': 4.216477492301455e-07, 'epoch': 2.49}
83%|████████▎ | 2350/2826 [3:50:57<45:41, 5.76s/it]
83%|████████▎ | 2351/2826 [3:51:03<46:17, 5.85s/it]
83%|████████▎ | 2352/2826 [3:51:08<44:44, 5.66s/it]
83%|████████▎ | 2353/2826 [3:51:14<46:25, 5.89s/it]
83%|████████▎ | 2354/2826 [3:51:20<44:29, 5.66s/it]
83%|████████▎ | 2355/2826 [3:51:25<42:56, 5.47s/it]
83%|████████▎ | 2356/2826 [3:51:30<41:58, 5.36s/it]
83%|████████▎ | 2357/2826 [3:51:35<42:07, 5.39s/it]
83%|████████▎ | 2358/2826 [3:51:42<44:47, 5.74s/it]
83%|████████▎ | 2359/2826 [3:51:47<43:11, 5.55s/it]
84%|████████▎ | 2360/2826 [3:51:53<44:19, 5.71s/it]
{'loss': 0.2232, 'grad_norm': 2.0100815296173096, 'learning_rate': 4.0464220812342526e-07, 'epoch': 2.5}
84%|████████▎ | 2360/2826 [3:51:53<44:19, 5.71s/it]
84%|████████▎ | 2361/2826 [3:51:58<42:59, 5.55s/it]
84%|████████▎ | 2362/2826 [3:52:05<45:05, 5.83s/it]
84%|████████▎ | 2363/2826 [3:52:10<43:23, 5.62s/it]
84%|████████▎ | 2364/2826 [3:52:16<44:01, 5.72s/it]
84%|████████▎ | 2365/2826 [3:52:21<42:47, 5.57s/it]
84%|████████▎ | 2366/2826 [3:52:27<44:40, 5.83s/it]
84%|████████▍ | 2367/2826 [3:52:34<46:48, 6.12s/it]
84%|████████▍ | 2368/2826 [3:52:41<47:55, 6.28s/it]
84%|████████▍ | 2369/2826 [3:52:47<46:53, 6.16s/it]
84%|████████▍ | 2370/2826 [3:52:52<45:14, 5.95s/it]
{'loss': 0.1432, 'grad_norm': 1.8439091444015503, 'learning_rate': 3.87956453198027e-07, 'epoch': 2.51}
84%|████████▍ | 2370/2826 [3:52:52<45:14, 5.95s/it]
84%|████████▍ | 2371/2826 [3:52:57<43:06, 5.68s/it]
84%|████████▍ | 2372/2826 [3:53:03<43:30, 5.75s/it]
84%|████████▍ | 2373/2826 [3:53:09<43:59, 5.83s/it]
84%|████████▍ | 2374/2826 [3:53:16<46:14, 6.14s/it]
84%|████████▍ | 2375/2826 [3:53:22<45:16, 6.02s/it]
84%|████████▍ | 2376/2826 [3:53:27<44:22, 5.92s/it]
84%|████████▍ | 2377/2826 [3:53:33<43:41, 5.84s/it]
84%|████████▍ | 2378/2826 [3:53:39<43:55, 5.88s/it]
84%|████████▍ | 2379/2826 [3:53:45<43:29, 5.84s/it]
84%|████████▍ | 2380/2826 [3:53:51<44:09, 5.94s/it]
{'loss': 0.1834, 'grad_norm': 2.3093338012695312, 'learning_rate': 3.715930309755389e-07, 'epoch': 2.52}
84%|████████▍ | 2380/2826 [3:53:51<44:09, 5.94s/it]
84%|████████▍ | 2381/2826 [3:53:57<44:37, 6.02s/it]
84%|████████▍ | 2382/2826 [3:54:03<43:21, 5.86s/it]
84%|████████▍ | 2383/2826 [3:54:08<42:30, 5.76s/it]
84%|████████▍ | 2384/2826 [3:54:14<42:46, 5.81s/it]
84%|████████▍ | 2385/2826 [3:54:19<41:07, 5.59s/it]
84%|████████▍ | 2386/2826 [3:54:24<39:50, 5.43s/it]
84%|████████▍ | 2387/2826 [3:54:30<39:53, 5.45s/it]
85%|████████▍ | 2388/2826 [3:54:35<40:12, 5.51s/it]
85%|████████▍ | 2389/2826 [3:54:41<40:39, 5.58s/it]
85%|████████▍ | 2390/2826 [3:54:47<40:25, 5.56s/it]
{'loss': 0.2123, 'grad_norm': 2.3250088691711426, 'learning_rate': 3.5555443878425635e-07, 'epoch': 2.53}
85%|████████▍ | 2390/2826 [3:54:47<40:25, 5.56s/it]
85%|████████▍ | 2391/2826 [3:54:52<39:14, 5.41s/it]
85%|████████▍ | 2392/2826 [3:54:57<39:20, 5.44s/it]
85%|████████▍ | 2393/2826 [3:55:03<40:15, 5.58s/it]
85%|████████▍ | 2394/2826 [3:55:09<40:40, 5.65s/it]
85%|████████▍ | 2395/2826 [3:55:14<39:20, 5.48s/it]
85%|████████▍ | 2396/2826 [3:55:19<38:34, 5.38s/it]
85%|████████▍ | 2397/2826 [3:55:24<38:05, 5.33s/it]
85%|████████▍ | 2398/2826 [3:55:30<38:19, 5.37s/it]
85%|████████▍ | 2399/2826 [3:55:35<37:45, 5.31s/it]
85%|████████▍ | 2400/2826 [3:55:41<38:55, 5.48s/it]
{'loss': 0.2034, 'grad_norm': 1.8003133535385132, 'learning_rate': 3.398431243780531e-07, 'epoch': 2.55}
85%|████████▍ | 2400/2826 [3:55:41<38:55, 5.48s/it]
85%|████████▍ | 2401/2826 [3:55:47<40:42, 5.75s/it]
85%|████████▍ | 2402/2826 [3:55:54<42:21, 5.99s/it]
85%|████████▌ | 2403/2826 [3:55:59<40:38, 5.76s/it]
85%|████████▌ | 2404/2826 [3:56:05<40:30, 5.76s/it]
85%|████████▌ | 2405/2826 [3:56:10<39:25, 5.62s/it]
85%|████████▌ | 2406/2826 [3:56:15<38:53, 5.55s/it]
85%|████████▌ | 2407/2826 [3:56:22<40:34, 5.81s/it]
85%|████████▌ | 2408/2826 [3:56:29<42:39, 6.12s/it]
85%|████████▌ | 2409/2826 [3:56:35<42:51, 6.17s/it]
85%|████████▌ | 2410/2826 [3:56:40<40:33, 5.85s/it]
{'loss': 0.1778, 'grad_norm': 2.8948135375976562, 'learning_rate': 3.2446148556281117e-07, 'epoch': 2.56}
85%|████████▌ | 2410/2826 [3:56:40<40:33, 5.85s/it]
85%|████████▌ | 2411/2826 [3:56:45<38:52, 5.62s/it]
85%|████████▌ | 2412/2826 [3:56:51<39:08, 5.67s/it]
85%|████████▌ | 2413/2826 [3:56:57<39:07, 5.68s/it]
85%|████████▌ | 2414/2826 [3:57:02<37:38, 5.48s/it]
85%|████████▌ | 2415/2826 [3:57:07<37:42, 5.50s/it]
85%|████████▌ | 2416/2826 [3:57:12<37:12, 5.45s/it]
86%|████████▌ | 2417/2826 [3:57:20<40:18, 5.91s/it]
86%|████████▌ | 2418/2826 [3:57:25<39:55, 5.87s/it]
86%|████████▌ | 2419/2826 [3:57:30<38:10, 5.63s/it]
86%|████████▌ | 2420/2826 [3:57:36<39:02, 5.77s/it]
{'loss': 0.1892, 'grad_norm': 1.8556360006332397, 'learning_rate': 3.0941186983047543e-07, 'epoch': 2.57}
86%|████████▌ | 2420/2826 [3:57:36<39:02, 5.77s/it]
86%|████████▌ | 2421/2826 [3:57:43<40:13, 5.96s/it]
86%|████████▌ | 2422/2826 [3:57:48<38:29, 5.72s/it]
86%|████████▌ | 2423/2826 [3:57:54<39:12, 5.84s/it]
86%|████████▌ | 2424/2826 [3:57:59<37:43, 5.63s/it]
86%|████████▌ | 2425/2826 [3:58:04<36:16, 5.43s/it]
86%|████████▌ | 2426/2826 [3:58:10<37:36, 5.64s/it]
86%|████████▌ | 2427/2826 [3:58:17<39:01, 5.87s/it]
86%|████████▌ | 2428/2826 [3:58:22<37:58, 5.72s/it]
86%|████████▌ | 2429/2826 [3:58:28<37:57, 5.74s/it]
86%|████████▌ | 2430/2826 [3:58:33<36:23, 5.51s/it]
{'loss': 0.1935, 'grad_norm': 2.771932363510132, 'learning_rate': 2.9469657400078925e-07, 'epoch': 2.58}
86%|████████▌ | 2430/2826 [3:58:33<36:23, 5.51s/it]
86%|████████▌ | 2431/2826 [3:58:38<36:04, 5.48s/it]
86%|████████▌ | 2432/2826 [3:58:45<37:46, 5.75s/it]
86%|████████▌ | 2433/2826 [3:58:51<38:08, 5.82s/it]
86%|████████▌ | 2434/2826 [3:58:56<36:33, 5.60s/it]
86%|████████▌ | 2435/2826 [3:59:01<36:02, 5.53s/it]
86%|████████▌ | 2436/2826 [3:59:06<35:16, 5.43s/it]
86%|████████▌ | 2437/2826 [3:59:12<35:44, 5.51s/it]
86%|████████▋ | 2438/2826 [3:59:19<38:57, 6.02s/it]
86%|████████▋ | 2439/2826 [3:59:24<37:04, 5.75s/it]
86%|████████▋ | 2440/2826 [3:59:29<35:36, 5.54s/it]
{'loss': 0.1858, 'grad_norm': 2.5325114727020264, 'learning_rate': 2.8031784387076186e-07, 'epoch': 2.59}
86%|████████▋ | 2440/2826 [3:59:29<35:36, 5.54s/it]
86%|████████▋ | 2441/2826 [3:59:34<34:37, 5.40s/it]
86%|████████▋ | 2442/2826 [3:59:40<34:14, 5.35s/it]
86%|████████▋ | 2443/2826 [3:59:46<35:04, 5.49s/it]
86%|████████▋ | 2444/2826 [3:59:51<34:15, 5.38s/it]
87%|████████▋ | 2445/2826 [3:59:56<34:37, 5.45s/it]
87%|████████▋ | 2446/2826 [4:00:02<34:34, 5.46s/it]
87%|████████▋ | 2447/2826 [4:00:08<36:16, 5.74s/it]
87%|████████▋ | 2448/2826 [4:00:13<35:17, 5.60s/it]
87%|████████▋ | 2449/2826 [4:00:20<37:44, 6.01s/it]
87%|████████▋ | 2450/2826 [4:00:25<35:57, 5.74s/it]
{'loss': 0.2118, 'grad_norm': 2.4069302082061768, 'learning_rate': 2.6627787387191934e-07, 'epoch': 2.6}
87%|████████▋ | 2450/2826 [4:00:25<35:57, 5.74s/it]
87%|████████▋ | 2451/2826 [4:00:31<34:38, 5.54s/it]
87%|████████▋ | 2452/2826 [4:00:36<34:11, 5.49s/it]
87%|████████▋ | 2453/2826 [4:00:41<34:12, 5.50s/it]
87%|████████▋ | 2454/2826 [4:00:48<36:09, 5.83s/it]
87%|████████▋ | 2455/2826 [4:00:53<34:42, 5.61s/it]
87%|████████▋ | 2456/2826 [4:00:58<33:52, 5.49s/it]
87%|████████▋ | 2457/2826 [4:01:04<33:06, 5.38s/it]
87%|████████▋ | 2458/2826 [4:01:10<34:17, 5.59s/it]
87%|████████▋ | 2459/2826 [4:01:15<33:33, 5.49s/it]
87%|████████▋ | 2460/2826 [4:01:21<34:37, 5.68s/it]
{'loss': 0.1929, 'grad_norm': 2.053656816482544, 'learning_rate': 2.5257880673540376e-07, 'epoch': 2.61}
87%|████████▋ | 2460/2826 [4:01:21<34:37, 5.68s/it]
87%|████████▋ | 2461/2826 [4:01:27<35:43, 5.87s/it]
87%|████████▋ | 2462/2826 [4:01:32<34:20, 5.66s/it]
87%|████████▋ | 2463/2826 [4:01:38<33:20, 5.51s/it]
87%|████████▋ | 2464/2826 [4:01:43<33:13, 5.51s/it]
87%|████████▋ | 2465/2826 [4:01:48<32:47, 5.45s/it]
87%|████████▋ | 2466/2826 [4:01:55<34:01, 5.67s/it]
87%|████████▋ | 2467/2826 [4:02:01<35:15, 5.89s/it]
87%|████████▋ | 2468/2826 [4:02:06<33:49, 5.67s/it]
87%|████████▋ | 2469/2826 [4:02:12<33:38, 5.66s/it]
87%|████████▋ | 2470/2826 [4:02:18<34:17, 5.78s/it]
{'loss': 0.1745, 'grad_norm': 1.8820626735687256, 'learning_rate': 2.392227331649527e-07, 'epoch': 2.62}
87%|████████▋ | 2470/2826 [4:02:18<34:17, 5.78s/it]
87%|████████▋ | 2471/2826 [4:02:25<36:53, 6.23s/it]
87%|████████▋ | 2472/2826 [4:02:31<36:49, 6.24s/it]
88%|████████▊ | 2473/2826 [4:02:37<34:55, 5.94s/it]
88%|████████▊ | 2474/2826 [4:02:42<33:14, 5.67s/it]
88%|████████▊ | 2475/2826 [4:02:47<33:20, 5.70s/it]
88%|████████▊ | 2476/2826 [4:02:52<32:06, 5.50s/it]
88%|████████▊ | 2477/2826 [4:02:58<32:24, 5.57s/it]
88%|████████▊ | 2478/2826 [4:03:04<32:45, 5.65s/it]
88%|████████▊ | 2479/2826 [4:03:09<32:06, 5.55s/it]
88%|████████▊ | 2480/2826 [4:03:15<32:24, 5.62s/it]
{'loss': 0.1823, 'grad_norm': 1.9418586492538452, 'learning_rate': 2.2621169151782417e-07, 'epoch': 2.63}
88%|████████▊ | 2480/2826 [4:03:15<32:24, 5.62s/it]
88%|████████▊ | 2481/2826 [4:03:21<32:21, 5.63s/it]
88%|████████▊ | 2482/2826 [4:03:27<32:35, 5.69s/it]
88%|████████▊ | 2483/2826 [4:03:32<31:38, 5.53s/it]
88%|████████▊ | 2484/2826 [4:03:37<30:42, 5.39s/it]
88%|████████▊ | 2485/2826 [4:03:42<30:08, 5.30s/it]
88%|████████▊ | 2486/2826 [4:03:48<31:41, 5.59s/it]
88%|████████▊ | 2487/2826 [4:03:54<31:32, 5.58s/it]
88%|████████▊ | 2488/2826 [4:04:00<31:43, 5.63s/it]
88%|████████▊ | 2489/2826 [4:04:05<31:34, 5.62s/it]
88%|████████▊ | 2490/2826 [4:04:11<31:21, 5.60s/it]
{'loss': 0.2037, 'grad_norm': 2.519037961959839, 'learning_rate': 2.1354766749371093e-07, 'epoch': 2.64}
88%|████████▊ | 2490/2826 [4:04:11<31:21, 5.60s/it]
88%|████████▊ | 2491/2826 [4:04:17<31:48, 5.70s/it]
88%|████████▊ | 2492/2826 [4:04:22<30:45, 5.52s/it]
88%|████████▊ | 2493/2826 [4:04:27<29:59, 5.40s/it]
88%|████████▊ | 2494/2826 [4:04:33<31:57, 5.78s/it]
88%|████████▊ | 2495/2826 [4:04:39<30:45, 5.58s/it]
88%|████████▊ | 2496/2826 [4:04:45<32:42, 5.95s/it]
88%|████████▊ | 2497/2826 [4:04:52<33:57, 6.19s/it]
88%|████████▊ | 2498/2826 [4:04:58<32:49, 6.00s/it]
88%|████████▊ | 2499/2826 [4:05:03<31:25, 5.77s/it]
88%|████████▊ | 2500/2826 [4:05:10<34:02, 6.27s/it]
{'loss': 0.2196, 'grad_norm': 2.010211944580078, 'learning_rate': 2.0123259383169031e-07, 'epoch': 2.65}
88%|████████▊ | 2500/2826 [4:05:10<34:02, 6.27s/it]
88%|████████▊ | 2501/2826 [4:05:17<33:53, 6.26s/it]
89%|████████▊ | 2502/2826 [4:05:22<31:51, 5.90s/it]
89%|████████▊ | 2503/2826 [4:05:27<30:32, 5.67s/it]
89%|████████▊ | 2504/2826 [4:05:32<30:05, 5.61s/it]
89%|████████▊ | 2505/2826 [4:05:38<29:39, 5.55s/it]
89%|████████▊ | 2506/2826 [4:05:43<29:51, 5.60s/it]
89%|████████▊ | 2507/2826 [4:05:50<30:35, 5.76s/it]
89%|████████▊ | 2508/2826 [4:05:55<29:29, 5.56s/it]
89%|████████▉ | 2509/2826 [4:06:00<29:27, 5.58s/it]
89%|████████▉ | 2510/2826 [4:06:05<28:49, 5.47s/it]
{'loss': 0.1848, 'grad_norm': 1.9838532209396362, 'learning_rate': 1.8926835001525257e-07, 'epoch': 2.66}
89%|████████▉ | 2510/2826 [4:06:05<28:49, 5.47s/it]
89%|████████▉ | 2511/2826 [4:06:11<29:01, 5.53s/it]
89%|████████▉ | 2512/2826 [4:06:16<28:19, 5.41s/it]
89%|████████▉ | 2513/2826 [4:06:23<30:10, 5.78s/it]
89%|████████▉ | 2514/2826 [4:06:28<29:03, 5.59s/it]
89%|████████▉ | 2515/2826 [4:06:33<28:38, 5.53s/it]
89%|████████▉ | 2516/2826 [4:06:39<27:51, 5.39s/it]
89%|████████▉ | 2517/2826 [4:06:44<27:15, 5.29s/it]
89%|████████▉ | 2518/2826 [4:06:49<26:48, 5.22s/it]
89%|████████▉ | 2519/2826 [4:06:55<28:33, 5.58s/it]
89%|████████▉ | 2520/2826 [4:07:00<27:43, 5.44s/it]
{'loss': 0.1823, 'grad_norm': 2.3488149642944336, 'learning_rate': 1.776567619854655e-07, 'epoch': 2.67}
89%|████████▉ | 2520/2826 [4:07:00<27:43, 5.44s/it]
89%|████████▉ | 2521/2826 [4:07:05<27:05, 5.33s/it]
89%|████████▉ | 2522/2826 [4:07:11<28:08, 5.55s/it]
89%|████████▉ | 2523/2826 [4:07:16<27:21, 5.42s/it]
89%|████████▉ | 2524/2826 [4:07:22<26:52, 5.34s/it]
89%|████████▉ | 2525/2826 [4:07:29<29:32, 5.89s/it]
89%|████████▉ | 2526/2826 [4:07:35<29:17, 5.86s/it]
89%|████████▉ | 2527/2826 [4:07:41<30:01, 6.02s/it]
89%|████████▉ | 2528/2826 [4:07:48<31:50, 6.41s/it]
89%|████████▉ | 2529/2826 [4:07:54<30:25, 6.15s/it]
90%|████████▉ | 2530/2826 [4:08:00<30:51, 6.26s/it]
{'loss': 0.2039, 'grad_norm': 2.839651584625244, 'learning_rate': 1.6639960186230293e-07, 'epoch': 2.68}
90%|████████▉ | 2530/2826 [4:08:00<30:51, 6.26s/it]
90%|████████▉ | 2531/2826 [4:08:05<29:05, 5.92s/it]
90%|████████▉ | 2532/2826 [4:08:12<29:48, 6.08s/it]
90%|████████▉ | 2533/2826 [4:08:19<31:31, 6.45s/it]
90%|████████▉ | 2534/2826 [4:08:24<29:25, 6.05s/it]
90%|████████▉ | 2535/2826 [4:08:29<27:43, 5.72s/it]
90%|████████▉ | 2536/2826 [4:08:34<26:45, 5.54s/it]
90%|████████▉ | 2537/2826 [4:08:40<26:13, 5.44s/it]
90%|████████▉ | 2538/2826 [4:08:45<25:45, 5.37s/it]
90%|████████▉ | 2539/2826 [4:08:51<26:14, 5.49s/it]
90%|████████▉ | 2540/2826 [4:08:57<26:51, 5.63s/it]
{'loss': 0.1796, 'grad_norm': 2.050480842590332, 'learning_rate': 1.5549858767419018e-07, 'epoch': 2.69}
90%|████████▉ | 2540/2826 [4:08:57<26:51, 5.63s/it]
90%|████████▉ | 2541/2826 [4:09:03<28:40, 6.04s/it]
90%|████████▉ | 2542/2826 [4:09:09<27:17, 5.77s/it]
90%|████████▉ | 2543/2826 [4:09:15<28:02, 5.95s/it]
90%|█████████ | 2544/2826 [4:09:22<29:02, 6.18s/it]
90%|█████████ | 2545/2826 [4:09:28<28:26, 6.07s/it]
90%|█████████ | 2546/2826 [4:09:33<26:58, 5.78s/it]
90%|█████████ | 2547/2826 [4:09:39<27:19, 5.88s/it]
90%|█████████ | 2548/2826 [4:09:44<26:22, 5.69s/it]
90%|█████████ | 2549/2826 [4:09:51<27:42, 6.00s/it]
90%|█████████ | 2550/2826 [4:09:57<27:41, 6.02s/it]
{'loss': 0.1893, 'grad_norm': 1.2738044261932373, 'learning_rate': 1.449553830958053e-07, 'epoch': 2.7}
90%|█████████ | 2550/2826 [4:09:57<27:41, 6.02s/it]
90%|█████████ | 2551/2826 [4:10:02<26:16, 5.73s/it]
90%|█████████ | 2552/2826 [4:10:07<25:27, 5.58s/it]
90%|█████████ | 2553/2826 [4:10:14<26:58, 5.93s/it]
90%|█████████ | 2554/2826 [4:10:20<26:41, 5.89s/it]
90%|█████████ | 2555/2826 [4:10:26<26:46, 5.93s/it]
90%|█████████ | 2556/2826 [4:10:33<28:20, 6.30s/it]
90%|█████████ | 2557/2826 [4:10:38<27:09, 6.06s/it]
91%|█████████ | 2558/2826 [4:10:44<27:07, 6.07s/it]
91%|█████████ | 2559/2826 [4:10:50<26:45, 6.01s/it]
91%|█████████ | 2560/2826 [4:10:56<26:51, 6.06s/it]
{'loss': 0.1947, 'grad_norm': 1.8912787437438965, 'learning_rate': 1.347715971941746e-07, 'epoch': 2.72}
91%|█████████ | 2560/2826 [4:10:56<26:51, 6.06s/it]
91%|█████████ | 2561/2826 [4:11:02<25:46, 5.83s/it]
91%|█████████ | 2562/2826 [4:11:08<26:50, 6.10s/it]
91%|█████████ | 2563/2826 [4:11:14<25:27, 5.81s/it]
91%|█████████ | 2564/2826 [4:11:19<25:23, 5.82s/it]
91%|█████████ | 2565/2826 [4:11:25<24:52, 5.72s/it]
91%|█████████ | 2566/2826 [4:11:30<23:57, 5.53s/it]
91%|█████████ | 2567/2826 [4:11:36<24:41, 5.72s/it]
91%|█████████ | 2568/2826 [4:11:43<26:07, 6.07s/it]
91%|█████████ | 2569/2826 [4:11:49<26:27, 6.18s/it]
91%|█████████ | 2570/2826 [4:11:56<26:21, 6.18s/it]
{'loss': 0.1744, 'grad_norm': 1.8385730981826782, 'learning_rate': 1.2494878418310234e-07, 'epoch': 2.73}
91%|█████████ | 2570/2826 [4:11:56<26:21, 6.18s/it]
91%|█████████ | 2571/2826 [4:12:01<24:59, 5.88s/it]
91%|█████████ | 2572/2826 [4:12:06<24:31, 5.79s/it]
91%|█████████ | 2573/2826 [4:12:13<24:48, 5.89s/it]
91%|█████████ | 2574/2826 [4:12:18<24:31, 5.84s/it]
91%|█████████ | 2575/2826 [4:12:24<23:53, 5.71s/it]
91%|█████████ | 2576/2826 [4:12:30<24:08, 5.79s/it]
91%|█████████ | 2577/2826 [4:12:36<24:20, 5.86s/it]
91%|█████████ | 2578/2826 [4:12:42<24:27, 5.92s/it]
91%|█████████▏| 2579/2826 [4:12:48<24:21, 5.92s/it]
91%|█████████▏| 2580/2826 [4:12:55<25:23, 6.19s/it]
{'loss': 0.2351, 'grad_norm': 2.1071712970733643, 'learning_rate': 1.1548844318597208e-07, 'epoch': 2.74}
91%|█████████▏| 2580/2826 [4:12:55<25:23, 6.19s/it]
91%|█████████▏| 2581/2826 [4:13:01<25:12, 6.17s/it]
91%|█████████▏| 2582/2826 [4:13:06<23:43, 5.83s/it]
91%|█████████▏| 2583/2826 [4:13:12<23:43, 5.86s/it]
91%|█████████▏| 2584/2826 [4:13:17<22:43, 5.64s/it]
91%|█████████▏| 2585/2826 [4:13:22<22:01, 5.48s/it]
92%|█████████▏| 2586/2826 [4:13:29<23:29, 5.87s/it]
92%|█████████▏| 2587/2826 [4:13:34<22:38, 5.68s/it]
92%|█████████▏| 2588/2826 [4:13:39<21:40, 5.46s/it]
92%|█████████▏| 2589/2826 [4:13:46<23:18, 5.90s/it]
92%|█████████▏| 2590/2826 [4:13:51<22:32, 5.73s/it]
{'loss': 0.2245, 'grad_norm': 2.054392099380493, 'learning_rate': 1.0639201800695553e-07, 'epoch': 2.75}
92%|█████████▏| 2590/2826 [4:13:51<22:32, 5.73s/it]
92%|█████████▏| 2591/2826 [4:13:56<22:01, 5.62s/it]
92%|█████████▏| 2592/2826 [4:14:03<23:37, 6.06s/it]
92%|█████████▏| 2593/2826 [4:14:10<23:44, 6.11s/it]
92%|█████████▏| 2594/2826 [4:14:15<22:38, 5.86s/it]
92%|█████████▏| 2595/2826 [4:14:20<21:35, 5.61s/it]
92%|█████████▏| 2596/2826 [4:14:26<21:55, 5.72s/it]
92%|█████████▏| 2597/2826 [4:14:31<21:10, 5.55s/it]
92%|█████████▏| 2598/2826 [4:14:37<21:32, 5.67s/it]
92%|█████████▏| 2599/2826 [4:14:43<21:36, 5.71s/it]
92%|█████████▏| 2600/2826 [4:14:49<21:48, 5.79s/it]
{'loss': 0.2014, 'grad_norm': 1.656562328338623, 'learning_rate': 9.76608969106646e-08, 'epoch': 2.76}
92%|█████████▏| 2600/2826 [4:14:49<21:48, 5.79s/it]
92%|█████████▏| 2601/2826 [4:14:55<22:14, 5.93s/it]
92%|█████████▏| 2602/2826 [4:15:01<21:51, 5.85s/it]
92%|█████████▏| 2603/2826 [4:15:06<21:26, 5.77s/it]
92%|█████████▏| 2604/2826 [4:15:14<22:49, 6.17s/it]
92%|█████████▏| 2605/2826 [4:15:19<21:31, 5.84s/it]
92%|█████████▏| 2606/2826 [4:15:24<21:04, 5.75s/it]
92%|█████████▏| 2607/2826 [4:15:30<20:49, 5.71s/it]
92%|█████████▏| 2608/2826 [4:15:35<20:36, 5.67s/it]
92%|█████████▏| 2609/2826 [4:15:42<21:47, 6.02s/it]
92%|█████████▏| 2610/2826 [4:15:47<20:42, 5.75s/it]
{'loss': 0.1824, 'grad_norm': 2.6887638568878174, 'learning_rate': 8.929641241027937e-08, 'epoch': 2.77}
92%|█████████▏| 2610/2826 [4:15:47<20:42, 5.75s/it]
92%|█████████▏| 2611/2826 [4:15:53<20:26, 5.71s/it]
92%|█████████▏| 2612/2826 [4:15:58<20:04, 5.63s/it]
92%|█████████▏| 2613/2826 [4:16:05<21:18, 6.00s/it]
92%|█████████▏| 2614/2826 [4:16:10<20:12, 5.72s/it]
93%|█████████▎| 2615/2826 [4:16:15<19:34, 5.57s/it]
93%|█████████▎| 2616/2826 [4:16:22<20:17, 5.80s/it]
93%|█████████▎| 2617/2826 [4:16:29<21:27, 6.16s/it]
93%|█████████▎| 2618/2826 [4:16:34<20:18, 5.86s/it]
93%|█████████▎| 2619/2826 [4:16:39<19:32, 5.66s/it]
93%|█████████▎| 2620/2826 [4:16:45<20:03, 5.84s/it]
{'loss': 0.1706, 'grad_norm': 2.4606659412384033, 'learning_rate': 8.129984106418354e-08, 'epoch': 2.78}
93%|█████████▎| 2620/2826 [4:16:45<20:03, 5.84s/it]
93%|█████████▎| 2621/2826 [4:16:51<19:10, 5.61s/it]
93%|█████████▎| 2622/2826 [4:16:56<18:29, 5.44s/it]
93%|█████████▎| 2623/2826 [4:17:02<19:40, 5.81s/it]
93%|█████████▎| 2624/2826 [4:17:08<19:46, 5.88s/it]
93%|█████████▎| 2625/2826 [4:17:14<19:08, 5.72s/it]
93%|█████████▎| 2626/2826 [4:17:19<19:02, 5.71s/it]
93%|█████████▎| 2627/2826 [4:17:25<19:14, 5.80s/it]
93%|█████████▎| 2628/2826 [4:17:33<20:31, 6.22s/it]
93%|█████████▎| 2629/2826 [4:17:39<20:35, 6.27s/it]
93%|█████████▎| 2630/2826 [4:17:44<19:21, 5.93s/it]
{'loss': 0.2195, 'grad_norm': 2.5548455715179443, 'learning_rate': 7.3672403281142e-08, 'epoch': 2.79}
93%|█████████▎| 2630/2826 [4:17:44<19:21, 5.93s/it]
93%|█████████▎| 2631/2826 [4:17:50<19:21, 5.96s/it]
93%|█████████▎| 2632/2826 [4:17:56<19:21, 5.99s/it]
93%|█████████▎| 2633/2826 [4:18:03<20:08, 6.26s/it]
93%|█████████▎| 2634/2826 [4:18:09<19:32, 6.11s/it]
93%|█████████▎| 2635/2826 [4:18:14<18:40, 5.87s/it]
93%|█████████▎| 2636/2826 [4:18:20<18:22, 5.80s/it]
93%|█████████▎| 2637/2826 [4:18:25<17:29, 5.55s/it]
93%|█████████▎| 2638/2826 [4:18:30<17:29, 5.58s/it]
93%|█████████▎| 2639/2826 [4:18:36<17:08, 5.50s/it]
93%|█████████▎| 2640/2826 [4:18:42<18:07, 5.85s/it]
{'loss': 0.1748, 'grad_norm': 1.7952167987823486, 'learning_rate': 6.641526313404534e-08, 'epoch': 2.8}
93%|█████████▎| 2640/2826 [4:18:42<18:07, 5.85s/it]
93%|█████████▎| 2641/2826 [4:18:47<17:17, 5.61s/it]
93%|█████████▎| 2642/2826 [4:18:53<17:28, 5.70s/it]
94%|█████████▎| 2643/2826 [4:18:59<17:45, 5.82s/it]
94%|█████████▎| 2644/2826 [4:19:06<18:19, 6.04s/it]
94%|█████████▎| 2645/2826 [4:19:11<17:25, 5.78s/it]
94%|█████████▎| 2646/2826 [4:19:16<16:43, 5.57s/it]
94%|█████████▎| 2647/2826 [4:19:21<16:19, 5.47s/it]
94%|█████████▎| 2648/2826 [4:19:27<16:34, 5.59s/it]
94%|█████████▎| 2649/2826 [4:19:33<16:24, 5.56s/it]
94%|█████████▍| 2650/2826 [4:19:40<17:37, 6.01s/it]
{'loss': 0.2061, 'grad_norm': 2.376830816268921, 'learning_rate': 5.952952818225416e-08, 'epoch': 2.81}
94%|█████████▍| 2650/2826 [4:19:40<17:37, 6.01s/it]
94%|█████████▍| 2651/2826 [4:19:45<16:43, 5.73s/it]
94%|█████████▍| 2652/2826 [4:19:51<16:39, 5.74s/it]
94%|█████████▍| 2653/2826 [4:19:56<16:36, 5.76s/it]
94%|█████████▍| 2654/2826 [4:20:02<16:01, 5.59s/it]
94%|█████████▍| 2655/2826 [4:20:07<15:57, 5.60s/it]
94%|█████████▍| 2656/2826 [4:20:14<16:59, 5.99s/it]
94%|█████████▍| 2657/2826 [4:20:20<16:48, 5.97s/it]
94%|█████████▍| 2658/2826 [4:20:26<16:15, 5.81s/it]
94%|█████████▍| 2659/2826 [4:20:31<15:27, 5.55s/it]
94%|█████████▍| 2660/2826 [4:20:36<15:22, 5.56s/it]
{'loss': 0.1742, 'grad_norm': 1.7183632850646973, 'learning_rate': 5.3016249302565436e-08, 'epoch': 2.82}
94%|█████████▍| 2660/2826 [4:20:36<15:22, 5.56s/it]
94%|█████████▍| 2661/2826 [4:20:41<14:54, 5.42s/it]
94%|█████████▍| 2662/2826 [4:20:48<15:59, 5.85s/it]
94%|█████████▍| 2663/2826 [4:20:54<15:47, 5.81s/it]
94%|█████████▍| 2664/2826 [4:21:00<16:05, 5.96s/it]
94%|█████████▍| 2665/2826 [4:21:06<15:54, 5.93s/it]
94%|█████████▍| 2666/2826 [4:21:12<15:59, 6.00s/it]
94%|█████████▍| 2667/2826 [4:21:18<15:26, 5.83s/it]
94%|█████████▍| 2668/2826 [4:21:24<16:07, 6.13s/it]
94%|█████████▍| 2669/2826 [4:21:30<15:40, 5.99s/it]
94%|█████████▍| 2670/2826 [4:21:35<14:55, 5.74s/it]
{'loss': 0.2082, 'grad_norm': 2.11011004447937, 'learning_rate': 4.6876420528833014e-08, 'epoch': 2.83}
94%|█████████▍| 2670/2826 [4:21:35<14:55, 5.74s/it]
95%|█████████▍| 2671/2826 [4:21:41<14:59, 5.80s/it]
95%|█████████▍| 2672/2826 [4:21:46<14:25, 5.62s/it]
95%|█████████▍| 2673/2826 [4:21:52<14:39, 5.75s/it]
95%|█████████▍| 2674/2826 [4:21:59<15:02, 5.94s/it]
95%|█████████▍| 2675/2826 [4:22:05<15:10, 6.03s/it]
95%|█████████▍| 2676/2826 [4:22:13<16:34, 6.63s/it]
95%|█████████▍| 2677/2826 [4:22:18<15:22, 6.19s/it]
95%|█████████▍| 2678/2826 [4:22:24<14:48, 6.00s/it]
95%|█████████▍| 2679/2826 [4:22:30<15:03, 6.14s/it]
95%|█████████▍| 2680/2826 [4:22:35<14:11, 5.83s/it]
{'loss': 0.1805, 'grad_norm': 1.8799868822097778, 'learning_rate': 4.111097890026089e-08, 'epoch': 2.84}
95%|█████████▍| 2680/2826 [4:22:35<14:11, 5.83s/it]
95%|█████████▍| 2681/2826 [4:22:40<13:30, 5.59s/it]
95%|█████████▍| 2682/2826 [4:22:46<13:19, 5.55s/it]
95%|█████████▍| 2683/2826 [4:22:51<13:01, 5.46s/it]
95%|█████████▍| 2684/2826 [4:22:57<13:24, 5.66s/it]
95%|█████████▌| 2685/2826 [4:23:04<14:13, 6.06s/it]
95%|█████████▌| 2686/2826 [4:23:10<13:57, 5.98s/it]
95%|█████████▌| 2687/2826 [4:23:17<14:17, 6.17s/it]
95%|█████████▌| 2688/2826 [4:23:22<13:59, 6.08s/it]
95%|█████████▌| 2689/2826 [4:23:29<13:54, 6.09s/it]
95%|█████████▌| 2690/2826 [4:23:34<13:18, 5.87s/it]
{'loss': 0.2058, 'grad_norm': 2.5171291828155518, 'learning_rate': 3.5720804318395976e-08, 'epoch': 2.85}
95%|█████████▌| 2690/2826 [4:23:34<13:18, 5.87s/it]
95%|█████████▌| 2691/2826 [4:23:39<12:41, 5.64s/it]
95%|█████████▌| 2692/2826 [4:23:45<12:54, 5.78s/it]
95%|█████████▌| 2693/2826 [4:23:51<13:09, 5.93s/it]
95%|█████████▌| 2694/2826 [4:23:59<13:56, 6.34s/it]
95%|█████████▌| 2695/2826 [4:24:04<13:13, 6.05s/it]
95%|█████████▌| 2696/2826 [4:24:10<12:43, 5.88s/it]
95%|█████████▌| 2697/2826 [4:24:16<12:52, 5.99s/it]
95%|█████████▌| 2698/2826 [4:24:21<12:31, 5.87s/it]
96%|█████████▌| 2699/2826 [4:24:27<12:23, 5.86s/it]
96%|█████████▌| 2700/2826 [4:24:32<11:51, 5.65s/it]
{'loss': 0.2027, 'grad_norm': 2.142263650894165, 'learning_rate': 3.0706719412839926e-08, 'epoch': 2.86}
96%|█████████▌| 2700/2826 [4:24:32<11:51, 5.65s/it]
96%|█████████▌| 2701/2826 [4:24:39<12:09, 5.84s/it]
96%|█████████▌| 2702/2826 [4:24:44<11:38, 5.64s/it]
96%|█████████▌| 2703/2826 [4:24:49<11:25, 5.58s/it]
96%|█████████▌| 2704/2826 [4:24:55<11:23, 5.60s/it]
96%|█████████▌| 2705/2826 [4:25:02<11:59, 5.94s/it]
96%|█████████▌| 2706/2826 [4:25:08<12:19, 6.17s/it]
96%|█████████▌| 2707/2826 [4:25:14<11:53, 6.00s/it]
96%|█████████▌| 2708/2826 [4:25:19<11:30, 5.86s/it]
96%|█████████▌| 2709/2826 [4:25:26<11:55, 6.12s/it]
96%|█████████▌| 2710/2826 [4:25:32<11:20, 5.87s/it]
{'loss': 0.1941, 'grad_norm': 2.2124040126800537, 'learning_rate': 2.6069489415703197e-08, 'epoch': 2.87}
96%|█████████▌| 2710/2826 [4:25:32<11:20, 5.87s/it]
96%|█████████▌| 2711/2826 [4:25:38<11:24, 5.95s/it]
96%|█████████▌| 2712/2826 [4:25:43<10:59, 5.78s/it]
96%|█████████▌| 2713/2826 [4:25:49<10:52, 5.78s/it]
96%|█████████▌| 2714/2826 [4:25:56<11:48, 6.32s/it]
96%|█████████▌| 2715/2826 [4:26:03<11:52, 6.42s/it]
96%|█████████▌| 2716/2826 [4:26:09<11:15, 6.14s/it]
96%|█████████▌| 2717/2826 [4:26:14<10:35, 5.83s/it]
96%|█████████▌| 2718/2826 [4:26:19<10:11, 5.66s/it]
96%|█████████▌| 2719/2826 [4:26:24<09:45, 5.48s/it]
96%|█████████▌| 2720/2826 [4:26:30<10:07, 5.73s/it]
{'loss': 0.2029, 'grad_norm': 2.033259153366089, 'learning_rate': 2.18098220448168e-08, 'epoch': 2.88}
96%|█████████▌| 2720/2826 [4:26:30<10:07, 5.73s/it]
96%|█████████▋| 2721/2826 [4:26:35<09:43, 5.56s/it]
96%|█████████▋| 2722/2826 [4:26:43<10:46, 6.22s/it]
96%|█████████▋| 2723/2826 [4:26:48<10:03, 5.86s/it]
96%|█████████▋| 2724/2826 [4:26:54<09:49, 5.78s/it]
96%|█████████▋| 2725/2826 [4:26:59<09:22, 5.57s/it]
96%|█████████▋| 2726/2826 [4:27:04<09:15, 5.56s/it]
96%|█████████▋| 2727/2826 [4:27:10<09:25, 5.71s/it]
97%|█████████▋| 2728/2826 [4:27:17<09:38, 5.90s/it]
97%|█████████▋| 2729/2826 [4:27:22<09:10, 5.67s/it]
97%|█████████▋| 2730/2826 [4:27:27<08:54, 5.56s/it]
{'loss': 0.2062, 'grad_norm': 2.416912794113159, 'learning_rate': 1.7928367395725066e-08, 'epoch': 2.9}
97%|█████████▋| 2730/2826 [4:27:27<08:54, 5.56s/it]
97%|█████████▋| 2731/2826 [4:27:33<08:47, 5.55s/it]
97%|█████████▋| 2732/2826 [4:27:39<08:47, 5.61s/it]
97%|█████████▋| 2733/2826 [4:27:46<09:34, 6.17s/it]
97%|█████████▋| 2734/2826 [4:27:53<09:58, 6.50s/it]
97%|█████████▋| 2735/2826 [4:27:58<09:12, 6.08s/it]
97%|█████████▋| 2736/2826 [4:28:05<09:17, 6.19s/it]
97%|█████████▋| 2737/2826 [4:28:11<09:02, 6.10s/it]
97%|█████████▋| 2738/2826 [4:28:16<08:45, 5.97s/it]
97%|█████████▋| 2739/2826 [4:28:21<08:14, 5.69s/it]
97%|█████████▋| 2740/2826 [4:28:27<08:14, 5.75s/it]
{'loss': 0.1873, 'grad_norm': 2.193751096725464, 'learning_rate': 1.442571784246699e-08, 'epoch': 2.91}
97%|█████████▋| 2740/2826 [4:28:27<08:14, 5.75s/it]
97%|█████████▋| 2741/2826 [4:28:35<08:48, 6.22s/it]
97%|█████████▋| 2742/2826 [4:28:42<09:04, 6.48s/it]
97%|█████████▋| 2743/2826 [4:28:47<08:25, 6.09s/it]
97%|█████████▋| 2744/2826 [4:28:53<08:24, 6.16s/it]
97%|█████████▋| 2745/2826 [4:28:59<07:59, 5.92s/it]
97%|█████████▋| 2746/2826 [4:29:04<07:34, 5.68s/it]
97%|█████████▋| 2747/2826 [4:29:10<07:53, 5.99s/it]
97%|█████████▋| 2748/2826 [4:29:16<07:27, 5.74s/it]
97%|█████████▋| 2749/2826 [4:29:21<07:14, 5.65s/it]
97%|█████████▋| 2750/2826 [4:29:27<07:19, 5.78s/it]
{'loss': 0.1653, 'grad_norm': 1.5729731321334839, 'learning_rate': 1.1302407947173522e-08, 'epoch': 2.92}
97%|█████████▋| 2750/2826 [4:29:27<07:19, 5.78s/it]
97%|█████████▋| 2751/2826 [4:29:33<07:26, 5.96s/it]
97%|█████████▋| 2752/2826 [4:29:40<07:35, 6.16s/it]
97%|█████████▋| 2753/2826 [4:29:47<07:56, 6.53s/it]
97%|█████████▋| 2754/2826 [4:29:55<08:02, 6.70s/it]
97%|█████████▋| 2755/2826 [4:30:00<07:26, 6.28s/it]
98%|█████████▊| 2756/2826 [4:30:06<07:10, 6.15s/it]
98%|█████████▊| 2757/2826 [4:30:11<06:44, 5.86s/it]
98%|█████████▊| 2758/2826 [4:30:16<06:21, 5.62s/it]
98%|█████████▊| 2759/2826 [4:30:21<06:13, 5.57s/it]
98%|█████████▊| 2760/2826 [4:30:28<06:28, 5.89s/it]
{'loss': 0.1743, 'grad_norm': 1.7562044858932495, 'learning_rate': 8.558914378481996e-09, 'epoch': 2.93}
98%|█████████▊| 2760/2826 [4:30:28<06:28, 5.89s/it]
98%|█████████▊| 2761/2826 [4:30:34<06:20, 5.86s/it]
98%|█████████▊| 2762/2826 [4:30:39<06:00, 5.63s/it]
98%|█████████▊| 2763/2826 [4:30:44<05:47, 5.52s/it]
98%|█████████▊| 2764/2826 [4:30:49<05:32, 5.36s/it]
98%|█████████▊| 2765/2826 [4:30:56<06:00, 5.91s/it]
98%|█████████▊| 2766/2826 [4:31:03<06:07, 6.12s/it]
98%|█████████▊| 2767/2826 [4:31:09<06:06, 6.21s/it]
98%|█████████▊| 2768/2826 [4:31:15<05:54, 6.12s/it]
98%|█████████▊| 2769/2826 [4:31:21<05:36, 5.90s/it]
98%|█████████▊| 2770/2826 [4:31:26<05:24, 5.80s/it]
{'loss': 0.1821, 'grad_norm': 2.183967351913452, 'learning_rate': 6.195655838790726e-09, 'epoch': 2.94}
98%|█████████▊| 2770/2826 [4:31:26<05:24, 5.80s/it]
98%|█████████▊| 2771/2826 [4:31:32<05:14, 5.73s/it]
98%|█████████▊| 2772/2826 [4:31:37<05:00, 5.56s/it]
98%|█████████▊| 2773/2826 [4:31:44<05:12, 5.90s/it]
98%|█████████▊| 2774/2826 [4:31:49<05:02, 5.81s/it]
98%|█████████▊| 2775/2826 [4:31:55<04:49, 5.67s/it]
98%|█████████▊| 2776/2826 [4:32:00<04:35, 5.50s/it]
98%|█████████▊| 2777/2826 [4:32:07<04:50, 5.92s/it]
98%|█████████▊| 2778/2826 [4:32:14<05:03, 6.33s/it]
98%|█████████▊| 2779/2826 [4:32:20<04:51, 6.20s/it]
98%|█████████▊| 2780/2826 [4:32:26<04:42, 6.13s/it]
{'loss': 0.1954, 'grad_norm': 1.9312433004379272, 'learning_rate': 4.212993000356491e-09, 'epoch': 2.95}
98%|█████████▊| 2780/2826 [4:32:26<04:42, 6.13s/it]
98%|█████████▊| 2781/2826 [4:32:31<04:22, 5.84s/it]
98%|█████████▊| 2782/2826 [4:32:36<04:06, 5.61s/it]
98%|█████████▊| 2783/2826 [4:32:42<04:02, 5.64s/it]
99%|█████████▊| 2784/2826 [4:32:49<04:13, 6.03s/it]
99%|█████████▊| 2785/2826 [4:32:54<03:57, 5.79s/it]
99%|█████████▊| 2786/2826 [4:32:59<03:43, 5.58s/it]
99%|█████████▊| 2787/2826 [4:33:04<03:31, 5.42s/it]
99%|█████████▊| 2788/2826 [4:33:09<03:24, 5.38s/it]
99%|█████████▊| 2789/2826 [4:33:15<03:23, 5.50s/it]
99%|█████████▊| 2790/2826 [4:33:21<03:21, 5.61s/it]
{'loss': 0.1925, 'grad_norm': 2.2055087089538574, 'learning_rate': 2.611228450250802e-09, 'epoch': 2.96}
99%|█████████▊| 2790/2826 [4:33:21<03:21, 5.61s/it]
99%|█████████▉| 2791/2826 [4:33:28<03:29, 6.00s/it]
99%|█████████▉| 2792/2826 [4:33:35<03:30, 6.19s/it]
99%|█████████▉| 2793/2826 [4:33:40<03:19, 6.06s/it]
99%|█████████▉| 2794/2826 [4:33:46<03:12, 6.01s/it]
99%|█████████▉| 2795/2826 [4:33:53<03:14, 6.28s/it]
99%|█████████▉| 2796/2826 [4:33:59<03:01, 6.06s/it]
99%|█████████▉| 2797/2826 [4:34:04<02:46, 5.75s/it]
99%|█████████▉| 2798/2826 [4:34:09<02:37, 5.62s/it]
99%|█████████▉| 2799/2826 [4:34:16<02:41, 5.97s/it]
99%|█████████▉| 2800/2826 [4:34:23<02:48, 6.49s/it]
{'loss': 0.1805, 'grad_norm': 1.6606404781341553, 'learning_rate': 1.3906066441798927e-09, 'epoch': 2.97}
99%|█████████▉| 2800/2826 [4:34:23<02:48, 6.49s/it]
99%|█████████▉| 2801/2826 [4:34:30<02:45, 6.62s/it]
99%|█████████▉| 2802/2826 [4:34:36<02:29, 6.22s/it]
99%|█████████▉| 2803/2826 [4:34:41<02:19, 6.05s/it]
99%|█████████▉| 2804/2826 [4:34:47<02:10, 5.92s/it]
99%|█████████▉| 2805/2826 [4:34:53<02:04, 5.93s/it]
99%|█████████▉| 2806/2826 [4:34:59<02:02, 6.11s/it]
99%|█████████▉| 2807/2826 [4:35:05<01:51, 5.86s/it]
99%|█████████▉| 2808/2826 [4:35:10<01:41, 5.66s/it]
99%|█████████▉| 2809/2826 [4:35:15<01:34, 5.54s/it]
99%|█████████▉| 2810/2826 [4:35:21<01:30, 5.63s/it]
{'loss': 0.2084, 'grad_norm': 2.594404458999634, 'learning_rate': 5.513138691767839e-10, 'epoch': 2.98}
99%|█████████▉| 2810/2826 [4:35:21<01:30, 5.63s/it]
99%|█████████▉| 2811/2826 [4:35:28<01:28, 5.93s/it]
100%|█████████▉| 2812/2826 [4:35:35<01:27, 6.26s/it]
100%|█████████▉| 2813/2826 [4:35:42<01:25, 6.55s/it]
100%|█████████▉| 2814/2826 [4:35:47<01:13, 6.16s/it]
100%|█████████▉| 2815/2826 [4:35:53<01:06, 6.01s/it]
100%|█████████▉| 2816/2826 [4:35:58<00:57, 5.75s/it]
100%|█████████▉| 2817/2826 [4:36:05<00:54, 6.09s/it]
100%|█████████▉| 2818/2826 [4:36:12<00:50, 6.37s/it]
100%|█████████▉| 2819/2826 [4:36:17<00:42, 6.13s/it]
100%|█████████▉| 2820/2826 [4:36:24<00:38, 6.37s/it]
{'loss': 0.2115, 'grad_norm': 2.007861375808716, 'learning_rate': 9.347821517069477e-11, 'epoch': 2.99}
100%|█████████▉| 2820/2826 [4:36:24<00:38, 6.37s/it]
100%|█████████▉| 2821/2826 [4:36:31<00:33, 6.60s/it]
100%|█████████▉| 2822/2826 [4:36:37<00:24, 6.17s/it]
100%|█████████▉| 2823/2826 [4:36:42<00:17, 5.84s/it]
100%|█████████▉| 2824/2826 [4:36:48<00:11, 5.91s/it]
100%|█████████▉| 2825/2826 [4:36:54<00:05, 5.88s/it]
100%|██████████| 2826/2826 [4:36:59<00:00, 5.66s/it][INFO|trainer.py:3984] 2025-10-18 11:23:15,155 >> Saving model checkpoint to /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-2826
[INFO|configuration_utils.py:419] 2025-10-18 11:23:15,160 >> Configuration saved in /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-2826/config.json
[INFO|configuration_utils.py:911] 2025-10-18 11:23:15,162 >> Configuration saved in /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-2826/generation_config.json
[INFO|modeling_utils.py:3580] 2025-10-18 11:23:35,979 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 4 checkpoint shards. You can find where each parameters has been saved in the index located at /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-2826/model.safetensors.index.json.
[INFO|tokenization_utils_base.py:2510] 2025-10-18 11:23:35,982 >> tokenizer config file saved in /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-2826/tokenizer_config.json
[INFO|tokenization_utils_base.py:2519] 2025-10-18 11:23:35,983 >> Special tokens file saved in /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-2826/special_tokens_map.json
[2025-10-18 11:23:36,183] [INFO] [logging.py:107:log_dist] [Rank 0] [Torch] Checkpoint global_step2825 is about to be saved!
[2025-10-18 11:23:36,615] [INFO] [logging.py:107:log_dist] [Rank 0] Saving model checkpoint: /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-2826/global_step2825/zero_pp_rank_0_mp_rank_00_model_states.pt
[2025-10-18 11:23:36,615] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-2826/global_step2825/zero_pp_rank_0_mp_rank_00_model_states.pt...
[2025-10-18 11:23:36,633] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-2826/global_step2825/zero_pp_rank_0_mp_rank_00_model_states.pt.
[2025-10-18 11:23:36,637] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-2826/global_step2825/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt...
[2025-10-18 11:23:55,201] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-2826/global_step2825/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt.
[2025-10-18 11:23:55,202] [INFO] [engine.py:3701:_save_zero_checkpoint] zero checkpoint saved /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-2826/global_step2825/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt
[2025-10-18 11:23:55,663] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step2825 is ready now!
[INFO|trainer.py:2681] 2025-10-18 11:23:55,685 >>
Training completed. Do not forget to share your model on huggingface.co/models =)
{'train_runtime': 16671.2674, 'train_samples_per_second': 2.713, 'train_steps_per_second': 0.17, 'train_loss': 0.34044326600333263, 'epoch': 3.0}
100%|██████████| 2826/2826 [4:37:51<00:00, 5.66s/it]
100%|██████████| 2826/2826 [4:37:51<00:00, 5.90s/it]
[INFO|trainer.py:3984] 2025-10-18 11:24:06,471 >> Saving model checkpoint to /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17
[INFO|configuration_utils.py:419] 2025-10-18 11:24:06,477 >> Configuration saved in /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/config.json
[INFO|configuration_utils.py:911] 2025-10-18 11:24:06,480 >> Configuration saved in /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/generation_config.json
[INFO|modeling_utils.py:3580] 2025-10-18 11:24:26,439 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 4 checkpoint shards. You can find where each parameters has been saved in the index located at /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/model.safetensors.index.json.
[INFO|tokenization_utils_base.py:2510] 2025-10-18 11:24:26,442 >> tokenizer config file saved in /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/tokenizer_config.json
[INFO|tokenization_utils_base.py:2519] 2025-10-18 11:24:26,443 >> Special tokens file saved in /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/special_tokens_map.json
***** train metrics *****
epoch = 2.9973
total_flos = 101656586GF
train_loss = 0.3404
train_runtime = 4:37:51.26
train_samples_per_second = 2.713
train_steps_per_second = 0.17
Figure saved at: /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/training_loss.png
[WARNING|2025-10-18 11:24:27] llamafactory.extras.ploting:148 >> No metric eval_loss to plot.
[WARNING|2025-10-18 11:24:27] llamafactory.extras.ploting:148 >> No metric eval_accuracy to plot.
[INFO|modelcard.py:450] 2025-10-18 11:24:27,224 >> Dropping the following result as it does not have all the necessary fields:
{'task': {'name': 'Causal Language Modeling', 'type': 'text-generation'}}