W1018 06:43:53.034832 2996178 site-packages/torch/distributed/run.py:792] W1018 06:43:53.034832 2996178 site-packages/torch/distributed/run.py:792] ***************************************** W1018 06:43:53.034832 2996178 site-packages/torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W1018 06:43:53.034832 2996178 site-packages/torch/distributed/run.py:792] ***************************************** [2025-10-18 06:44:00,629] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-10-18 06:44:00,969] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-10-18 06:44:01,009] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-10-18 06:44:01,026] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-10-18 06:44:01,031] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-10-18 06:44:01,033] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-10-18 06:44:01,039] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-10-18 06:44:01,040] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-10-18 06:44:04,890] [INFO] [comm.py:669:init_distributed] cdb=None [2025-10-18 06:44:05,492] [INFO] [comm.py:669:init_distributed] cdb=None [2025-10-18 06:44:05,518] [INFO] [comm.py:669:init_distributed] cdb=None [2025-10-18 06:44:05,602] [INFO] [comm.py:669:init_distributed] cdb=None [2025-10-18 06:44:05,625] [INFO] [comm.py:669:init_distributed] cdb=None [2025-10-18 06:44:05,666] [INFO] [comm.py:669:init_distributed] cdb=None [2025-10-18 06:44:05,666] [INFO] [comm.py:700:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl [2025-10-18 06:44:05,673] [INFO] [comm.py:669:init_distributed] cdb=None [2025-10-18 06:44:05,829] [INFO] [comm.py:669:init_distributed] cdb=None [INFO|2025-10-18 06:44:06] llamafactory.hparams.parser:406 >> Process rank: 5, world size: 8, device: cuda:5, distributed training: True, compute dtype: torch.bfloat16 [INFO|2025-10-18 06:44:06] llamafactory.hparams.parser:406 >> Process rank: 0, world size: 8, device: cuda:0, distributed training: True, compute dtype: torch.bfloat16 [INFO|tokenization_utils_base.py:2058] 2025-10-18 06:44:06,954 >> loading file vocab.json [INFO|tokenization_utils_base.py:2058] 2025-10-18 06:44:06,954 >> loading file merges.txt [INFO|tokenization_utils_base.py:2058] 2025-10-18 06:44:06,954 >> loading file tokenizer.json [INFO|tokenization_utils_base.py:2058] 2025-10-18 06:44:06,954 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2058] 2025-10-18 06:44:06,954 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2058] 2025-10-18 06:44:06,954 >> loading file tokenizer_config.json [INFO|tokenization_utils_base.py:2058] 2025-10-18 06:44:06,954 >> loading file chat_template.jinja [INFO|tokenization_utils_base.py:2323] 2025-10-18 06:44:07,355 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [INFO|configuration_utils.py:691] 2025-10-18 06:44:07,357 >> loading configuration file /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/base_model/method6_qwen2.5-7b_qwen3-4b_distill_qwen2.5-7b-it_difficulty-scale_method17/config.json [INFO|configuration_utils.py:765] 2025-10-18 06:44:07,359 >> Model config Qwen2Config { "architectures": [ "Qwen2ForCausalLM" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151645, "hidden_act": "silu", "hidden_size": 3584, "initializer_range": 0.02, "intermediate_size": 18944, "max_position_embeddings": 32768, "max_window_layers": 28, "model_type": "qwen2", "num_attention_heads": 28, "num_hidden_layers": 28, "num_key_value_heads": 4, "rms_norm_eps": 1e-06, "rope_scaling": null, "rope_theta": 1000000.0, "sliding_window": 131072, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.51.3", "use_cache": false, "use_sliding_window": false, "vocab_size": 152064 } [INFO|tokenization_utils_base.py:2058] 2025-10-18 06:44:07,360 >> loading file vocab.json [INFO|tokenization_utils_base.py:2058] 2025-10-18 06:44:07,360 >> loading file merges.txt [INFO|tokenization_utils_base.py:2058] 2025-10-18 06:44:07,360 >> loading file tokenizer.json [INFO|tokenization_utils_base.py:2058] 2025-10-18 06:44:07,360 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2058] 2025-10-18 06:44:07,360 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2058] 2025-10-18 06:44:07,360 >> loading file tokenizer_config.json [INFO|tokenization_utils_base.py:2058] 2025-10-18 06:44:07,360 >> loading file chat_template.jinja [rank5]:[W1018 06:44:07.811198098 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 5] using GPU 5 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. [INFO|tokenization_utils_base.py:2323] 2025-10-18 06:44:07,759 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [INFO|2025-10-18 06:44:07] llamafactory.data.loader:143 >> Loading dataset /mmu_nlp_ssd/dongguanting/tool_light_data/method7-qwen2.5-7b-instruct-llama-factory-sft-edition17.json... [INFO|2025-10-18 06:44:07] llamafactory.hparams.parser:406 >> Process rank: 4, world size: 8, device: cuda:4, distributed training: True, compute dtype: torch.bfloat16 [INFO|2025-10-18 06:44:07] llamafactory.hparams.parser:406 >> Process rank: 1, world size: 8, device: cuda:1, distributed training: True, compute dtype: torch.bfloat16 [INFO|2025-10-18 06:44:08] llamafactory.hparams.parser:406 >> Process rank: 7, world size: 8, device: cuda:7, distributed training: True, compute dtype: torch.bfloat16 [INFO|2025-10-18 06:44:08] llamafactory.hparams.parser:406 >> Process rank: 6, world size: 8, device: cuda:6, distributed training: True, compute dtype: torch.bfloat16 [INFO|2025-10-18 06:44:08] llamafactory.hparams.parser:406 >> Process rank: 2, world size: 8, device: cuda:2, distributed training: True, compute dtype: torch.bfloat16 [INFO|2025-10-18 06:44:08] llamafactory.hparams.parser:406 >> Process rank: 3, world size: 8, device: cuda:3, distributed training: True, compute dtype: torch.bfloat16 [rank3]:[W1018 06:44:08.022060437 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 3] using GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. [rank4]:[W1018 06:44:08.046568749 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 4] using GPU 4 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. [rank1]:[W1018 06:44:08.058006051 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 1] using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. [rank6]:[W1018 06:44:08.087474291 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 6] using GPU 6 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. [rank7]:[W1018 06:44:08.088871679 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 7] using GPU 7 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. [rank2]:[W1018 06:44:08.124219695 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 2] using GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. Setting num_proc from 16 back to 1 for the train split to disable multiprocessing as it only contains one shard. Generating train split: 0 examples [00:00, ? examples/s] Generating train split: 15077 examples [00:01, 11581.17 examples/s] Generating train split: 15077 examples [00:01, 11559.35 examples/s] Converting format of dataset (num_proc=16): 0%| | 0/15077 [00:00system You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|> <|im_start|>user You are a helpful assistant that can solve the given question step by step with the help of the wikipedia search tool and python interpreter tool. Given a question, you need to first think about the reasoning process in the mind and then provide the answer. During thinking, you can invoke the wikipedia search tool to search and python interpreter tool to calculate the math problem for fact information about specific topics if needed. The reasoning process and answer are enclosed within and tags respectively, and the search query and result are enclosed within and tags respectively. For example, This is the reasoning process. search query here search result here This is the reasoning process. python code here python interpreter result here This is the reasoning process. The final answer is \[ \boxed{answer here} \] . In the last part of the answer, the final exact answer is enclosed within \boxed{} with latex format. When did Roy E. Disney's father die?<|im_end|> <|im_start|>assistant To determine when Roy E. Disney's father died, I will look up biographical information on him.Roy E. Disney father death date**Final Information** Based on the provided documents, Roy E. Disney’s father, Roy O. Disney, passed away on December 20, 1971, due to a stroke. This information comes directly from the historical details about Roy O. Disney within these sources.The final answer is \boxed{December 20, 1971}.\<|im_end|> label_ids: [-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 13708, 766, 93376, 8253, 979, 11038, 468, 13, 16390, 594, 6981, 8469, 11, 358, 686, 1401, 705, 6032, 31177, 1995, 389, 1435, 3918, 26865, 1784, 1836, 43960, 2253, 468, 13, 16390, 6981, 4545, 2400, 522, 1836, 1784, 1382, 29, 334, 19357, 8085, 56177, 28715, 389, 279, 3897, 9293, 11, 11038, 468, 13, 16390, 748, 6981, 11, 11038, 506, 13, 16390, 11, 5823, 3123, 389, 6652, 220, 17, 15, 11, 220, 16, 24, 22, 16, 11, 4152, 311, 264, 12654, 13, 1096, 1995, 4041, 5961, 504, 279, 13656, 3565, 911, 11038, 506, 13, 16390, 2878, 1493, 8173, 3918, 1382, 1784, 9217, 16357, 1590, 4226, 374, 1124, 79075, 90, 32146, 220, 17, 15, 11, 220, 16, 24, 22, 16, 92, 7110, 522, 9217, 29, 151645, 198] labels: To determine when Roy E. Disney's father died, I will look up biographical information on him.Roy E. Disney father death date**Final Information** Based on the provided documents, Roy E. Disney’s father, Roy O. Disney, passed away on December 20, 1971, due to a stroke. This information comes directly from the historical details about Roy O. Disney within these sources.The final answer is \boxed{December 20, 1971}.\<|im_end|> [INFO|configuration_utils.py:691] 2025-10-18 06:45:06,145 >> loading configuration file /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/base_model/method6_qwen2.5-7b_qwen3-4b_distill_qwen2.5-7b-it_difficulty-scale_method17/config.json [INFO|configuration_utils.py:765] 2025-10-18 06:45:06,147 >> Model config Qwen2Config { "architectures": [ "Qwen2ForCausalLM" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151645, "hidden_act": "silu", "hidden_size": 3584, "initializer_range": 0.02, "intermediate_size": 18944, "max_position_embeddings": 32768, "max_window_layers": 28, "model_type": "qwen2", "num_attention_heads": 28, "num_hidden_layers": 28, "num_key_value_heads": 4, "rms_norm_eps": 1e-06, "rope_scaling": null, "rope_theta": 1000000.0, "sliding_window": 131072, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.51.3", "use_cache": false, "use_sliding_window": false, "vocab_size": 152064 } [INFO|2025-10-18 06:45:06] llamafactory.model.model_utils.kv_cache:143 >> KV cache is disabled during training. Applied Liger kernels to Qwen2 Applied Liger kernels to Qwen2 Applied Liger kernels to Qwen2 Applied Liger kernels to Qwen2Applied Liger kernels to Qwen2 Applied Liger kernels to Qwen2Applied Liger kernels to Qwen2Applied Liger kernels to Qwen2 [INFO|2025-10-18 06:45:06] llamafactory.model.model_utils.liger_kernel:143 >> Liger kernel has been applied to the model. [INFO|modeling_utils.py:1121] 2025-10-18 06:45:07,293 >> loading weights file /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/base_model/method6_qwen2.5-7b_qwen3-4b_distill_qwen2.5-7b-it_difficulty-scale_method17/model.safetensors.index.json [INFO|modeling_utils.py:3726] 2025-10-18 06:45:07,308 >> Detected DeepSpeed ZeRO-3: activating zero.init() for this model [2025-10-18 06:45:07,308] [INFO] [config.py:735:__init__] Config mesh_device None world_size = 8 [2025-10-18 06:45:07,308] [INFO] [config.py:735:__init__] Config mesh_device None world_size = 8 [2025-10-18 06:45:07,308] [INFO] [config.py:735:__init__] Config mesh_device None world_size = 8 [2025-10-18 06:45:07,308] [INFO] [config.py:735:__init__] Config mesh_device None world_size = 8 [2025-10-18 06:45:07,308] [INFO] [config.py:735:__init__] Config mesh_device None world_size = 8 [2025-10-18 06:45:07,308] [INFO] [config.py:735:__init__] Config mesh_device None world_size = 8 [2025-10-18 06:45:07,308] [INFO] [config.py:735:__init__] Config mesh_device None world_size = 8 [2025-10-18 06:45:07,308] [INFO] [config.py:735:__init__] Config mesh_device None world_size = 8 [INFO|configuration_utils.py:1142] 2025-10-18 06:45:07,321 >> Generate config GenerationConfig { "bos_token_id": 151643, "eos_token_id": 151645, "use_cache": false } Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered. Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered. Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered. [WARNING|logging.py:328] 2025-10-18 06:45:07,616 >> Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered. Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered. Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered. Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered. Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered. [2025-10-18 06:45:09,887] [INFO] [partition_parameters.py:348:__exit__] finished initializing model - num_params = 339, num_elems = 7.62B Loading checkpoint shards: 0%| | 0/4 [00:00> All model checkpoint weights were used when initializing Qwen2ForCausalLM. [INFO|modeling_utils.py:4938] 2025-10-18 06:45:42,606 >> All the weights of Qwen2ForCausalLM were initialized from the model checkpoint at /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/base_model/method6_qwen2.5-7b_qwen3-4b_distill_qwen2.5-7b-it_difficulty-scale_method17. If your task is similar to the task the model of the checkpoint was trained on, you can already use Qwen2ForCausalLM for predictions without further training. /mmu_nlp_ssd/makai05/miniconda3/envs/verl/lib/python3.10/site-packages/llamafactory/model/model_utils/checkpointing.py:47: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward( /mmu_nlp_ssd/makai05/miniconda3/envs/verl/lib/python3.10/site-packages/llamafactory/model/model_utils/checkpointing.py:47: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward( /mmu_nlp_ssd/makai05/miniconda3/envs/verl/lib/python3.10/site-packages/llamafactory/model/model_utils/checkpointing.py:47: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward( /mmu_nlp_ssd/makai05/miniconda3/envs/verl/lib/python3.10/site-packages/llamafactory/model/model_utils/checkpointing.py:47: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward( /mmu_nlp_ssd/makai05/miniconda3/envs/verl/lib/python3.10/site-packages/llamafactory/model/model_utils/checkpointing.py:47: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward( /mmu_nlp_ssd/makai05/miniconda3/envs/verl/lib/python3.10/site-packages/llamafactory/model/model_utils/checkpointing.py:47: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward( /mmu_nlp_ssd/makai05/miniconda3/envs/verl/lib/python3.10/site-packages/llamafactory/model/model_utils/checkpointing.py:64: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx: "torch.autograd.Function", grad_output: "torch.Tensor") -> "torch.Tensor": /mmu_nlp_ssd/makai05/miniconda3/envs/verl/lib/python3.10/site-packages/llamafactory/model/model_utils/checkpointing.py:64: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx: "torch.autograd.Function", grad_output: "torch.Tensor") -> "torch.Tensor": /mmu_nlp_ssd/makai05/miniconda3/envs/verl/lib/python3.10/site-packages/llamafactory/model/model_utils/checkpointing.py:47: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward( /mmu_nlp_ssd/makai05/miniconda3/envs/verl/lib/python3.10/site-packages/llamafactory/model/model_utils/checkpointing.py:64: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx: "torch.autograd.Function", grad_output: "torch.Tensor") -> "torch.Tensor": /mmu_nlp_ssd/makai05/miniconda3/envs/verl/lib/python3.10/site-packages/llamafactory/model/model_utils/checkpointing.py:64: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx: "torch.autograd.Function", grad_output: "torch.Tensor") -> "torch.Tensor": /mmu_nlp_ssd/makai05/miniconda3/envs/verl/lib/python3.10/site-packages/llamafactory/model/model_utils/checkpointing.py:64: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx: "torch.autograd.Function", grad_output: "torch.Tensor") -> "torch.Tensor": /mmu_nlp_ssd/makai05/miniconda3/envs/verl/lib/python3.10/site-packages/llamafactory/model/model_utils/checkpointing.py:64: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx: "torch.autograd.Function", grad_output: "torch.Tensor") -> "torch.Tensor": /mmu_nlp_ssd/makai05/miniconda3/envs/verl/lib/python3.10/site-packages/llamafactory/model/model_utils/checkpointing.py:64: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx: "torch.autograd.Function", grad_output: "torch.Tensor") -> "torch.Tensor": [INFO|configuration_utils.py:1095] 2025-10-18 06:45:42,608 >> loading configuration file /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/base_model/method6_qwen2.5-7b_qwen3-4b_distill_qwen2.5-7b-it_difficulty-scale_method17/generation_config.json [INFO|configuration_utils.py:1142] 2025-10-18 06:45:42,608 >> Generate config GenerationConfig { "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "repetition_penalty": 1.05, "temperature": 0.7, "top_k": 20, "top_p": 0.8 } /mmu_nlp_ssd/makai05/miniconda3/envs/verl/lib/python3.10/site-packages/llamafactory/model/model_utils/checkpointing.py:47: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward( /mmu_nlp_ssd/makai05/miniconda3/envs/verl/lib/python3.10/site-packages/llamafactory/model/model_utils/checkpointing.py:64: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx: "torch.autograd.Function", grad_output: "torch.Tensor") -> "torch.Tensor": [INFO|2025-10-18 06:45:42] llamafactory.model.model_utils.checkpointing:143 >> Gradient checkpointing enabled. [INFO|2025-10-18 06:45:42] llamafactory.model.model_utils.attention:143 >> Using torch SDPA for faster training and inference. [INFO|2025-10-18 06:45:42] llamafactory.model.adapter:143 >> DeepSpeed ZeRO3 detected, remaining trainable params in float32. [INFO|2025-10-18 06:45:42] llamafactory.model.adapter:143 >> Fine-tuning method: Full [INFO|2025-10-18 06:45:42] llamafactory.model.loader:143 >> trainable params: 7,615,616,512 || all params: 7,615,616,512 || trainable%: 100.0000 [INFO|trainer.py:748] 2025-10-18 06:45:42,648 >> Using auto half precision backend [INFO|deepspeed.py:380] 2025-10-18 06:45:43,067 >> Detected ZeRO Offload and non-DeepSpeed optimizers: This combination should work as long as the custom optimizer has both CPU and GPU implementation (except LAMB) Installed CUDA version 12.3 does not match the version torch was compiled with 12.4 but since the APIs are compatible, accepting this combination Using /root/.cache/torch_extensions/py310_cu124 as PyTorch extensions root... Installed CUDA version 12.3 does not match the version torch was compiled with 12.4 but since the APIs are compatible, accepting this combination Using /root/.cache/torch_extensions/py310_cu124 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /root/.cache/torch_extensions/py310_cu124/cpu_adam/build.ninja... /mmu_nlp_ssd/makai05/miniconda3/envs/verl/lib/python3.10/site-packages/torch/utils/cpp_extension.py:2059: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST']. warnings.warn( Building extension module cpu_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module cpu_adam... Time to load cpu_adam op: 2.425875186920166 seconds Loading extension module cpu_adam... Time to load cpu_adam op: 2.4944374561309814 seconds Installed CUDA version 12.3 does not match the version torch was compiled with 12.4 but since the APIs are compatible, accepting this combination Using /root/.cache/torch_extensions/py310_cu124 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /root/.cache/torch_extensions/py310_cu124/cpu_adam/build.ninja... /mmu_nlp_ssd/makai05/miniconda3/envs/verl/lib/python3.10/site-packages/torch/utils/cpp_extension.py:2059: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST']. warnings.warn( Building extension module cpu_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module cpu_adam... Time to load cpu_adam op: 2.5600526332855225 seconds Installed CUDA version 12.3 does not match the version torch was compiled with 12.4 but since the APIs are compatible, accepting this combination Using /root/.cache/torch_extensions/py310_cu124 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /root/.cache/torch_extensions/py310_cu124/cpu_adam/build.ninja... /mmu_nlp_ssd/makai05/miniconda3/envs/verl/lib/python3.10/site-packages/torch/utils/cpp_extension.py:2059: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST']. warnings.warn( Building extension module cpu_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module cpu_adam... Time to load cpu_adam op: 2.7998712062835693 seconds Installed CUDA version 12.3 does not match the version torch was compiled with 12.4 but since the APIs are compatible, accepting this combination Using /root/.cache/torch_extensions/py310_cu124 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /root/.cache/torch_extensions/py310_cu124/cpu_adam/build.ninja... /mmu_nlp_ssd/makai05/miniconda3/envs/verl/lib/python3.10/site-packages/torch/utils/cpp_extension.py:2059: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST']. warnings.warn( Building extension module cpu_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module cpu_adam... Time to load cpu_adam op: 2.860788345336914 seconds Adam Optimizer #0 is created with AVX512 arithmetic capability. Config: alpha=0.000005, betas=(0.900000, 0.999000), weight_decay=0.010000, adam_w=1 [2025-10-18 06:45:47,795] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed info: version=0.16.7, git-hash=unknown, git-branch=unknown [2025-10-18 06:45:47,795] [INFO] [config.py:735:__init__] Config mesh_device None world_size = 8 [2025-10-18 06:45:47,804] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False [2025-10-18 06:45:47,805] [INFO] [logging.py:107:log_dist] [Rank 0] Using client Optimizer as basic optimizer [2025-10-18 06:45:47,805] [INFO] [logging.py:107:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer [2025-10-18 06:45:47,818] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam [2025-10-18 06:45:47,818] [INFO] [utils.py:59:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type= [2025-10-18 06:45:47,818] [INFO] [logging.py:107:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer, MiCS is enabled False, Hierarchical params gather False [2025-10-18 06:45:47,818] [INFO] [logging.py:107:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 3 optimizer Installed CUDA version 12.3 does not match the version torch was compiled with 12.4 but since the APIs are compatible, accepting this combination Using /root/.cache/torch_extensions/py310_cu124 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /root/.cache/torch_extensions/py310_cu124/cpu_adam/build.ninja... /mmu_nlp_ssd/makai05/miniconda3/envs/verl/lib/python3.10/site-packages/torch/utils/cpp_extension.py:2059: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST']. warnings.warn( Building extension module cpu_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module cpu_adam... Time to load cpu_adam op: 2.975823402404785 seconds Installed CUDA version 12.3 does not match the version torch was compiled with 12.4 but since the APIs are compatible, accepting this combination Using /root/.cache/torch_extensions/py310_cu124 as PyTorch extensions root... Installed CUDA version 12.3 does not match the version torch was compiled with 12.4 but since the APIs are compatible, accepting this combination Using /root/.cache/torch_extensions/py310_cu124 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /root/.cache/torch_extensions/py310_cu124/cpu_adam/build.ninja... /mmu_nlp_ssd/makai05/miniconda3/envs/verl/lib/python3.10/site-packages/torch/utils/cpp_extension.py:2059: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST']. warnings.warn( Building extension module cpu_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module cpu_adam... Time to load cpu_adam op: 3.004814624786377 seconds Loading extension module cpu_adam... Time to load cpu_adam op: 3.098824977874756 seconds [2025-10-18 06:45:48,100] [INFO] [utils.py:781:see_memory_usage] Stage 3 initialize beginning [2025-10-18 06:45:48,101] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB Max_MA 3.05 GB CA 0.0 GB Max_CA 3 GB [2025-10-18 06:45:48,101] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 80.18 GB, percent = 4.0% [2025-10-18 06:45:48,103] [INFO] [stage3.py:170:__init__] Reduce bucket size 12845056 [2025-10-18 06:45:48,103] [INFO] [stage3.py:171:__init__] Prefetch bucket size 11560550 [2025-10-18 06:45:48,355] [INFO] [utils.py:781:see_memory_usage] DeepSpeedZeRoOffload initialize [begin] [2025-10-18 06:45:48,356] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB [2025-10-18 06:45:48,356] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 80.18 GB, percent = 4.0% Parameter Offload: Total persistent parameters: 333312 in 141 params [2025-10-18 06:45:48,621] [INFO] [utils.py:781:see_memory_usage] DeepSpeedZeRoOffload initialize [end] [2025-10-18 06:45:48,622] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB [2025-10-18 06:45:48,622] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 80.18 GB, percent = 4.0% [2025-10-18 06:45:48,836] [INFO] [utils.py:781:see_memory_usage] Before creating fp16 partitions [2025-10-18 06:45:48,837] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB [2025-10-18 06:45:48,837] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 80.18 GB, percent = 4.0% [2025-10-18 06:45:51,184] [INFO] [utils.py:781:see_memory_usage] After creating fp16 partitions: 2 [2025-10-18 06:45:51,185] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB [2025-10-18 06:45:51,186] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 102.07 GB, percent = 5.1% [2025-10-18 06:45:51,455] [INFO] [utils.py:781:see_memory_usage] Before creating fp32 partitions [2025-10-18 06:45:51,456] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB [2025-10-18 06:45:51,456] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 105.93 GB, percent = 5.3% [2025-10-18 06:45:54,718] [INFO] [utils.py:781:see_memory_usage] After creating fp32 partitions [2025-10-18 06:45:54,719] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB [2025-10-18 06:45:54,719] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 124.92 GB, percent = 6.2% [2025-10-18 06:45:54,956] [INFO] [utils.py:781:see_memory_usage] Before initializing optimizer states [2025-10-18 06:45:54,956] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB [2025-10-18 06:45:54,957] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 128.81 GB, percent = 6.4% [2025-10-18 06:46:01,399] [INFO] [utils.py:781:see_memory_usage] After initializing optimizer states [2025-10-18 06:46:01,400] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB [2025-10-18 06:46:01,400] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 157.56 GB, percent = 7.8% [2025-10-18 06:46:01,401] [INFO] [stage3.py:534:_setup_for_real_optimizer] optimizer state initialized [2025-10-18 06:46:04,410] [INFO] [utils.py:781:see_memory_usage] After initializing ZeRO optimizer [2025-10-18 06:46:04,411] [INFO] [utils.py:782:see_memory_usage] MA 0.02 GB Max_MA 2.06 GB CA 2.06 GB Max_CA 2 GB [2025-10-18 06:46:04,411] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 174.14 GB, percent = 8.6% [2025-10-18 06:46:04,411] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedZeroOptimizer_Stage3 [2025-10-18 06:46:04,411] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed using configured LR scheduler = None [2025-10-18 06:46:04,411] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed LR Scheduler = None [2025-10-18 06:46:04,411] [INFO] [logging.py:107:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0, 0.0], mom=[(0.9, 0.999), (0.9, 0.999)] [2025-10-18 06:46:04,412] [INFO] [config.py:1003:print] DeepSpeedEngine configuration: [2025-10-18 06:46:04,413] [INFO] [config.py:1007:print] activation_checkpointing_config { "partition_activations": false, "contiguous_memory_optimization": false, "cpu_checkpointing": false, "number_checkpoints": null, "synchronize_checkpoint_boundary": false, "profile": false } [2025-10-18 06:46:04,413] [INFO] [config.py:1007:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'intra_op_parallelism': 1, 'single_submit': False, 'overlap_events': True, 'use_gds': False} [2025-10-18 06:46:04,413] [INFO] [config.py:1007:print] amp_enabled .................. False [2025-10-18 06:46:04,413] [INFO] [config.py:1007:print] amp_params ................... False [2025-10-18 06:46:04,413] [INFO] [config.py:1007:print] autotuning_config ............ { "enabled": false, "start_step": null, "end_step": null, "metric_path": null, "arg_mappings": null, "metric": "throughput", "model_info": null, "results_dir": "autotuning_results", "exps_dir": "autotuning_exps", "overwrite": true, "fast": true, "start_profile_step": 3, "end_profile_step": 5, "tuner_type": "gridsearch", "tuner_early_stopping": 5, "tuner_num_trials": 50, "model_info_path": null, "mp_size": 1, "max_train_batch_size": null, "min_train_batch_size": 1, "max_train_micro_batch_size_per_gpu": 1.024000e+03, "min_train_micro_batch_size_per_gpu": 1, "num_tuning_micro_batch_sizes": 3 } [2025-10-18 06:46:04,413] [INFO] [config.py:1007:print] bfloat16_enabled ............. True [2025-10-18 06:46:04,413] [INFO] [config.py:1007:print] bfloat16_immediate_grad_update True [2025-10-18 06:46:04,413] [INFO] [config.py:1007:print] checkpoint_parallel_write_pipeline False [2025-10-18 06:46:04,413] [INFO] [config.py:1007:print] checkpoint_tag_validation_enabled True [2025-10-18 06:46:04,413] [INFO] [config.py:1007:print] checkpoint_tag_validation_fail False [2025-10-18 06:46:04,413] [INFO] [config.py:1007:print] comms_config ................. [2025-10-18 06:46:04,413] [INFO] [config.py:1007:print] communication_data_type ...... None [2025-10-18 06:46:04,413] [INFO] [config.py:1007:print] compile_config ............... deepcompile=False free_activation=False offload_activation=False offload_opt_states=False double_buffer=True symmetric_memory=False debug_log=False offload_parameters=False sync_before_reduce=False sync_after_reduce=False sync_before_allgather=False sync_after_allgather=False [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}} [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] curriculum_enabled_legacy .... False [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] curriculum_params_legacy ..... False [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'pin_memory': False, 'curriculum_learning': {'enabled': False}, 'dynamic_batching': {'enabled': False, 'lr_scaling_method': 'linear', 'min_batch_size': 1, 'max_batch_size': None, 'sequence_picking_order': 'dataloader', 'verbose': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}} [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] data_efficiency_enabled ...... False [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] dataloader_drop_last ......... False [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] disable_allgather ............ False [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] dump_state ................... False [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] dynamic_loss_scale_args ...... None [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] eigenvalue_enabled ........... False [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] eigenvalue_gas_boundary_resolution 1 [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] eigenvalue_layer_name ........ bert.encoder.layer [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] eigenvalue_layer_num ......... 0 [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] eigenvalue_max_iter .......... 100 [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] eigenvalue_stability ......... 1e-06 [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] eigenvalue_tol ............... 0.01 [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] eigenvalue_verbose ........... False [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] elasticity_enabled ........... False [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] flops_profiler_config ........ { "enabled": false, "recompute_fwd_factor": 0.0, "profile_step": 1, "module_depth": -1, "top_modules": 1, "detailed": true, "output_file": null } [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] fp16_auto_cast ............... None [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] fp16_enabled ................. False [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] fp16_master_weights_and_gradients False [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] global_rank .................. 0 [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] grad_accum_dtype ............. None [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] gradient_accumulation_steps .. 2 [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] gradient_clipping ............ 1.0 [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] gradient_predivide_factor .... 1.0 [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] graph_harvesting ............. False [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8 [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] initial_dynamic_scale ........ 1 [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] load_universal_checkpoint .... False [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] loss_scale ................... 1.0 [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] memory_breakdown ............. False [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] mics_hierarchial_params_gather False [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] mics_shard_size .............. -1 [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] nebula_config ................ { "enabled": false, "persistent_storage_path": null, "persistent_time_interval": 100, "num_of_version_in_retention": 2, "enable_nebula_load": true, "load_path": null } [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] optimizer_legacy_fusion ...... False [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] optimizer_name ............... None [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] optimizer_params ............. None [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True} [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] pld_enabled .................. False [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] pld_params ................... False [2025-10-18 06:46:04,415] [INFO] [config.py:1007:print] prescale_gradients ........... False [2025-10-18 06:46:04,415] [INFO] [config.py:1007:print] scheduler_name ............... None [2025-10-18 06:46:04,415] [INFO] [config.py:1007:print] scheduler_params ............. None [2025-10-18 06:46:04,415] [INFO] [config.py:1007:print] seq_parallel_communication_data_type torch.float32 [2025-10-18 06:46:04,415] [INFO] [config.py:1007:print] sparse_attention ............. None [2025-10-18 06:46:04,415] [INFO] [config.py:1007:print] sparse_gradients_enabled ..... False [2025-10-18 06:46:04,415] [INFO] [config.py:1007:print] steps_per_print .............. inf [2025-10-18 06:46:04,415] [INFO] [config.py:1007:print] tensor_parallel_config ....... dtype=torch.float16 autotp_size=0 tp_overlap_comm=False tensor_parallel=TPConfig(tp_size=1, tp_grain_size=1, mpu=None, tp_group=None) injection_policy_tuple=None keep_module_on_host=False replace_with_kernel_inject=False [2025-10-18 06:46:04,415] [INFO] [config.py:1007:print] timers_config ................ enabled=True synchronized=True [2025-10-18 06:46:04,415] [INFO] [config.py:1007:print] train_batch_size ............. 16 [2025-10-18 06:46:04,415] [INFO] [config.py:1007:print] train_micro_batch_size_per_gpu 1 [2025-10-18 06:46:04,415] [INFO] [config.py:1007:print] use_data_before_expert_parallel_ False [2025-10-18 06:46:04,415] [INFO] [config.py:1007:print] use_node_local_storage ....... False [2025-10-18 06:46:04,415] [INFO] [config.py:1007:print] wall_clock_breakdown ......... False [2025-10-18 06:46:04,415] [INFO] [config.py:1007:print] weight_quantization_config ... None [2025-10-18 06:46:04,415] [INFO] [config.py:1007:print] world_size ................... 8 [2025-10-18 06:46:04,415] [INFO] [config.py:1007:print] zero_allow_untested_optimizer True [2025-10-18 06:46:04,415] [INFO] [config.py:1007:print] zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=12845056 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500000000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='cpu', nvme_path=None, buffer_count=5, buffer_size=100000000, max_in_cpu=1000000000, pin_memory=True) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='cpu', nvme_path=None, buffer_count=4, pin_memory=True, pipeline_read=False, pipeline_write=False, fast_init=False, ratio=1.0) sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=11560550 param_persistence_threshold=35840 model_persistence_threshold=9223372036854775807 max_live_parameters=1000000000 max_reuse_distance=1000000000 gather_16bit_weights_on_model_save=True module_granularity_threshold=0 use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False zeropp_loco_param=None mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True log_trace_cache_warnings=False [2025-10-18 06:46:04,415] [INFO] [config.py:1007:print] zero_enabled ................. True [2025-10-18 06:46:04,415] [INFO] [config.py:1007:print] zero_force_ds_cpu_optimizer .. True [2025-10-18 06:46:04,415] [INFO] [config.py:1007:print] zero_optimization_stage ...... 3 [2025-10-18 06:46:04,415] [INFO] [config.py:993:print_user_config] json = { "train_batch_size": 16, "train_micro_batch_size_per_gpu": 1, "gradient_accumulation_steps": 2, "gradient_clipping": 1.0, "zero_allow_untested_optimizer": true, "fp16": { "enabled": false, "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": true }, "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "offload_param": { "device": "cpu", "pin_memory": true }, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1.000000e+09, "reduce_bucket_size": 1.284506e+07, "stage3_prefetch_bucket_size": 1.156055e+07, "stage3_param_persistence_threshold": 3.584000e+04, "stage3_max_live_parameters": 1.000000e+09, "stage3_max_reuse_distance": 1.000000e+09, "stage3_gather_16bit_weights_on_model_save": true }, "steps_per_print": inf } [INFO|trainer.py:2414] 2025-10-18 06:46:04,417 >> ***** Running training ***** [INFO|trainer.py:2415] 2025-10-18 06:46:04,417 >> Num examples = 15,077 [INFO|trainer.py:2416] 2025-10-18 06:46:04,417 >> Num Epochs = 3 [INFO|trainer.py:2417] 2025-10-18 06:46:04,417 >> Instantaneous batch size per device = 1 [INFO|trainer.py:2420] 2025-10-18 06:46:04,417 >> Total train batch size (w. parallel, distributed & accumulation) = 16 [INFO|trainer.py:2421] 2025-10-18 06:46:04,417 >> Gradient Accumulation steps = 2 [INFO|trainer.py:2422] 2025-10-18 06:46:04,417 >> Total optimization steps = 2,826 [INFO|trainer.py:2423] 2025-10-18 06:46:04,418 >> Number of trainable parameters = 7,615,616,512 0%| | 0/2826 [00:00> Saving model checkpoint to /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-943 [INFO|configuration_utils.py:419] 2025-10-18 08:18:38,877 >> Configuration saved in /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-943/config.json [INFO|configuration_utils.py:911] 2025-10-18 08:18:38,879 >> Configuration saved in /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-943/generation_config.json [INFO|modeling_utils.py:3580] 2025-10-18 08:18:54,649 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 4 checkpoint shards. You can find where each parameters has been saved in the index located at /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-943/model.safetensors.index.json. [INFO|tokenization_utils_base.py:2510] 2025-10-18 08:18:54,651 >> tokenizer config file saved in /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-943/tokenizer_config.json [INFO|tokenization_utils_base.py:2519] 2025-10-18 08:18:54,652 >> Special tokens file saved in /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-943/special_tokens_map.json [2025-10-18 08:18:55,344] [INFO] [logging.py:107:log_dist] [Rank 0] [Torch] Checkpoint global_step942 is about to be saved! [2025-10-18 08:18:55,355] [INFO] [logging.py:107:log_dist] [Rank 0] Saving model checkpoint: /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-943/global_step942/zero_pp_rank_0_mp_rank_00_model_states.pt [2025-10-18 08:18:55,355] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-943/global_step942/zero_pp_rank_0_mp_rank_00_model_states.pt... [2025-10-18 08:18:55,372] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-943/global_step942/zero_pp_rank_0_mp_rank_00_model_states.pt. [2025-10-18 08:18:55,384] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-943/global_step942/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... [2025-10-18 08:19:06,711] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-943/global_step942/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. [2025-10-18 08:19:06,716] [INFO] [engine.py:3701:_save_zero_checkpoint] zero checkpoint saved /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-943/global_step942/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt [2025-10-18 08:19:07,451] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step942 is ready now! 33%|███▎ | 944/2826 [1:33:09<8:41:27, 16.62s/it] 33%|███▎ | 945/2826 [1:33:15<7:01:57, 13.46s/it] 33%|███▎ | 946/2826 [1:33:20<5:46:28, 11.06s/it] 34%|███▎ | 947/2826 [1:33:27<5:04:11, 9.71s/it] 34%|███▎ | 948/2826 [1:33:32<4:23:27, 8.42s/it] 34%|███▎ | 949/2826 [1:33:39<4:01:52, 7.73s/it] 34%|███▎ | 950/2826 [1:33:45<3:49:22, 7.34s/it] {'loss': 0.3963, 'grad_norm': 2.5962352752685547, 'learning_rate': 4.20048574867773e-06, 'epoch': 1.01} 34%|███▎ | 950/2826 [1:33:45<3:49:22, 7.34s/it] 34%|███▎ | 951/2826 [1:33:52<3:43:24, 7.15s/it] 34%|███▎ | 952/2826 [1:33:57<3:24:25, 6.55s/it] 34%|███▎ | 953/2826 [1:34:04<3:28:22, 6.68s/it] 34%|███▍ | 954/2826 [1:34:10<3:23:13, 6.51s/it] 34%|███▍ | 955/2826 [1:34:15<3:11:55, 6.15s/it] 34%|███▍ | 956/2826 [1:34:21<3:03:10, 5.88s/it] 34%|███▍ | 957/2826 [1:34:26<3:02:49, 5.87s/it] 34%|███▍ | 958/2826 [1:34:32<3:00:15, 5.79s/it] 34%|███▍ | 959/2826 [1:34:38<3:01:04, 5.82s/it] 34%|███▍ | 960/2826 [1:34:43<2:53:58, 5.59s/it] {'loss': 0.3125, 'grad_norm': 2.707613229751587, 'learning_rate': 4.1777170872197725e-06, 'epoch': 1.02} 34%|███▍ | 960/2826 [1:34:43<2:53:58, 5.59s/it] 34%|███▍ | 961/2826 [1:34:48<2:49:24, 5.45s/it] 34%|███▍ | 962/2826 [1:34:55<2:59:06, 5.77s/it] 34%|███▍ | 963/2826 [1:35:00<2:52:13, 5.55s/it] 34%|███▍ | 964/2826 [1:35:06<2:57:05, 5.71s/it] 34%|███▍ | 965/2826 [1:35:12<3:00:15, 5.81s/it] 34%|███▍ | 966/2826 [1:35:17<2:54:05, 5.62s/it] 34%|███▍ | 967/2826 [1:35:22<2:50:43, 5.51s/it] 34%|███▍ | 968/2826 [1:35:28<2:51:12, 5.53s/it] 34%|███▍ | 969/2826 [1:35:35<3:06:10, 6.02s/it] 34%|███▍ | 970/2826 [1:35:40<3:00:38, 5.84s/it] {'loss': 0.3457, 'grad_norm': 2.4237964153289795, 'learning_rate': 4.1546923784448646e-06, 'epoch': 1.03} 34%|███▍ | 970/2826 [1:35:40<3:00:38, 5.84s/it] 34%|███▍ | 971/2826 [1:35:46<2:56:18, 5.70s/it] 34%|███▍ | 972/2826 [1:35:52<3:02:30, 5.91s/it] 34%|███▍ | 973/2826 [1:35:57<2:56:13, 5.71s/it] 34%|███▍ | 974/2826 [1:36:02<2:51:26, 5.55s/it] 35%|███▍ | 975/2826 [1:36:09<2:58:15, 5.78s/it] 35%|███▍ | 976/2826 [1:36:15<3:01:07, 5.87s/it] 35%|███▍ | 977/2826 [1:36:21<2:59:05, 5.81s/it] 35%|███▍ | 978/2826 [1:36:27<3:05:22, 6.02s/it] 35%|███▍ | 979/2826 [1:36:33<3:02:09, 5.92s/it] 35%|███▍ | 980/2826 [1:36:38<2:55:14, 5.70s/it] {'loss': 0.3029, 'grad_norm': 1.6531928777694702, 'learning_rate': 4.1314151363035705e-06, 'epoch': 1.04} 35%|███▍ | 980/2826 [1:36:38<2:55:14, 5.70s/it] 35%|███▍ | 981/2826 [1:36:43<2:49:17, 5.51s/it] 35%|███▍ | 982/2826 [1:36:48<2:46:14, 5.41s/it] 35%|███▍ | 983/2826 [1:36:54<2:50:11, 5.54s/it] 35%|███▍ | 984/2826 [1:37:00<2:56:04, 5.74s/it] 35%|███▍ | 985/2826 [1:37:06<2:56:10, 5.74s/it] 35%|███▍ | 986/2826 [1:37:12<3:02:53, 5.96s/it] 35%|███▍ | 987/2826 [1:37:19<3:06:18, 6.08s/it] 35%|███▍ | 988/2826 [1:37:25<3:05:34, 6.06s/it] 35%|███▍ | 989/2826 [1:37:30<2:58:32, 5.83s/it] 35%|███▌ | 990/2826 [1:37:36<3:02:40, 5.97s/it] {'loss': 0.3289, 'grad_norm': 2.1669981479644775, 'learning_rate': 4.1078889132872145e-06, 'epoch': 1.05} 35%|███▌ | 990/2826 [1:37:36<3:02:40, 5.97s/it] 35%|███▌ | 991/2826 [1:37:42<2:57:27, 5.80s/it] 35%|███▌ | 992/2826 [1:37:48<2:57:00, 5.79s/it] 35%|███▌ | 993/2826 [1:37:54<2:58:57, 5.86s/it] 35%|███▌ | 994/2826 [1:37:59<2:54:16, 5.71s/it] 35%|███▌ | 995/2826 [1:38:05<3:02:00, 5.96s/it] 35%|███▌ | 996/2826 [1:38:12<3:09:19, 6.21s/it] 35%|███▌ | 997/2826 [1:38:17<2:58:04, 5.84s/it] 35%|███▌ | 998/2826 [1:38:23<2:52:42, 5.67s/it] 35%|███▌ | 999/2826 [1:38:30<3:08:35, 6.19s/it] 35%|███▌ | 1000/2826 [1:38:36<3:09:47, 6.24s/it] {'loss': 0.3234, 'grad_norm': 2.445012092590332, 'learning_rate': 4.084117299885712e-06, 'epoch': 1.06} 35%|███▌ | 1000/2826 [1:38:36<3:09:47, 6.24s/it] 35%|███▌ | 1001/2826 [1:38:42<3:04:16, 6.06s/it] 35%|███▌ | 1002/2826 [1:38:48<3:01:52, 5.98s/it] 35%|███▌ | 1003/2826 [1:38:54<3:00:59, 5.96s/it] 36%|███▌ | 1004/2826 [1:39:00<3:09:10, 6.23s/it] 36%|███▌ | 1005/2826 [1:39:07<3:07:42, 6.18s/it] 36%|███▌ | 1006/2826 [1:39:13<3:11:31, 6.31s/it] 36%|███▌ | 1007/2826 [1:39:18<2:59:50, 5.93s/it] 36%|███▌ | 1008/2826 [1:39:23<2:51:32, 5.66s/it] 36%|███▌ | 1009/2826 [1:39:29<2:49:19, 5.59s/it] 36%|███▌ | 1010/2826 [1:39:35<2:59:22, 5.93s/it] {'loss': 0.3139, 'grad_norm': 2.0615527629852295, 'learning_rate': 4.060103924039599e-06, 'epoch': 1.07} 36%|███▌ | 1010/2826 [1:39:35<2:59:22, 5.93s/it] 36%|███▌ | 1011/2826 [1:39:41<2:59:26, 5.93s/it] 36%|███▌ | 1012/2826 [1:39:47<2:56:02, 5.82s/it] 36%|███▌ | 1013/2826 [1:39:52<2:50:05, 5.63s/it] 36%|███▌ | 1014/2826 [1:39:57<2:47:28, 5.55s/it] 36%|███▌ | 1015/2826 [1:40:03<2:48:04, 5.57s/it] 36%|███▌ | 1016/2826 [1:40:09<2:47:44, 5.56s/it] 36%|███▌ | 1017/2826 [1:40:14<2:46:35, 5.53s/it] 36%|███▌ | 1018/2826 [1:40:20<2:48:22, 5.59s/it] 36%|███▌ | 1019/2826 [1:40:25<2:49:06, 5.62s/it] 36%|███▌ | 1020/2826 [1:40:32<2:55:54, 5.84s/it] {'loss': 0.3144, 'grad_norm': 1.990400791168213, 'learning_rate': 4.035852450586352e-06, 'epoch': 1.08} 36%|███▌ | 1020/2826 [1:40:32<2:55:54, 5.84s/it] 36%|███▌ | 1021/2826 [1:40:37<2:52:24, 5.73s/it] 36%|███▌ | 1022/2826 [1:40:43<2:52:07, 5.73s/it] 36%|███▌ | 1023/2826 [1:40:48<2:49:56, 5.66s/it] 36%|███▌ | 1024/2826 [1:40:54<2:49:48, 5.65s/it] 36%|███▋ | 1025/2826 [1:40:59<2:44:53, 5.49s/it] 36%|███▋ | 1026/2826 [1:41:05<2:46:54, 5.56s/it] 36%|███▋ | 1027/2826 [1:41:10<2:42:46, 5.43s/it] 36%|███▋ | 1028/2826 [1:41:17<2:58:46, 5.97s/it] 36%|███▋ | 1029/2826 [1:41:23<2:59:35, 6.00s/it] 36%|███▋ | 1030/2826 [1:41:29<2:58:41, 5.97s/it] {'loss': 0.323, 'grad_norm': 2.5510122776031494, 'learning_rate': 4.011366580701073e-06, 'epoch': 1.09} 36%|███▋ | 1030/2826 [1:41:29<2:58:41, 5.97s/it] 36%|███▋ | 1031/2826 [1:41:36<3:04:08, 6.16s/it] 37%|███▋ | 1032/2826 [1:41:42<3:07:15, 6.26s/it] 37%|███▋ | 1033/2826 [1:41:48<3:00:04, 6.03s/it] 37%|███▋ | 1034/2826 [1:41:54<3:03:10, 6.13s/it] 37%|███▋ | 1035/2826 [1:42:01<3:10:27, 6.38s/it] 37%|███▋ | 1036/2826 [1:42:07<3:05:23, 6.21s/it] 37%|███▋ | 1037/2826 [1:42:12<2:58:14, 5.98s/it] 37%|███▋ | 1038/2826 [1:42:18<2:50:23, 5.72s/it] 37%|███▋ | 1039/2826 [1:42:23<2:47:49, 5.64s/it] 37%|███▋ | 1040/2826 [1:42:28<2:42:38, 5.46s/it] {'loss': 0.3694, 'grad_norm': 2.462083101272583, 'learning_rate': 3.9866500513316274e-06, 'epoch': 1.1} 37%|███▋ | 1040/2826 [1:42:28<2:42:38, 5.46s/it] 37%|███▋ | 1041/2826 [1:42:34<2:50:41, 5.74s/it] 37%|███▋ | 1042/2826 [1:42:40<2:45:18, 5.56s/it] 37%|███▋ | 1043/2826 [1:42:45<2:45:43, 5.58s/it] 37%|███▋ | 1044/2826 [1:42:53<3:00:56, 6.09s/it] 37%|███▋ | 1045/2826 [1:42:58<2:59:06, 6.03s/it] 37%|███▋ | 1046/2826 [1:43:04<2:58:46, 6.03s/it] 37%|███▋ | 1047/2826 [1:43:10<2:51:13, 5.78s/it] 37%|███▋ | 1048/2826 [1:43:15<2:46:01, 5.60s/it] 37%|███▋ | 1049/2826 [1:43:20<2:45:51, 5.60s/it] 37%|███▋ | 1050/2826 [1:43:26<2:41:47, 5.47s/it] {'loss': 0.3351, 'grad_norm': 2.4385085105895996, 'learning_rate': 3.961706634628323e-06, 'epoch': 1.11} 37%|███▋ | 1050/2826 [1:43:26<2:41:47, 5.47s/it] 37%|███▋ | 1051/2826 [1:43:32<2:46:15, 5.62s/it] 37%|███▋ | 1052/2826 [1:43:37<2:41:15, 5.45s/it] 37%|███▋ | 1053/2826 [1:43:42<2:42:28, 5.50s/it] 37%|███▋ | 1054/2826 [1:43:49<2:51:56, 5.82s/it] 37%|███▋ | 1055/2826 [1:43:55<2:55:01, 5.93s/it] 37%|███▋ | 1056/2826 [1:44:01<2:52:50, 5.86s/it] 37%|███▋ | 1057/2826 [1:44:07<2:58:38, 6.06s/it] 37%|███▋ | 1058/2826 [1:44:14<3:03:47, 6.24s/it] 37%|███▋ | 1059/2826 [1:44:21<3:11:08, 6.49s/it] 38%|███▊ | 1060/2826 [1:44:26<3:00:03, 6.12s/it] {'loss': 0.3459, 'grad_norm': 1.7553578615188599, 'learning_rate': 3.936540137368222e-06, 'epoch': 1.12} 38%|███▊ | 1060/2826 [1:44:26<3:00:03, 6.12s/it] 38%|███▊ | 1061/2826 [1:44:32<3:01:32, 6.17s/it] 38%|███▊ | 1062/2826 [1:44:38<2:54:03, 5.92s/it] 38%|███▊ | 1063/2826 [1:44:45<3:05:06, 6.30s/it] 38%|███▊ | 1064/2826 [1:44:51<2:59:14, 6.10s/it] 38%|███▊ | 1065/2826 [1:44:56<2:50:04, 5.79s/it] 38%|███▊ | 1066/2826 [1:45:01<2:44:29, 5.61s/it] 38%|███▊ | 1067/2826 [1:45:07<2:44:30, 5.61s/it] 38%|███▊ | 1068/2826 [1:45:12<2:40:10, 5.47s/it] 38%|███▊ | 1069/2826 [1:45:18<2:50:12, 5.81s/it] 38%|███▊ | 1070/2826 [1:45:24<2:45:11, 5.64s/it] {'loss': 0.3186, 'grad_norm': 2.513950824737549, 'learning_rate': 3.911154400374159e-06, 'epoch': 1.13} 38%|███▊ | 1070/2826 [1:45:24<2:45:11, 5.64s/it] 38%|███▊ | 1071/2826 [1:45:29<2:44:36, 5.63s/it] 38%|███▊ | 1072/2826 [1:45:35<2:43:08, 5.58s/it] 38%|███▊ | 1073/2826 [1:45:40<2:38:13, 5.42s/it] 38%|███▊ | 1074/2826 [1:45:45<2:39:13, 5.45s/it] 38%|███▊ | 1075/2826 [1:45:50<2:37:30, 5.40s/it] 38%|███▊ | 1076/2826 [1:45:57<2:51:32, 5.88s/it] 38%|███▊ | 1077/2826 [1:46:04<2:56:12, 6.04s/it] 38%|███▊ | 1078/2826 [1:46:09<2:52:41, 5.93s/it] 38%|███▊ | 1079/2826 [1:46:14<2:44:03, 5.63s/it] 38%|███▊ | 1080/2826 [1:46:21<2:48:28, 5.79s/it] {'loss': 0.3333, 'grad_norm': 2.6273515224456787, 'learning_rate': 3.885553297928573e-06, 'epoch': 1.15} 38%|███▊ | 1080/2826 [1:46:21<2:48:28, 5.79s/it] 38%|███▊ | 1081/2826 [1:46:27<2:53:52, 5.98s/it] 38%|███▊ | 1082/2826 [1:46:34<3:02:20, 6.27s/it] 38%|███▊ | 1083/2826 [1:46:40<2:57:07, 6.10s/it] 38%|███▊ | 1084/2826 [1:46:45<2:53:59, 5.99s/it] 38%|███▊ | 1085/2826 [1:46:51<2:46:29, 5.74s/it] 38%|███▊ | 1086/2826 [1:46:56<2:43:57, 5.65s/it] 38%|███▊ | 1087/2826 [1:47:01<2:39:38, 5.51s/it] 38%|███▊ | 1088/2826 [1:47:08<2:50:11, 5.88s/it] 39%|███▊ | 1089/2826 [1:47:13<2:45:55, 5.73s/it] 39%|███▊ | 1090/2826 [1:47:19<2:42:49, 5.63s/it] {'loss': 0.3137, 'grad_norm': 2.4155592918395996, 'learning_rate': 3.859740737182222e-06, 'epoch': 1.16} 39%|███▊ | 1090/2826 [1:47:19<2:42:49, 5.63s/it] 39%|███▊ | 1091/2826 [1:47:26<2:54:06, 6.02s/it] 39%|███▊ | 1092/2826 [1:47:31<2:52:35, 5.97s/it] 39%|███▊ | 1093/2826 [1:47:37<2:45:28, 5.73s/it] 39%|███▊ | 1094/2826 [1:47:42<2:39:51, 5.54s/it] 39%|███▊ | 1095/2826 [1:47:49<2:50:25, 5.91s/it] 39%|███▉ | 1096/2826 [1:47:54<2:47:13, 5.80s/it] 39%|███▉ | 1097/2826 [1:48:00<2:47:03, 5.80s/it] 39%|███▉ | 1098/2826 [1:48:06<2:50:12, 5.91s/it] 39%|███▉ | 1099/2826 [1:48:12<2:52:30, 5.99s/it] 39%|███▉ | 1100/2826 [1:48:18<2:49:45, 5.90s/it] {'loss': 0.3426, 'grad_norm': 2.719611644744873, 'learning_rate': 3.833720657557894e-06, 'epoch': 1.17} 39%|███▉ | 1100/2826 [1:48:18<2:49:45, 5.90s/it] 39%|███▉ | 1101/2826 [1:48:23<2:42:16, 5.64s/it] 39%|███▉ | 1102/2826 [1:48:28<2:38:20, 5.51s/it] 39%|███▉ | 1103/2826 [1:48:33<2:36:09, 5.44s/it] 39%|███▉ | 1104/2826 [1:48:39<2:39:40, 5.56s/it] 39%|███▉ | 1105/2826 [1:48:44<2:36:35, 5.46s/it] 39%|███▉ | 1106/2826 [1:48:50<2:35:30, 5.42s/it] 39%|███▉ | 1107/2826 [1:48:56<2:39:28, 5.57s/it] 39%|███▉ | 1108/2826 [1:49:01<2:36:21, 5.46s/it] 39%|███▉ | 1109/2826 [1:49:07<2:44:47, 5.76s/it] 39%|███▉ | 1110/2826 [1:49:13<2:43:13, 5.71s/it] {'loss': 0.3709, 'grad_norm': 2.5729358196258545, 'learning_rate': 3.807497030149181e-06, 'epoch': 1.18} 39%|███▉ | 1110/2826 [1:49:13<2:43:13, 5.71s/it] 39%|███▉ | 1111/2826 [1:49:18<2:37:52, 5.52s/it] 39%|███▉ | 1112/2826 [1:49:25<2:51:38, 6.01s/it] 39%|███▉ | 1113/2826 [1:49:31<2:48:24, 5.90s/it] 39%|███▉ | 1114/2826 [1:49:36<2:44:09, 5.75s/it] 39%|███▉ | 1115/2826 [1:49:42<2:41:00, 5.65s/it] 39%|███▉ | 1116/2826 [1:49:47<2:36:20, 5.49s/it] 40%|███▉ | 1117/2826 [1:49:52<2:33:20, 5.38s/it] 40%|███▉ | 1118/2826 [1:49:59<2:43:55, 5.76s/it] 40%|███▉ | 1119/2826 [1:50:05<2:50:31, 5.99s/it] 40%|███▉ | 1120/2826 [1:50:12<2:59:47, 6.32s/it] {'loss': 0.329, 'grad_norm': 1.9626141786575317, 'learning_rate': 3.7810738571144257e-06, 'epoch': 1.19} 40%|███▉ | 1120/2826 [1:50:12<2:59:47, 6.32s/it] 40%|███▉ | 1121/2826 [1:50:19<3:00:26, 6.35s/it] 40%|███▉ | 1122/2826 [1:50:25<2:58:29, 6.28s/it] 40%|███▉ | 1123/2826 [1:50:31<3:00:30, 6.36s/it] 40%|███▉ | 1124/2826 [1:50:37<2:54:26, 6.15s/it] 40%|███▉ | 1125/2826 [1:50:43<2:49:42, 5.99s/it] 40%|███▉ | 1126/2826 [1:50:48<2:47:34, 5.91s/it] 40%|███▉ | 1127/2826 [1:50:53<2:40:54, 5.68s/it] 40%|███▉ | 1128/2826 [1:51:00<2:49:57, 6.01s/it] 40%|███▉ | 1129/2826 [1:51:06<2:44:59, 5.83s/it] 40%|███▉ | 1130/2826 [1:51:11<2:43:19, 5.78s/it] {'loss': 0.305, 'grad_norm': 2.601951837539673, 'learning_rate': 3.7544551710659296e-06, 'epoch': 1.2} 40%|███▉ | 1130/2826 [1:51:11<2:43:19, 5.78s/it] 40%|████ | 1131/2826 [1:51:17<2:38:51, 5.62s/it] 40%|████ | 1132/2826 [1:51:22<2:38:00, 5.60s/it] 40%|████ | 1133/2826 [1:51:27<2:33:40, 5.45s/it] 40%|████ | 1134/2826 [1:51:32<2:31:17, 5.37s/it] 40%|████ | 1135/2826 [1:51:38<2:31:27, 5.37s/it] 40%|████ | 1136/2826 [1:51:43<2:32:33, 5.42s/it] 40%|████ | 1137/2826 [1:51:49<2:34:00, 5.47s/it] 40%|████ | 1138/2826 [1:51:55<2:38:03, 5.62s/it] 40%|████ | 1139/2826 [1:52:02<2:54:30, 6.21s/it] 40%|████ | 1140/2826 [1:52:08<2:48:10, 5.98s/it] {'loss': 0.3449, 'grad_norm': 2.4118540287017822, 'learning_rate': 3.7276450344545024e-06, 'epoch': 1.21} 40%|████ | 1140/2826 [1:52:08<2:48:10, 5.98s/it] 40%|████ | 1141/2826 [1:52:14<2:49:24, 6.03s/it] 40%|████ | 1142/2826 [1:52:19<2:44:28, 5.86s/it] 40%|████ | 1143/2826 [1:52:26<2:46:48, 5.95s/it] 40%|████ | 1144/2826 [1:52:32<2:49:26, 6.04s/it] 41%|████ | 1145/2826 [1:52:39<2:54:45, 6.24s/it] 41%|████ | 1146/2826 [1:52:44<2:49:05, 6.04s/it] 41%|████ | 1147/2826 [1:52:50<2:44:31, 5.88s/it] 41%|████ | 1148/2826 [1:52:56<2:44:26, 5.88s/it] 41%|████ | 1149/2826 [1:53:01<2:38:38, 5.68s/it] 41%|████ | 1150/2826 [1:53:07<2:41:21, 5.78s/it] {'loss': 0.3403, 'grad_norm': 2.5080604553222656, 'learning_rate': 3.7006475389494723e-06, 'epoch': 1.22} 41%|████ | 1150/2826 [1:53:07<2:41:21, 5.78s/it] 41%|████ | 1151/2826 [1:53:13<2:44:09, 5.88s/it] 41%|████ | 1152/2826 [1:53:18<2:40:38, 5.76s/it] 41%|████ | 1153/2826 [1:53:24<2:36:03, 5.60s/it] 41%|████ | 1154/2826 [1:53:29<2:34:55, 5.56s/it] 41%|████ | 1155/2826 [1:53:35<2:34:28, 5.55s/it] 41%|████ | 1156/2826 [1:53:41<2:41:39, 5.81s/it] 41%|████ | 1157/2826 [1:53:48<2:50:40, 6.14s/it] 41%|████ | 1158/2826 [1:53:54<2:49:25, 6.09s/it] 41%|████ | 1159/2826 [1:54:00<2:46:10, 5.98s/it] 41%|████ | 1160/2826 [1:54:06<2:46:48, 6.01s/it] {'loss': 0.3342, 'grad_norm': 2.6882951259613037, 'learning_rate': 3.6734668048142273e-06, 'epoch': 1.23} 41%|████ | 1160/2826 [1:54:06<2:46:48, 6.01s/it] 41%|████ | 1161/2826 [1:54:12<2:51:09, 6.17s/it] 41%|████ | 1162/2826 [1:54:18<2:47:06, 6.03s/it] 41%|████ | 1163/2826 [1:54:23<2:39:00, 5.74s/it] 41%|████ | 1164/2826 [1:54:28<2:33:37, 5.55s/it] 41%|████ | 1165/2826 [1:54:33<2:32:15, 5.50s/it] 41%|████▏ | 1166/2826 [1:54:40<2:38:52, 5.74s/it] 41%|████▏ | 1167/2826 [1:54:46<2:41:32, 5.84s/it] 41%|████▏ | 1168/2826 [1:54:52<2:42:24, 5.88s/it] 41%|████▏ | 1169/2826 [1:54:59<2:53:32, 6.28s/it] 41%|████▏ | 1170/2826 [1:55:04<2:46:16, 6.02s/it] {'loss': 0.3589, 'grad_norm': 2.3755247592926025, 'learning_rate': 3.646106980277394e-06, 'epoch': 1.24} 41%|████▏ | 1170/2826 [1:55:04<2:46:16, 6.02s/it] 41%|████▏ | 1171/2826 [1:55:12<3:00:48, 6.56s/it] 41%|████▏ | 1172/2826 [1:55:18<2:53:22, 6.29s/it] 42%|████▏ | 1173/2826 [1:55:23<2:46:19, 6.04s/it] 42%|████▏ | 1174/2826 [1:55:29<2:44:25, 5.97s/it] 42%|████▏ | 1175/2826 [1:55:34<2:37:32, 5.73s/it] 42%|████▏ | 1176/2826 [1:55:39<2:32:24, 5.54s/it] 42%|████▏ | 1177/2826 [1:55:45<2:28:56, 5.42s/it] 42%|████▏ | 1178/2826 [1:55:51<2:37:45, 5.74s/it] 42%|████▏ | 1179/2826 [1:55:57<2:39:37, 5.82s/it] 42%|████▏ | 1180/2826 [1:56:02<2:35:59, 5.69s/it] {'loss': 0.3447, 'grad_norm': 2.4138166904449463, 'learning_rate': 3.618572240899748e-06, 'epoch': 1.25} 42%|████▏ | 1180/2826 [1:56:02<2:35:59, 5.69s/it] 42%|████▏ | 1181/2826 [1:56:08<2:33:22, 5.59s/it] 42%|████▏ | 1182/2826 [1:56:14<2:36:00, 5.69s/it] 42%|████▏ | 1183/2826 [1:56:19<2:33:35, 5.61s/it] 42%|████▏ | 1184/2826 [1:56:25<2:34:16, 5.64s/it] 42%|████▏ | 1185/2826 [1:56:30<2:32:58, 5.59s/it] 42%|████▏ | 1186/2826 [1:56:35<2:28:12, 5.42s/it] 42%|████▏ | 1187/2826 [1:56:41<2:29:00, 5.45s/it] 42%|████▏ | 1188/2826 [1:56:46<2:25:36, 5.33s/it] 42%|████▏ | 1189/2826 [1:56:51<2:25:56, 5.35s/it] 42%|████▏ | 1190/2826 [1:56:58<2:33:41, 5.64s/it] {'loss': 0.3787, 'grad_norm': 2.6930105686187744, 'learning_rate': 3.5908667889369603e-06, 'epoch': 1.26} 42%|████▏ | 1190/2826 [1:56:58<2:33:41, 5.64s/it] 42%|████▏ | 1191/2826 [1:57:03<2:33:44, 5.64s/it] 42%|████▏ | 1192/2826 [1:57:08<2:29:11, 5.48s/it] 42%|████▏ | 1193/2826 [1:57:15<2:35:09, 5.70s/it] 42%|████▏ | 1194/2826 [1:57:20<2:35:46, 5.73s/it] 42%|████▏ | 1195/2826 [1:57:28<2:51:33, 6.31s/it] 42%|████▏ | 1196/2826 [1:57:35<2:55:24, 6.46s/it] 42%|████▏ | 1197/2826 [1:57:41<2:50:09, 6.27s/it] 42%|████▏ | 1198/2826 [1:57:46<2:43:27, 6.02s/it] 42%|████▏ | 1199/2826 [1:57:52<2:39:53, 5.90s/it] 42%|████▏ | 1200/2826 [1:57:58<2:39:54, 5.90s/it] {'loss': 0.3376, 'grad_norm': 2.732795476913452, 'learning_rate': 3.5629948526982563e-06, 'epoch': 1.27} 42%|████▏ | 1200/2826 [1:57:58<2:39:54, 5.90s/it] 42%|████▏ | 1201/2826 [1:58:03<2:39:13, 5.88s/it] 43%|████▎ | 1202/2826 [1:58:09<2:39:19, 5.89s/it] 43%|████▎ | 1203/2826 [1:58:15<2:34:22, 5.71s/it] 43%|████▎ | 1204/2826 [1:58:20<2:29:42, 5.54s/it] 43%|████▎ | 1205/2826 [1:58:25<2:29:21, 5.53s/it] 43%|████▎ | 1206/2826 [1:58:32<2:39:28, 5.91s/it] 43%|████▎ | 1207/2826 [1:58:38<2:40:30, 5.95s/it] 43%|████▎ | 1208/2826 [1:58:44<2:39:40, 5.92s/it] 43%|████▎ | 1209/2826 [1:58:50<2:40:20, 5.95s/it] 43%|████▎ | 1210/2826 [1:58:55<2:33:47, 5.71s/it] {'loss': 0.3461, 'grad_norm': 1.8468087911605835, 'learning_rate': 3.534960685901111e-06, 'epoch': 1.28} 43%|████▎ | 1210/2826 [1:58:55<2:33:47, 5.71s/it] 43%|████▎ | 1211/2826 [1:59:00<2:30:04, 5.58s/it] 43%|████▎ | 1212/2826 [1:59:06<2:29:24, 5.55s/it] 43%|████▎ | 1213/2826 [1:59:12<2:31:21, 5.63s/it] 43%|████▎ | 1214/2826 [1:59:17<2:26:46, 5.46s/it] 43%|████▎ | 1215/2826 [1:59:22<2:24:28, 5.38s/it] 43%|████▎ | 1216/2826 [1:59:27<2:22:41, 5.32s/it] 43%|████▎ | 1217/2826 [1:59:33<2:25:32, 5.43s/it] 43%|████▎ | 1218/2826 [1:59:39<2:29:00, 5.56s/it] 43%|████▎ | 1219/2826 [1:59:44<2:26:55, 5.49s/it] 43%|████▎ | 1220/2826 [1:59:49<2:24:10, 5.39s/it] {'loss': 0.3396, 'grad_norm': 2.3408284187316895, 'learning_rate': 3.506768567022062e-06, 'epoch': 1.29} 43%|████▎ | 1220/2826 [1:59:49<2:24:10, 5.39s/it] 43%|████▎ | 1221/2826 [1:59:57<2:42:47, 6.09s/it] 43%|████▎ | 1222/2826 [2:00:02<2:35:46, 5.83s/it] 43%|████▎ | 1223/2826 [2:00:07<2:29:33, 5.60s/it] 43%|████▎ | 1224/2826 [2:00:13<2:34:39, 5.79s/it] 43%|████▎ | 1225/2826 [2:00:19<2:29:34, 5.61s/it] 43%|████▎ | 1226/2826 [2:00:25<2:34:01, 5.78s/it] 43%|████▎ | 1227/2826 [2:00:31<2:38:56, 5.96s/it] 43%|████▎ | 1228/2826 [2:00:38<2:47:10, 6.28s/it] 43%|████▎ | 1229/2826 [2:00:43<2:37:14, 5.91s/it] 44%|████▎ | 1230/2826 [2:00:49<2:38:19, 5.95s/it] {'loss': 0.3364, 'grad_norm': 2.7420434951782227, 'learning_rate': 3.478422798643737e-06, 'epoch': 1.3} 44%|████▎ | 1230/2826 [2:00:49<2:38:19, 5.95s/it] 44%|████▎ | 1231/2826 [2:00:54<2:31:24, 5.70s/it] 44%|████▎ | 1232/2826 [2:01:00<2:27:28, 5.55s/it] 44%|████▎ | 1233/2826 [2:01:05<2:24:12, 5.43s/it] 44%|████▎ | 1234/2826 [2:01:11<2:26:55, 5.54s/it] 44%|████▎ | 1235/2826 [2:01:16<2:22:12, 5.36s/it] 44%|████▎ | 1236/2826 [2:01:22<2:33:06, 5.78s/it] 44%|████▍ | 1237/2826 [2:01:30<2:45:42, 6.26s/it] 44%|████▍ | 1238/2826 [2:01:35<2:42:04, 6.12s/it] 44%|████▍ | 1239/2826 [2:01:41<2:36:57, 5.93s/it] 44%|████▍ | 1240/2826 [2:01:46<2:29:55, 5.67s/it] {'loss': 0.3126, 'grad_norm': 2.634403705596924, 'learning_rate': 3.4499277067982177e-06, 'epoch': 1.32} 44%|████▍ | 1240/2826 [2:01:46<2:29:55, 5.67s/it] 44%|████▍ | 1241/2826 [2:01:52<2:35:38, 5.89s/it] 44%|████▍ | 1242/2826 [2:01:58<2:30:23, 5.70s/it] 44%|████▍ | 1243/2826 [2:02:04<2:37:16, 5.96s/it] 44%|████▍ | 1244/2826 [2:02:09<2:30:31, 5.71s/it] 44%|████▍ | 1245/2826 [2:02:16<2:33:58, 5.84s/it] 44%|████▍ | 1246/2826 [2:02:23<2:47:57, 6.38s/it] 44%|████▍ | 1247/2826 [2:02:29<2:40:25, 6.10s/it] 44%|████▍ | 1248/2826 [2:02:34<2:33:38, 5.84s/it] 44%|████▍ | 1249/2826 [2:02:39<2:27:36, 5.62s/it] 44%|████▍ | 1250/2826 [2:02:44<2:22:53, 5.44s/it] {'loss': 0.3092, 'grad_norm': 2.4217336177825928, 'learning_rate': 3.421287640306809e-06, 'epoch': 1.33} 44%|████▍ | 1250/2826 [2:02:44<2:22:53, 5.44s/it] 44%|████▍ | 1251/2826 [2:02:49<2:22:20, 5.42s/it] 44%|████▍ | 1252/2826 [2:02:55<2:20:51, 5.37s/it] 44%|████▍ | 1253/2826 [2:03:00<2:20:22, 5.35s/it] 44%|████▍ | 1254/2826 [2:03:07<2:30:44, 5.75s/it] 44%|████▍ | 1255/2826 [2:03:12<2:31:47, 5.80s/it] 44%|████▍ | 1256/2826 [2:03:20<2:43:03, 6.23s/it] 44%|████▍ | 1257/2826 [2:03:25<2:38:33, 6.06s/it] 45%|████▍ | 1258/2826 [2:03:33<2:51:53, 6.58s/it] 45%|████▍ | 1259/2826 [2:03:41<3:04:50, 7.08s/it] 45%|████▍ | 1260/2826 [2:03:48<2:57:03, 6.78s/it] {'loss': 0.3374, 'grad_norm': 1.7107937335968018, 'learning_rate': 3.3925069701163406e-06, 'epoch': 1.34} 45%|████▍ | 1260/2826 [2:03:48<2:57:03, 6.78s/it] 45%|████▍ | 1261/2826 [2:03:55<2:59:04, 6.87s/it] 45%|████▍ | 1262/2826 [2:04:02<3:02:05, 6.99s/it] 45%|████▍ | 1263/2826 [2:04:08<2:54:45, 6.71s/it] 45%|████▍ | 1264/2826 [2:04:13<2:42:59, 6.26s/it] 45%|████▍ | 1265/2826 [2:04:20<2:47:00, 6.42s/it] 45%|████▍ | 1266/2826 [2:04:26<2:48:05, 6.47s/it] 45%|████▍ | 1267/2826 [2:04:32<2:43:49, 6.31s/it] 45%|████▍ | 1268/2826 [2:04:38<2:40:28, 6.18s/it] 45%|████▍ | 1269/2826 [2:04:43<2:32:35, 5.88s/it] 45%|████▍ | 1270/2826 [2:04:49<2:26:33, 5.65s/it] {'loss': 0.3436, 'grad_norm': 2.1515822410583496, 'learning_rate': 3.363590088632085e-06, 'epoch': 1.35} 45%|████▍ | 1270/2826 [2:04:49<2:26:33, 5.65s/it] 45%|████▍ | 1271/2826 [2:04:54<2:21:12, 5.45s/it] 45%|████▌ | 1272/2826 [2:05:01<2:33:43, 5.94s/it] 45%|████▌ | 1273/2826 [2:05:07<2:34:11, 5.96s/it] 45%|████▌ | 1274/2826 [2:05:13<2:40:23, 6.20s/it] 45%|████▌ | 1275/2826 [2:05:20<2:44:42, 6.37s/it] 45%|████▌ | 1276/2826 [2:05:26<2:38:25, 6.13s/it] 45%|████▌ | 1277/2826 [2:05:32<2:36:57, 6.08s/it] 45%|████▌ | 1278/2826 [2:05:38<2:37:00, 6.09s/it] 45%|████▌ | 1279/2826 [2:05:44<2:34:16, 5.98s/it] 45%|████▌ | 1280/2826 [2:05:49<2:28:39, 5.77s/it] {'loss': 0.3283, 'grad_norm': 2.0105717182159424, 'learning_rate': 3.334541409047408e-06, 'epoch': 1.36} 45%|████▌ | 1280/2826 [2:05:49<2:28:39, 5.77s/it] 45%|████▌ | 1281/2826 [2:05:54<2:25:09, 5.64s/it] 45%|████▌ | 1282/2826 [2:06:00<2:29:19, 5.80s/it] 45%|████▌ | 1283/2826 [2:06:07<2:35:08, 6.03s/it] 45%|████▌ | 1284/2826 [2:06:12<2:28:22, 5.77s/it] 45%|████▌ | 1285/2826 [2:06:17<2:23:35, 5.59s/it] 46%|████▌ | 1286/2826 [2:06:22<2:19:50, 5.45s/it] 46%|████▌ | 1287/2826 [2:06:28<2:22:23, 5.55s/it] 46%|████▌ | 1288/2826 [2:06:33<2:18:55, 5.42s/it] 46%|████▌ | 1289/2826 [2:06:40<2:27:16, 5.75s/it] 46%|████▌ | 1290/2826 [2:06:47<2:39:01, 6.21s/it] {'loss': 0.358, 'grad_norm': 1.8952791690826416, 'learning_rate': 3.3053653646702422e-06, 'epoch': 1.37} 46%|████▌ | 1290/2826 [2:06:47<2:39:01, 6.21s/it] 46%|████▌ | 1291/2826 [2:06:53<2:34:47, 6.05s/it] 46%|████▌ | 1292/2826 [2:06:58<2:28:17, 5.80s/it] 46%|████▌ | 1293/2826 [2:07:05<2:37:43, 6.17s/it] 46%|████▌ | 1294/2826 [2:07:10<2:30:34, 5.90s/it] 46%|████▌ | 1295/2826 [2:07:16<2:32:57, 5.99s/it] 46%|████▌ | 1296/2826 [2:07:23<2:35:31, 6.10s/it] 46%|████▌ | 1297/2826 [2:07:28<2:31:19, 5.94s/it] 46%|████▌ | 1298/2826 [2:07:34<2:25:42, 5.72s/it] 46%|████▌ | 1299/2826 [2:07:39<2:26:39, 5.76s/it] 46%|████▌ | 1300/2826 [2:07:45<2:21:48, 5.58s/it] {'loss': 0.3084, 'grad_norm': 1.8639928102493286, 'learning_rate': 3.276066408246487e-06, 'epoch': 1.38} 46%|████▌ | 1300/2826 [2:07:45<2:21:48, 5.58s/it] 46%|████▌ | 1301/2826 [2:07:50<2:19:52, 5.50s/it] 46%|████▌ | 1302/2826 [2:07:55<2:18:49, 5.47s/it] 46%|████▌ | 1303/2826 [2:08:01<2:22:56, 5.63s/it] 46%|████▌ | 1304/2826 [2:08:06<2:17:04, 5.40s/it] 46%|████▌ | 1305/2826 [2:08:12<2:19:13, 5.49s/it] 46%|████▌ | 1306/2826 [2:08:19<2:33:02, 6.04s/it] 46%|████▌ | 1307/2826 [2:08:25<2:30:09, 5.93s/it] 46%|████▋ | 1308/2826 [2:08:31<2:32:11, 6.02s/it] 46%|████▋ | 1309/2826 [2:08:38<2:38:16, 6.26s/it] 46%|████▋ | 1310/2826 [2:08:45<2:43:16, 6.46s/it] {'loss': 0.3508, 'grad_norm': 2.563251256942749, 'learning_rate': 3.2466490112804484e-06, 'epoch': 1.39} 46%|████▋ | 1310/2826 [2:08:45<2:43:16, 6.46s/it] 46%|████▋ | 1311/2826 [2:08:50<2:33:04, 6.06s/it] 46%|████▋ | 1312/2826 [2:08:55<2:24:59, 5.75s/it] 46%|████▋ | 1313/2826 [2:09:02<2:31:01, 5.99s/it] 46%|████▋ | 1314/2826 [2:09:07<2:26:05, 5.80s/it] 47%|████▋ | 1315/2826 [2:09:13<2:30:47, 5.99s/it] 47%|████▋ | 1316/2826 [2:09:19<2:31:49, 6.03s/it] 47%|████▋ | 1317/2826 [2:09:25<2:24:31, 5.75s/it] 47%|████▋ | 1318/2826 [2:09:30<2:19:42, 5.56s/it] 47%|████▋ | 1319/2826 [2:09:35<2:17:57, 5.49s/it] 47%|████▋ | 1320/2826 [2:09:41<2:17:45, 5.49s/it] {'loss': 0.3215, 'grad_norm': 2.214616060256958, 'learning_rate': 3.217117663352417e-06, 'epoch': 1.4} 47%|████▋ | 1320/2826 [2:09:41<2:17:45, 5.49s/it] 47%|████▋ | 1321/2826 [2:09:46<2:16:04, 5.42s/it] 47%|████▋ | 1322/2826 [2:09:51<2:14:27, 5.36s/it] 47%|████▋ | 1323/2826 [2:09:57<2:20:51, 5.62s/it] 47%|████▋ | 1324/2826 [2:10:04<2:25:54, 5.83s/it] 47%|████▋ | 1325/2826 [2:10:10<2:32:08, 6.08s/it] 47%|████▋ | 1326/2826 [2:10:15<2:24:22, 5.78s/it] 47%|████▋ | 1327/2826 [2:10:21<2:24:36, 5.79s/it] 47%|████▋ | 1328/2826 [2:10:27<2:28:57, 5.97s/it] 47%|████▋ | 1329/2826 [2:10:34<2:32:56, 6.13s/it] 47%|████▋ | 1330/2826 [2:10:40<2:29:08, 5.98s/it] {'loss': 0.3193, 'grad_norm': 1.793468952178955, 'learning_rate': 3.187476871433478e-06, 'epoch': 1.41} 47%|████▋ | 1330/2826 [2:10:40<2:29:08, 5.98s/it] 47%|████▋ | 1331/2826 [2:10:45<2:23:33, 5.76s/it] 47%|████▋ | 1332/2826 [2:10:50<2:18:35, 5.57s/it] 47%|████▋ | 1333/2826 [2:10:57<2:26:11, 5.88s/it] 47%|████▋ | 1334/2826 [2:11:02<2:21:18, 5.68s/it] 47%|████▋ | 1335/2826 [2:11:08<2:27:13, 5.92s/it] 47%|████▋ | 1336/2826 [2:11:14<2:24:04, 5.80s/it] 47%|████▋ | 1337/2826 [2:11:21<2:35:42, 6.27s/it] 47%|████▋ | 1338/2826 [2:11:27<2:30:18, 6.06s/it] 47%|████▋ | 1339/2826 [2:11:32<2:23:26, 5.79s/it] 47%|████▋ | 1340/2826 [2:11:37<2:18:12, 5.58s/it] {'loss': 0.3019, 'grad_norm': 2.204789638519287, 'learning_rate': 3.1577311591976766e-06, 'epoch': 1.42} 47%|████▋ | 1340/2826 [2:11:37<2:18:12, 5.58s/it] 47%|████▋ | 1341/2826 [2:11:42<2:14:46, 5.45s/it] 47%|████▋ | 1342/2826 [2:11:49<2:21:40, 5.73s/it] 48%|████▊ | 1343/2826 [2:11:54<2:19:57, 5.66s/it] 48%|████▊ | 1344/2826 [2:11:59<2:18:13, 5.60s/it] 48%|████▊ | 1345/2826 [2:12:05<2:14:23, 5.44s/it] 48%|████▊ | 1346/2826 [2:12:10<2:12:03, 5.35s/it] 48%|████▊ | 1347/2826 [2:12:16<2:17:11, 5.57s/it] 48%|████▊ | 1348/2826 [2:12:23<2:28:48, 6.04s/it] 48%|████▊ | 1349/2826 [2:12:29<2:28:56, 6.05s/it] 48%|████▊ | 1350/2826 [2:12:36<2:37:53, 6.42s/it] {'loss': 0.3099, 'grad_norm': 2.307568311691284, 'learning_rate': 3.1278850663316307e-06, 'epoch': 1.43} 48%|████▊ | 1350/2826 [2:12:36<2:37:53, 6.42s/it] 48%|████▊ | 1351/2826 [2:12:42<2:34:33, 6.29s/it] 48%|████▊ | 1352/2826 [2:12:49<2:36:56, 6.39s/it] 48%|████▊ | 1353/2826 [2:12:55<2:32:30, 6.21s/it] 48%|████▊ | 1354/2826 [2:13:00<2:24:05, 5.87s/it] 48%|████▊ | 1355/2826 [2:13:05<2:22:01, 5.79s/it] 48%|████▊ | 1356/2826 [2:13:11<2:17:15, 5.60s/it] 48%|████▊ | 1357/2826 [2:13:17<2:20:35, 5.74s/it] 48%|████▊ | 1358/2826 [2:13:23<2:24:48, 5.92s/it] 48%|████▊ | 1359/2826 [2:13:29<2:25:00, 5.93s/it] 48%|████▊ | 1360/2826 [2:13:35<2:27:52, 6.05s/it] {'loss': 0.3085, 'grad_norm': 2.485848903656006, 'learning_rate': 3.0979431478416987e-06, 'epoch': 1.44} 48%|████▊ | 1360/2826 [2:13:35<2:27:52, 6.05s/it] 48%|████▊ | 1361/2826 [2:13:42<2:33:04, 6.27s/it] 48%|████▊ | 1362/2826 [2:13:47<2:25:26, 5.96s/it] 48%|████▊ | 1363/2826 [2:13:53<2:21:55, 5.82s/it] 48%|████▊ | 1364/2826 [2:13:59<2:25:53, 5.99s/it] 48%|████▊ | 1365/2826 [2:14:04<2:20:01, 5.75s/it] 48%|████▊ | 1366/2826 [2:14:10<2:19:49, 5.75s/it] 48%|████▊ | 1367/2826 [2:14:15<2:17:32, 5.66s/it] 48%|████▊ | 1368/2826 [2:14:21<2:15:01, 5.56s/it] 48%|████▊ | 1369/2826 [2:14:26<2:11:38, 5.42s/it] 48%|████▊ | 1370/2826 [2:14:32<2:14:16, 5.53s/it] {'loss': 0.3211, 'grad_norm': 1.953053593635559, 'learning_rate': 3.067909973358811e-06, 'epoch': 1.45} 48%|████▊ | 1370/2826 [2:14:32<2:14:16, 5.53s/it] 49%|████▊ | 1371/2826 [2:14:38<2:23:06, 5.90s/it] 49%|████▊ | 1372/2826 [2:14:45<2:24:23, 5.96s/it] 49%|████▊ | 1373/2826 [2:14:50<2:22:01, 5.86s/it] 49%|████▊ | 1374/2826 [2:14:58<2:37:49, 6.52s/it] 49%|████▊ | 1375/2826 [2:15:05<2:40:58, 6.66s/it] 49%|████▊ | 1376/2826 [2:15:11<2:30:58, 6.25s/it] 49%|████▊ | 1377/2826 [2:15:17<2:30:50, 6.25s/it] 49%|████▉ | 1378/2826 [2:15:25<2:46:58, 6.92s/it] 49%|████▉ | 1379/2826 [2:15:32<2:42:59, 6.76s/it] 49%|████▉ | 1380/2826 [2:15:37<2:31:28, 6.29s/it] {'loss': 0.3329, 'grad_norm': 2.2350101470947266, 'learning_rate': 3.0377901264410673e-06, 'epoch': 1.46} 49%|████▉ | 1380/2826 [2:15:37<2:31:28, 6.29s/it] 49%|████▉ | 1381/2826 [2:15:43<2:29:53, 6.22s/it] 49%|████▉ | 1382/2826 [2:15:48<2:22:06, 5.90s/it] 49%|████▉ | 1383/2826 [2:15:54<2:23:53, 5.98s/it] 49%|████▉ | 1384/2826 [2:16:00<2:19:47, 5.82s/it] 49%|████▉ | 1385/2826 [2:16:07<2:34:03, 6.41s/it] 49%|████▉ | 1386/2826 [2:16:13<2:25:59, 6.08s/it] 49%|████▉ | 1387/2826 [2:16:18<2:21:43, 5.91s/it] 49%|████▉ | 1388/2826 [2:16:23<2:16:02, 5.68s/it] 49%|████▉ | 1389/2826 [2:16:29<2:17:27, 5.74s/it] 49%|████▉ | 1390/2826 [2:16:35<2:14:07, 5.60s/it] {'loss': 0.3376, 'grad_norm': 2.542452335357666, 'learning_rate': 3.0075882038742133e-06, 'epoch': 1.47} 49%|████▉ | 1390/2826 [2:16:35<2:14:07, 5.60s/it] 49%|████▉ | 1391/2826 [2:16:41<2:18:52, 5.81s/it] 49%|████▉ | 1392/2826 [2:16:47<2:18:47, 5.81s/it] 49%|████▉ | 1393/2826 [2:16:53<2:21:25, 5.92s/it] 49%|████▉ | 1394/2826 [2:16:59<2:19:25, 5.84s/it] 49%|████▉ | 1395/2826 [2:17:05<2:25:19, 6.09s/it] 49%|████▉ | 1396/2826 [2:17:11<2:23:11, 6.01s/it] 49%|████▉ | 1397/2826 [2:17:17<2:25:45, 6.12s/it] 49%|████▉ | 1398/2826 [2:17:24<2:28:19, 6.23s/it] 50%|████▉ | 1399/2826 [2:17:30<2:24:35, 6.08s/it] 50%|████▉ | 1400/2826 [2:17:35<2:17:12, 5.77s/it] {'loss': 0.2896, 'grad_norm': 2.3203530311584473, 'learning_rate': 2.9773088149700923e-06, 'epoch': 1.48} 50%|████▉ | 1400/2826 [2:17:35<2:17:12, 5.77s/it] 50%|████▉ | 1401/2826 [2:17:40<2:15:09, 5.69s/it] 50%|████▉ | 1402/2826 [2:17:46<2:19:12, 5.87s/it] 50%|████▉ | 1403/2826 [2:17:52<2:18:14, 5.83s/it] 50%|████▉ | 1404/2826 [2:17:59<2:25:04, 6.12s/it] 50%|████▉ | 1405/2826 [2:18:04<2:20:20, 5.93s/it] 50%|████▉ | 1406/2826 [2:18:10<2:14:53, 5.70s/it] 50%|████▉ | 1407/2826 [2:18:15<2:15:57, 5.75s/it] 50%|████▉ | 1408/2826 [2:18:21<2:12:40, 5.61s/it] 50%|████▉ | 1409/2826 [2:18:27<2:15:12, 5.73s/it] 50%|████▉ | 1410/2826 [2:18:32<2:10:58, 5.55s/it] {'loss': 0.299, 'grad_norm': 1.9708584547042847, 'learning_rate': 2.9469565808631888e-06, 'epoch': 1.5} 50%|████▉ | 1410/2826 [2:18:32<2:10:58, 5.55s/it] 50%|████▉ | 1411/2826 [2:18:39<2:24:05, 6.11s/it] 50%|████▉ | 1412/2826 [2:18:45<2:22:02, 6.03s/it] 50%|█████ | 1413/2826 [2:18:50<2:15:37, 5.76s/it] 50%|█████ | 1414/2826 [2:18:56<2:18:27, 5.88s/it] 50%|█████ | 1415/2826 [2:19:02<2:13:23, 5.67s/it] 50%|█████ | 1416/2826 [2:19:07<2:11:13, 5.58s/it] 50%|█████ | 1417/2826 [2:19:12<2:07:42, 5.44s/it] 50%|█████ | 1418/2826 [2:19:17<2:06:28, 5.39s/it] 50%|█████ | 1419/2826 [2:19:23<2:07:08, 5.42s/it] 50%|█████ | 1420/2826 [2:19:29<2:08:46, 5.50s/it] {'loss': 0.3484, 'grad_norm': 2.63698148727417, 'learning_rate': 2.9165361338053683e-06, 'epoch': 1.51} 50%|█████ | 1420/2826 [2:19:29<2:08:46, 5.50s/it] 50%|█████ | 1421/2826 [2:19:34<2:06:34, 5.41s/it] 50%|█████ | 1422/2826 [2:19:40<2:15:20, 5.78s/it] 50%|█████ | 1423/2826 [2:19:46<2:11:04, 5.61s/it] 50%|█████ | 1424/2826 [2:19:51<2:06:47, 5.43s/it] 50%|█████ | 1425/2826 [2:19:56<2:05:00, 5.35s/it] 50%|█████ | 1426/2826 [2:20:02<2:09:19, 5.54s/it] 50%|█████ | 1427/2826 [2:20:08<2:11:39, 5.65s/it] 51%|█████ | 1428/2826 [2:20:15<2:26:46, 6.30s/it] 51%|█████ | 1429/2826 [2:20:22<2:25:56, 6.27s/it] 51%|█████ | 1430/2826 [2:20:28<2:28:01, 6.36s/it] {'loss': 0.3316, 'grad_norm': 2.091648578643799, 'learning_rate': 2.886052116458918e-06, 'epoch': 1.52} 51%|█████ | 1430/2826 [2:20:28<2:28:01, 6.36s/it] 51%|█████ | 1431/2826 [2:20:34<2:21:56, 6.11s/it] 51%|█████ | 1432/2826 [2:20:39<2:18:51, 5.98s/it] 51%|█████ | 1433/2826 [2:20:45<2:13:38, 5.76s/it] 51%|█████ | 1434/2826 [2:20:50<2:09:06, 5.56s/it] 51%|█████ | 1435/2826 [2:20:57<2:16:52, 5.90s/it] 51%|█████ | 1436/2826 [2:21:03<2:20:39, 6.07s/it] 51%|█████ | 1437/2826 [2:21:08<2:15:37, 5.86s/it] 51%|█████ | 1438/2826 [2:21:13<2:10:24, 5.64s/it] 51%|█████ | 1439/2826 [2:21:20<2:18:44, 6.00s/it] 51%|█████ | 1440/2826 [2:21:25<2:12:38, 5.74s/it] {'loss': 0.328, 'grad_norm': 1.955355167388916, 'learning_rate': 2.8555091811880004e-06, 'epoch': 1.53} 51%|█████ | 1440/2826 [2:21:25<2:12:38, 5.74s/it] 51%|█████ | 1441/2826 [2:21:33<2:22:46, 6.18s/it] 51%|█████ | 1442/2826 [2:21:38<2:16:39, 5.92s/it] 51%|█████ | 1443/2826 [2:21:44<2:18:11, 6.00s/it] 51%|█████ | 1444/2826 [2:21:50<2:16:05, 5.91s/it] 51%|█████ | 1445/2826 [2:21:57<2:24:43, 6.29s/it] 51%|█████ | 1446/2826 [2:22:04<2:28:06, 6.44s/it] 51%|█████ | 1447/2826 [2:22:09<2:21:07, 6.14s/it] 51%|█████ | 1448/2826 [2:22:16<2:22:48, 6.22s/it] 51%|█████▏ | 1449/2826 [2:22:22<2:22:37, 6.21s/it] 51%|█████▏ | 1450/2826 [2:22:28<2:24:52, 6.32s/it] {'loss': 0.3215, 'grad_norm': 1.6724951267242432, 'learning_rate': 2.8249119893486252e-06, 'epoch': 1.54} 51%|█████▏ | 1450/2826 [2:22:28<2:24:52, 6.32s/it] 51%|█████▏ | 1451/2826 [2:22:34<2:22:48, 6.23s/it] 51%|█████▏ | 1452/2826 [2:22:41<2:24:20, 6.30s/it] 51%|█████▏ | 1453/2826 [2:22:47<2:20:23, 6.14s/it] 51%|█████▏ | 1454/2826 [2:22:52<2:14:36, 5.89s/it] 51%|█████▏ | 1455/2826 [2:22:57<2:10:07, 5.70s/it] 52%|█████▏ | 1456/2826 [2:23:04<2:20:31, 6.15s/it] 52%|█████▏ | 1457/2826 [2:23:11<2:23:45, 6.30s/it] 52%|█████▏ | 1458/2826 [2:23:18<2:25:44, 6.39s/it] 52%|█████▏ | 1459/2826 [2:23:23<2:18:02, 6.06s/it] 52%|█████▏ | 1460/2826 [2:23:28<2:13:18, 5.86s/it] {'loss': 0.3118, 'grad_norm': 2.1872570514678955, 'learning_rate': 2.7942652105772516e-06, 'epoch': 1.55} 52%|█████▏ | 1460/2826 [2:23:28<2:13:18, 5.86s/it] 52%|█████▏ | 1461/2826 [2:23:35<2:18:35, 6.09s/it] 52%|█████▏ | 1462/2826 [2:23:41<2:18:42, 6.10s/it] 52%|█████▏ | 1463/2826 [2:23:47<2:14:19, 5.91s/it] 52%|█████▏ | 1464/2826 [2:23:52<2:10:18, 5.74s/it] 52%|█████▏ | 1465/2826 [2:23:59<2:21:58, 6.26s/it] 52%|█████▏ | 1466/2826 [2:24:07<2:31:09, 6.67s/it] 52%|█████▏ | 1467/2826 [2:24:14<2:30:24, 6.64s/it] 52%|█████▏ | 1468/2826 [2:24:19<2:21:01, 6.23s/it] 52%|█████▏ | 1469/2826 [2:24:25<2:18:52, 6.14s/it] 52%|█████▏ | 1470/2826 [2:24:31<2:19:27, 6.17s/it] {'loss': 0.2973, 'grad_norm': 3.0710208415985107, 'learning_rate': 2.7635735220781214e-06, 'epoch': 1.56} 52%|█████▏ | 1470/2826 [2:24:31<2:19:27, 6.17s/it] 52%|█████▏ | 1471/2826 [2:24:37<2:15:11, 5.99s/it] 52%|█████▏ | 1472/2826 [2:24:42<2:08:45, 5.71s/it] 52%|█████▏ | 1473/2826 [2:24:47<2:04:36, 5.53s/it] 52%|█████▏ | 1474/2826 [2:24:53<2:09:37, 5.75s/it] 52%|█████▏ | 1475/2826 [2:24:59<2:14:02, 5.95s/it] 52%|█████▏ | 1476/2826 [2:25:05<2:12:32, 5.89s/it] 52%|█████▏ | 1477/2826 [2:25:13<2:23:37, 6.39s/it] 52%|█████▏ | 1478/2826 [2:25:19<2:24:09, 6.42s/it] 52%|█████▏ | 1479/2826 [2:25:24<2:15:55, 6.05s/it] 52%|█████▏ | 1480/2826 [2:25:31<2:16:18, 6.08s/it] {'loss': 0.3423, 'grad_norm': 2.357663631439209, 'learning_rate': 2.7328416079094412e-06, 'epoch': 1.57} 52%|█████▏ | 1480/2826 [2:25:31<2:16:18, 6.08s/it] 52%|█████▏ | 1481/2826 [2:25:36<2:12:05, 5.89s/it] 52%|█████▏ | 1482/2826 [2:25:42<2:09:59, 5.80s/it] 52%|█████▏ | 1483/2826 [2:25:48<2:12:18, 5.91s/it] 53%|█████▎ | 1484/2826 [2:25:53<2:09:47, 5.80s/it] 53%|█████▎ | 1485/2826 [2:25:59<2:05:43, 5.63s/it] 53%|█████▎ | 1486/2826 [2:26:04<2:03:09, 5.51s/it] 53%|█████▎ | 1487/2826 [2:26:09<2:02:44, 5.50s/it] 53%|█████▎ | 1488/2826 [2:26:15<2:04:11, 5.57s/it] 53%|█████▎ | 1489/2826 [2:26:21<2:06:54, 5.70s/it] 53%|█████▎ | 1490/2826 [2:26:27<2:11:41, 5.91s/it] {'loss': 0.3211, 'grad_norm': 2.2559144496917725, 'learning_rate': 2.7020741582685217e-06, 'epoch': 1.58} 53%|█████▎ | 1490/2826 [2:26:27<2:11:41, 5.91s/it] 53%|█████▎ | 1491/2826 [2:26:33<2:08:57, 5.80s/it] 53%|█████▎ | 1492/2826 [2:26:38<2:05:45, 5.66s/it] 53%|█████▎ | 1493/2826 [2:26:45<2:10:31, 5.88s/it] 53%|█████▎ | 1494/2826 [2:26:50<2:07:31, 5.74s/it] 53%|█████▎ | 1495/2826 [2:26:56<2:05:43, 5.67s/it] 53%|█████▎ | 1496/2826 [2:27:04<2:21:24, 6.38s/it] 53%|█████▎ | 1497/2826 [2:27:09<2:14:05, 6.05s/it] 53%|█████▎ | 1498/2826 [2:27:14<2:08:05, 5.79s/it] 53%|█████▎ | 1499/2826 [2:27:21<2:15:04, 6.11s/it] 53%|█████▎ | 1500/2826 [2:27:26<2:08:59, 5.84s/it] {'loss': 0.2733, 'grad_norm': 2.0730817317962646, 'learning_rate': 2.6712758687759706e-06, 'epoch': 1.59} 53%|█████▎ | 1500/2826 [2:27:26<2:08:59, 5.84s/it] 53%|█████▎ | 1501/2826 [2:27:31<2:05:15, 5.67s/it] 53%|█████▎ | 1502/2826 [2:27:37<2:05:15, 5.68s/it] 53%|█████▎ | 1503/2826 [2:27:42<2:02:02, 5.53s/it] 53%|█████▎ | 1504/2826 [2:27:48<2:04:18, 5.64s/it] 53%|█████▎ | 1505/2826 [2:27:53<2:00:14, 5.46s/it] 53%|█████▎ | 1506/2826 [2:28:00<2:09:37, 5.89s/it] 53%|█████▎ | 1507/2826 [2:28:06<2:11:55, 6.00s/it] 53%|█████▎ | 1508/2826 [2:28:13<2:16:18, 6.21s/it] 53%|█████▎ | 1509/2826 [2:28:19<2:15:06, 6.16s/it] 53%|█████▎ | 1510/2826 [2:28:24<2:08:26, 5.86s/it] {'loss': 0.338, 'grad_norm': 2.6119141578674316, 'learning_rate': 2.6404514397590657e-06, 'epoch': 1.6} 53%|█████▎ | 1510/2826 [2:28:24<2:08:26, 5.86s/it] 53%|█████▎ | 1511/2826 [2:28:31<2:15:14, 6.17s/it] 54%|█████▎ | 1512/2826 [2:28:37<2:15:48, 6.20s/it] 54%|█████▎ | 1513/2826 [2:28:44<2:20:11, 6.41s/it] 54%|█████▎ | 1514/2826 [2:28:50<2:14:05, 6.13s/it] 54%|█████▎ | 1515/2826 [2:28:55<2:09:32, 5.93s/it] 54%|█████▎ | 1516/2826 [2:29:01<2:09:27, 5.93s/it] 54%|█████▎ | 1517/2826 [2:29:07<2:10:28, 5.98s/it] 54%|█████▎ | 1518/2826 [2:29:14<2:12:44, 6.09s/it] 54%|█████▍ | 1519/2826 [2:29:19<2:07:23, 5.85s/it] 54%|█████▍ | 1520/2826 [2:29:24<2:02:12, 5.61s/it] {'loss': 0.3124, 'grad_norm': 2.315875768661499, 'learning_rate': 2.6096055755344113e-06, 'epoch': 1.61} 54%|█████▍ | 1520/2826 [2:29:24<2:02:12, 5.61s/it] 54%|█████▍ | 1521/2826 [2:29:29<2:00:01, 5.52s/it] 54%|█████▍ | 1522/2826 [2:29:35<1:59:29, 5.50s/it] 54%|█████▍ | 1523/2826 [2:29:40<1:56:40, 5.37s/it] 54%|█████▍ | 1524/2826 [2:29:45<1:54:55, 5.30s/it] 54%|█████▍ | 1525/2826 [2:29:50<1:53:48, 5.25s/it] 54%|█████▍ | 1526/2826 [2:29:55<1:54:11, 5.27s/it] 54%|█████▍ | 1527/2826 [2:30:02<1:59:37, 5.53s/it] 54%|█████▍ | 1528/2826 [2:30:07<1:56:35, 5.39s/it] 54%|█████▍ | 1529/2826 [2:30:12<1:57:28, 5.43s/it] 54%|█████▍ | 1530/2826 [2:30:18<1:59:13, 5.52s/it] {'loss': 0.3538, 'grad_norm': 2.2880892753601074, 'learning_rate': 2.578742983689973e-06, 'epoch': 1.62} 54%|█████▍ | 1530/2826 [2:30:18<1:59:13, 5.52s/it] 54%|█████▍ | 1531/2826 [2:30:24<1:59:53, 5.55s/it] 54%|█████▍ | 1532/2826 [2:30:30<2:04:08, 5.76s/it] 54%|█████▍ | 1533/2826 [2:30:35<2:00:30, 5.59s/it] 54%|█████▍ | 1534/2826 [2:30:42<2:10:51, 6.08s/it] 54%|█████▍ | 1535/2826 [2:30:49<2:12:32, 6.16s/it] 54%|█████▍ | 1536/2826 [2:30:55<2:14:24, 6.25s/it] 54%|█████▍ | 1537/2826 [2:31:01<2:15:44, 6.32s/it] 54%|█████▍ | 1538/2826 [2:31:07<2:09:14, 6.02s/it] 54%|█████▍ | 1539/2826 [2:31:13<2:09:01, 6.02s/it] 54%|█████▍ | 1540/2826 [2:31:20<2:15:35, 6.33s/it] {'loss': 0.3353, 'grad_norm': 2.2615041732788086, 'learning_rate': 2.547868374366631e-06, 'epoch': 1.63} 54%|█████▍ | 1540/2826 [2:31:20<2:15:35, 6.33s/it] 55%|█████▍ | 1541/2826 [2:31:26<2:13:57, 6.25s/it] 55%|█████▍ | 1542/2826 [2:31:32<2:13:36, 6.24s/it] 55%|█████▍ | 1543/2826 [2:31:38<2:09:49, 6.07s/it] 55%|█████▍ | 1544/2826 [2:31:43<2:06:18, 5.91s/it] 55%|█████▍ | 1545/2826 [2:31:49<2:06:29, 5.92s/it] 55%|█████▍ | 1546/2826 [2:31:55<2:04:51, 5.85s/it] 55%|█████▍ | 1547/2826 [2:32:01<2:06:52, 5.95s/it] 55%|█████▍ | 1548/2826 [2:32:06<2:01:18, 5.69s/it] 55%|█████▍ | 1549/2826 [2:32:12<1:59:28, 5.61s/it] 55%|█████▍ | 1550/2826 [2:32:18<2:02:13, 5.75s/it] {'loss': 0.302, 'grad_norm': 1.9062315225601196, 'learning_rate': 2.5169864595393295e-06, 'epoch': 1.64} 55%|█████▍ | 1550/2826 [2:32:18<2:02:13, 5.75s/it] 55%|█████▍ | 1551/2826 [2:32:24<2:03:14, 5.80s/it] 55%|█████▍ | 1552/2826 [2:32:29<1:59:34, 5.63s/it] 55%|█████▍ | 1553/2826 [2:32:35<2:03:47, 5.83s/it] 55%|█████▍ | 1554/2826 [2:32:41<2:03:06, 5.81s/it] 55%|█████▌ | 1555/2826 [2:32:48<2:12:37, 6.26s/it] 55%|█████▌ | 1556/2826 [2:32:54<2:09:23, 6.11s/it] 55%|█████▌ | 1557/2826 [2:32:59<2:03:22, 5.83s/it] 55%|█████▌ | 1558/2826 [2:33:05<2:01:59, 5.77s/it] 55%|█████▌ | 1559/2826 [2:33:10<1:58:14, 5.60s/it] 55%|█████▌ | 1560/2826 [2:33:16<1:57:39, 5.58s/it] {'loss': 0.3124, 'grad_norm': 2.7016942501068115, 'learning_rate': 2.4861019522979537e-06, 'epoch': 1.65} 55%|█████▌ | 1560/2826 [2:33:16<1:57:39, 5.58s/it] 55%|█████▌ | 1561/2826 [2:33:22<2:02:07, 5.79s/it] 55%|█████▌ | 1562/2826 [2:33:28<2:01:58, 5.79s/it] 55%|█████▌ | 1563/2826 [2:33:34<2:07:10, 6.04s/it] 55%|█████▌ | 1564/2826 [2:33:40<2:05:25, 5.96s/it] 55%|█████▌ | 1565/2826 [2:33:46<2:06:55, 6.04s/it] 55%|█████▌ | 1566/2826 [2:33:52<2:02:21, 5.83s/it] 55%|█████▌ | 1567/2826 [2:33:57<1:58:05, 5.63s/it] 55%|█████▌ | 1568/2826 [2:34:02<1:54:14, 5.45s/it] 56%|█████▌ | 1569/2826 [2:34:07<1:53:19, 5.41s/it] 56%|█████▌ | 1570/2826 [2:34:13<1:53:32, 5.42s/it] {'loss': 0.3497, 'grad_norm': 2.4618184566497803, 'learning_rate': 2.455219566128034e-06, 'epoch': 1.67} 56%|█████▌ | 1570/2826 [2:34:13<1:53:32, 5.42s/it] 56%|█████▌ | 1571/2826 [2:34:18<1:53:24, 5.42s/it] 56%|█████▌ | 1572/2826 [2:34:24<1:54:07, 5.46s/it] 56%|█████▌ | 1573/2826 [2:34:29<1:52:16, 5.38s/it] 56%|█████▌ | 1574/2826 [2:34:35<1:56:00, 5.56s/it] 56%|█████▌ | 1575/2826 [2:34:40<1:56:14, 5.58s/it] 56%|█████▌ | 1576/2826 [2:34:46<1:57:30, 5.64s/it] 56%|█████▌ | 1577/2826 [2:34:53<2:05:30, 6.03s/it] 56%|█████▌ | 1578/2826 [2:34:59<2:07:31, 6.13s/it] 56%|█████▌ | 1579/2826 [2:35:05<2:05:00, 6.01s/it] 56%|█████▌ | 1580/2826 [2:35:12<2:07:54, 6.16s/it] {'loss': 0.3233, 'grad_norm': 2.8924951553344727, 'learning_rate': 2.4243440141913905e-06, 'epoch': 1.68} 56%|█████▌ | 1580/2826 [2:35:12<2:07:54, 6.16s/it] 56%|█████▌ | 1581/2826 [2:35:18<2:07:28, 6.14s/it] 56%|█████▌ | 1582/2826 [2:35:25<2:11:43, 6.35s/it] 56%|█████▌ | 1583/2826 [2:35:30<2:08:40, 6.21s/it] 56%|█████▌ | 1584/2826 [2:35:36<2:05:36, 6.07s/it] 56%|█████▌ | 1585/2826 [2:35:41<2:00:26, 5.82s/it] 56%|█████▌ | 1586/2826 [2:35:48<2:04:25, 6.02s/it] 56%|█████▌ | 1587/2826 [2:35:53<1:59:19, 5.78s/it] 56%|█████▌ | 1588/2826 [2:35:59<1:58:20, 5.74s/it] 56%|█████▌ | 1589/2826 [2:36:04<1:54:37, 5.56s/it] 56%|█████▋ | 1590/2826 [2:36:10<1:55:40, 5.62s/it] {'loss': 0.3067, 'grad_norm': 2.32255482673645, 'learning_rate': 2.393480008606825e-06, 'epoch': 1.69} 56%|█████▋ | 1590/2826 [2:36:10<1:55:40, 5.62s/it] 56%|█████▋ | 1591/2826 [2:36:15<1:53:13, 5.50s/it] 56%|█████▋ | 1592/2826 [2:36:20<1:50:24, 5.37s/it] 56%|█████▋ | 1593/2826 [2:36:26<1:55:33, 5.62s/it] 56%|█████▋ | 1594/2826 [2:36:32<1:57:34, 5.73s/it] 56%|█████▋ | 1595/2826 [2:36:38<1:55:04, 5.61s/it] 56%|█████▋ | 1596/2826 [2:36:44<1:59:10, 5.81s/it] 57%|█████▋ | 1597/2826 [2:36:49<1:55:38, 5.65s/it] 57%|█████▋ | 1598/2826 [2:36:55<1:58:18, 5.78s/it] 57%|█████▋ | 1599/2826 [2:37:02<2:04:41, 6.10s/it] 57%|█████▋ | 1600/2826 [2:37:09<2:11:51, 6.45s/it] {'loss': 0.2893, 'grad_norm': 1.8984359502792358, 'learning_rate': 2.3626322597309774e-06, 'epoch': 1.7} 57%|█████▋ | 1600/2826 [2:37:09<2:11:51, 6.45s/it] 57%|█████▋ | 1601/2826 [2:37:15<2:06:46, 6.21s/it] 57%|█████▋ | 1602/2826 [2:37:20<2:02:48, 6.02s/it] 57%|█████▋ | 1603/2826 [2:37:26<1:57:59, 5.79s/it] 57%|█████▋ | 1604/2826 [2:37:31<1:56:41, 5.73s/it] 57%|█████▋ | 1605/2826 [2:37:37<1:58:05, 5.80s/it] 57%|█████▋ | 1606/2826 [2:37:43<1:58:54, 5.85s/it] 57%|█████▋ | 1607/2826 [2:37:50<2:05:37, 6.18s/it] 57%|█████▋ | 1608/2826 [2:37:56<2:01:53, 6.00s/it] 57%|█████▋ | 1609/2826 [2:38:02<2:03:18, 6.08s/it] 57%|█████▋ | 1610/2826 [2:38:07<1:59:12, 5.88s/it] {'loss': 0.2825, 'grad_norm': 1.8360289335250854, 'learning_rate': 2.331805475439445e-06, 'epoch': 1.71} 57%|█████▋ | 1610/2826 [2:38:07<1:59:12, 5.88s/it] 57%|█████▋ | 1611/2826 [2:38:13<1:54:49, 5.67s/it] 57%|█████▋ | 1612/2826 [2:38:18<1:51:13, 5.50s/it] 57%|█████▋ | 1613/2826 [2:38:23<1:49:05, 5.40s/it] 57%|█████▋ | 1614/2826 [2:38:28<1:49:12, 5.41s/it] 57%|█████▋ | 1615/2826 [2:38:35<1:55:45, 5.74s/it] 57%|█████▋ | 1616/2826 [2:38:40<1:52:21, 5.57s/it] 57%|█████▋ | 1617/2826 [2:38:45<1:49:27, 5.43s/it] 57%|█████▋ | 1618/2826 [2:38:50<1:48:33, 5.39s/it] 57%|█████▋ | 1619/2826 [2:38:55<1:46:10, 5.28s/it] 57%|█████▋ | 1620/2826 [2:39:01<1:44:54, 5.22s/it] {'loss': 0.3379, 'grad_norm': 2.331998109817505, 'learning_rate': 2.3010043604082824e-06, 'epoch': 1.72} 57%|█████▋ | 1620/2826 [2:39:01<1:44:54, 5.22s/it] 57%|█████▋ | 1621/2826 [2:39:06<1:44:19, 5.19s/it] 57%|█████▋ | 1622/2826 [2:39:13<1:54:32, 5.71s/it] 57%|█████▋ | 1623/2826 [2:39:18<1:51:12, 5.55s/it] 57%|█████▋ | 1624/2826 [2:39:25<1:58:44, 5.93s/it] 58%|█████▊ | 1625/2826 [2:39:30<1:55:06, 5.75s/it] 58%|█████▊ | 1626/2826 [2:39:37<2:01:37, 6.08s/it] 58%|█████▊ | 1627/2826 [2:39:44<2:08:46, 6.44s/it] 58%|█████▊ | 1628/2826 [2:39:49<2:02:13, 6.12s/it] 58%|█████▊ | 1629/2826 [2:39:55<1:58:04, 5.92s/it] 58%|█████▊ | 1630/2826 [2:40:01<1:59:03, 5.97s/it] {'loss': 0.301, 'grad_norm': 2.3304574489593506, 'learning_rate': 2.2702336153959925e-06, 'epoch': 1.73} 58%|█████▊ | 1630/2826 [2:40:01<1:59:03, 5.97s/it] 58%|█████▊ | 1631/2826 [2:40:07<2:00:10, 6.03s/it] 58%|█████▊ | 1632/2826 [2:40:13<1:56:44, 5.87s/it] 58%|█████▊ | 1633/2826 [2:40:18<1:51:58, 5.63s/it] 58%|█████▊ | 1634/2826 [2:40:23<1:48:38, 5.47s/it] 58%|█████▊ | 1635/2826 [2:40:28<1:49:03, 5.49s/it] 58%|█████▊ | 1636/2826 [2:40:33<1:46:44, 5.38s/it] 58%|█████▊ | 1637/2826 [2:40:39<1:47:59, 5.45s/it] 58%|█████▊ | 1638/2826 [2:40:45<1:48:22, 5.47s/it] 58%|█████▊ | 1639/2826 [2:40:50<1:45:44, 5.34s/it] 58%|█████▊ | 1640/2826 [2:40:58<2:02:47, 6.21s/it] {'loss': 0.404, 'grad_norm': 2.534090518951416, 'learning_rate': 2.2394979365261134e-06, 'epoch': 1.74} 58%|█████▊ | 1640/2826 [2:40:58<2:02:47, 6.21s/it] 58%|█████▊ | 1641/2826 [2:41:04<2:01:16, 6.14s/it] 58%|█████▊ | 1642/2826 [2:41:10<2:02:52, 6.23s/it] 58%|█████▊ | 1643/2826 [2:41:16<1:57:24, 5.95s/it] 58%|█████▊ | 1644/2826 [2:41:22<1:57:17, 5.95s/it] 58%|█████▊ | 1645/2826 [2:41:27<1:55:10, 5.85s/it] 58%|█████▊ | 1646/2826 [2:41:33<1:54:44, 5.83s/it] 58%|█████▊ | 1647/2826 [2:41:39<1:53:42, 5.79s/it] 58%|█████▊ | 1648/2826 [2:41:44<1:49:27, 5.58s/it] 58%|█████▊ | 1649/2826 [2:41:51<1:56:58, 5.96s/it] 58%|█████▊ | 1650/2826 [2:41:56<1:52:02, 5.72s/it] {'loss': 0.3242, 'grad_norm': 2.273122549057007, 'learning_rate': 2.208802014570507e-06, 'epoch': 1.75} 58%|█████▊ | 1650/2826 [2:41:56<1:52:02, 5.72s/it] 58%|█████▊ | 1651/2826 [2:42:02<1:52:53, 5.76s/it] 58%|█████▊ | 1652/2826 [2:42:07<1:49:24, 5.59s/it] 58%|█████▊ | 1653/2826 [2:42:13<1:50:33, 5.66s/it] 59%|█████▊ | 1654/2826 [2:42:19<1:57:09, 6.00s/it] 59%|█████▊ | 1655/2826 [2:42:25<1:54:50, 5.88s/it] 59%|█████▊ | 1656/2826 [2:42:32<1:58:24, 6.07s/it] 59%|█████▊ | 1657/2826 [2:42:37<1:53:51, 5.84s/it] 59%|█████▊ | 1658/2826 [2:42:42<1:49:11, 5.61s/it] 59%|█████▊ | 1659/2826 [2:42:49<1:56:21, 5.98s/it] 59%|█████▊ | 1660/2826 [2:42:55<2:00:40, 6.21s/it] {'loss': 0.3152, 'grad_norm': 1.8859643936157227, 'learning_rate': 2.1781505342334775e-06, 'epoch': 1.76} 59%|█████▊ | 1660/2826 [2:42:55<2:00:40, 6.21s/it] 59%|█████▉ | 1661/2826 [2:43:03<2:06:10, 6.50s/it] 59%|█████▉ | 1662/2826 [2:43:08<2:02:01, 6.29s/it] 59%|█████▉ | 1663/2826 [2:43:16<2:06:22, 6.52s/it] 59%|█████▉ | 1664/2826 [2:43:22<2:05:53, 6.50s/it] 59%|█████▉ | 1665/2826 [2:43:27<1:59:42, 6.19s/it] 59%|█████▉ | 1666/2826 [2:43:34<1:59:38, 6.19s/it] 59%|█████▉ | 1667/2826 [2:43:39<1:54:03, 5.90s/it] 59%|█████▉ | 1668/2826 [2:43:44<1:49:47, 5.69s/it] 59%|█████▉ | 1669/2826 [2:43:51<1:57:06, 6.07s/it] 59%|█████▉ | 1670/2826 [2:43:56<1:51:35, 5.79s/it] {'loss': 0.3302, 'grad_norm': 2.567715644836426, 'learning_rate': 2.147548173436805e-06, 'epoch': 1.77} 59%|█████▉ | 1670/2826 [2:43:56<1:51:35, 5.79s/it] 59%|█████▉ | 1671/2826 [2:44:01<1:48:16, 5.62s/it] 59%|█████▉ | 1672/2826 [2:44:07<1:47:12, 5.57s/it] 59%|█████▉ | 1673/2826 [2:44:13<1:50:40, 5.76s/it] 59%|█████▉ | 1674/2826 [2:44:21<2:03:28, 6.43s/it] 59%|█████▉ | 1675/2826 [2:44:27<2:00:11, 6.27s/it] 59%|█████▉ | 1676/2826 [2:44:32<1:54:02, 5.95s/it] 59%|█████▉ | 1677/2826 [2:44:38<1:51:21, 5.81s/it] 59%|█████▉ | 1678/2826 [2:44:44<1:53:40, 5.94s/it] 59%|█████▉ | 1679/2826 [2:44:49<1:47:55, 5.65s/it] 59%|█████▉ | 1680/2826 [2:44:55<1:49:36, 5.74s/it] {'loss': 0.293, 'grad_norm': 2.7930519580841064, 'learning_rate': 2.116999602605814e-06, 'epoch': 1.78} 59%|█████▉ | 1680/2826 [2:44:55<1:49:36, 5.74s/it] 59%|█████▉ | 1681/2826 [2:45:00<1:48:18, 5.68s/it] 60%|█████▉ | 1682/2826 [2:45:07<1:54:42, 6.02s/it] 60%|█████▉ | 1683/2826 [2:45:13<1:51:58, 5.88s/it] 60%|█████▉ | 1684/2826 [2:45:18<1:48:19, 5.69s/it] 60%|█████▉ | 1685/2826 [2:45:25<1:57:26, 6.18s/it] 60%|█████▉ | 1686/2826 [2:45:32<1:58:25, 6.23s/it] 60%|█████▉ | 1687/2826 [2:45:38<1:56:40, 6.15s/it] 60%|█████▉ | 1688/2826 [2:45:44<1:57:54, 6.22s/it] 60%|█████▉ | 1689/2826 [2:45:49<1:51:06, 5.86s/it] 60%|█████▉ | 1690/2826 [2:45:54<1:46:55, 5.65s/it] {'loss': 0.2683, 'grad_norm': 2.646296262741089, 'learning_rate': 2.086509483956594e-06, 'epoch': 1.79} 60%|█████▉ | 1690/2826 [2:45:54<1:46:55, 5.65s/it] 60%|█████▉ | 1691/2826 [2:46:00<1:47:50, 5.70s/it] 60%|█████▉ | 1692/2826 [2:46:05<1:45:04, 5.56s/it] 60%|█████▉ | 1693/2826 [2:46:11<1:45:16, 5.58s/it] 60%|█████▉ | 1694/2826 [2:46:16<1:42:07, 5.41s/it] 60%|█████▉ | 1695/2826 [2:46:21<1:43:29, 5.49s/it] 60%|██████ | 1696/2826 [2:46:27<1:42:21, 5.44s/it] 60%|██████ | 1697/2826 [2:46:33<1:44:08, 5.53s/it] 60%|██████ | 1698/2826 [2:46:39<1:47:26, 5.71s/it] 60%|██████ | 1699/2826 [2:46:44<1:44:49, 5.58s/it] 60%|██████ | 1700/2826 [2:46:50<1:48:47, 5.80s/it] {'loss': 0.313, 'grad_norm': 2.3010053634643555, 'learning_rate': 2.056082470784469e-06, 'epoch': 1.8} 60%|██████ | 1700/2826 [2:46:50<1:48:47, 5.80s/it] 60%|██████ | 1701/2826 [2:46:55<1:44:42, 5.58s/it] 60%|██████ | 1702/2826 [2:47:02<1:49:02, 5.82s/it] 60%|██████ | 1703/2826 [2:47:08<1:52:33, 6.01s/it] 60%|██████ | 1704/2826 [2:47:14<1:48:40, 5.81s/it] 60%|██████ | 1705/2826 [2:47:19<1:48:14, 5.79s/it] 60%|██████ | 1706/2826 [2:47:25<1:49:15, 5.85s/it] 60%|██████ | 1707/2826 [2:47:30<1:45:38, 5.66s/it] 60%|██████ | 1708/2826 [2:47:36<1:43:37, 5.56s/it] 60%|██████ | 1709/2826 [2:47:42<1:45:30, 5.67s/it] 61%|██████ | 1710/2826 [2:47:47<1:42:47, 5.53s/it] {'loss': 0.262, 'grad_norm': 2.3864669799804688, 'learning_rate': 2.0257232067538213e-06, 'epoch': 1.81} 61%|██████ | 1710/2826 [2:47:47<1:42:47, 5.53s/it] 61%|██████ | 1711/2826 [2:47:53<1:43:52, 5.59s/it] 61%|██████ | 1712/2826 [2:47:59<1:45:58, 5.71s/it] 61%|██████ | 1713/2826 [2:48:04<1:42:33, 5.53s/it] 61%|██████ | 1714/2826 [2:48:10<1:46:20, 5.74s/it] 61%|██████ | 1715/2826 [2:48:16<1:46:34, 5.76s/it] 61%|██████ | 1716/2826 [2:48:21<1:43:35, 5.60s/it] 61%|██████ | 1717/2826 [2:48:27<1:43:00, 5.57s/it] 61%|██████ | 1718/2826 [2:48:33<1:45:42, 5.72s/it] 61%|██████ | 1719/2826 [2:48:38<1:44:57, 5.69s/it] 61%|██████ | 1720/2826 [2:48:43<1:42:34, 5.56s/it] {'loss': 0.3457, 'grad_norm': 2.63028883934021, 'learning_rate': 1.9954363251894007e-06, 'epoch': 1.82} 61%|██████ | 1720/2826 [2:48:43<1:42:34, 5.56s/it] 61%|██████ | 1721/2826 [2:48:50<1:46:39, 5.79s/it] 61%|██████ | 1722/2826 [2:48:56<1:51:04, 6.04s/it] 61%|██████ | 1723/2826 [2:49:02<1:46:15, 5.78s/it] 61%|██████ | 1724/2826 [2:49:07<1:46:08, 5.78s/it] 61%|██████ | 1725/2826 [2:49:13<1:47:50, 5.88s/it] 61%|██████ | 1726/2826 [2:49:19<1:46:57, 5.83s/it] 61%|██████ | 1727/2826 [2:49:24<1:43:12, 5.64s/it] 61%|██████ | 1728/2826 [2:49:31<1:47:26, 5.87s/it] 61%|██████ | 1729/2826 [2:49:38<1:52:35, 6.16s/it] 61%|██████ | 1730/2826 [2:49:43<1:50:19, 6.04s/it] {'loss': 0.2739, 'grad_norm': 2.0011484622955322, 'learning_rate': 1.9652264483691933e-06, 'epoch': 1.84} 61%|██████ | 1730/2826 [2:49:43<1:50:19, 6.04s/it] 61%|██████▏ | 1731/2826 [2:49:49<1:46:41, 5.85s/it] 61%|██████▏ | 1732/2826 [2:49:54<1:42:24, 5.62s/it] 61%|██████▏ | 1733/2826 [2:49:59<1:40:58, 5.54s/it] 61%|██████▏ | 1734/2826 [2:50:06<1:48:38, 5.97s/it] 61%|██████▏ | 1735/2826 [2:50:12<1:46:45, 5.87s/it] 61%|██████▏ | 1736/2826 [2:50:19<1:52:13, 6.18s/it] 61%|██████▏ | 1737/2826 [2:50:24<1:47:58, 5.95s/it] 62%|██████▏ | 1738/2826 [2:50:30<1:46:00, 5.85s/it] 62%|██████▏ | 1739/2826 [2:50:35<1:41:12, 5.59s/it] 62%|██████▏ | 1740/2826 [2:50:40<1:38:18, 5.43s/it] {'loss': 0.3109, 'grad_norm': 2.6818690299987793, 'learning_rate': 1.9350981868189944e-06, 'epoch': 1.85} 62%|██████▏ | 1740/2826 [2:50:40<1:38:18, 5.43s/it] 62%|██████▏ | 1741/2826 [2:50:45<1:35:50, 5.30s/it] 62%|██████▏ | 1742/2826 [2:50:50<1:34:29, 5.23s/it] 62%|██████▏ | 1743/2826 [2:50:55<1:34:51, 5.26s/it] 62%|██████▏ | 1744/2826 [2:51:01<1:37:15, 5.39s/it] 62%|██████▏ | 1745/2826 [2:51:06<1:36:09, 5.34s/it] 62%|██████▏ | 1746/2826 [2:51:13<1:42:39, 5.70s/it] 62%|██████▏ | 1747/2826 [2:51:18<1:39:15, 5.52s/it] 62%|██████▏ | 1748/2826 [2:51:23<1:40:05, 5.57s/it] 62%|██████▏ | 1749/2826 [2:51:30<1:46:27, 5.93s/it] 62%|██████▏ | 1750/2826 [2:51:36<1:47:42, 6.01s/it] {'loss': 0.3269, 'grad_norm': 2.6978225708007812, 'learning_rate': 1.9050561386087618e-06, 'epoch': 1.86} 62%|██████▏ | 1750/2826 [2:51:36<1:47:42, 6.01s/it] 62%|██████▏ | 1751/2826 [2:51:41<1:42:26, 5.72s/it] 62%|██████▏ | 1752/2826 [2:51:48<1:45:14, 5.88s/it] 62%|██████▏ | 1753/2826 [2:51:53<1:41:24, 5.67s/it] 62%|██████▏ | 1754/2826 [2:51:58<1:38:46, 5.53s/it] 62%|██████▏ | 1755/2826 [2:52:03<1:36:29, 5.41s/it] 62%|██████▏ | 1756/2826 [2:52:10<1:46:05, 5.95s/it] 62%|██████▏ | 1757/2826 [2:52:17<1:49:18, 6.14s/it] 62%|██████▏ | 1758/2826 [2:52:22<1:44:35, 5.88s/it] 62%|██████▏ | 1759/2826 [2:52:29<1:49:05, 6.13s/it] 62%|██████▏ | 1760/2826 [2:52:34<1:44:20, 5.87s/it] {'loss': 0.3617, 'grad_norm': 2.578031301498413, 'learning_rate': 1.8751048886508711e-06, 'epoch': 1.87} 62%|██████▏ | 1760/2826 [2:52:34<1:44:20, 5.87s/it] 62%|██████▏ | 1761/2826 [2:52:41<1:49:08, 6.15s/it] 62%|██████▏ | 1762/2826 [2:52:46<1:45:00, 5.92s/it] 62%|██████▏ | 1763/2826 [2:52:53<1:45:45, 5.97s/it] 62%|██████▏ | 1764/2826 [2:52:58<1:43:38, 5.86s/it] 62%|██████▏ | 1765/2826 [2:53:03<1:39:37, 5.63s/it] 62%|██████▏ | 1766/2826 [2:53:10<1:46:18, 6.02s/it] 63%|██████▎ | 1767/2826 [2:53:15<1:40:50, 5.71s/it] 63%|██████▎ | 1768/2826 [2:53:20<1:38:28, 5.58s/it] 63%|██████▎ | 1769/2826 [2:53:26<1:37:01, 5.51s/it] 63%|██████▎ | 1770/2826 [2:53:33<1:44:39, 5.95s/it] {'loss': 0.3228, 'grad_norm': 2.5525052547454834, 'learning_rate': 1.8452490080003888e-06, 'epoch': 1.88} 63%|██████▎ | 1770/2826 [2:53:33<1:44:39, 5.95s/it] 63%|██████▎ | 1771/2826 [2:53:39<1:48:00, 6.14s/it] 63%|██████▎ | 1772/2826 [2:53:45<1:47:00, 6.09s/it] 63%|██████▎ | 1773/2826 [2:53:51<1:42:38, 5.85s/it] 63%|██████▎ | 1774/2826 [2:53:57<1:44:30, 5.96s/it] 63%|██████▎ | 1775/2826 [2:54:02<1:41:07, 5.77s/it] 63%|██████▎ | 1776/2826 [2:54:08<1:41:44, 5.81s/it] 63%|██████▎ | 1777/2826 [2:54:13<1:38:22, 5.63s/it] 63%|██████▎ | 1778/2826 [2:54:21<1:47:25, 6.15s/it] 63%|██████▎ | 1779/2826 [2:54:26<1:43:40, 5.94s/it] 63%|██████▎ | 1780/2826 [2:54:32<1:42:29, 5.88s/it] {'loss': 0.2857, 'grad_norm': 2.1095635890960693, 'learning_rate': 1.8154930531574521e-06, 'epoch': 1.89} 63%|██████▎ | 1780/2826 [2:54:32<1:42:29, 5.88s/it] 63%|██████▎ | 1781/2826 [2:54:37<1:40:35, 5.78s/it] 63%|██████▎ | 1782/2826 [2:54:44<1:44:03, 5.98s/it] 63%|██████▎ | 1783/2826 [2:54:49<1:41:01, 5.81s/it] 63%|██████▎ | 1784/2826 [2:54:54<1:36:51, 5.58s/it] 63%|██████▎ | 1785/2826 [2:55:01<1:41:36, 5.86s/it] 63%|██████▎ | 1786/2826 [2:55:06<1:39:35, 5.75s/it] 63%|██████▎ | 1787/2826 [2:55:12<1:37:48, 5.65s/it] 63%|██████▎ | 1788/2826 [2:55:17<1:34:36, 5.47s/it] 63%|██████▎ | 1789/2826 [2:55:22<1:35:07, 5.50s/it] 63%|██████▎ | 1790/2826 [2:55:29<1:42:31, 5.94s/it] {'loss': 0.3622, 'grad_norm': 2.3965845108032227, 'learning_rate': 1.785841565371868e-06, 'epoch': 1.9} 63%|██████▎ | 1790/2826 [2:55:29<1:42:31, 5.94s/it] 63%|██████▎ | 1791/2826 [2:55:35<1:40:40, 5.84s/it] 63%|██████▎ | 1792/2826 [2:55:41<1:40:29, 5.83s/it] 63%|██████▎ | 1793/2826 [2:55:46<1:38:35, 5.73s/it] 63%|██████▎ | 1794/2826 [2:55:52<1:37:03, 5.64s/it] 64%|██████▎ | 1795/2826 [2:55:59<1:45:40, 6.15s/it] 64%|██████▎ | 1796/2826 [2:56:04<1:42:18, 5.96s/it] 64%|██████▎ | 1797/2826 [2:56:10<1:37:45, 5.70s/it] 64%|██████▎ | 1798/2826 [2:56:15<1:38:16, 5.74s/it] 64%|██████▎ | 1799/2826 [2:56:20<1:34:52, 5.54s/it] 64%|██████▎ | 1800/2826 [2:56:27<1:38:07, 5.74s/it] {'loss': 0.3031, 'grad_norm': 2.293715238571167, 'learning_rate': 1.7562990699500482e-06, 'epoch': 1.91} 64%|██████▎ | 1800/2826 [2:56:27<1:38:07, 5.74s/it] 64%|██████▎ | 1801/2826 [2:56:33<1:42:57, 6.03s/it] 64%|██████▍ | 1802/2826 [2:56:38<1:38:10, 5.75s/it] 64%|██████▍ | 1803/2826 [2:56:44<1:36:22, 5.65s/it] 64%|██████▍ | 1804/2826 [2:56:50<1:40:07, 5.88s/it] 64%|██████▍ | 1805/2826 [2:56:56<1:37:25, 5.73s/it] 64%|██████▍ | 1806/2826 [2:57:01<1:36:31, 5.68s/it] 64%|██████▍ | 1807/2826 [2:57:08<1:43:01, 6.07s/it] 64%|██████▍ | 1808/2826 [2:57:14<1:41:20, 5.97s/it] 64%|██████▍ | 1809/2826 [2:57:20<1:40:21, 5.92s/it] 64%|██████▍ | 1810/2826 [2:57:25<1:36:22, 5.69s/it] {'loss': 0.3019, 'grad_norm': 2.026015281677246, 'learning_rate': 1.7268700755643708e-06, 'epoch': 1.92} 64%|██████▍ | 1810/2826 [2:57:25<1:36:22, 5.69s/it] 64%|██████▍ | 1811/2826 [2:57:31<1:36:47, 5.72s/it] 64%|██████▍ | 1812/2826 [2:57:38<1:47:09, 6.34s/it] 64%|██████▍ | 1813/2826 [2:57:45<1:46:12, 6.29s/it] 64%|██████▍ | 1814/2826 [2:57:51<1:44:59, 6.22s/it] 64%|██████▍ | 1815/2826 [2:57:56<1:38:55, 5.87s/it] 64%|██████▍ | 1816/2826 [2:58:01<1:36:12, 5.72s/it] 64%|██████▍ | 1817/2826 [2:58:06<1:32:43, 5.51s/it] 64%|██████▍ | 1818/2826 [2:58:13<1:36:57, 5.77s/it] 64%|██████▍ | 1819/2826 [2:58:19<1:38:58, 5.90s/it] 64%|██████▍ | 1820/2826 [2:58:25<1:39:09, 5.91s/it] {'loss': 0.3047, 'grad_norm': 1.7175791263580322, 'learning_rate': 1.6975590735650812e-06, 'epoch': 1.93} 64%|██████▍ | 1820/2826 [2:58:25<1:39:09, 5.91s/it] 64%|██████▍ | 1821/2826 [2:58:30<1:35:36, 5.71s/it] 64%|██████▍ | 1822/2826 [2:58:36<1:36:03, 5.74s/it] 65%|██████▍ | 1823/2826 [2:58:41<1:34:19, 5.64s/it] 65%|██████▍ | 1824/2826 [2:58:46<1:31:50, 5.50s/it] 65%|██████▍ | 1825/2826 [2:58:52<1:31:48, 5.50s/it] 65%|██████▍ | 1826/2826 [2:58:58<1:34:17, 5.66s/it] 65%|██████▍ | 1827/2826 [2:59:04<1:35:07, 5.71s/it] 65%|██████▍ | 1828/2826 [2:59:10<1:39:48, 6.00s/it] 65%|██████▍ | 1829/2826 [2:59:16<1:38:58, 5.96s/it] 65%|██████▍ | 1830/2826 [2:59:22<1:36:23, 5.81s/it] {'loss': 0.3048, 'grad_norm': 2.0024490356445312, 'learning_rate': 1.668370537294841e-06, 'epoch': 1.94} 65%|██████▍ | 1830/2826 [2:59:22<1:36:23, 5.81s/it] 65%|██████▍ | 1831/2826 [2:59:27<1:35:03, 5.73s/it] 65%|██████▍ | 1832/2826 [2:59:33<1:34:32, 5.71s/it] 65%|██████▍ | 1833/2826 [2:59:38<1:31:26, 5.52s/it] 65%|██████▍ | 1834/2826 [2:59:44<1:32:25, 5.59s/it] 65%|██████▍ | 1835/2826 [2:59:49<1:29:51, 5.44s/it] 65%|██████▍ | 1836/2826 [2:59:55<1:35:45, 5.80s/it] 65%|██████▌ | 1837/2826 [3:00:02<1:40:11, 6.08s/it] 65%|██████▌ | 1838/2826 [3:00:08<1:39:40, 6.05s/it] 65%|██████▌ | 1839/2826 [3:00:14<1:37:52, 5.95s/it] 65%|██████▌ | 1840/2826 [3:00:20<1:38:25, 5.99s/it] {'loss': 0.3205, 'grad_norm': 2.8226239681243896, 'learning_rate': 1.6393089214060204e-06, 'epoch': 1.95} 65%|██████▌ | 1840/2826 [3:00:20<1:38:25, 5.99s/it] 65%|██████▌ | 1841/2826 [3:00:25<1:34:10, 5.74s/it] 65%|██████▌ | 1842/2826 [3:00:30<1:31:41, 5.59s/it] 65%|██████▌ | 1843/2826 [3:00:36<1:32:12, 5.63s/it] 65%|██████▌ | 1844/2826 [3:00:43<1:39:03, 6.05s/it] 65%|██████▌ | 1845/2826 [3:00:48<1:34:42, 5.79s/it] 65%|██████▌ | 1846/2826 [3:00:54<1:33:47, 5.74s/it] 65%|██████▌ | 1847/2826 [3:01:01<1:39:31, 6.10s/it] 65%|██████▌ | 1848/2826 [3:01:08<1:44:20, 6.40s/it] 65%|██████▌ | 1849/2826 [3:01:16<1:51:30, 6.85s/it] 65%|██████▌ | 1850/2826 [3:01:23<1:52:02, 6.89s/it] {'loss': 0.321, 'grad_norm': 1.9452221393585205, 'learning_rate': 1.6103786611808414e-06, 'epoch': 1.96} 65%|██████▌ | 1850/2826 [3:01:23<1:52:02, 6.89s/it] 65%|██████▌ | 1851/2826 [3:01:30<1:52:23, 6.92s/it] 66%|██████▌ | 1852/2826 [3:01:35<1:43:37, 6.38s/it] 66%|██████▌ | 1853/2826 [3:01:40<1:36:58, 5.98s/it] 66%|██████▌ | 1854/2826 [3:01:45<1:33:39, 5.78s/it] 66%|██████▌ | 1855/2826 [3:01:51<1:31:43, 5.67s/it] 66%|██████▌ | 1856/2826 [3:01:58<1:38:38, 6.10s/it] 66%|██████▌ | 1857/2826 [3:02:04<1:37:27, 6.03s/it] 66%|██████▌ | 1858/2826 [3:02:09<1:32:53, 5.76s/it] 66%|██████▌ | 1859/2826 [3:02:14<1:29:45, 5.57s/it] 66%|██████▌ | 1860/2826 [3:02:19<1:27:40, 5.45s/it] {'loss': 0.2954, 'grad_norm': 2.304274320602417, 'learning_rate': 1.5815841718544884e-06, 'epoch': 1.97} 66%|██████▌ | 1860/2826 [3:02:19<1:27:40, 5.45s/it] 66%|██████▌ | 1861/2826 [3:02:26<1:34:15, 5.86s/it] 66%|██████▌ | 1862/2826 [3:02:32<1:34:22, 5.87s/it] 66%|██████▌ | 1863/2826 [3:02:37<1:29:58, 5.61s/it] 66%|██████▌ | 1864/2826 [3:02:43<1:30:24, 5.64s/it] 66%|██████▌ | 1865/2826 [3:02:48<1:28:03, 5.50s/it] 66%|██████▌ | 1866/2826 [3:02:53<1:28:22, 5.52s/it] 66%|██████▌ | 1867/2826 [3:02:58<1:25:49, 5.37s/it] 66%|██████▌ | 1868/2826 [3:03:03<1:24:40, 5.30s/it] 66%|██████▌ | 1869/2826 [3:03:10<1:28:38, 5.56s/it] 66%|██████▌ | 1870/2826 [3:03:16<1:32:21, 5.80s/it] {'loss': 0.2945, 'grad_norm': 2.502206802368164, 'learning_rate': 1.5529298479412636e-06, 'epoch': 1.98} 66%|██████▌ | 1870/2826 [3:03:16<1:32:21, 5.80s/it] 66%|██████▌ | 1871/2826 [3:03:22<1:34:55, 5.96s/it] 66%|██████▌ | 1872/2826 [3:03:29<1:37:43, 6.15s/it] 66%|██████▋ | 1873/2826 [3:03:34<1:34:39, 5.96s/it] 66%|██████▋ | 1874/2826 [3:03:40<1:32:15, 5.81s/it] 66%|██████▋ | 1875/2826 [3:03:45<1:29:33, 5.65s/it] 66%|██████▋ | 1876/2826 [3:03:52<1:35:24, 6.03s/it] 66%|██████▋ | 1877/2826 [3:03:57<1:31:38, 5.79s/it] 66%|██████▋ | 1878/2826 [3:04:03<1:29:25, 5.66s/it] 66%|██████▋ | 1879/2826 [3:04:08<1:26:37, 5.49s/it] 67%|██████▋ | 1880/2826 [3:04:13<1:24:42, 5.37s/it] {'loss': 0.3291, 'grad_norm': 2.5796189308166504, 'learning_rate': 1.524420062563912e-06, 'epoch': 1.99} 67%|██████▋ | 1880/2826 [3:04:13<1:24:42, 5.37s/it] 67%|██████▋ | 1881/2826 [3:04:19<1:26:16, 5.48s/it] 67%|██████▋ | 1882/2826 [3:04:25<1:33:05, 5.92s/it] 67%|██████▋ | 1883/2826 [3:04:32<1:35:55, 6.10s/it] 67%|██████▋ | 1884/2826 [3:04:38<1:34:43, 6.03s/it] 67%|██████▋ | 1885/2826 [3:04:43<1:30:52, 5.79s/it] 67%|██████▋ | 1886/2826 [3:04:48<1:26:53, 5.55s/it][INFO|trainer.py:3984] 2025-10-18 09:51:01,296 >> Saving model checkpoint to /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-1886 [INFO|configuration_utils.py:419] 2025-10-18 09:51:01,303 >> Configuration saved in /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-1886/config.json [INFO|configuration_utils.py:911] 2025-10-18 09:51:01,304 >> Configuration saved in /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-1886/generation_config.json [INFO|modeling_utils.py:3580] 2025-10-18 09:51:16,354 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 4 checkpoint shards. You can find where each parameters has been saved in the index located at /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-1886/model.safetensors.index.json. [INFO|tokenization_utils_base.py:2510] 2025-10-18 09:51:16,359 >> tokenizer config file saved in /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-1886/tokenizer_config.json [INFO|tokenization_utils_base.py:2519] 2025-10-18 09:51:16,360 >> Special tokens file saved in /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-1886/special_tokens_map.json [2025-10-18 09:51:16,892] [INFO] [logging.py:107:log_dist] [Rank 0] [Torch] Checkpoint global_step1885 is about to be saved! [2025-10-18 09:51:16,903] [INFO] [logging.py:107:log_dist] [Rank 0] Saving model checkpoint: /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-1886/global_step1885/zero_pp_rank_0_mp_rank_00_model_states.pt [2025-10-18 09:51:16,903] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-1886/global_step1885/zero_pp_rank_0_mp_rank_00_model_states.pt... [2025-10-18 09:51:16,923] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-1886/global_step1885/zero_pp_rank_0_mp_rank_00_model_states.pt. [2025-10-18 09:51:16,936] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-1886/global_step1885/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... [2025-10-18 09:51:34,927] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-1886/global_step1885/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. [2025-10-18 09:51:34,929] [INFO] [engine.py:3701:_save_zero_checkpoint] zero checkpoint saved /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-1886/global_step1885/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt [2025-10-18 09:51:35,161] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step1885 is ready now! 67%|██████▋ | 1887/2826 [3:05:36<4:45:09, 18.22s/it] 67%|██████▋ | 1888/2826 [3:05:42<3:48:33, 14.62s/it] 67%|██████▋ | 1889/2826 [3:05:47<3:04:18, 11.80s/it] 67%|██████▋ | 1890/2826 [3:05:53<2:37:08, 10.07s/it] {'loss': 0.234, 'grad_norm': 1.9198871850967407, 'learning_rate': 1.4960591667862163e-06, 'epoch': 2.0} 67%|██████▋ | 1890/2826 [3:05:53<2:37:08, 10.07s/it] 67%|██████▋ | 1891/2826 [3:05:59<2:15:35, 8.70s/it] 67%|██████▋ | 1892/2826 [3:06:05<2:03:47, 7.95s/it] 67%|██████▋ | 1893/2826 [3:06:11<1:54:09, 7.34s/it] 67%|██████▋ | 1894/2826 [3:06:16<1:45:05, 6.77s/it] 67%|██████▋ | 1895/2826 [3:06:22<1:38:19, 6.34s/it] 67%|██████▋ | 1896/2826 [3:06:28<1:35:50, 6.18s/it] 67%|██████▋ | 1897/2826 [3:06:35<1:39:16, 6.41s/it] 67%|██████▋ | 1898/2826 [3:06:41<1:40:07, 6.47s/it] 67%|██████▋ | 1899/2826 [3:06:47<1:36:03, 6.22s/it] 67%|██████▋ | 1900/2826 [3:06:54<1:40:35, 6.52s/it] {'loss': 0.1943, 'grad_norm': 1.7082706689834595, 'learning_rate': 1.4678514889489464e-06, 'epoch': 2.01} 67%|██████▋ | 1900/2826 [3:06:54<1:40:35, 6.52s/it] 67%|██████▋ | 1901/2826 [3:07:03<1:51:41, 7.24s/it] 67%|██████▋ | 1902/2826 [3:07:09<1:45:07, 6.83s/it] 67%|██████▋ | 1903/2826 [3:07:14<1:39:02, 6.44s/it] 67%|██████▋ | 1904/2826 [3:07:19<1:32:32, 6.02s/it] 67%|██████▋ | 1905/2826 [3:07:26<1:34:50, 6.18s/it] 67%|██████▋ | 1906/2826 [3:07:31<1:30:04, 5.87s/it] 67%|██████▋ | 1907/2826 [3:07:36<1:27:27, 5.71s/it] 68%|██████▊ | 1908/2826 [3:07:43<1:33:41, 6.12s/it] 68%|██████▊ | 1909/2826 [3:07:49<1:30:38, 5.93s/it] 68%|██████▊ | 1910/2826 [3:07:56<1:35:49, 6.28s/it] {'loss': 0.1911, 'grad_norm': 1.8571817874908447, 'learning_rate': 1.4398013340092864e-06, 'epoch': 2.03} 68%|██████▊ | 1910/2826 [3:07:56<1:35:49, 6.28s/it] 68%|██████▊ | 1911/2826 [3:08:02<1:32:27, 6.06s/it] 68%|██████▊ | 1912/2826 [3:08:07<1:29:36, 5.88s/it] 68%|██████▊ | 1913/2826 [3:08:13<1:28:53, 5.84s/it] 68%|██████▊ | 1914/2826 [3:08:19<1:31:11, 6.00s/it] 68%|██████▊ | 1915/2826 [3:08:26<1:33:33, 6.16s/it] 68%|██████▊ | 1916/2826 [3:08:32<1:34:31, 6.23s/it] 68%|██████▊ | 1917/2826 [3:08:37<1:29:10, 5.89s/it] 68%|██████▊ | 1918/2826 [3:08:42<1:25:34, 5.65s/it] 68%|██████▊ | 1919/2826 [3:08:49<1:30:19, 5.98s/it] 68%|██████▊ | 1920/2826 [3:08:56<1:33:49, 6.21s/it] {'loss': 0.1895, 'grad_norm': 2.454561233520508, 'learning_rate': 1.4119129828838275e-06, 'epoch': 2.04} 68%|██████▊ | 1920/2826 [3:08:56<1:33:49, 6.21s/it] 68%|██████▊ | 1921/2826 [3:09:02<1:33:13, 6.18s/it] 68%|██████▊ | 1922/2826 [3:09:09<1:37:57, 6.50s/it] 68%|██████▊ | 1923/2826 [3:09:17<1:43:01, 6.84s/it] 68%|██████▊ | 1924/2826 [3:09:22<1:34:50, 6.31s/it] 68%|██████▊ | 1925/2826 [3:09:27<1:30:38, 6.04s/it] 68%|██████▊ | 1926/2826 [3:09:33<1:27:29, 5.83s/it] 68%|██████▊ | 1927/2826 [3:09:39<1:29:13, 5.96s/it] 68%|██████▊ | 1928/2826 [3:09:45<1:28:23, 5.91s/it] 68%|██████▊ | 1929/2826 [3:09:51<1:28:49, 5.94s/it] 68%|██████▊ | 1930/2826 [3:09:56<1:24:37, 5.67s/it] {'loss': 0.2177, 'grad_norm': 2.3714683055877686, 'learning_rate': 1.384190691795226e-06, 'epoch': 2.05} 68%|██████▊ | 1930/2826 [3:09:56<1:24:37, 5.67s/it] 68%|██████▊ | 1931/2826 [3:10:01<1:24:10, 5.64s/it] 68%|██████▊ | 1932/2826 [3:10:06<1:21:41, 5.48s/it] 68%|██████▊ | 1933/2826 [3:10:14<1:29:20, 6.00s/it] 68%|██████▊ | 1934/2826 [3:10:19<1:25:39, 5.76s/it] 68%|██████▊ | 1935/2826 [3:10:24<1:22:28, 5.55s/it] 69%|██████▊ | 1936/2826 [3:10:29<1:20:14, 5.41s/it] 69%|██████▊ | 1937/2826 [3:10:36<1:26:10, 5.82s/it] 69%|██████▊ | 1938/2826 [3:10:42<1:29:57, 6.08s/it] 69%|██████▊ | 1939/2826 [3:10:48<1:25:36, 5.79s/it] 69%|██████▊ | 1940/2826 [3:10:53<1:25:20, 5.78s/it] {'loss': 0.2252, 'grad_norm': 2.1356313228607178, 'learning_rate': 1.3566386916226373e-06, 'epoch': 2.06} 69%|██████▊ | 1940/2826 [3:10:53<1:25:20, 5.78s/it] 69%|██████▊ | 1941/2826 [3:11:01<1:33:30, 6.34s/it] 69%|██████▊ | 1942/2826 [3:11:06<1:29:11, 6.05s/it] 69%|██████▉ | 1943/2826 [3:11:12<1:25:14, 5.79s/it] 69%|██████▉ | 1944/2826 [3:11:18<1:28:11, 6.00s/it] 69%|██████▉ | 1945/2826 [3:11:25<1:33:26, 6.36s/it] 69%|██████▉ | 1946/2826 [3:11:32<1:37:07, 6.62s/it] 69%|██████▉ | 1947/2826 [3:11:40<1:39:07, 6.77s/it] 69%|██████▉ | 1948/2826 [3:11:45<1:33:28, 6.39s/it] 69%|██████▉ | 1949/2826 [3:11:51<1:29:38, 6.13s/it] 69%|██████▉ | 1950/2826 [3:11:57<1:28:57, 6.09s/it] {'loss': 0.1982, 'grad_norm': 2.446906089782715, 'learning_rate': 1.3292611872560134e-06, 'epoch': 2.07} 69%|██████▉ | 1950/2826 [3:11:57<1:28:57, 6.09s/it] 69%|██████▉ | 1951/2826 [3:12:02<1:26:12, 5.91s/it] 69%|██████▉ | 1952/2826 [3:12:07<1:23:51, 5.76s/it] 69%|██████▉ | 1953/2826 [3:12:13<1:24:18, 5.79s/it] 69%|██████▉ | 1954/2826 [3:12:19<1:21:45, 5.63s/it] 69%|██████▉ | 1955/2826 [3:12:26<1:28:01, 6.06s/it] 69%|██████▉ | 1956/2826 [3:12:33<1:32:10, 6.36s/it] 69%|██████▉ | 1957/2826 [3:12:38<1:29:15, 6.16s/it] 69%|██████▉ | 1958/2826 [3:12:44<1:26:07, 5.95s/it] 69%|██████▉ | 1959/2826 [3:12:50<1:28:51, 6.15s/it] 69%|██████▉ | 1960/2826 [3:12:58<1:32:42, 6.42s/it] {'loss': 0.1696, 'grad_norm': 2.1040875911712646, 'learning_rate': 1.302062356954365e-06, 'epoch': 2.08} 69%|██████▉ | 1960/2826 [3:12:58<1:32:42, 6.42s/it] 69%|██████▉ | 1961/2826 [3:13:06<1:40:15, 6.95s/it] 69%|██████▉ | 1962/2826 [3:13:11<1:33:04, 6.46s/it] 69%|██████▉ | 1963/2826 [3:13:16<1:27:18, 6.07s/it] 69%|██████▉ | 1964/2826 [3:13:22<1:24:00, 5.85s/it] 70%|██████▉ | 1965/2826 [3:13:29<1:28:56, 6.20s/it] 70%|██████▉ | 1966/2826 [3:13:34<1:26:09, 6.01s/it] 70%|██████▉ | 1967/2826 [3:13:41<1:31:34, 6.40s/it] 70%|██████▉ | 1968/2826 [3:13:47<1:27:26, 6.12s/it] 70%|██████▉ | 1969/2826 [3:13:53<1:26:53, 6.08s/it] 70%|██████▉ | 1970/2826 [3:14:00<1:29:35, 6.28s/it] {'loss': 0.1936, 'grad_norm': 2.220742702484131, 'learning_rate': 1.2750463517080922e-06, 'epoch': 2.09} 70%|██████▉ | 1970/2826 [3:14:00<1:29:35, 6.28s/it] 70%|██████▉ | 1971/2826 [3:14:05<1:24:17, 5.91s/it] 70%|██████▉ | 1972/2826 [3:14:10<1:22:25, 5.79s/it] 70%|██████▉ | 1973/2826 [3:14:17<1:26:58, 6.12s/it] 70%|██████▉ | 1974/2826 [3:14:23<1:26:07, 6.07s/it] 70%|██████▉ | 1975/2826 [3:14:28<1:22:19, 5.80s/it] 70%|██████▉ | 1976/2826 [3:14:33<1:19:45, 5.63s/it] 70%|██████▉ | 1977/2826 [3:14:39<1:17:36, 5.48s/it] 70%|██████▉ | 1978/2826 [3:14:45<1:22:04, 5.81s/it] 70%|███████ | 1979/2826 [3:14:50<1:20:02, 5.67s/it] 70%|███████ | 1980/2826 [3:14:56<1:20:16, 5.69s/it] {'loss': 0.1604, 'grad_norm': 2.7784054279327393, 'learning_rate': 1.2482172946054753e-06, 'epoch': 2.1} 70%|███████ | 1980/2826 [3:14:56<1:20:16, 5.69s/it] 70%|███████ | 1981/2826 [3:15:02<1:18:45, 5.59s/it] 70%|███████ | 1982/2826 [3:15:08<1:20:40, 5.73s/it] 70%|███████ | 1983/2826 [3:15:14<1:24:24, 6.01s/it] 70%|███████ | 1984/2826 [3:15:19<1:20:36, 5.74s/it] 70%|███████ | 1985/2826 [3:15:25<1:20:48, 5.76s/it] 70%|███████ | 1986/2826 [3:15:31<1:20:35, 5.76s/it] 70%|███████ | 1987/2826 [3:15:37<1:19:54, 5.72s/it] 70%|███████ | 1988/2826 [3:15:42<1:18:59, 5.66s/it] 70%|███████ | 1989/2826 [3:15:48<1:18:14, 5.61s/it] 70%|███████ | 1990/2826 [3:15:54<1:22:43, 5.94s/it] {'loss': 0.2069, 'grad_norm': 2.0539498329162598, 'learning_rate': 1.2215792802034187e-06, 'epoch': 2.11} 70%|███████ | 1990/2826 [3:15:54<1:22:43, 5.94s/it] 70%|███████ | 1991/2826 [3:15:59<1:19:14, 5.69s/it] 70%|███████ | 1992/2826 [3:16:06<1:21:47, 5.88s/it] 71%|███████ | 1993/2826 [3:16:13<1:26:41, 6.24s/it] 71%|███████ | 1994/2826 [3:16:19<1:26:02, 6.20s/it] 71%|███████ | 1995/2826 [3:16:25<1:25:45, 6.19s/it] 71%|███████ | 1996/2826 [3:16:30<1:21:23, 5.88s/it] 71%|███████ | 1997/2826 [3:16:38<1:27:54, 6.36s/it] 71%|███████ | 1998/2826 [3:16:44<1:25:16, 6.18s/it] 71%|███████ | 1999/2826 [3:16:49<1:21:13, 5.89s/it] 71%|███████ | 2000/2826 [3:16:54<1:18:18, 5.69s/it] {'loss': 0.1964, 'grad_norm': 1.8337138891220093, 'learning_rate': 1.1951363739025618e-06, 'epoch': 2.12} 71%|███████ | 2000/2826 [3:16:54<1:18:18, 5.69s/it] 71%|███████ | 2001/2826 [3:17:01<1:24:31, 6.15s/it] 71%|███████ | 2002/2826 [3:17:08<1:28:55, 6.47s/it] 71%|███████ | 2003/2826 [3:17:14<1:24:46, 6.18s/it] 71%|███████ | 2004/2826 [3:17:20<1:25:04, 6.21s/it] 71%|███████ | 2005/2826 [3:17:26<1:21:26, 5.95s/it] 71%|███████ | 2006/2826 [3:17:31<1:18:48, 5.77s/it] 71%|███████ | 2007/2826 [3:17:36<1:16:02, 5.57s/it] 71%|███████ | 2008/2826 [3:17:42<1:17:17, 5.67s/it] 71%|███████ | 2009/2826 [3:17:48<1:17:53, 5.72s/it] 71%|███████ | 2010/2826 [3:17:53<1:14:54, 5.51s/it] {'loss': 0.1871, 'grad_norm': 1.7631642818450928, 'learning_rate': 1.168892611326827e-06, 'epoch': 2.13} 71%|███████ | 2010/2826 [3:17:53<1:14:54, 5.51s/it] 71%|███████ | 2011/2826 [3:17:58<1:14:29, 5.48s/it] 71%|███████ | 2012/2826 [3:18:06<1:23:32, 6.16s/it] 71%|███████ | 2013/2826 [3:18:11<1:20:27, 5.94s/it] 71%|███████▏ | 2014/2826 [3:18:17<1:17:52, 5.75s/it] 71%|███████▏ | 2015/2826 [3:18:22<1:15:07, 5.56s/it] 71%|███████▏ | 2016/2826 [3:18:27<1:13:24, 5.44s/it] 71%|███████▏ | 2017/2826 [3:18:33<1:16:56, 5.71s/it] 71%|███████▏ | 2018/2826 [3:18:39<1:18:59, 5.87s/it] 71%|███████▏ | 2019/2826 [3:18:46<1:21:10, 6.04s/it] 71%|███████▏ | 2020/2826 [3:18:51<1:16:58, 5.73s/it] {'loss': 0.2595, 'grad_norm': 2.386589527130127, 'learning_rate': 1.1428519977075136e-06, 'epoch': 2.14} 71%|███████▏ | 2020/2826 [3:18:51<1:16:58, 5.73s/it] 72%|███████▏ | 2021/2826 [3:18:57<1:18:28, 5.85s/it] 72%|███████▏ | 2022/2826 [3:19:02<1:15:55, 5.67s/it] 72%|███████▏ | 2023/2826 [3:19:07<1:13:31, 5.49s/it] 72%|███████▏ | 2024/2826 [3:19:14<1:16:15, 5.71s/it] 72%|███████▏ | 2025/2826 [3:19:20<1:19:35, 5.96s/it] 72%|███████▏ | 2026/2826 [3:19:27<1:22:04, 6.16s/it] 72%|███████▏ | 2027/2826 [3:19:32<1:17:51, 5.85s/it] 72%|███████▏ | 2028/2826 [3:19:38<1:19:51, 6.00s/it] 72%|███████▏ | 2029/2826 [3:19:43<1:16:21, 5.75s/it] 72%|███████▏ | 2030/2826 [3:19:49<1:13:51, 5.57s/it] {'loss': 0.185, 'grad_norm': 2.553382635116577, 'learning_rate': 1.1170185072720434e-06, 'epoch': 2.15} 72%|███████▏ | 2030/2826 [3:19:49<1:13:51, 5.57s/it] 72%|███████▏ | 2031/2826 [3:19:54<1:12:13, 5.45s/it] 72%|███████▏ | 2032/2826 [3:19:59<1:10:46, 5.35s/it] 72%|███████▏ | 2033/2826 [3:20:04<1:11:20, 5.40s/it] 72%|███████▏ | 2034/2826 [3:20:10<1:13:55, 5.60s/it] 72%|███████▏ | 2035/2826 [3:20:17<1:17:21, 5.87s/it] 72%|███████▏ | 2036/2826 [3:20:23<1:16:08, 5.78s/it] 72%|███████▏ | 2037/2826 [3:20:28<1:16:03, 5.78s/it] 72%|███████▏ | 2038/2826 [3:20:35<1:18:50, 6.00s/it] 72%|███████▏ | 2039/2826 [3:20:41<1:19:04, 6.03s/it] 72%|███████▏ | 2040/2826 [3:20:47<1:20:00, 6.11s/it] {'loss': 0.228, 'grad_norm': 2.870973825454712, 'learning_rate': 1.091396082637419e-06, 'epoch': 2.16} 72%|███████▏ | 2040/2826 [3:20:47<1:20:00, 6.11s/it] 72%|███████▏ | 2041/2826 [3:20:53<1:17:39, 5.94s/it] 72%|███████▏ | 2042/2826 [3:20:58<1:15:02, 5.74s/it] 72%|███████▏ | 2043/2826 [3:21:03<1:13:49, 5.66s/it] 72%|███████▏ | 2044/2826 [3:21:10<1:15:49, 5.82s/it] 72%|███████▏ | 2045/2826 [3:21:16<1:18:18, 6.02s/it] 72%|███████▏ | 2046/2826 [3:21:22<1:18:41, 6.05s/it] 72%|███████▏ | 2047/2826 [3:21:28<1:18:47, 6.07s/it] 72%|███████▏ | 2048/2826 [3:21:34<1:18:41, 6.07s/it] 73%|███████▎ | 2049/2826 [3:21:42<1:22:51, 6.40s/it] 73%|███████▎ | 2050/2826 [3:21:48<1:21:13, 6.28s/it] {'loss': 0.2098, 'grad_norm': 2.643745183944702, 'learning_rate': 1.065988634208516e-06, 'epoch': 2.17} 73%|███████▎ | 2050/2826 [3:21:48<1:21:13, 6.28s/it] 73%|███████▎ | 2051/2826 [3:21:53<1:18:57, 6.11s/it] 73%|███████▎ | 2052/2826 [3:22:00<1:21:20, 6.31s/it] 73%|███████▎ | 2053/2826 [3:22:07<1:23:59, 6.52s/it] 73%|███████▎ | 2054/2826 [3:22:13<1:20:35, 6.26s/it] 73%|███████▎ | 2055/2826 [3:22:19<1:18:48, 6.13s/it] 73%|███████▎ | 2056/2826 [3:22:24<1:14:42, 5.82s/it] 73%|███████▎ | 2057/2826 [3:22:29<1:11:45, 5.60s/it] 73%|███████▎ | 2058/2826 [3:22:36<1:18:24, 6.13s/it] 73%|███████▎ | 2059/2826 [3:22:42<1:15:54, 5.94s/it] 73%|███████▎ | 2060/2826 [3:22:48<1:15:33, 5.92s/it] {'loss': 0.1982, 'grad_norm': 2.369596481323242, 'learning_rate': 1.0408000395812961e-06, 'epoch': 2.18} 73%|███████▎ | 2060/2826 [3:22:48<1:15:33, 5.92s/it] 73%|███████▎ | 2061/2826 [3:22:54<1:16:32, 6.00s/it] 73%|███████▎ | 2062/2826 [3:22:59<1:13:27, 5.77s/it] 73%|███████▎ | 2063/2826 [3:23:04<1:10:57, 5.58s/it] 73%|███████▎ | 2064/2826 [3:23:10<1:11:39, 5.64s/it] 73%|███████▎ | 2065/2826 [3:23:16<1:12:13, 5.69s/it] 73%|███████▎ | 2066/2826 [3:23:22<1:15:30, 5.96s/it] 73%|███████▎ | 2067/2826 [3:23:28<1:14:26, 5.88s/it] 73%|███████▎ | 2068/2826 [3:23:34<1:14:49, 5.92s/it] 73%|███████▎ | 2069/2826 [3:23:39<1:12:33, 5.75s/it] 73%|███████▎ | 2070/2826 [3:23:45<1:11:27, 5.67s/it] {'loss': 0.1844, 'grad_norm': 2.1093883514404297, 'learning_rate': 1.0158341429510194e-06, 'epoch': 2.2} 73%|███████▎ | 2070/2826 [3:23:45<1:11:27, 5.67s/it] 73%|███████▎ | 2071/2826 [3:23:52<1:18:35, 6.25s/it] 73%|███████▎ | 2072/2826 [3:23:58<1:14:18, 5.91s/it] 73%|███████▎ | 2073/2826 [3:24:03<1:11:32, 5.70s/it] 73%|███████▎ | 2074/2826 [3:24:10<1:16:00, 6.06s/it] 73%|███████▎ | 2075/2826 [3:24:15<1:12:30, 5.79s/it] 73%|███████▎ | 2076/2826 [3:24:22<1:18:19, 6.27s/it] 73%|███████▎ | 2077/2826 [3:24:27<1:14:25, 5.96s/it] 74%|███████▎ | 2078/2826 [3:24:34<1:16:09, 6.11s/it] 74%|███████▎ | 2079/2826 [3:24:39<1:13:08, 5.87s/it] 74%|███████▎ | 2080/2826 [3:24:45<1:12:01, 5.79s/it] {'loss': 0.1654, 'grad_norm': 1.951935052871704, 'learning_rate': 9.910947545255523e-07, 'epoch': 2.21} 74%|███████▎ | 2080/2826 [3:24:45<1:12:01, 5.79s/it] 74%|███████▎ | 2081/2826 [3:24:51<1:12:04, 5.80s/it] 74%|███████▎ | 2082/2826 [3:24:57<1:12:16, 5.83s/it] 74%|███████▎ | 2083/2826 [3:25:02<1:11:05, 5.74s/it] 74%|███████▎ | 2084/2826 [3:25:07<1:09:16, 5.60s/it] 74%|███████▍ | 2085/2826 [3:25:15<1:15:48, 6.14s/it] 74%|███████▍ | 2086/2826 [3:25:20<1:12:21, 5.87s/it] 74%|███████▍ | 2087/2826 [3:25:26<1:12:01, 5.85s/it] 74%|███████▍ | 2088/2826 [3:25:31<1:09:49, 5.68s/it] 74%|███████▍ | 2089/2826 [3:25:37<1:09:15, 5.64s/it] 74%|███████▍ | 2090/2826 [3:25:43<1:11:40, 5.84s/it] {'loss': 0.2037, 'grad_norm': 2.230781078338623, 'learning_rate': 9.665856499438744e-07, 'epoch': 2.22} 74%|███████▍ | 2090/2826 [3:25:43<1:11:40, 5.84s/it] 74%|███████▍ | 2091/2826 [3:25:48<1:09:46, 5.70s/it] 74%|███████▍ | 2092/2826 [3:25:55<1:11:46, 5.87s/it] 74%|███████▍ | 2093/2826 [3:26:01<1:12:30, 5.94s/it] 74%|███████▍ | 2094/2826 [3:26:08<1:16:57, 6.31s/it] 74%|███████▍ | 2095/2826 [3:26:13<1:13:42, 6.05s/it] 74%|███████▍ | 2096/2826 [3:26:18<1:10:09, 5.77s/it] 74%|███████▍ | 2097/2826 [3:26:24<1:10:40, 5.82s/it] 74%|███████▍ | 2098/2826 [3:26:29<1:08:00, 5.61s/it] 74%|███████▍ | 2099/2826 [3:26:34<1:05:53, 5.44s/it] 74%|███████▍ | 2100/2826 [3:26:40<1:05:13, 5.39s/it] {'loss': 0.2087, 'grad_norm': 2.6240904331207275, 'learning_rate': 9.423105696998491e-07, 'epoch': 2.23} 74%|███████▍ | 2100/2826 [3:26:40<1:05:13, 5.39s/it] 74%|███████▍ | 2101/2826 [3:26:46<1:09:12, 5.73s/it] 74%|███████▍ | 2102/2826 [3:26:52<1:08:27, 5.67s/it] 74%|███████▍ | 2103/2826 [3:26:57<1:08:10, 5.66s/it] 74%|███████▍ | 2104/2826 [3:27:04<1:09:40, 5.79s/it] 74%|███████▍ | 2105/2826 [3:27:09<1:08:16, 5.68s/it] 75%|███████▍ | 2106/2826 [3:27:15<1:10:07, 5.84s/it] 75%|███████▍ | 2107/2826 [3:27:20<1:07:24, 5.62s/it] 75%|███████▍ | 2108/2826 [3:27:25<1:05:28, 5.47s/it] 75%|███████▍ | 2109/2826 [3:27:32<1:08:24, 5.72s/it] 75%|███████▍ | 2110/2826 [3:27:38<1:10:22, 5.90s/it] {'loss': 0.2105, 'grad_norm': 1.712857723236084, 'learning_rate': 9.182732185713633e-07, 'epoch': 2.24} 75%|███████▍ | 2110/2826 [3:27:38<1:10:22, 5.90s/it] 75%|███████▍ | 2111/2826 [3:27:44<1:09:00, 5.79s/it] 75%|███████▍ | 2112/2826 [3:27:49<1:06:52, 5.62s/it] 75%|███████▍ | 2113/2826 [3:27:55<1:08:21, 5.75s/it] 75%|███████▍ | 2114/2826 [3:28:01<1:10:52, 5.97s/it] 75%|███████▍ | 2115/2826 [3:28:08<1:14:43, 6.31s/it] 75%|███████▍ | 2116/2826 [3:28:14<1:11:51, 6.07s/it] 75%|███████▍ | 2117/2826 [3:28:20<1:12:30, 6.14s/it] 75%|███████▍ | 2118/2826 [3:28:25<1:09:08, 5.86s/it] 75%|███████▍ | 2119/2826 [3:28:31<1:06:20, 5.63s/it] 75%|███████▌ | 2120/2826 [3:28:36<1:04:27, 5.48s/it] {'loss': 0.2186, 'grad_norm': 2.036086082458496, 'learning_rate': 8.94477265054918e-07, 'epoch': 2.25} 75%|███████▌ | 2120/2826 [3:28:36<1:04:27, 5.48s/it] 75%|███████▌ | 2121/2826 [3:28:41<1:05:06, 5.54s/it] 75%|███████▌ | 2122/2826 [3:28:47<1:05:10, 5.55s/it] 75%|███████▌ | 2123/2826 [3:28:52<1:04:29, 5.50s/it] 75%|███████▌ | 2124/2826 [3:28:59<1:08:49, 5.88s/it] 75%|███████▌ | 2125/2826 [3:29:06<1:12:23, 6.20s/it] 75%|███████▌ | 2126/2826 [3:29:11<1:08:44, 5.89s/it] 75%|███████▌ | 2127/2826 [3:29:16<1:06:02, 5.67s/it] 75%|███████▌ | 2128/2826 [3:29:23<1:07:45, 5.82s/it] 75%|███████▌ | 2129/2826 [3:29:28<1:06:46, 5.75s/it] 75%|███████▌ | 2130/2826 [3:29:34<1:05:57, 5.69s/it] {'loss': 0.1879, 'grad_norm': 2.3545398712158203, 'learning_rate': 8.709263408057522e-07, 'epoch': 2.26} 75%|███████▌ | 2130/2826 [3:29:34<1:05:57, 5.69s/it] 75%|███████▌ | 2131/2826 [3:29:40<1:09:22, 5.99s/it] 75%|███████▌ | 2132/2826 [3:29:45<1:06:14, 5.73s/it] 75%|███████▌ | 2133/2826 [3:29:50<1:03:38, 5.51s/it] 76%|███████▌ | 2134/2826 [3:29:56<1:03:07, 5.47s/it] 76%|███████▌ | 2135/2826 [3:30:01<1:02:23, 5.42s/it] 76%|███████▌ | 2136/2826 [3:30:06<1:01:41, 5.36s/it] 76%|███████▌ | 2137/2826 [3:30:13<1:05:37, 5.71s/it] 76%|███████▌ | 2138/2826 [3:30:18<1:04:12, 5.60s/it] 76%|███████▌ | 2139/2826 [3:30:23<1:02:18, 5.44s/it] 76%|███████▌ | 2140/2826 [3:30:30<1:07:36, 5.91s/it] {'loss': 0.2177, 'grad_norm': 1.9098992347717285, 'learning_rate': 8.476240400835972e-07, 'epoch': 2.27} 76%|███████▌ | 2140/2826 [3:30:30<1:07:36, 5.91s/it] 76%|███████▌ | 2141/2826 [3:30:36<1:06:05, 5.79s/it] 76%|███████▌ | 2142/2826 [3:30:41<1:03:44, 5.59s/it] 76%|███████▌ | 2143/2826 [3:30:47<1:03:51, 5.61s/it] 76%|███████▌ | 2144/2826 [3:30:52<1:04:15, 5.65s/it] 76%|███████▌ | 2145/2826 [3:30:58<1:05:14, 5.75s/it] 76%|███████▌ | 2146/2826 [3:31:04<1:05:48, 5.81s/it] 76%|███████▌ | 2147/2826 [3:31:12<1:11:58, 6.36s/it] 76%|███████▌ | 2148/2826 [3:31:19<1:13:50, 6.53s/it] 76%|███████▌ | 2149/2826 [3:31:25<1:10:42, 6.27s/it] 76%|███████▌ | 2150/2826 [3:31:30<1:07:01, 5.95s/it] {'loss': 0.165, 'grad_norm': 2.107959270477295, 'learning_rate': 8.245739192041311e-07, 'epoch': 2.28} 76%|███████▌ | 2150/2826 [3:31:30<1:07:01, 5.95s/it] 76%|███████▌ | 2151/2826 [3:31:36<1:07:15, 5.98s/it] 76%|███████▌ | 2152/2826 [3:31:41<1:05:21, 5.82s/it] 76%|███████▌ | 2153/2826 [3:31:48<1:06:53, 5.96s/it] 76%|███████▌ | 2154/2826 [3:31:53<1:04:08, 5.73s/it] 76%|███████▋ | 2155/2826 [3:31:59<1:05:29, 5.86s/it] 76%|███████▋ | 2156/2826 [3:32:04<1:02:42, 5.62s/it] 76%|███████▋ | 2157/2826 [3:32:09<1:00:49, 5.45s/it] 76%|███████▋ | 2158/2826 [3:32:15<1:01:31, 5.53s/it] 76%|███████▋ | 2159/2826 [3:32:20<1:00:05, 5.41s/it] 76%|███████▋ | 2160/2826 [3:32:27<1:04:54, 5.85s/it] {'loss': 0.2018, 'grad_norm': 2.550719976425171, 'learning_rate': 8.017794959962225e-07, 'epoch': 2.29} 76%|███████▋ | 2160/2826 [3:32:27<1:04:54, 5.85s/it] 76%|███████▋ | 2161/2826 [3:32:33<1:06:06, 5.96s/it] 77%|███████▋ | 2162/2826 [3:32:40<1:09:13, 6.26s/it] 77%|███████▋ | 2163/2826 [3:32:47<1:12:06, 6.53s/it] 77%|███████▋ | 2164/2826 [3:32:54<1:13:20, 6.65s/it] 77%|███████▋ | 2165/2826 [3:32:59<1:09:29, 6.31s/it] 77%|███████▋ | 2166/2826 [3:33:06<1:11:42, 6.52s/it] 77%|███████▋ | 2167/2826 [3:33:12<1:08:24, 6.23s/it] 77%|███████▋ | 2168/2826 [3:33:18<1:06:30, 6.06s/it] 77%|███████▋ | 2169/2826 [3:33:23<1:03:08, 5.77s/it] 77%|███████▋ | 2170/2826 [3:33:28<1:00:57, 5.58s/it] {'loss': 0.1955, 'grad_norm': 2.354701280593872, 'learning_rate': 7.792442492650587e-07, 'epoch': 2.3} 77%|███████▋ | 2170/2826 [3:33:28<1:00:57, 5.58s/it] 77%|███████▋ | 2171/2826 [3:33:35<1:04:28, 5.91s/it] 77%|███████▋ | 2172/2826 [3:33:41<1:06:40, 6.12s/it] 77%|███████▋ | 2173/2826 [3:33:47<1:05:52, 6.05s/it] 77%|███████▋ | 2174/2826 [3:33:53<1:03:50, 5.88s/it] 77%|███████▋ | 2175/2826 [3:33:58<1:01:10, 5.64s/it] 77%|███████▋ | 2176/2826 [3:34:03<59:25, 5.48s/it] 77%|███████▋ | 2177/2826 [3:34:08<1:00:00, 5.55s/it] 77%|███████▋ | 2178/2826 [3:34:14<59:35, 5.52s/it] 77%|███████▋ | 2179/2826 [3:34:19<59:16, 5.50s/it] 77%|███████▋ | 2180/2826 [3:34:25<58:22, 5.42s/it] {'loss': 0.1976, 'grad_norm': 2.3547091484069824, 'learning_rate': 7.569716182612177e-07, 'epoch': 2.31} 77%|███████▋ | 2180/2826 [3:34:25<58:22, 5.42s/it] 77%|███████▋ | 2181/2826 [3:34:30<58:29, 5.44s/it] 77%|███████▋ | 2182/2826 [3:34:37<1:02:44, 5.85s/it] 77%|███████▋ | 2183/2826 [3:34:43<1:03:58, 5.97s/it] 77%|███████▋ | 2184/2826 [3:34:50<1:06:22, 6.20s/it] 77%|███████▋ | 2185/2826 [3:34:56<1:04:53, 6.07s/it] 77%|███████▋ | 2186/2826 [3:35:01<1:01:41, 5.78s/it] 77%|███████▋ | 2187/2826 [3:35:06<1:00:31, 5.68s/it] 77%|███████▋ | 2188/2826 [3:35:12<1:01:55, 5.82s/it] 77%|███████▋ | 2189/2826 [3:35:18<1:01:15, 5.77s/it] 77%|███████▋ | 2190/2826 [3:35:25<1:04:03, 6.04s/it] {'loss': 0.1685, 'grad_norm': 1.4048022031784058, 'learning_rate': 7.349650021557839e-07, 'epoch': 2.32} 77%|███████▋ | 2190/2826 [3:35:25<1:04:03, 6.04s/it] 78%|███████▊ | 2191/2826 [3:35:30<1:01:02, 5.77s/it] 78%|███████▊ | 2192/2826 [3:35:36<1:01:22, 5.81s/it] 78%|███████▊ | 2193/2826 [3:35:41<1:00:40, 5.75s/it] 78%|███████▊ | 2194/2826 [3:35:47<1:00:55, 5.78s/it] 78%|███████▊ | 2195/2826 [3:35:52<59:07, 5.62s/it] 78%|███████▊ | 2196/2826 [3:35:58<57:46, 5.50s/it] 78%|███████▊ | 2197/2826 [3:36:03<56:47, 5.42s/it] 78%|███████▊ | 2198/2826 [3:36:08<56:36, 5.41s/it] 78%|███████▊ | 2199/2826 [3:36:13<55:36, 5.32s/it] 78%|███████▊ | 2200/2826 [3:36:19<54:53, 5.26s/it] {'loss': 0.1519, 'grad_norm': 2.568500280380249, 'learning_rate': 7.132277595215773e-07, 'epoch': 2.33} 78%|███████▊ | 2200/2826 [3:36:19<54:53, 5.26s/it] 78%|███████▊ | 2201/2826 [3:36:25<59:00, 5.66s/it] 78%|███████▊ | 2202/2826 [3:36:31<1:00:43, 5.84s/it] 78%|███████▊ | 2203/2826 [3:36:37<1:00:25, 5.82s/it] 78%|███████▊ | 2204/2826 [3:36:44<1:02:22, 6.02s/it] 78%|███████▊ | 2205/2826 [3:36:49<59:23, 5.74s/it] 78%|███████▊ | 2206/2826 [3:36:54<59:15, 5.74s/it] 78%|███████▊ | 2207/2826 [3:37:00<57:33, 5.58s/it] 78%|███████▊ | 2208/2826 [3:37:06<1:00:19, 5.86s/it] 78%|███████▊ | 2209/2826 [3:37:12<1:00:50, 5.92s/it] 78%|███████▊ | 2210/2826 [3:37:18<1:01:06, 5.95s/it] {'loss': 0.1573, 'grad_norm': 2.205993413925171, 'learning_rate': 6.917632078205805e-07, 'epoch': 2.34} 78%|███████▊ | 2210/2826 [3:37:18<1:01:06, 5.95s/it] 78%|███████▊ | 2211/2826 [3:37:24<1:01:57, 6.04s/it] 78%|███████▊ | 2212/2826 [3:37:30<1:01:16, 5.99s/it] 78%|███████▊ | 2213/2826 [3:37:37<1:02:16, 6.10s/it] 78%|███████▊ | 2214/2826 [3:37:42<1:00:56, 5.98s/it] 78%|███████▊ | 2215/2826 [3:37:47<58:08, 5.71s/it] 78%|███████▊ | 2216/2826 [3:37:53<56:31, 5.56s/it] 78%|███████▊ | 2217/2826 [3:37:59<57:27, 5.66s/it] 78%|███████▊ | 2218/2826 [3:38:04<55:42, 5.50s/it] 79%|███████▊ | 2219/2826 [3:38:09<54:57, 5.43s/it] 79%|███████▊ | 2220/2826 [3:38:14<54:23, 5.39s/it] {'loss': 0.184, 'grad_norm': 2.067505121231079, 'learning_rate': 6.705746228976387e-07, 'epoch': 2.35} 79%|███████▊ | 2220/2826 [3:38:14<54:23, 5.39s/it] 79%|███████▊ | 2221/2826 [3:38:21<56:59, 5.65s/it] 79%|███████▊ | 2222/2826 [3:38:26<55:25, 5.51s/it] 79%|███████▊ | 2223/2826 [3:38:31<55:25, 5.51s/it] 79%|███████▊ | 2224/2826 [3:38:37<55:44, 5.55s/it] 79%|███████▊ | 2225/2826 [3:38:42<55:28, 5.54s/it] 79%|███████▉ | 2226/2826 [3:38:49<58:35, 5.86s/it] 79%|███████▉ | 2227/2826 [3:38:54<56:41, 5.68s/it] 79%|███████▉ | 2228/2826 [3:39:00<56:33, 5.67s/it] 79%|███████▉ | 2229/2826 [3:39:05<56:04, 5.64s/it] 79%|███████▉ | 2230/2826 [3:39:11<54:31, 5.49s/it] {'loss': 0.1968, 'grad_norm': 2.4360201358795166, 'learning_rate': 6.496652384805125e-07, 'epoch': 2.36} 79%|███████▉ | 2230/2826 [3:39:11<54:31, 5.49s/it] 79%|███████▉ | 2231/2826 [3:39:16<53:53, 5.44s/it] 79%|███████▉ | 2232/2826 [3:39:22<54:48, 5.54s/it] 79%|███████▉ | 2233/2826 [3:39:28<57:11, 5.79s/it] 79%|███████▉ | 2234/2826 [3:39:34<58:45, 5.95s/it] 79%|███████▉ | 2235/2826 [3:39:41<59:48, 6.07s/it] 79%|███████▉ | 2236/2826 [3:39:47<59:45, 6.08s/it] 79%|███████▉ | 2237/2826 [3:39:53<1:00:38, 6.18s/it] 79%|███████▉ | 2238/2826 [3:39:59<1:00:35, 6.18s/it] 79%|███████▉ | 2239/2826 [3:40:05<58:35, 5.99s/it] 79%|███████▉ | 2240/2826 [3:40:12<1:01:36, 6.31s/it] {'loss': 0.1846, 'grad_norm': 2.042179584503174, 'learning_rate': 6.290382456863584e-07, 'epoch': 2.38} 79%|███████▉ | 2240/2826 [3:40:12<1:01:36, 6.31s/it] 79%|███████▉ | 2241/2826 [3:40:17<59:00, 6.05s/it] 79%|███████▉ | 2242/2826 [3:40:24<58:59, 6.06s/it] 79%|███████▉ | 2243/2826 [3:40:31<1:03:39, 6.55s/it] 79%|███████▉ | 2244/2826 [3:40:38<1:03:01, 6.50s/it] 79%|███████▉ | 2245/2826 [3:40:43<59:06, 6.10s/it] 79%|███████▉ | 2246/2826 [3:40:50<1:02:49, 6.50s/it] 80%|███████▉ | 2247/2826 [3:40:55<58:38, 6.08s/it] 80%|███████▉ | 2248/2826 [3:41:02<1:00:33, 6.29s/it] 80%|███████▉ | 2249/2826 [3:41:08<58:50, 6.12s/it] 80%|███████▉ | 2250/2826 [3:41:13<55:52, 5.82s/it] {'loss': 0.1858, 'grad_norm': 2.849271535873413, 'learning_rate': 6.086967925347075e-07, 'epoch': 2.39} 80%|███████▉ | 2250/2826 [3:41:13<55:52, 5.82s/it] 80%|███████▉ | 2251/2826 [3:41:19<56:26, 5.89s/it] 80%|███████▉ | 2252/2826 [3:41:24<55:01, 5.75s/it] 80%|███████▉ | 2253/2826 [3:41:30<53:34, 5.61s/it] 80%|███████▉ | 2254/2826 [3:41:36<56:02, 5.88s/it] 80%|███████▉ | 2255/2826 [3:41:42<56:14, 5.91s/it] 80%|███████▉ | 2256/2826 [3:41:49<57:35, 6.06s/it] 80%|███████▉ | 2257/2826 [3:41:54<55:16, 5.83s/it] 80%|███████▉ | 2258/2826 [3:42:01<58:27, 6.17s/it] 80%|███████▉ | 2259/2826 [3:42:06<55:15, 5.85s/it] 80%|███████▉ | 2260/2826 [3:42:12<55:51, 5.92s/it] {'loss': 0.1837, 'grad_norm': 2.0765082836151123, 'learning_rate': 5.88643983467033e-07, 'epoch': 2.4} 80%|███████▉ | 2260/2826 [3:42:12<55:51, 5.92s/it] 80%|████████ | 2261/2826 [3:42:18<55:54, 5.94s/it] 80%|████████ | 2262/2826 [3:42:23<53:39, 5.71s/it] 80%|████████ | 2263/2826 [3:42:30<56:01, 5.97s/it] 80%|████████ | 2264/2826 [3:42:35<53:51, 5.75s/it] 80%|████████ | 2265/2826 [3:42:41<53:00, 5.67s/it] 80%|████████ | 2266/2826 [3:42:47<54:28, 5.84s/it] 80%|████████ | 2267/2826 [3:42:53<55:09, 5.92s/it] 80%|████████ | 2268/2826 [3:42:58<54:00, 5.81s/it] 80%|████████ | 2269/2826 [3:43:04<53:00, 5.71s/it] 80%|████████ | 2270/2826 [3:43:09<51:56, 5.60s/it] {'loss': 0.1659, 'grad_norm': 1.9958840608596802, 'learning_rate': 5.688828788729547e-07, 'epoch': 2.41} 80%|████████ | 2270/2826 [3:43:09<51:56, 5.60s/it] 80%|████████ | 2271/2826 [3:43:16<54:23, 5.88s/it] 80%|████████ | 2272/2826 [3:43:21<53:29, 5.79s/it] 80%|████████ | 2273/2826 [3:43:27<51:39, 5.61s/it] 80%|████████ | 2274/2826 [3:43:32<49:48, 5.41s/it] 81%|████████ | 2275/2826 [3:43:37<48:50, 5.32s/it] 81%|████████ | 2276/2826 [3:43:42<48:05, 5.25s/it] 81%|████████ | 2277/2826 [3:43:47<48:08, 5.26s/it] 81%|████████ | 2278/2826 [3:43:53<50:46, 5.56s/it] 81%|████████ | 2279/2826 [3:43:59<50:21, 5.52s/it] 81%|████████ | 2280/2826 [3:44:04<50:05, 5.51s/it] {'loss': 0.2095, 'grad_norm': 2.253602981567383, 'learning_rate': 5.494164946231747e-07, 'epoch': 2.42} 81%|████████ | 2280/2826 [3:44:04<50:05, 5.51s/it] 81%|████████ | 2281/2826 [3:44:09<49:02, 5.40s/it] 81%|████████ | 2282/2826 [3:44:15<48:52, 5.39s/it] 81%|████████ | 2283/2826 [3:44:20<49:40, 5.49s/it] 81%|████████ | 2284/2826 [3:44:26<48:51, 5.41s/it] 81%|████████ | 2285/2826 [3:44:32<50:44, 5.63s/it] 81%|████████ | 2286/2826 [3:44:37<49:33, 5.51s/it] 81%|████████ | 2287/2826 [3:44:45<56:11, 6.26s/it] 81%|████████ | 2288/2826 [3:44:52<57:50, 6.45s/it] 81%|████████ | 2289/2826 [3:44:57<54:16, 6.07s/it] 81%|████████ | 2290/2826 [3:45:03<53:50, 6.03s/it] {'loss': 0.1862, 'grad_norm': 1.5552992820739746, 'learning_rate': 5.302478016092075e-07, 'epoch': 2.43} 81%|████████ | 2290/2826 [3:45:03<53:50, 6.03s/it] 81%|████████ | 2291/2826 [3:45:08<51:04, 5.73s/it] 81%|████████ | 2292/2826 [3:45:13<49:29, 5.56s/it] 81%|████████ | 2293/2826 [3:45:18<48:33, 5.47s/it] 81%|████████ | 2294/2826 [3:45:24<47:41, 5.38s/it] 81%|████████ | 2295/2826 [3:45:29<47:22, 5.35s/it] 81%|████████ | 2296/2826 [3:45:35<49:27, 5.60s/it] 81%|████████▏ | 2297/2826 [3:45:42<51:51, 5.88s/it] 81%|████████▏ | 2298/2826 [3:45:47<50:21, 5.72s/it] 81%|████████▏ | 2299/2826 [3:45:56<57:54, 6.59s/it] 81%|████████▏ | 2300/2826 [3:46:01<53:44, 6.13s/it] {'loss': 0.2085, 'grad_norm': 2.721445322036743, 'learning_rate': 5.113797252899728e-07, 'epoch': 2.44} 81%|████████▏ | 2300/2826 [3:46:01<53:44, 6.13s/it] 81%|████████▏ | 2301/2826 [3:46:06<50:48, 5.81s/it] 81%|████████▏ | 2302/2826 [3:46:12<52:45, 6.04s/it] 81%|████████▏ | 2303/2826 [3:46:17<50:26, 5.79s/it] 82%|████████▏ | 2304/2826 [3:46:23<48:47, 5.61s/it] 82%|████████▏ | 2305/2826 [3:46:28<48:03, 5.53s/it] 82%|████████▏ | 2306/2826 [3:46:33<46:58, 5.42s/it] 82%|████████▏ | 2307/2826 [3:46:41<53:02, 6.13s/it] 82%|████████▏ | 2308/2826 [3:46:46<50:44, 5.88s/it] 82%|████████▏ | 2309/2826 [3:46:51<48:41, 5.65s/it] 82%|████████▏ | 2310/2826 [3:46:58<51:06, 5.94s/it] {'loss': 0.1914, 'grad_norm': 2.3488707542419434, 'learning_rate': 4.928151452453184e-07, 'epoch': 2.45} 82%|████████▏ | 2310/2826 [3:46:58<51:06, 5.94s/it] 82%|████████▏ | 2311/2826 [3:47:03<48:58, 5.71s/it] 82%|████████▏ | 2312/2826 [3:47:09<48:15, 5.63s/it] 82%|████████▏ | 2313/2826 [3:47:14<46:56, 5.49s/it] 82%|████████▏ | 2314/2826 [3:47:20<49:18, 5.78s/it] 82%|████████▏ | 2315/2826 [3:47:26<48:12, 5.66s/it] 82%|████████▏ | 2316/2826 [3:47:31<47:14, 5.56s/it] 82%|████████▏ | 2317/2826 [3:47:38<49:55, 5.89s/it] 82%|████████▏ | 2318/2826 [3:47:45<53:20, 6.30s/it] 82%|████████▏ | 2319/2826 [3:47:52<56:04, 6.64s/it] 82%|████████▏ | 2320/2826 [3:47:59<55:20, 6.56s/it] {'loss': 0.1718, 'grad_norm': 2.49068021774292, 'learning_rate': 4.745568947365542e-07, 'epoch': 2.46} 82%|████████▏ | 2320/2826 [3:47:59<55:20, 6.56s/it] 82%|████████▏ | 2321/2826 [3:48:04<51:33, 6.13s/it] 82%|████████▏ | 2322/2826 [3:48:09<49:00, 5.83s/it] 82%|████████▏ | 2323/2826 [3:48:15<50:23, 6.01s/it] 82%|████████▏ | 2324/2826 [3:48:21<49:10, 5.88s/it] 82%|████████▏ | 2325/2826 [3:48:26<48:18, 5.79s/it] 82%|████████▏ | 2326/2826 [3:48:33<49:17, 5.91s/it] 82%|████████▏ | 2327/2826 [3:48:40<52:32, 6.32s/it] 82%|████████▏ | 2328/2826 [3:48:46<50:49, 6.12s/it] 82%|████████▏ | 2329/2826 [3:48:52<50:21, 6.08s/it] 82%|████████▏ | 2330/2826 [3:48:58<50:41, 6.13s/it] {'loss': 0.1669, 'grad_norm': 1.4638549089431763, 'learning_rate': 4.5660776027404654e-07, 'epoch': 2.47} 82%|████████▏ | 2330/2826 [3:48:58<50:41, 6.13s/it] 82%|████████▏ | 2331/2826 [3:49:04<51:05, 6.19s/it] 83%|████████▎ | 2332/2826 [3:49:11<52:12, 6.34s/it] 83%|████████▎ | 2333/2826 [3:49:19<55:36, 6.77s/it] 83%|████████▎ | 2334/2826 [3:49:24<52:32, 6.41s/it] 83%|████████▎ | 2335/2826 [3:49:29<49:42, 6.07s/it] 83%|████████▎ | 2336/2826 [3:49:36<51:12, 6.27s/it] 83%|████████▎ | 2337/2826 [3:49:43<51:27, 6.31s/it] 83%|████████▎ | 2338/2826 [3:49:48<48:33, 5.97s/it] 83%|████████▎ | 2339/2826 [3:49:53<47:47, 5.89s/it] 83%|████████▎ | 2340/2826 [3:50:00<48:21, 5.97s/it] {'loss': 0.1731, 'grad_norm': 2.288776159286499, 'learning_rate': 4.389704811919507e-07, 'epoch': 2.48} 83%|████████▎ | 2340/2826 [3:50:00<48:21, 5.97s/it] 83%|████████▎ | 2341/2826 [3:50:05<47:06, 5.83s/it] 83%|████████▎ | 2342/2826 [3:50:10<45:28, 5.64s/it] 83%|████████▎ | 2343/2826 [3:50:16<46:05, 5.73s/it] 83%|████████▎ | 2344/2826 [3:50:23<48:01, 5.98s/it] 83%|████████▎ | 2345/2826 [3:50:28<46:00, 5.74s/it] 83%|████████▎ | 2346/2826 [3:50:33<44:32, 5.57s/it] 83%|████████▎ | 2347/2826 [3:50:39<45:26, 5.69s/it] 83%|████████▎ | 2348/2826 [3:50:45<44:54, 5.64s/it] 83%|████████▎ | 2349/2826 [3:50:51<47:32, 5.98s/it] 83%|████████▎ | 2350/2826 [3:50:57<45:41, 5.76s/it] {'loss': 0.1802, 'grad_norm': 2.385162115097046, 'learning_rate': 4.216477492301455e-07, 'epoch': 2.49} 83%|████████▎ | 2350/2826 [3:50:57<45:41, 5.76s/it] 83%|████████▎ | 2351/2826 [3:51:03<46:17, 5.85s/it] 83%|████████▎ | 2352/2826 [3:51:08<44:44, 5.66s/it] 83%|████████▎ | 2353/2826 [3:51:14<46:25, 5.89s/it] 83%|████████▎ | 2354/2826 [3:51:20<44:29, 5.66s/it] 83%|████████▎ | 2355/2826 [3:51:25<42:56, 5.47s/it] 83%|████████▎ | 2356/2826 [3:51:30<41:58, 5.36s/it] 83%|████████▎ | 2357/2826 [3:51:35<42:07, 5.39s/it] 83%|████████▎ | 2358/2826 [3:51:42<44:47, 5.74s/it] 83%|████████▎ | 2359/2826 [3:51:47<43:11, 5.55s/it] 84%|████████▎ | 2360/2826 [3:51:53<44:19, 5.71s/it] {'loss': 0.2232, 'grad_norm': 2.0100815296173096, 'learning_rate': 4.0464220812342526e-07, 'epoch': 2.5} 84%|████████▎ | 2360/2826 [3:51:53<44:19, 5.71s/it] 84%|████████▎ | 2361/2826 [3:51:58<42:59, 5.55s/it] 84%|████████▎ | 2362/2826 [3:52:05<45:05, 5.83s/it] 84%|████████▎ | 2363/2826 [3:52:10<43:23, 5.62s/it] 84%|████████▎ | 2364/2826 [3:52:16<44:01, 5.72s/it] 84%|████████▎ | 2365/2826 [3:52:21<42:47, 5.57s/it] 84%|████████▎ | 2366/2826 [3:52:27<44:40, 5.83s/it] 84%|████████▍ | 2367/2826 [3:52:34<46:48, 6.12s/it] 84%|████████▍ | 2368/2826 [3:52:41<47:55, 6.28s/it] 84%|████████▍ | 2369/2826 [3:52:47<46:53, 6.16s/it] 84%|████████▍ | 2370/2826 [3:52:52<45:14, 5.95s/it] {'loss': 0.1432, 'grad_norm': 1.8439091444015503, 'learning_rate': 3.87956453198027e-07, 'epoch': 2.51} 84%|████████▍ | 2370/2826 [3:52:52<45:14, 5.95s/it] 84%|████████▍ | 2371/2826 [3:52:57<43:06, 5.68s/it] 84%|████████▍ | 2372/2826 [3:53:03<43:30, 5.75s/it] 84%|████████▍ | 2373/2826 [3:53:09<43:59, 5.83s/it] 84%|████████▍ | 2374/2826 [3:53:16<46:14, 6.14s/it] 84%|████████▍ | 2375/2826 [3:53:22<45:16, 6.02s/it] 84%|████████▍ | 2376/2826 [3:53:27<44:22, 5.92s/it] 84%|████████▍ | 2377/2826 [3:53:33<43:41, 5.84s/it] 84%|████████▍ | 2378/2826 [3:53:39<43:55, 5.88s/it] 84%|████████▍ | 2379/2826 [3:53:45<43:29, 5.84s/it] 84%|████████▍ | 2380/2826 [3:53:51<44:09, 5.94s/it] {'loss': 0.1834, 'grad_norm': 2.3093338012695312, 'learning_rate': 3.715930309755389e-07, 'epoch': 2.52} 84%|████████▍ | 2380/2826 [3:53:51<44:09, 5.94s/it] 84%|████████▍ | 2381/2826 [3:53:57<44:37, 6.02s/it] 84%|████████▍ | 2382/2826 [3:54:03<43:21, 5.86s/it] 84%|████████▍ | 2383/2826 [3:54:08<42:30, 5.76s/it] 84%|████████▍ | 2384/2826 [3:54:14<42:46, 5.81s/it] 84%|████████▍ | 2385/2826 [3:54:19<41:07, 5.59s/it] 84%|████████▍ | 2386/2826 [3:54:24<39:50, 5.43s/it] 84%|████████▍ | 2387/2826 [3:54:30<39:53, 5.45s/it] 85%|████████▍ | 2388/2826 [3:54:35<40:12, 5.51s/it] 85%|████████▍ | 2389/2826 [3:54:41<40:39, 5.58s/it] 85%|████████▍ | 2390/2826 [3:54:47<40:25, 5.56s/it] {'loss': 0.2123, 'grad_norm': 2.3250088691711426, 'learning_rate': 3.5555443878425635e-07, 'epoch': 2.53} 85%|████████▍ | 2390/2826 [3:54:47<40:25, 5.56s/it] 85%|████████▍ | 2391/2826 [3:54:52<39:14, 5.41s/it] 85%|████████▍ | 2392/2826 [3:54:57<39:20, 5.44s/it] 85%|████████▍ | 2393/2826 [3:55:03<40:15, 5.58s/it] 85%|████████▍ | 2394/2826 [3:55:09<40:40, 5.65s/it] 85%|████████▍ | 2395/2826 [3:55:14<39:20, 5.48s/it] 85%|████████▍ | 2396/2826 [3:55:19<38:34, 5.38s/it] 85%|████████▍ | 2397/2826 [3:55:24<38:05, 5.33s/it] 85%|████████▍ | 2398/2826 [3:55:30<38:19, 5.37s/it] 85%|████████▍ | 2399/2826 [3:55:35<37:45, 5.31s/it] 85%|████████▍ | 2400/2826 [3:55:41<38:55, 5.48s/it] {'loss': 0.2034, 'grad_norm': 1.8003133535385132, 'learning_rate': 3.398431243780531e-07, 'epoch': 2.55} 85%|████████▍ | 2400/2826 [3:55:41<38:55, 5.48s/it] 85%|████████▍ | 2401/2826 [3:55:47<40:42, 5.75s/it] 85%|████████▍ | 2402/2826 [3:55:54<42:21, 5.99s/it] 85%|████████▌ | 2403/2826 [3:55:59<40:38, 5.76s/it] 85%|████████▌ | 2404/2826 [3:56:05<40:30, 5.76s/it] 85%|████████▌ | 2405/2826 [3:56:10<39:25, 5.62s/it] 85%|████████▌ | 2406/2826 [3:56:15<38:53, 5.55s/it] 85%|████████▌ | 2407/2826 [3:56:22<40:34, 5.81s/it] 85%|████████▌ | 2408/2826 [3:56:29<42:39, 6.12s/it] 85%|████████▌ | 2409/2826 [3:56:35<42:51, 6.17s/it] 85%|████████▌ | 2410/2826 [3:56:40<40:33, 5.85s/it] {'loss': 0.1778, 'grad_norm': 2.8948135375976562, 'learning_rate': 3.2446148556281117e-07, 'epoch': 2.56} 85%|████████▌ | 2410/2826 [3:56:40<40:33, 5.85s/it] 85%|████████▌ | 2411/2826 [3:56:45<38:52, 5.62s/it] 85%|████████▌ | 2412/2826 [3:56:51<39:08, 5.67s/it] 85%|████████▌ | 2413/2826 [3:56:57<39:07, 5.68s/it] 85%|████████▌ | 2414/2826 [3:57:02<37:38, 5.48s/it] 85%|████████▌ | 2415/2826 [3:57:07<37:42, 5.50s/it] 85%|████████▌ | 2416/2826 [3:57:12<37:12, 5.45s/it] 86%|████████▌ | 2417/2826 [3:57:20<40:18, 5.91s/it] 86%|████████▌ | 2418/2826 [3:57:25<39:55, 5.87s/it] 86%|████████▌ | 2419/2826 [3:57:30<38:10, 5.63s/it] 86%|████████▌ | 2420/2826 [3:57:36<39:02, 5.77s/it] {'loss': 0.1892, 'grad_norm': 1.8556360006332397, 'learning_rate': 3.0941186983047543e-07, 'epoch': 2.57} 86%|████████▌ | 2420/2826 [3:57:36<39:02, 5.77s/it] 86%|████████▌ | 2421/2826 [3:57:43<40:13, 5.96s/it] 86%|████████▌ | 2422/2826 [3:57:48<38:29, 5.72s/it] 86%|████████▌ | 2423/2826 [3:57:54<39:12, 5.84s/it] 86%|████████▌ | 2424/2826 [3:57:59<37:43, 5.63s/it] 86%|████████▌ | 2425/2826 [3:58:04<36:16, 5.43s/it] 86%|████████▌ | 2426/2826 [3:58:10<37:36, 5.64s/it] 86%|████████▌ | 2427/2826 [3:58:17<39:01, 5.87s/it] 86%|████████▌ | 2428/2826 [3:58:22<37:58, 5.72s/it] 86%|████████▌ | 2429/2826 [3:58:28<37:57, 5.74s/it] 86%|████████▌ | 2430/2826 [3:58:33<36:23, 5.51s/it] {'loss': 0.1935, 'grad_norm': 2.771932363510132, 'learning_rate': 2.9469657400078925e-07, 'epoch': 2.58} 86%|████████▌ | 2430/2826 [3:58:33<36:23, 5.51s/it] 86%|████████▌ | 2431/2826 [3:58:38<36:04, 5.48s/it] 86%|████████▌ | 2432/2826 [3:58:45<37:46, 5.75s/it] 86%|████████▌ | 2433/2826 [3:58:51<38:08, 5.82s/it] 86%|████████▌ | 2434/2826 [3:58:56<36:33, 5.60s/it] 86%|████████▌ | 2435/2826 [3:59:01<36:02, 5.53s/it] 86%|████████▌ | 2436/2826 [3:59:06<35:16, 5.43s/it] 86%|████████▌ | 2437/2826 [3:59:12<35:44, 5.51s/it] 86%|████████▋ | 2438/2826 [3:59:19<38:57, 6.02s/it] 86%|████████▋ | 2439/2826 [3:59:24<37:04, 5.75s/it] 86%|████████▋ | 2440/2826 [3:59:29<35:36, 5.54s/it] {'loss': 0.1858, 'grad_norm': 2.5325114727020264, 'learning_rate': 2.8031784387076186e-07, 'epoch': 2.59} 86%|████████▋ | 2440/2826 [3:59:29<35:36, 5.54s/it] 86%|████████▋ | 2441/2826 [3:59:34<34:37, 5.40s/it] 86%|████████▋ | 2442/2826 [3:59:40<34:14, 5.35s/it] 86%|████████▋ | 2443/2826 [3:59:46<35:04, 5.49s/it] 86%|████████▋ | 2444/2826 [3:59:51<34:15, 5.38s/it] 87%|████████▋ | 2445/2826 [3:59:56<34:37, 5.45s/it] 87%|████████▋ | 2446/2826 [4:00:02<34:34, 5.46s/it] 87%|████████▋ | 2447/2826 [4:00:08<36:16, 5.74s/it] 87%|████████▋ | 2448/2826 [4:00:13<35:17, 5.60s/it] 87%|████████▋ | 2449/2826 [4:00:20<37:44, 6.01s/it] 87%|████████▋ | 2450/2826 [4:00:25<35:57, 5.74s/it] {'loss': 0.2118, 'grad_norm': 2.4069302082061768, 'learning_rate': 2.6627787387191934e-07, 'epoch': 2.6} 87%|████████▋ | 2450/2826 [4:00:25<35:57, 5.74s/it] 87%|████████▋ | 2451/2826 [4:00:31<34:38, 5.54s/it] 87%|████████▋ | 2452/2826 [4:00:36<34:11, 5.49s/it] 87%|████████▋ | 2453/2826 [4:00:41<34:12, 5.50s/it] 87%|████████▋ | 2454/2826 [4:00:48<36:09, 5.83s/it] 87%|████████▋ | 2455/2826 [4:00:53<34:42, 5.61s/it] 87%|████████▋ | 2456/2826 [4:00:58<33:52, 5.49s/it] 87%|████████▋ | 2457/2826 [4:01:04<33:06, 5.38s/it] 87%|████████▋ | 2458/2826 [4:01:10<34:17, 5.59s/it] 87%|████████▋ | 2459/2826 [4:01:15<33:33, 5.49s/it] 87%|████████▋ | 2460/2826 [4:01:21<34:37, 5.68s/it] {'loss': 0.1929, 'grad_norm': 2.053656816482544, 'learning_rate': 2.5257880673540376e-07, 'epoch': 2.61} 87%|████████▋ | 2460/2826 [4:01:21<34:37, 5.68s/it] 87%|████████▋ | 2461/2826 [4:01:27<35:43, 5.87s/it] 87%|████████▋ | 2462/2826 [4:01:32<34:20, 5.66s/it] 87%|████████▋ | 2463/2826 [4:01:38<33:20, 5.51s/it] 87%|████████▋ | 2464/2826 [4:01:43<33:13, 5.51s/it] 87%|████████▋ | 2465/2826 [4:01:48<32:47, 5.45s/it] 87%|████████▋ | 2466/2826 [4:01:55<34:01, 5.67s/it] 87%|████████▋ | 2467/2826 [4:02:01<35:15, 5.89s/it] 87%|████████▋ | 2468/2826 [4:02:06<33:49, 5.67s/it] 87%|████████▋ | 2469/2826 [4:02:12<33:38, 5.66s/it] 87%|████████▋ | 2470/2826 [4:02:18<34:17, 5.78s/it] {'loss': 0.1745, 'grad_norm': 1.8820626735687256, 'learning_rate': 2.392227331649527e-07, 'epoch': 2.62} 87%|████████▋ | 2470/2826 [4:02:18<34:17, 5.78s/it] 87%|████████▋ | 2471/2826 [4:02:25<36:53, 6.23s/it] 87%|████████▋ | 2472/2826 [4:02:31<36:49, 6.24s/it] 88%|████████▊ | 2473/2826 [4:02:37<34:55, 5.94s/it] 88%|████████▊ | 2474/2826 [4:02:42<33:14, 5.67s/it] 88%|████████▊ | 2475/2826 [4:02:47<33:20, 5.70s/it] 88%|████████▊ | 2476/2826 [4:02:52<32:06, 5.50s/it] 88%|████████▊ | 2477/2826 [4:02:58<32:24, 5.57s/it] 88%|████████▊ | 2478/2826 [4:03:04<32:45, 5.65s/it] 88%|████████▊ | 2479/2826 [4:03:09<32:06, 5.55s/it] 88%|████████▊ | 2480/2826 [4:03:15<32:24, 5.62s/it] {'loss': 0.1823, 'grad_norm': 1.9418586492538452, 'learning_rate': 2.2621169151782417e-07, 'epoch': 2.63} 88%|████████▊ | 2480/2826 [4:03:15<32:24, 5.62s/it] 88%|████████▊ | 2481/2826 [4:03:21<32:21, 5.63s/it] 88%|████████▊ | 2482/2826 [4:03:27<32:35, 5.69s/it] 88%|████████▊ | 2483/2826 [4:03:32<31:38, 5.53s/it] 88%|████████▊ | 2484/2826 [4:03:37<30:42, 5.39s/it] 88%|████████▊ | 2485/2826 [4:03:42<30:08, 5.30s/it] 88%|████████▊ | 2486/2826 [4:03:48<31:41, 5.59s/it] 88%|████████▊ | 2487/2826 [4:03:54<31:32, 5.58s/it] 88%|████████▊ | 2488/2826 [4:04:00<31:43, 5.63s/it] 88%|████████▊ | 2489/2826 [4:04:05<31:34, 5.62s/it] 88%|████████▊ | 2490/2826 [4:04:11<31:21, 5.60s/it] {'loss': 0.2037, 'grad_norm': 2.519037961959839, 'learning_rate': 2.1354766749371093e-07, 'epoch': 2.64} 88%|████████▊ | 2490/2826 [4:04:11<31:21, 5.60s/it] 88%|████████▊ | 2491/2826 [4:04:17<31:48, 5.70s/it] 88%|████████▊ | 2492/2826 [4:04:22<30:45, 5.52s/it] 88%|████████▊ | 2493/2826 [4:04:27<29:59, 5.40s/it] 88%|████████▊ | 2494/2826 [4:04:33<31:57, 5.78s/it] 88%|████████▊ | 2495/2826 [4:04:39<30:45, 5.58s/it] 88%|████████▊ | 2496/2826 [4:04:45<32:42, 5.95s/it] 88%|████████▊ | 2497/2826 [4:04:52<33:57, 6.19s/it] 88%|████████▊ | 2498/2826 [4:04:58<32:49, 6.00s/it] 88%|████████▊ | 2499/2826 [4:05:03<31:25, 5.77s/it] 88%|████████▊ | 2500/2826 [4:05:10<34:02, 6.27s/it] {'loss': 0.2196, 'grad_norm': 2.010211944580078, 'learning_rate': 2.0123259383169031e-07, 'epoch': 2.65} 88%|████████▊ | 2500/2826 [4:05:10<34:02, 6.27s/it] 88%|████████▊ | 2501/2826 [4:05:17<33:53, 6.26s/it] 89%|████████▊ | 2502/2826 [4:05:22<31:51, 5.90s/it] 89%|████████▊ | 2503/2826 [4:05:27<30:32, 5.67s/it] 89%|████████▊ | 2504/2826 [4:05:32<30:05, 5.61s/it] 89%|████████▊ | 2505/2826 [4:05:38<29:39, 5.55s/it] 89%|████████▊ | 2506/2826 [4:05:43<29:51, 5.60s/it] 89%|████████▊ | 2507/2826 [4:05:50<30:35, 5.76s/it] 89%|████████▊ | 2508/2826 [4:05:55<29:29, 5.56s/it] 89%|████████▉ | 2509/2826 [4:06:00<29:27, 5.58s/it] 89%|████████▉ | 2510/2826 [4:06:05<28:49, 5.47s/it] {'loss': 0.1848, 'grad_norm': 1.9838532209396362, 'learning_rate': 1.8926835001525257e-07, 'epoch': 2.66} 89%|████████▉ | 2510/2826 [4:06:05<28:49, 5.47s/it] 89%|████████▉ | 2511/2826 [4:06:11<29:01, 5.53s/it] 89%|████████▉ | 2512/2826 [4:06:16<28:19, 5.41s/it] 89%|████████▉ | 2513/2826 [4:06:23<30:10, 5.78s/it] 89%|████████▉ | 2514/2826 [4:06:28<29:03, 5.59s/it] 89%|████████▉ | 2515/2826 [4:06:33<28:38, 5.53s/it] 89%|████████▉ | 2516/2826 [4:06:39<27:51, 5.39s/it] 89%|████████▉ | 2517/2826 [4:06:44<27:15, 5.29s/it] 89%|████████▉ | 2518/2826 [4:06:49<26:48, 5.22s/it] 89%|████████▉ | 2519/2826 [4:06:55<28:33, 5.58s/it] 89%|████████▉ | 2520/2826 [4:07:00<27:43, 5.44s/it] {'loss': 0.1823, 'grad_norm': 2.3488149642944336, 'learning_rate': 1.776567619854655e-07, 'epoch': 2.67} 89%|████████▉ | 2520/2826 [4:07:00<27:43, 5.44s/it] 89%|████████▉ | 2521/2826 [4:07:05<27:05, 5.33s/it] 89%|████████▉ | 2522/2826 [4:07:11<28:08, 5.55s/it] 89%|████████▉ | 2523/2826 [4:07:16<27:21, 5.42s/it] 89%|████████▉ | 2524/2826 [4:07:22<26:52, 5.34s/it] 89%|████████▉ | 2525/2826 [4:07:29<29:32, 5.89s/it] 89%|████████▉ | 2526/2826 [4:07:35<29:17, 5.86s/it] 89%|████████▉ | 2527/2826 [4:07:41<30:01, 6.02s/it] 89%|████████▉ | 2528/2826 [4:07:48<31:50, 6.41s/it] 89%|████████▉ | 2529/2826 [4:07:54<30:25, 6.15s/it] 90%|████████▉ | 2530/2826 [4:08:00<30:51, 6.26s/it] {'loss': 0.2039, 'grad_norm': 2.839651584625244, 'learning_rate': 1.6639960186230293e-07, 'epoch': 2.68} 90%|████████▉ | 2530/2826 [4:08:00<30:51, 6.26s/it] 90%|████████▉ | 2531/2826 [4:08:05<29:05, 5.92s/it] 90%|████████▉ | 2532/2826 [4:08:12<29:48, 6.08s/it] 90%|████████▉ | 2533/2826 [4:08:19<31:31, 6.45s/it] 90%|████████▉ | 2534/2826 [4:08:24<29:25, 6.05s/it] 90%|████████▉ | 2535/2826 [4:08:29<27:43, 5.72s/it] 90%|████████▉ | 2536/2826 [4:08:34<26:45, 5.54s/it] 90%|████████▉ | 2537/2826 [4:08:40<26:13, 5.44s/it] 90%|████████▉ | 2538/2826 [4:08:45<25:45, 5.37s/it] 90%|████████▉ | 2539/2826 [4:08:51<26:14, 5.49s/it] 90%|████████▉ | 2540/2826 [4:08:57<26:51, 5.63s/it] {'loss': 0.1796, 'grad_norm': 2.050480842590332, 'learning_rate': 1.5549858767419018e-07, 'epoch': 2.69} 90%|████████▉ | 2540/2826 [4:08:57<26:51, 5.63s/it] 90%|████████▉ | 2541/2826 [4:09:03<28:40, 6.04s/it] 90%|████████▉ | 2542/2826 [4:09:09<27:17, 5.77s/it] 90%|████████▉ | 2543/2826 [4:09:15<28:02, 5.95s/it] 90%|█████████ | 2544/2826 [4:09:22<29:02, 6.18s/it] 90%|█████████ | 2545/2826 [4:09:28<28:26, 6.07s/it] 90%|█████████ | 2546/2826 [4:09:33<26:58, 5.78s/it] 90%|█████████ | 2547/2826 [4:09:39<27:19, 5.88s/it] 90%|█████████ | 2548/2826 [4:09:44<26:22, 5.69s/it] 90%|█████████ | 2549/2826 [4:09:51<27:42, 6.00s/it] 90%|█████████ | 2550/2826 [4:09:57<27:41, 6.02s/it] {'loss': 0.1893, 'grad_norm': 1.2738044261932373, 'learning_rate': 1.449553830958053e-07, 'epoch': 2.7} 90%|█████████ | 2550/2826 [4:09:57<27:41, 6.02s/it] 90%|█████████ | 2551/2826 [4:10:02<26:16, 5.73s/it] 90%|█████████ | 2552/2826 [4:10:07<25:27, 5.58s/it] 90%|█████████ | 2553/2826 [4:10:14<26:58, 5.93s/it] 90%|█████████ | 2554/2826 [4:10:20<26:41, 5.89s/it] 90%|█████████ | 2555/2826 [4:10:26<26:46, 5.93s/it] 90%|█████████ | 2556/2826 [4:10:33<28:20, 6.30s/it] 90%|█████████ | 2557/2826 [4:10:38<27:09, 6.06s/it] 91%|█████████ | 2558/2826 [4:10:44<27:07, 6.07s/it] 91%|█████████ | 2559/2826 [4:10:50<26:45, 6.01s/it] 91%|█████████ | 2560/2826 [4:10:56<26:51, 6.06s/it] {'loss': 0.1947, 'grad_norm': 1.8912787437438965, 'learning_rate': 1.347715971941746e-07, 'epoch': 2.72} 91%|█████████ | 2560/2826 [4:10:56<26:51, 6.06s/it] 91%|█████████ | 2561/2826 [4:11:02<25:46, 5.83s/it] 91%|█████████ | 2562/2826 [4:11:08<26:50, 6.10s/it] 91%|█████████ | 2563/2826 [4:11:14<25:27, 5.81s/it] 91%|█████████ | 2564/2826 [4:11:19<25:23, 5.82s/it] 91%|█████████ | 2565/2826 [4:11:25<24:52, 5.72s/it] 91%|█████████ | 2566/2826 [4:11:30<23:57, 5.53s/it] 91%|█████████ | 2567/2826 [4:11:36<24:41, 5.72s/it] 91%|█████████ | 2568/2826 [4:11:43<26:07, 6.07s/it] 91%|█████████ | 2569/2826 [4:11:49<26:27, 6.18s/it] 91%|█████████ | 2570/2826 [4:11:56<26:21, 6.18s/it] {'loss': 0.1744, 'grad_norm': 1.8385730981826782, 'learning_rate': 1.2494878418310234e-07, 'epoch': 2.73} 91%|█████████ | 2570/2826 [4:11:56<26:21, 6.18s/it] 91%|█████████ | 2571/2826 [4:12:01<24:59, 5.88s/it] 91%|█████████ | 2572/2826 [4:12:06<24:31, 5.79s/it] 91%|█████████ | 2573/2826 [4:12:13<24:48, 5.89s/it] 91%|█████████ | 2574/2826 [4:12:18<24:31, 5.84s/it] 91%|█████████ | 2575/2826 [4:12:24<23:53, 5.71s/it] 91%|█████████ | 2576/2826 [4:12:30<24:08, 5.79s/it] 91%|█████████ | 2577/2826 [4:12:36<24:20, 5.86s/it] 91%|█████████ | 2578/2826 [4:12:42<24:27, 5.92s/it] 91%|█████████▏| 2579/2826 [4:12:48<24:21, 5.92s/it] 91%|█████████▏| 2580/2826 [4:12:55<25:23, 6.19s/it] {'loss': 0.2351, 'grad_norm': 2.1071712970733643, 'learning_rate': 1.1548844318597208e-07, 'epoch': 2.74} 91%|█████████▏| 2580/2826 [4:12:55<25:23, 6.19s/it] 91%|█████████▏| 2581/2826 [4:13:01<25:12, 6.17s/it] 91%|█████████▏| 2582/2826 [4:13:06<23:43, 5.83s/it] 91%|█████████▏| 2583/2826 [4:13:12<23:43, 5.86s/it] 91%|█████████▏| 2584/2826 [4:13:17<22:43, 5.64s/it] 91%|█████████▏| 2585/2826 [4:13:22<22:01, 5.48s/it] 92%|█████████▏| 2586/2826 [4:13:29<23:29, 5.87s/it] 92%|█████████▏| 2587/2826 [4:13:34<22:38, 5.68s/it] 92%|█████████▏| 2588/2826 [4:13:39<21:40, 5.46s/it] 92%|█████████▏| 2589/2826 [4:13:46<23:18, 5.90s/it] 92%|█████████▏| 2590/2826 [4:13:51<22:32, 5.73s/it] {'loss': 0.2245, 'grad_norm': 2.054392099380493, 'learning_rate': 1.0639201800695553e-07, 'epoch': 2.75} 92%|█████████▏| 2590/2826 [4:13:51<22:32, 5.73s/it] 92%|█████████▏| 2591/2826 [4:13:56<22:01, 5.62s/it] 92%|█████████▏| 2592/2826 [4:14:03<23:37, 6.06s/it] 92%|█████████▏| 2593/2826 [4:14:10<23:44, 6.11s/it] 92%|█████████▏| 2594/2826 [4:14:15<22:38, 5.86s/it] 92%|█████████▏| 2595/2826 [4:14:20<21:35, 5.61s/it] 92%|█████████▏| 2596/2826 [4:14:26<21:55, 5.72s/it] 92%|█████████▏| 2597/2826 [4:14:31<21:10, 5.55s/it] 92%|█████████▏| 2598/2826 [4:14:37<21:32, 5.67s/it] 92%|█████████▏| 2599/2826 [4:14:43<21:36, 5.71s/it] 92%|█████████▏| 2600/2826 [4:14:49<21:48, 5.79s/it] {'loss': 0.2014, 'grad_norm': 1.656562328338623, 'learning_rate': 9.76608969106646e-08, 'epoch': 2.76} 92%|█████████▏| 2600/2826 [4:14:49<21:48, 5.79s/it] 92%|█████████▏| 2601/2826 [4:14:55<22:14, 5.93s/it] 92%|█████████▏| 2602/2826 [4:15:01<21:51, 5.85s/it] 92%|█████████▏| 2603/2826 [4:15:06<21:26, 5.77s/it] 92%|█████████▏| 2604/2826 [4:15:14<22:49, 6.17s/it] 92%|█████████▏| 2605/2826 [4:15:19<21:31, 5.84s/it] 92%|█████████▏| 2606/2826 [4:15:24<21:04, 5.75s/it] 92%|█████████▏| 2607/2826 [4:15:30<20:49, 5.71s/it] 92%|█████████▏| 2608/2826 [4:15:35<20:36, 5.67s/it] 92%|█████████▏| 2609/2826 [4:15:42<21:47, 6.02s/it] 92%|█████████▏| 2610/2826 [4:15:47<20:42, 5.75s/it] {'loss': 0.1824, 'grad_norm': 2.6887638568878174, 'learning_rate': 8.929641241027937e-08, 'epoch': 2.77} 92%|█████████▏| 2610/2826 [4:15:47<20:42, 5.75s/it] 92%|█████████▏| 2611/2826 [4:15:53<20:26, 5.71s/it] 92%|█████████▏| 2612/2826 [4:15:58<20:04, 5.63s/it] 92%|█████████▏| 2613/2826 [4:16:05<21:18, 6.00s/it] 92%|█████████▏| 2614/2826 [4:16:10<20:12, 5.72s/it] 93%|█████████▎| 2615/2826 [4:16:15<19:34, 5.57s/it] 93%|█████████▎| 2616/2826 [4:16:22<20:17, 5.80s/it] 93%|█████████▎| 2617/2826 [4:16:29<21:27, 6.16s/it] 93%|█████████▎| 2618/2826 [4:16:34<20:18, 5.86s/it] 93%|█████████▎| 2619/2826 [4:16:39<19:32, 5.66s/it] 93%|█████████▎| 2620/2826 [4:16:45<20:03, 5.84s/it] {'loss': 0.1706, 'grad_norm': 2.4606659412384033, 'learning_rate': 8.129984106418354e-08, 'epoch': 2.78} 93%|█████████▎| 2620/2826 [4:16:45<20:03, 5.84s/it] 93%|█████████▎| 2621/2826 [4:16:51<19:10, 5.61s/it] 93%|█████████▎| 2622/2826 [4:16:56<18:29, 5.44s/it] 93%|█████████▎| 2623/2826 [4:17:02<19:40, 5.81s/it] 93%|█████████▎| 2624/2826 [4:17:08<19:46, 5.88s/it] 93%|█████████▎| 2625/2826 [4:17:14<19:08, 5.72s/it] 93%|█████████▎| 2626/2826 [4:17:19<19:02, 5.71s/it] 93%|█████████▎| 2627/2826 [4:17:25<19:14, 5.80s/it] 93%|█████████▎| 2628/2826 [4:17:33<20:31, 6.22s/it] 93%|█████████▎| 2629/2826 [4:17:39<20:35, 6.27s/it] 93%|█████████▎| 2630/2826 [4:17:44<19:21, 5.93s/it] {'loss': 0.2195, 'grad_norm': 2.5548455715179443, 'learning_rate': 7.3672403281142e-08, 'epoch': 2.79} 93%|█████████▎| 2630/2826 [4:17:44<19:21, 5.93s/it] 93%|█████████▎| 2631/2826 [4:17:50<19:21, 5.96s/it] 93%|█████████▎| 2632/2826 [4:17:56<19:21, 5.99s/it] 93%|█████████▎| 2633/2826 [4:18:03<20:08, 6.26s/it] 93%|█████████▎| 2634/2826 [4:18:09<19:32, 6.11s/it] 93%|█████████▎| 2635/2826 [4:18:14<18:40, 5.87s/it] 93%|█████████▎| 2636/2826 [4:18:20<18:22, 5.80s/it] 93%|█████████▎| 2637/2826 [4:18:25<17:29, 5.55s/it] 93%|█████████▎| 2638/2826 [4:18:30<17:29, 5.58s/it] 93%|█████████▎| 2639/2826 [4:18:36<17:08, 5.50s/it] 93%|█████████▎| 2640/2826 [4:18:42<18:07, 5.85s/it] {'loss': 0.1748, 'grad_norm': 1.7952167987823486, 'learning_rate': 6.641526313404534e-08, 'epoch': 2.8} 93%|█████████▎| 2640/2826 [4:18:42<18:07, 5.85s/it] 93%|█████████▎| 2641/2826 [4:18:47<17:17, 5.61s/it] 93%|█████████▎| 2642/2826 [4:18:53<17:28, 5.70s/it] 94%|█████████▎| 2643/2826 [4:18:59<17:45, 5.82s/it] 94%|█████████▎| 2644/2826 [4:19:06<18:19, 6.04s/it] 94%|█████████▎| 2645/2826 [4:19:11<17:25, 5.78s/it] 94%|█████████▎| 2646/2826 [4:19:16<16:43, 5.57s/it] 94%|█████████▎| 2647/2826 [4:19:21<16:19, 5.47s/it] 94%|█████████▎| 2648/2826 [4:19:27<16:34, 5.59s/it] 94%|█████████▎| 2649/2826 [4:19:33<16:24, 5.56s/it] 94%|█████████▍| 2650/2826 [4:19:40<17:37, 6.01s/it] {'loss': 0.2061, 'grad_norm': 2.376830816268921, 'learning_rate': 5.952952818225416e-08, 'epoch': 2.81} 94%|█████████▍| 2650/2826 [4:19:40<17:37, 6.01s/it] 94%|█████████▍| 2651/2826 [4:19:45<16:43, 5.73s/it] 94%|█████████▍| 2652/2826 [4:19:51<16:39, 5.74s/it] 94%|█████████▍| 2653/2826 [4:19:56<16:36, 5.76s/it] 94%|█████████▍| 2654/2826 [4:20:02<16:01, 5.59s/it] 94%|█████████▍| 2655/2826 [4:20:07<15:57, 5.60s/it] 94%|█████████▍| 2656/2826 [4:20:14<16:59, 5.99s/it] 94%|█████████▍| 2657/2826 [4:20:20<16:48, 5.97s/it] 94%|█████████▍| 2658/2826 [4:20:26<16:15, 5.81s/it] 94%|█████████▍| 2659/2826 [4:20:31<15:27, 5.55s/it] 94%|█████████▍| 2660/2826 [4:20:36<15:22, 5.56s/it] {'loss': 0.1742, 'grad_norm': 1.7183632850646973, 'learning_rate': 5.3016249302565436e-08, 'epoch': 2.82} 94%|█████████▍| 2660/2826 [4:20:36<15:22, 5.56s/it] 94%|█████████▍| 2661/2826 [4:20:41<14:54, 5.42s/it] 94%|█████████▍| 2662/2826 [4:20:48<15:59, 5.85s/it] 94%|█████████▍| 2663/2826 [4:20:54<15:47, 5.81s/it] 94%|█████████▍| 2664/2826 [4:21:00<16:05, 5.96s/it] 94%|█████████▍| 2665/2826 [4:21:06<15:54, 5.93s/it] 94%|█████████▍| 2666/2826 [4:21:12<15:59, 6.00s/it] 94%|█████████▍| 2667/2826 [4:21:18<15:26, 5.83s/it] 94%|█████████▍| 2668/2826 [4:21:24<16:07, 6.13s/it] 94%|█████████▍| 2669/2826 [4:21:30<15:40, 5.99s/it] 94%|█████████▍| 2670/2826 [4:21:35<14:55, 5.74s/it] {'loss': 0.2082, 'grad_norm': 2.11011004447937, 'learning_rate': 4.6876420528833014e-08, 'epoch': 2.83} 94%|█████████▍| 2670/2826 [4:21:35<14:55, 5.74s/it] 95%|█████████▍| 2671/2826 [4:21:41<14:59, 5.80s/it] 95%|█████████▍| 2672/2826 [4:21:46<14:25, 5.62s/it] 95%|█████████▍| 2673/2826 [4:21:52<14:39, 5.75s/it] 95%|█████████▍| 2674/2826 [4:21:59<15:02, 5.94s/it] 95%|█████████▍| 2675/2826 [4:22:05<15:10, 6.03s/it] 95%|█████████▍| 2676/2826 [4:22:13<16:34, 6.63s/it] 95%|█████████▍| 2677/2826 [4:22:18<15:22, 6.19s/it] 95%|█████████▍| 2678/2826 [4:22:24<14:48, 6.00s/it] 95%|█████████▍| 2679/2826 [4:22:30<15:03, 6.14s/it] 95%|█████████▍| 2680/2826 [4:22:35<14:11, 5.83s/it] {'loss': 0.1805, 'grad_norm': 1.8799868822097778, 'learning_rate': 4.111097890026089e-08, 'epoch': 2.84} 95%|█████████▍| 2680/2826 [4:22:35<14:11, 5.83s/it] 95%|█████████▍| 2681/2826 [4:22:40<13:30, 5.59s/it] 95%|█████████▍| 2682/2826 [4:22:46<13:19, 5.55s/it] 95%|█████████▍| 2683/2826 [4:22:51<13:01, 5.46s/it] 95%|█████████▍| 2684/2826 [4:22:57<13:24, 5.66s/it] 95%|█████████▌| 2685/2826 [4:23:04<14:13, 6.06s/it] 95%|█████████▌| 2686/2826 [4:23:10<13:57, 5.98s/it] 95%|█████████▌| 2687/2826 [4:23:17<14:17, 6.17s/it] 95%|█████████▌| 2688/2826 [4:23:22<13:59, 6.08s/it] 95%|█████████▌| 2689/2826 [4:23:29<13:54, 6.09s/it] 95%|█████████▌| 2690/2826 [4:23:34<13:18, 5.87s/it] {'loss': 0.2058, 'grad_norm': 2.5171291828155518, 'learning_rate': 3.5720804318395976e-08, 'epoch': 2.85} 95%|█████████▌| 2690/2826 [4:23:34<13:18, 5.87s/it] 95%|█████████▌| 2691/2826 [4:23:39<12:41, 5.64s/it] 95%|█████████▌| 2692/2826 [4:23:45<12:54, 5.78s/it] 95%|█████████▌| 2693/2826 [4:23:51<13:09, 5.93s/it] 95%|█████████▌| 2694/2826 [4:23:59<13:56, 6.34s/it] 95%|█████████▌| 2695/2826 [4:24:04<13:13, 6.05s/it] 95%|█████████▌| 2696/2826 [4:24:10<12:43, 5.88s/it] 95%|█████████▌| 2697/2826 [4:24:16<12:52, 5.99s/it] 95%|█████████▌| 2698/2826 [4:24:21<12:31, 5.87s/it] 96%|█████████▌| 2699/2826 [4:24:27<12:23, 5.86s/it] 96%|█████████▌| 2700/2826 [4:24:32<11:51, 5.65s/it] {'loss': 0.2027, 'grad_norm': 2.142263650894165, 'learning_rate': 3.0706719412839926e-08, 'epoch': 2.86} 96%|█████████▌| 2700/2826 [4:24:32<11:51, 5.65s/it] 96%|█████████▌| 2701/2826 [4:24:39<12:09, 5.84s/it] 96%|█████████▌| 2702/2826 [4:24:44<11:38, 5.64s/it] 96%|█████████▌| 2703/2826 [4:24:49<11:25, 5.58s/it] 96%|█████████▌| 2704/2826 [4:24:55<11:23, 5.60s/it] 96%|█████████▌| 2705/2826 [4:25:02<11:59, 5.94s/it] 96%|█████████▌| 2706/2826 [4:25:08<12:19, 6.17s/it] 96%|█████████▌| 2707/2826 [4:25:14<11:53, 6.00s/it] 96%|█████████▌| 2708/2826 [4:25:19<11:30, 5.86s/it] 96%|█████████▌| 2709/2826 [4:25:26<11:55, 6.12s/it] 96%|█████████▌| 2710/2826 [4:25:32<11:20, 5.87s/it] {'loss': 0.1941, 'grad_norm': 2.2124040126800537, 'learning_rate': 2.6069489415703197e-08, 'epoch': 2.87} 96%|█████████▌| 2710/2826 [4:25:32<11:20, 5.87s/it] 96%|█████████▌| 2711/2826 [4:25:38<11:24, 5.95s/it] 96%|█████████▌| 2712/2826 [4:25:43<10:59, 5.78s/it] 96%|█████████▌| 2713/2826 [4:25:49<10:52, 5.78s/it] 96%|█████████▌| 2714/2826 [4:25:56<11:48, 6.32s/it] 96%|█████████▌| 2715/2826 [4:26:03<11:52, 6.42s/it] 96%|█████████▌| 2716/2826 [4:26:09<11:15, 6.14s/it] 96%|█████████▌| 2717/2826 [4:26:14<10:35, 5.83s/it] 96%|█████████▌| 2718/2826 [4:26:19<10:11, 5.66s/it] 96%|█████████▌| 2719/2826 [4:26:24<09:45, 5.48s/it] 96%|█████████▌| 2720/2826 [4:26:30<10:07, 5.73s/it] {'loss': 0.2029, 'grad_norm': 2.033259153366089, 'learning_rate': 2.18098220448168e-08, 'epoch': 2.88} 96%|█████████▌| 2720/2826 [4:26:30<10:07, 5.73s/it] 96%|█████████▋| 2721/2826 [4:26:35<09:43, 5.56s/it] 96%|█████████▋| 2722/2826 [4:26:43<10:46, 6.22s/it] 96%|█████████▋| 2723/2826 [4:26:48<10:03, 5.86s/it] 96%|█████████▋| 2724/2826 [4:26:54<09:49, 5.78s/it] 96%|█████████▋| 2725/2826 [4:26:59<09:22, 5.57s/it] 96%|█████████▋| 2726/2826 [4:27:04<09:15, 5.56s/it] 96%|█████████▋| 2727/2826 [4:27:10<09:25, 5.71s/it] 97%|█████████▋| 2728/2826 [4:27:17<09:38, 5.90s/it] 97%|█████████▋| 2729/2826 [4:27:22<09:10, 5.67s/it] 97%|█████████▋| 2730/2826 [4:27:27<08:54, 5.56s/it] {'loss': 0.2062, 'grad_norm': 2.416912794113159, 'learning_rate': 1.7928367395725066e-08, 'epoch': 2.9} 97%|█████████▋| 2730/2826 [4:27:27<08:54, 5.56s/it] 97%|█████████▋| 2731/2826 [4:27:33<08:47, 5.55s/it] 97%|█████████▋| 2732/2826 [4:27:39<08:47, 5.61s/it] 97%|█████████▋| 2733/2826 [4:27:46<09:34, 6.17s/it] 97%|█████████▋| 2734/2826 [4:27:53<09:58, 6.50s/it] 97%|█████████▋| 2735/2826 [4:27:58<09:12, 6.08s/it] 97%|█████████▋| 2736/2826 [4:28:05<09:17, 6.19s/it] 97%|█████████▋| 2737/2826 [4:28:11<09:02, 6.10s/it] 97%|█████████▋| 2738/2826 [4:28:16<08:45, 5.97s/it] 97%|█████████▋| 2739/2826 [4:28:21<08:14, 5.69s/it] 97%|█████████▋| 2740/2826 [4:28:27<08:14, 5.75s/it] {'loss': 0.1873, 'grad_norm': 2.193751096725464, 'learning_rate': 1.442571784246699e-08, 'epoch': 2.91} 97%|█████████▋| 2740/2826 [4:28:27<08:14, 5.75s/it] 97%|█████████▋| 2741/2826 [4:28:35<08:48, 6.22s/it] 97%|█████████▋| 2742/2826 [4:28:42<09:04, 6.48s/it] 97%|█████████▋| 2743/2826 [4:28:47<08:25, 6.09s/it] 97%|█████████▋| 2744/2826 [4:28:53<08:24, 6.16s/it] 97%|█████████▋| 2745/2826 [4:28:59<07:59, 5.92s/it] 97%|█████████▋| 2746/2826 [4:29:04<07:34, 5.68s/it] 97%|█████████▋| 2747/2826 [4:29:10<07:53, 5.99s/it] 97%|█████████▋| 2748/2826 [4:29:16<07:27, 5.74s/it] 97%|█████████▋| 2749/2826 [4:29:21<07:14, 5.65s/it] 97%|█████████▋| 2750/2826 [4:29:27<07:19, 5.78s/it] {'loss': 0.1653, 'grad_norm': 1.5729731321334839, 'learning_rate': 1.1302407947173522e-08, 'epoch': 2.92} 97%|█████████▋| 2750/2826 [4:29:27<07:19, 5.78s/it] 97%|█████████▋| 2751/2826 [4:29:33<07:26, 5.96s/it] 97%|█████████▋| 2752/2826 [4:29:40<07:35, 6.16s/it] 97%|█████████▋| 2753/2826 [4:29:47<07:56, 6.53s/it] 97%|█████████▋| 2754/2826 [4:29:55<08:02, 6.70s/it] 97%|█████████▋| 2755/2826 [4:30:00<07:26, 6.28s/it] 98%|█████████▊| 2756/2826 [4:30:06<07:10, 6.15s/it] 98%|█████████▊| 2757/2826 [4:30:11<06:44, 5.86s/it] 98%|█████████▊| 2758/2826 [4:30:16<06:21, 5.62s/it] 98%|█████████▊| 2759/2826 [4:30:21<06:13, 5.57s/it] 98%|█████████▊| 2760/2826 [4:30:28<06:28, 5.89s/it] {'loss': 0.1743, 'grad_norm': 1.7562044858932495, 'learning_rate': 8.558914378481996e-09, 'epoch': 2.93} 98%|█████████▊| 2760/2826 [4:30:28<06:28, 5.89s/it] 98%|█████████▊| 2761/2826 [4:30:34<06:20, 5.86s/it] 98%|█████████▊| 2762/2826 [4:30:39<06:00, 5.63s/it] 98%|█████████▊| 2763/2826 [4:30:44<05:47, 5.52s/it] 98%|█████████▊| 2764/2826 [4:30:49<05:32, 5.36s/it] 98%|█████████▊| 2765/2826 [4:30:56<06:00, 5.91s/it] 98%|█████████▊| 2766/2826 [4:31:03<06:07, 6.12s/it] 98%|█████████▊| 2767/2826 [4:31:09<06:06, 6.21s/it] 98%|█████████▊| 2768/2826 [4:31:15<05:54, 6.12s/it] 98%|█████████▊| 2769/2826 [4:31:21<05:36, 5.90s/it] 98%|█████████▊| 2770/2826 [4:31:26<05:24, 5.80s/it] {'loss': 0.1821, 'grad_norm': 2.183967351913452, 'learning_rate': 6.195655838790726e-09, 'epoch': 2.94} 98%|█████████▊| 2770/2826 [4:31:26<05:24, 5.80s/it] 98%|█████████▊| 2771/2826 [4:31:32<05:14, 5.73s/it] 98%|█████████▊| 2772/2826 [4:31:37<05:00, 5.56s/it] 98%|█████████▊| 2773/2826 [4:31:44<05:12, 5.90s/it] 98%|█████████▊| 2774/2826 [4:31:49<05:02, 5.81s/it] 98%|█████████▊| 2775/2826 [4:31:55<04:49, 5.67s/it] 98%|█████████▊| 2776/2826 [4:32:00<04:35, 5.50s/it] 98%|█████████▊| 2777/2826 [4:32:07<04:50, 5.92s/it] 98%|█████████▊| 2778/2826 [4:32:14<05:03, 6.33s/it] 98%|█████████▊| 2779/2826 [4:32:20<04:51, 6.20s/it] 98%|█████████▊| 2780/2826 [4:32:26<04:42, 6.13s/it] {'loss': 0.1954, 'grad_norm': 1.9312433004379272, 'learning_rate': 4.212993000356491e-09, 'epoch': 2.95} 98%|█████████▊| 2780/2826 [4:32:26<04:42, 6.13s/it] 98%|█████████▊| 2781/2826 [4:32:31<04:22, 5.84s/it] 98%|█████████▊| 2782/2826 [4:32:36<04:06, 5.61s/it] 98%|█████████▊| 2783/2826 [4:32:42<04:02, 5.64s/it] 99%|█████████▊| 2784/2826 [4:32:49<04:13, 6.03s/it] 99%|█████████▊| 2785/2826 [4:32:54<03:57, 5.79s/it] 99%|█████████▊| 2786/2826 [4:32:59<03:43, 5.58s/it] 99%|█████████▊| 2787/2826 [4:33:04<03:31, 5.42s/it] 99%|█████████▊| 2788/2826 [4:33:09<03:24, 5.38s/it] 99%|█████████▊| 2789/2826 [4:33:15<03:23, 5.50s/it] 99%|█████████▊| 2790/2826 [4:33:21<03:21, 5.61s/it] {'loss': 0.1925, 'grad_norm': 2.2055087089538574, 'learning_rate': 2.611228450250802e-09, 'epoch': 2.96} 99%|█████████▊| 2790/2826 [4:33:21<03:21, 5.61s/it] 99%|█████████▉| 2791/2826 [4:33:28<03:29, 6.00s/it] 99%|█████████▉| 2792/2826 [4:33:35<03:30, 6.19s/it] 99%|█████████▉| 2793/2826 [4:33:40<03:19, 6.06s/it] 99%|█████████▉| 2794/2826 [4:33:46<03:12, 6.01s/it] 99%|█████████▉| 2795/2826 [4:33:53<03:14, 6.28s/it] 99%|█████████▉| 2796/2826 [4:33:59<03:01, 6.06s/it] 99%|█████████▉| 2797/2826 [4:34:04<02:46, 5.75s/it] 99%|█████████▉| 2798/2826 [4:34:09<02:37, 5.62s/it] 99%|█████████▉| 2799/2826 [4:34:16<02:41, 5.97s/it] 99%|█████████▉| 2800/2826 [4:34:23<02:48, 6.49s/it] {'loss': 0.1805, 'grad_norm': 1.6606404781341553, 'learning_rate': 1.3906066441798927e-09, 'epoch': 2.97} 99%|█████████▉| 2800/2826 [4:34:23<02:48, 6.49s/it] 99%|█████████▉| 2801/2826 [4:34:30<02:45, 6.62s/it] 99%|█████████▉| 2802/2826 [4:34:36<02:29, 6.22s/it] 99%|█████████▉| 2803/2826 [4:34:41<02:19, 6.05s/it] 99%|█████████▉| 2804/2826 [4:34:47<02:10, 5.92s/it] 99%|█████████▉| 2805/2826 [4:34:53<02:04, 5.93s/it] 99%|█████████▉| 2806/2826 [4:34:59<02:02, 6.11s/it] 99%|█████████▉| 2807/2826 [4:35:05<01:51, 5.86s/it] 99%|█████████▉| 2808/2826 [4:35:10<01:41, 5.66s/it] 99%|█████████▉| 2809/2826 [4:35:15<01:34, 5.54s/it] 99%|█████████▉| 2810/2826 [4:35:21<01:30, 5.63s/it] {'loss': 0.2084, 'grad_norm': 2.594404458999634, 'learning_rate': 5.513138691767839e-10, 'epoch': 2.98} 99%|█████████▉| 2810/2826 [4:35:21<01:30, 5.63s/it] 99%|█████████▉| 2811/2826 [4:35:28<01:28, 5.93s/it] 100%|█████████▉| 2812/2826 [4:35:35<01:27, 6.26s/it] 100%|█████████▉| 2813/2826 [4:35:42<01:25, 6.55s/it] 100%|█████████▉| 2814/2826 [4:35:47<01:13, 6.16s/it] 100%|█████████▉| 2815/2826 [4:35:53<01:06, 6.01s/it] 100%|█████████▉| 2816/2826 [4:35:58<00:57, 5.75s/it] 100%|█████████▉| 2817/2826 [4:36:05<00:54, 6.09s/it] 100%|█████████▉| 2818/2826 [4:36:12<00:50, 6.37s/it] 100%|█████████▉| 2819/2826 [4:36:17<00:42, 6.13s/it] 100%|█████████▉| 2820/2826 [4:36:24<00:38, 6.37s/it] {'loss': 0.2115, 'grad_norm': 2.007861375808716, 'learning_rate': 9.347821517069477e-11, 'epoch': 2.99} 100%|█████████▉| 2820/2826 [4:36:24<00:38, 6.37s/it] 100%|█████████▉| 2821/2826 [4:36:31<00:33, 6.60s/it] 100%|█████████▉| 2822/2826 [4:36:37<00:24, 6.17s/it] 100%|█████████▉| 2823/2826 [4:36:42<00:17, 5.84s/it] 100%|█████████▉| 2824/2826 [4:36:48<00:11, 5.91s/it] 100%|█████████▉| 2825/2826 [4:36:54<00:05, 5.88s/it] 100%|██████████| 2826/2826 [4:36:59<00:00, 5.66s/it][INFO|trainer.py:3984] 2025-10-18 11:23:15,155 >> Saving model checkpoint to /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-2826 [INFO|configuration_utils.py:419] 2025-10-18 11:23:15,160 >> Configuration saved in /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-2826/config.json [INFO|configuration_utils.py:911] 2025-10-18 11:23:15,162 >> Configuration saved in /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-2826/generation_config.json [INFO|modeling_utils.py:3580] 2025-10-18 11:23:35,979 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 4 checkpoint shards. You can find where each parameters has been saved in the index located at /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-2826/model.safetensors.index.json. [INFO|tokenization_utils_base.py:2510] 2025-10-18 11:23:35,982 >> tokenizer config file saved in /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-2826/tokenizer_config.json [INFO|tokenization_utils_base.py:2519] 2025-10-18 11:23:35,983 >> Special tokens file saved in /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-2826/special_tokens_map.json [2025-10-18 11:23:36,183] [INFO] [logging.py:107:log_dist] [Rank 0] [Torch] Checkpoint global_step2825 is about to be saved! [2025-10-18 11:23:36,615] [INFO] [logging.py:107:log_dist] [Rank 0] Saving model checkpoint: /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-2826/global_step2825/zero_pp_rank_0_mp_rank_00_model_states.pt [2025-10-18 11:23:36,615] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-2826/global_step2825/zero_pp_rank_0_mp_rank_00_model_states.pt... [2025-10-18 11:23:36,633] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-2826/global_step2825/zero_pp_rank_0_mp_rank_00_model_states.pt. [2025-10-18 11:23:36,637] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-2826/global_step2825/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... [2025-10-18 11:23:55,201] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-2826/global_step2825/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. [2025-10-18 11:23:55,202] [INFO] [engine.py:3701:_save_zero_checkpoint] zero checkpoint saved /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-2826/global_step2825/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt [2025-10-18 11:23:55,663] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step2825 is ready now! [INFO|trainer.py:2681] 2025-10-18 11:23:55,685 >> Training completed. Do not forget to share your model on huggingface.co/models =) {'train_runtime': 16671.2674, 'train_samples_per_second': 2.713, 'train_steps_per_second': 0.17, 'train_loss': 0.34044326600333263, 'epoch': 3.0} 100%|██████████| 2826/2826 [4:37:51<00:00, 5.66s/it] 100%|██████████| 2826/2826 [4:37:51<00:00, 5.90s/it] [INFO|trainer.py:3984] 2025-10-18 11:24:06,471 >> Saving model checkpoint to /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17 [INFO|configuration_utils.py:419] 2025-10-18 11:24:06,477 >> Configuration saved in /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/config.json [INFO|configuration_utils.py:911] 2025-10-18 11:24:06,480 >> Configuration saved in /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/generation_config.json [INFO|modeling_utils.py:3580] 2025-10-18 11:24:26,439 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 4 checkpoint shards. You can find where each parameters has been saved in the index located at /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/model.safetensors.index.json. [INFO|tokenization_utils_base.py:2510] 2025-10-18 11:24:26,442 >> tokenizer config file saved in /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/tokenizer_config.json [INFO|tokenization_utils_base.py:2519] 2025-10-18 11:24:26,443 >> Special tokens file saved in /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/special_tokens_map.json ***** train metrics ***** epoch = 2.9973 total_flos = 101656586GF train_loss = 0.3404 train_runtime = 4:37:51.26 train_samples_per_second = 2.713 train_steps_per_second = 0.17 Figure saved at: /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/training_loss.png [WARNING|2025-10-18 11:24:27] llamafactory.extras.ploting:148 >> No metric eval_loss to plot. [WARNING|2025-10-18 11:24:27] llamafactory.extras.ploting:148 >> No metric eval_accuracy to plot. [INFO|modelcard.py:450] 2025-10-18 11:24:27,224 >> Dropping the following result as it does not have all the necessary fields: {'task': {'name': 'Causal Language Modeling', 'type': 'text-generation'}}