[2025-02-21 03:00:06,780] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-02-21 03:00:06,793] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-02-21 03:00:06,797] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-02-21 03:00:06,797] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-02-21 03:00:06,797] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-02-21 03:00:06,797] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-02-21 03:00:06,797] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) INFO 02-21 03:00:12 __init__.py:190] Automatically detected platform cuda. INFO 02-21 03:00:12 __init__.py:190] Automatically detected platform cuda. INFO 02-21 03:00:12 __init__.py:190] Automatically detected platform cuda. INFO 02-21 03:00:12 __init__.py:190] Automatically detected platform cuda. INFO 02-21 03:00:12 __init__.py:190] Automatically detected platform cuda. INFO 02-21 03:00:12 __init__.py:190] Automatically detected platform cuda. INFO 02-21 03:00:12 __init__.py:190] Automatically detected platform cuda. [2025-02-21 03:00:17,584] [INFO] [comm.py:652:init_distributed] cdb=None [2025-02-21 03:00:17,584] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl [2025-02-21 03:00:17,588] [INFO] [comm.py:652:init_distributed] cdb=None [2025-02-21 03:00:17,592] [INFO] [comm.py:652:init_distributed] cdb=None [2025-02-21 03:00:17,595] [INFO] [comm.py:652:init_distributed] cdb=None [2025-02-21 03:00:17,598] [INFO] [comm.py:652:init_distributed] cdb=None [2025-02-21 03:00:17,620] [INFO] [comm.py:652:init_distributed] cdb=None [2025-02-21 03:00:17,624] [INFO] [comm.py:652:init_distributed] cdb=None Generating train split: 0 examples [00:00, ? examples/s] Generating train split: 4508 examples [00:00, 35821.11 examples/s] Generating train split: 4508 examples [00:00, 35633.78 examples/s] Map: 0%| | 0/4508 [00:00 Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in Qwen2VisionTransformerPretrainedModel is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)` Map: 100%|██████████| 4508/4508 [00:00<00:00, 12559.56 examples/s] [2025-02-21 03:00:18,995] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 7 You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in Qwen2VisionTransformerPretrainedModel is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)` Map: 100%|██████████| 4508/4508 [00:00<00:00, 13160.81 examples/s] Map: 100%|██████████| 4508/4508 [00:00<00:00, 13172.20 examples/s] [2025-02-21 03:00:19,039] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 7 You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in Qwen2VisionTransformerPretrainedModel is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)` p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3650919 [0] NCCL INFO cudaDriverVersion 12040 NCCL version 2.21.5+cuda12.4 Map: 100%|██████████| 4508/4508 [00:00<00:00, 11890.32 examples/s] [2025-02-21 03:00:19,065] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 7 You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in Qwen2VisionTransformerPretrainedModel is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)` p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3650924 [5] NCCL INFO cudaDriverVersion 12040 p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3650924 [5] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0 p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3650920 [1] NCCL INFO cudaDriverVersion 12040 p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3650920 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0 p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3650924 [5] NCCL INFO Bootstrap : Using bond0:10.9.200.104<0> p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3650920 [1] NCCL INFO Bootstrap : Using bond0:10.9.200.104<0> p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3650925 [6] NCCL INFO cudaDriverVersion 12040 p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3650925 [6] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0 p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3650925 [6] NCCL INFO Bootstrap : Using bond0:10.9.200.104<0> p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3650922 [3] NCCL INFO cudaDriverVersion 12040 p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3650922 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0 p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3650921 [2] NCCL INFO cudaDriverVersion 12040 p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3650922 [3] NCCL INFO Bootstrap : Using bond0:10.9.200.104<0> p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3650921 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0 p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3650921 [2] NCCL INFO Bootstrap : Using bond0:10.9.200.104<0> p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3650923 [4] NCCL INFO cudaDriverVersion 12040 p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3650923 [4] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0 p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3650923 [4] NCCL INFO Bootstrap : Using bond0:10.9.200.104<0> p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3653255 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3653255 [0] NCCL INFO P2P plugin IBext_v8 p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3653255 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0 p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3653272 [5] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3653272 [5] NCCL INFO P2P plugin IBext_v8 p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3653272 [5] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0 p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3653273 [1] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3653273 [1] NCCL INFO P2P plugin IBext_v8 p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3653273 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0 p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3653275 [6] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3653275 [6] NCCL INFO P2P plugin IBext_v8 p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3653275 [6] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0 p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3653279 [2] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3653279 [2] NCCL INFO P2P plugin IBext_v8 p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3653279 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0 p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3653278 [3] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3653278 [3] NCCL INFO P2P plugin IBext_v8 p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3653278 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0 p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3653255 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [RO]; OOB bond0:10.9.200.104<0> p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3653255 [0] NCCL INFO Using non-device net plugin version 0 p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3653255 [0] NCCL INFO Using network IBext_v8 p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3653281 [4] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3653281 [4] NCCL INFO P2P plugin IBext_v8 p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3653281 [4] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0 p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3653272 [5] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [RO]; OOB bond0:10.9.200.104<0> p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3653272 [5] NCCL INFO Using non-device net plugin version 0 p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3653272 [5] NCCL INFO Using network IBext_v8 p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3653273 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [RO]; OOB bond0:10.9.200.104<0> p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3653273 [1] NCCL INFO Using non-device net plugin version 0 p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3653273 [1] NCCL INFO Using network IBext_v8 p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3653275 [6] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [RO]; OOB bond0:10.9.200.104<0> p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3653275 [6] NCCL INFO Using non-device net plugin version 0 p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3653275 [6] NCCL INFO Using network IBext_v8 p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3653278 [3] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [RO]; OOB bond0:10.9.200.104<0> p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3653278 [3] NCCL INFO Using non-device net plugin version 0 p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3653278 [3] NCCL INFO Using network IBext_v8 p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3653279 [2] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [RO]; OOB bond0:10.9.200.104<0> p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3653279 [2] NCCL INFO Using non-device net plugin version 0 p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3653279 [2] NCCL INFO Using network IBext_v8 p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3653281 [4] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [RO]; OOB bond0:10.9.200.104<0> p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3653281 [4] NCCL INFO Using non-device net plugin version 0 p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3653281 [4] NCCL INFO Using network IBext_v8 p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3653278 [3] NCCL INFO ncclCommInitRank comm 0x55f1432b8ed0 rank 3 nranks 7 cudaDev 3 nvmlDev 3 busId 59000 commId 0xed83bdb4b6159d06 - Init START p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3653273 [1] NCCL INFO ncclCommInitRank comm 0x563f8ba04490 rank 1 nranks 7 cudaDev 1 nvmlDev 1 busId 2d000 commId 0xed83bdb4b6159d06 - Init START p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3653255 [0] NCCL INFO ncclCommInitRank comm 0x55fcb5df2e20 rank 0 nranks 7 cudaDev 0 nvmlDev 0 busId 27000 commId 0xed83bdb4b6159d06 - Init START p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3653279 [2] NCCL INFO ncclCommInitRank comm 0x5628d6593600 rank 2 nranks 7 cudaDev 2 nvmlDev 2 busId 54000 commId 0xed83bdb4b6159d06 - Init START p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3653275 [6] NCCL INFO ncclCommInitRank comm 0x5572b545e210 rank 6 nranks 7 cudaDev 6 nvmlDev 6 busId bf000 commId 0xed83bdb4b6159d06 - Init START p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3653272 [5] NCCL INFO ncclCommInitRank comm 0x5646ab5579c0 rank 5 nranks 7 cudaDev 5 nvmlDev 5 busId 92000 commId 0xed83bdb4b6159d06 - Init START p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3653281 [4] NCCL INFO ncclCommInitRank comm 0x5629d49b4430 rank 4 nranks 7 cudaDev 4 nvmlDev 4 busId 8d000 commId 0xed83bdb4b6159d06 - Init START p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3653281 [4] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0. p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3653255 [0] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0. p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3653278 [3] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0. p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3653281 [4] NCCL INFO Setting affinity for GPU 4 to ffffffff,00000000,ffffffff,00000000 p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3653281 [4] NCCL INFO NVLS multicast support is not available on dev 4 p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3653255 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,00000000,ffffffff p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3653255 [0] NCCL INFO NVLS multicast support is not available on dev 0 p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3653278 [3] NCCL INFO Setting affinity for GPU 3 to ffffffff,00000000,ffffffff p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3653278 [3] NCCL INFO NVLS multicast support is not available on dev 3 p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3653279 [2] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0. p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3653279 [2] NCCL INFO Setting affinity for GPU 2 to ffffffff,00000000,ffffffff p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3653279 [2] NCCL INFO NVLS multicast support is not available on dev 2 p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3653275 [6] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0. p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3653275 [6] NCCL INFO Setting affinity for GPU 6 to ffffffff,00000000,ffffffff,00000000 p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3653275 [6] NCCL INFO NVLS multicast support is not available on dev 6 p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3653273 [1] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0. p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3653272 [5] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0. p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3653273 [1] NCCL INFO Setting affinity for GPU 1 to ffffffff,00000000,ffffffff p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3653272 [5] NCCL INFO Setting affinity for GPU 5 to ffffffff,00000000,ffffffff,00000000 p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3653272 [5] NCCL INFO NVLS multicast support is not available on dev 5 p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3653273 [1] NCCL INFO NVLS multicast support is not available on dev 1 p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3653255 [0] NCCL INFO comm 0x55fcb5df2e20 rank 0 nRanks 7 nNodes 1 localRanks 7 localRank 0 MNNVL 0 p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3653279 [2] NCCL INFO comm 0x5628d6593600 rank 2 nRanks 7 nNodes 1 localRanks 7 localRank 2 MNNVL 0 p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3653275 [6] NCCL INFO comm 0x5572b545e210 rank 6 nRanks 7 nNodes 1 localRanks 7 localRank 6 MNNVL 0 p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3653273 [1] NCCL INFO comm 0x563f8ba04490 rank 1 nRanks 7 nNodes 1 localRanks 7 localRank 1 MNNVL 0 p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3653281 [4] NCCL INFO comm 0x5629d49b4430 rank 4 nRanks 7 nNodes 1 localRanks 7 localRank 4 MNNVL 0 p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3653272 [5] NCCL INFO comm 0x5646ab5579c0 rank 5 nRanks 7 nNodes 1 localRanks 7 localRank 5 MNNVL 0 p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3653278 [3] NCCL INFO comm 0x55f1432b8ed0 rank 3 nRanks 7 nNodes 1 localRanks 7 localRank 3 MNNVL 0 p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3653275 [6] NCCL INFO Trees [0] -1/-1/-1->6->5 [1] -1/-1/-1->6->5 [2] -1/-1/-1->6->5 [3] -1/-1/-1->6->5 [4] -1/-1/-1->6->5 [5] -1/-1/-1->6->5 [6] -1/-1/-1->6->5 [7] -1/-1/-1->6->5 [8] -1/-1/-1->6->5 [9] -1/-1/-1->6->5 [10] -1/-1/-1->6->5 [11] -1/-1/-1->6->5 [12] -1/-1/-1->6->5 [13] -1/-1/-1->6->5 [14] -1/-1/-1->6->5 [15] -1/-1/-1->6->5 p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3653255 [0] NCCL INFO Channel 00/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3653275 [6] NCCL INFO P2P Chunksize set to 524288 p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3653255 [0] NCCL INFO Channel 01/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3653255 [0] NCCL INFO Channel 02/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3653255 [0] NCCL INFO Channel 03/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3653255 [0] NCCL INFO Channel 04/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3653255 [0] NCCL INFO Channel 05/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3653255 [0] NCCL INFO Channel 06/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3653255 [0] NCCL INFO Channel 07/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3653279 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->1 [11] 3/-1/-1->2->1 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1 p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3653255 [0] NCCL INFO Channel 08/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3653255 [0] NCCL INFO Channel 09/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3653279 [2] NCCL INFO P2P Chunksize set to 524288 p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3653255 [0] NCCL INFO Channel 10/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3653281 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 5/-1/-1->4->3 [7] 5/-1/-1->4->3 [8] 5/-1/-1->4->3 [9] 5/-1/-1->4->3 [10] 5/-1/-1->4->3 [11] 5/-1/-1->4->3 [12] 5/-1/-1->4->3 [13] 5/-1/-1->4->3 [14] 5/-1/-1->4->3 [15] 5/-1/-1->4->3 p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3653255 [0] NCCL INFO Channel 11/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3653255 [0] NCCL INFO Channel 12/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3653281 [4] NCCL INFO P2P Chunksize set to 524288 p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3653255 [0] NCCL INFO Channel 13/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3653255 [0] NCCL INFO Channel 14/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3653255 [0] NCCL INFO Channel 15/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3653255 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3653255 [0] NCCL INFO P2P Chunksize set to 524288 p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3653273 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0 p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3653278 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] 4/-1/-1->3->2 [7] 4/-1/-1->3->2 [8] 4/-1/-1->3->2 [9] 4/-1/-1->3->2 [10] 4/-1/-1->3->2 [11] 4/-1/-1->3->2 [12] 4/-1/-1->3->2 [13] 4/-1/-1->3->2 [14] 4/-1/-1->3->2 [15] 4/-1/-1->3->2 p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3653273 [1] NCCL INFO P2P Chunksize set to 524288 p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3653272 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 6/-1/-1->5->4 [7] 6/-1/-1->5->4 [8] 6/-1/-1->5->4 [9] 6/-1/-1->5->4 [10] 6/-1/-1->5->4 [11] 6/-1/-1->5->4 [12] 6/-1/-1->5->4 [13] 6/-1/-1->5->4 [14] 6/-1/-1->5->4 [15] 6/-1/-1->5->4 p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3653272 [5] NCCL INFO P2P Chunksize set to 524288 p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3653278 [3] NCCL INFO P2P Chunksize set to 524288 p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3653275 [6] NCCL INFO Channel 00/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3653281 [4] NCCL INFO Channel 00/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3653255 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3653275 [6] NCCL INFO Channel 01/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3653281 [4] NCCL INFO Channel 01/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3653273 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3653255 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3653275 [6] NCCL INFO Channel 02/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3653281 [4] NCCL INFO Channel 02/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3653278 [3] NCCL INFO Channel 00/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3653255 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3653275 [6] NCCL INFO Channel 03/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3653273 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3653272 [5] NCCL INFO Channel 00/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3653281 [4] NCCL INFO Channel 03/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3653278 [3] NCCL INFO Channel 01/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3653255 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3653273 [1] NCCL INFO Channel 02/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3653275 [6] NCCL INFO Channel 04/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3653272 [5] NCCL INFO Channel 01/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3653281 [4] NCCL INFO Channel 04/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3653278 [3] NCCL INFO Channel 02/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3653255 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3653273 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3653275 [6] NCCL INFO Channel 05/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3653272 [5] NCCL INFO Channel 02/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3653281 [4] NCCL INFO Channel 05/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3653278 [3] NCCL INFO Channel 03/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3653255 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3653273 [1] NCCL INFO Channel 04/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3653275 [6] NCCL INFO Channel 06/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3653272 [5] NCCL INFO Channel 03/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3653281 [4] NCCL INFO Channel 06/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3653278 [3] NCCL INFO Channel 04/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3653255 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3653273 [1] NCCL INFO Channel 05/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3653275 [6] NCCL INFO Channel 07/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3653272 [5] NCCL INFO Channel 04/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3653281 [4] NCCL INFO Channel 07/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3653279 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3653278 [3] NCCL INFO Channel 05/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3653255 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3653273 [1] NCCL INFO Channel 06/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3653275 [6] NCCL INFO Channel 08/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3653272 [5] NCCL INFO Channel 05/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3653281 [4] NCCL INFO Channel 08/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3653278 [3] NCCL INFO Channel 06/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3653255 [0] NCCL INFO Channel 08/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3653273 [1] NCCL INFO Channel 07/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3653279 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3653275 [6] NCCL INFO Channel 09/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3653272 [5] NCCL INFO Channel 06/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3653281 [4] NCCL INFO Channel 09/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3653273 [1] NCCL INFO Channel 08/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3653278 [3] NCCL INFO Channel 07/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3653255 [0] NCCL INFO Channel 09/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3653275 [6] NCCL INFO Channel 10/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3653279 [2] NCCL INFO Channel 02/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3653272 [5] NCCL INFO Channel 07/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3653281 [4] NCCL INFO Channel 10/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3653273 [1] NCCL INFO Channel 09/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3653278 [3] NCCL INFO Channel 08/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3653255 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3653275 [6] NCCL INFO Channel 11/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3653279 [2] NCCL INFO Channel 03/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3653272 [5] NCCL INFO Channel 08/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3653281 [4] NCCL INFO Channel 11/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3653273 [1] NCCL INFO Channel 10/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3653278 [3] NCCL INFO Channel 09/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3653255 [0] NCCL INFO Channel 11/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3653275 [6] NCCL INFO Channel 12/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3653272 [5] NCCL INFO Channel 09/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3653279 [2] NCCL INFO Channel 04/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3653281 [4] NCCL INFO Channel 12/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3653273 [1] NCCL INFO Channel 11/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3653278 [3] NCCL INFO Channel 10/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3653255 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3653275 [6] NCCL INFO Channel 13/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3653272 [5] NCCL INFO Channel 10/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3653281 [4] NCCL INFO Channel 13/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3653279 [2] NCCL INFO Channel 05/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3653273 [1] NCCL INFO Channel 12/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3653278 [3] NCCL INFO Channel 11/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3653255 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3653275 [6] NCCL INFO Channel 14/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3653272 [5] NCCL INFO Channel 11/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3653281 [4] NCCL INFO Channel 14/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3653279 [2] NCCL INFO Channel 06/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3653273 [1] NCCL INFO Channel 13/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3653255 [0] NCCL INFO Channel 14/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3653278 [3] NCCL INFO Channel 12/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3653275 [6] NCCL INFO Channel 15/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3653272 [5] NCCL INFO Channel 12/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3653281 [4] NCCL INFO Channel 15/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3653273 [1] NCCL INFO Channel 14/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3653255 [0] NCCL INFO Channel 15/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3653279 [2] NCCL INFO Channel 07/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3653278 [3] NCCL INFO Channel 13/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3653272 [5] NCCL INFO Channel 13/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3653273 [1] NCCL INFO Channel 15/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3653279 [2] NCCL INFO Channel 08/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3653278 [3] NCCL INFO Channel 14/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3653272 [5] NCCL INFO Channel 14/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3653278 [3] NCCL INFO Channel 15/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3653279 [2] NCCL INFO Channel 09/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3653272 [5] NCCL INFO Channel 15/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3653279 [2] NCCL INFO Channel 10/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3653279 [2] NCCL INFO Channel 11/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3653279 [2] NCCL INFO Channel 12/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3653279 [2] NCCL INFO Channel 13/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3653279 [2] NCCL INFO Channel 14/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3653279 [2] NCCL INFO Channel 15/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3653272 [5] NCCL INFO Connected all rings p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3653275 [6] NCCL INFO Connected all rings p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3653255 [0] NCCL INFO Connected all rings p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3653275 [6] NCCL INFO Channel 00/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3653275 [6] NCCL INFO Channel 01/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3653275 [6] NCCL INFO Channel 02/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3653275 [6] NCCL INFO Channel 03/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3653275 [6] NCCL INFO Channel 04/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3653281 [4] NCCL INFO Connected all rings p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3653278 [3] NCCL INFO Connected all rings p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3653273 [1] NCCL INFO Connected all rings p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3653279 [2] NCCL INFO Connected all rings p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3653275 [6] NCCL INFO Channel 05/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3653275 [6] NCCL INFO Channel 06/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3653275 [6] NCCL INFO Channel 07/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3653275 [6] NCCL INFO Channel 08/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3653275 [6] NCCL INFO Channel 09/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3653275 [6] NCCL INFO Channel 10/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3653275 [6] NCCL INFO Channel 11/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3653275 [6] NCCL INFO Channel 12/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3653275 [6] NCCL INFO Channel 13/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3653275 [6] NCCL INFO Channel 14/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3653272 [5] NCCL INFO Channel 00/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3653272 [5] NCCL INFO Channel 01/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3653272 [5] NCCL INFO Channel 02/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3653272 [5] NCCL INFO Channel 03/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3653272 [5] NCCL INFO Channel 04/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3653272 [5] NCCL INFO Channel 05/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3653272 [5] NCCL INFO Channel 06/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3653275 [6] NCCL INFO Channel 15/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3653273 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3653272 [5] NCCL INFO Channel 07/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3653273 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3653272 [5] NCCL INFO Channel 08/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3653279 [2] NCCL INFO Channel 00/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3653278 [3] NCCL INFO Channel 00/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3653273 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3653272 [5] NCCL INFO Channel 09/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3653279 [2] NCCL INFO Channel 01/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3653278 [3] NCCL INFO Channel 01/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3653272 [5] NCCL INFO Channel 10/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3653273 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3653279 [2] NCCL INFO Channel 02/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3653272 [5] NCCL INFO Channel 11/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3653278 [3] NCCL INFO Channel 02/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3653273 [1] NCCL INFO Channel 04/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3653279 [2] NCCL INFO Channel 03/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3653272 [5] NCCL INFO Channel 12/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3653278 [3] NCCL INFO Channel 03/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3653273 [1] NCCL INFO Channel 05/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3653272 [5] NCCL INFO Channel 13/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3653279 [2] NCCL INFO Channel 04/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3653278 [3] NCCL INFO Channel 04/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3653273 [1] NCCL INFO Channel 06/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3653272 [5] NCCL INFO Channel 14/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3653279 [2] NCCL INFO Channel 05/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3653273 [1] NCCL INFO Channel 07/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3653278 [3] NCCL INFO Channel 05/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3653272 [5] NCCL INFO Channel 15/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3653273 [1] NCCL INFO Channel 08/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3653279 [2] NCCL INFO Channel 06/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3653278 [3] NCCL INFO Channel 06/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3653281 [4] NCCL INFO Channel 00/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3653273 [1] NCCL INFO Channel 09/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3653278 [3] NCCL INFO Channel 07/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3653279 [2] NCCL INFO Channel 07/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3653281 [4] NCCL INFO Channel 01/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3653273 [1] NCCL INFO Channel 10/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3653278 [3] NCCL INFO Channel 08/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3653279 [2] NCCL INFO Channel 08/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3653281 [4] NCCL INFO Channel 02/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3653273 [1] NCCL INFO Channel 11/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3653278 [3] NCCL INFO Channel 09/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3653279 [2] NCCL INFO Channel 09/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3653281 [4] NCCL INFO Channel 03/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3653273 [1] NCCL INFO Channel 12/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3653278 [3] NCCL INFO Channel 10/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3653279 [2] NCCL INFO Channel 10/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3653281 [4] NCCL INFO Channel 04/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3653273 [1] NCCL INFO Channel 13/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3653278 [3] NCCL INFO Channel 11/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3653279 [2] NCCL INFO Channel 11/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3653273 [1] NCCL INFO Channel 14/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3653281 [4] NCCL INFO Channel 05/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3653278 [3] NCCL INFO Channel 12/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3653279 [2] NCCL INFO Channel 12/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3653273 [1] NCCL INFO Channel 15/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3653281 [4] NCCL INFO Channel 06/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3653278 [3] NCCL INFO Channel 13/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3653279 [2] NCCL INFO Channel 13/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3653281 [4] NCCL INFO Channel 07/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3653278 [3] NCCL INFO Channel 14/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3653279 [2] NCCL INFO Channel 14/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3653278 [3] NCCL INFO Channel 15/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3653281 [4] NCCL INFO Channel 08/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3653279 [2] NCCL INFO Channel 15/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3653281 [4] NCCL INFO Channel 09/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3653281 [4] NCCL INFO Channel 10/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3653281 [4] NCCL INFO Channel 11/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3653281 [4] NCCL INFO Channel 12/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3653281 [4] NCCL INFO Channel 13/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3653281 [4] NCCL INFO Channel 14/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3653281 [4] NCCL INFO Channel 15/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3653255 [0] NCCL INFO Connected all trees p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3653255 [0] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512 p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3653255 [0] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3653273 [1] NCCL INFO Connected all trees p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3653273 [1] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512 p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3653273 [1] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3653275 [6] NCCL INFO Connected all trees p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3653279 [2] NCCL INFO Connected all trees p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3653275 [6] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512 p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3653275 [6] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3653279 [2] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512 p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3653279 [2] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3653278 [3] NCCL INFO Connected all trees p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3653278 [3] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512 p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3653278 [3] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3653272 [5] NCCL INFO Connected all trees p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3653272 [5] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512 p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3653272 [5] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3653281 [4] NCCL INFO Connected all trees p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3653281 [4] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512 p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3653281 [4] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3653281 [4] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2, using internal tuner instead. p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3653273 [1] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2, using internal tuner instead. p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3653273 [1] NCCL INFO ncclCommInitRank comm 0x563f8ba04490 rank 1 nranks 7 cudaDev 1 nvmlDev 1 busId 2d000 commId 0xed83bdb4b6159d06 - Init COMPLETE p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3653281 [4] NCCL INFO ncclCommInitRank comm 0x5629d49b4430 rank 4 nranks 7 cudaDev 4 nvmlDev 4 busId 8d000 commId 0xed83bdb4b6159d06 - Init COMPLETE p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3653255 [0] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2, using internal tuner instead. p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3653278 [3] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2, using internal tuner instead. p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3653255 [0] NCCL INFO ncclCommInitRank comm 0x55fcb5df2e20 rank 0 nranks 7 cudaDev 0 nvmlDev 0 busId 27000 commId 0xed83bdb4b6159d06 - Init COMPLETE p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3653278 [3] NCCL INFO ncclCommInitRank comm 0x55f1432b8ed0 rank 3 nranks 7 cudaDev 3 nvmlDev 3 busId 59000 commId 0xed83bdb4b6159d06 - Init COMPLETE p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3653272 [5] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2, using internal tuner instead. p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3653272 [5] NCCL INFO ncclCommInitRank comm 0x5646ab5579c0 rank 5 nranks 7 cudaDev 5 nvmlDev 5 busId 92000 commId 0xed83bdb4b6159d06 - Init COMPLETE p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3653279 [2] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2, using internal tuner instead. p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3653279 [2] NCCL INFO ncclCommInitRank comm 0x5628d6593600 rank 2 nranks 7 cudaDev 2 nvmlDev 2 busId 54000 commId 0xed83bdb4b6159d06 - Init COMPLETE p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3653275 [6] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2, using internal tuner instead. p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3653275 [6] NCCL INFO ncclCommInitRank comm 0x5572b545e210 rank 6 nranks 7 cudaDev 6 nvmlDev 6 busId bf000 commId 0xed83bdb4b6159d06 - Init COMPLETE [2025-02-21 03:00:20,743] [INFO] [partition_parameters.py:348:__exit__] finished initializing model - num_params = 730, num_elems = 2.44B Loading checkpoint shards: 0%| | 0/2 [00:00 [2025-02-21 03:00:34,562] [INFO] [config.py:1003:print] communication_data_type ...... None [2025-02-21 03:00:34,562] [INFO] [config.py:1003:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}} [2025-02-21 03:00:34,562] [INFO] [config.py:1003:print] curriculum_enabled_legacy .... False [2025-02-21 03:00:34,562] [INFO] [config.py:1003:print] curriculum_params_legacy ..... False [2025-02-21 03:00:34,562] [INFO] [config.py:1003:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}} [2025-02-21 03:00:34,563] [INFO] [config.py:1003:print] data_efficiency_enabled ...... False [2025-02-21 03:00:34,563] [INFO] [config.py:1003:print] dataloader_drop_last ......... False [2025-02-21 03:00:34,563] [INFO] [config.py:1003:print] disable_allgather ............ False [2025-02-21 03:00:34,563] [INFO] [config.py:1003:print] dump_state ................... False [2025-02-21 03:00:34,563] [INFO] [config.py:1003:print] dynamic_loss_scale_args ...... None [2025-02-21 03:00:34,563] [INFO] [config.py:1003:print] eigenvalue_enabled ........... False [2025-02-21 03:00:34,563] [INFO] [config.py:1003:print] eigenvalue_gas_boundary_resolution 1 [2025-02-21 03:00:34,563] [INFO] [config.py:1003:print] eigenvalue_layer_name ........ bert.encoder.layer [2025-02-21 03:00:34,563] [INFO] [config.py:1003:print] eigenvalue_layer_num ......... 0 [2025-02-21 03:00:34,563] [INFO] [config.py:1003:print] eigenvalue_max_iter .......... 100 [2025-02-21 03:00:34,563] [INFO] [config.py:1003:print] eigenvalue_stability ......... 1e-06 [2025-02-21 03:00:34,563] [INFO] [config.py:1003:print] eigenvalue_tol ............... 0.01 [2025-02-21 03:00:34,563] [INFO] [config.py:1003:print] eigenvalue_verbose ........... False [2025-02-21 03:00:34,563] [INFO] [config.py:1003:print] elasticity_enabled ........... False [2025-02-21 03:00:34,563] [INFO] [config.py:1003:print] flops_profiler_config ........ { "enabled": false, "recompute_fwd_factor": 0.0, "profile_step": 1, "module_depth": -1, "top_modules": 1, "detailed": true, "output_file": null } [2025-02-21 03:00:34,563] [INFO] [config.py:1003:print] fp16_auto_cast ............... None [2025-02-21 03:00:34,564] [INFO] [config.py:1003:print] fp16_enabled ................. False [2025-02-21 03:00:34,564] [INFO] [config.py:1003:print] fp16_master_weights_and_gradients False [2025-02-21 03:00:34,564] [INFO] [config.py:1003:print] global_rank .................. 0 [2025-02-21 03:00:34,564] [INFO] [config.py:1003:print] grad_accum_dtype ............. None [2025-02-21 03:00:34,564] [INFO] [config.py:1003:print] gradient_accumulation_steps .. 2 [2025-02-21 03:00:34,564] [INFO] [config.py:1003:print] gradient_clipping ............ 1.0 [2025-02-21 03:00:34,564] [INFO] [config.py:1003:print] gradient_predivide_factor .... 1.0 [2025-02-21 03:00:34,564] [INFO] [config.py:1003:print] graph_harvesting ............. False [2025-02-21 03:00:34,564] [INFO] [config.py:1003:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8 [2025-02-21 03:00:34,564] [INFO] [config.py:1003:print] initial_dynamic_scale ........ 1 [2025-02-21 03:00:34,564] [INFO] [config.py:1003:print] load_universal_checkpoint .... False [2025-02-21 03:00:34,564] [INFO] [config.py:1003:print] loss_scale ................... 1.0 [2025-02-21 03:00:34,564] [INFO] [config.py:1003:print] memory_breakdown ............. False [2025-02-21 03:00:34,564] [INFO] [config.py:1003:print] mics_hierarchial_params_gather False [2025-02-21 03:00:34,564] [INFO] [config.py:1003:print] mics_shard_size .............. -1 [2025-02-21 03:00:34,565] [INFO] [config.py:1003:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') [2025-02-21 03:00:34,565] [INFO] [config.py:1003:print] nebula_config ................ { "enabled": false, "persistent_storage_path": null, "persistent_time_interval": 100, "num_of_version_in_retention": 2, "enable_nebula_load": true, "load_path": null } [2025-02-21 03:00:34,565] [INFO] [config.py:1003:print] optimizer_legacy_fusion ...... False [2025-02-21 03:00:34,565] [INFO] [config.py:1003:print] optimizer_name ............... None [2025-02-21 03:00:34,565] [INFO] [config.py:1003:print] optimizer_params ............. None [2025-02-21 03:00:34,565] [INFO] [config.py:1003:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True} [2025-02-21 03:00:34,565] [INFO] [config.py:1003:print] pld_enabled .................. False [2025-02-21 03:00:34,565] [INFO] [config.py:1003:print] pld_params ................... False [2025-02-21 03:00:34,565] [INFO] [config.py:1003:print] prescale_gradients ........... False [2025-02-21 03:00:34,565] [INFO] [config.py:1003:print] scheduler_name ............... None [2025-02-21 03:00:34,565] [INFO] [config.py:1003:print] scheduler_params ............. None [2025-02-21 03:00:34,565] [INFO] [config.py:1003:print] seq_parallel_communication_data_type torch.float32 [2025-02-21 03:00:34,565] [INFO] [config.py:1003:print] sparse_attention ............. None [2025-02-21 03:00:34,565] [INFO] [config.py:1003:print] sparse_gradients_enabled ..... False [2025-02-21 03:00:34,565] [INFO] [config.py:1003:print] steps_per_print .............. inf [2025-02-21 03:00:34,566] [INFO] [config.py:1003:print] timers_config ................ enabled=True synchronized=True [2025-02-21 03:00:34,566] [INFO] [config.py:1003:print] train_batch_size ............. 14 [2025-02-21 03:00:34,566] [INFO] [config.py:1003:print] train_micro_batch_size_per_gpu 1 [2025-02-21 03:00:34,566] [INFO] [config.py:1003:print] use_data_before_expert_parallel_ False [2025-02-21 03:00:34,566] [INFO] [config.py:1003:print] use_node_local_storage ....... False [2025-02-21 03:00:34,566] [INFO] [config.py:1003:print] wall_clock_breakdown ......... False [2025-02-21 03:00:34,566] [INFO] [config.py:1003:print] weight_quantization_config ... None [2025-02-21 03:00:34,566] [INFO] [config.py:1003:print] world_size ................... 7 [2025-02-21 03:00:34,566] [INFO] [config.py:1003:print] zero_allow_untested_optimizer False [2025-02-21 03:00:34,566] [INFO] [config.py:1003:print] zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500000000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='none', nvme_path=None, buffer_count=5, buffer_size=100000000, max_in_cpu=1000000000, pin_memory=True) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='none', nvme_path=None, buffer_count=4, pin_memory=True, pipeline_read=False, pipeline_write=False, fast_init=False, ratio=1.0) sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50000000 param_persistence_threshold=100000 model_persistence_threshold=9223372036854775807 max_live_parameters=1000000000 max_reuse_distance=1000000000 gather_16bit_weights_on_model_save=True module_granularity_threshold=0 use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False zeropp_loco_param=None mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True [2025-02-21 03:00:34,566] [INFO] [config.py:1003:print] zero_enabled ................. True [2025-02-21 03:00:34,566] [INFO] [config.py:1003:print] zero_force_ds_cpu_optimizer .. True [2025-02-21 03:00:34,566] [INFO] [config.py:1003:print] zero_optimization_stage ...... 3 [2025-02-21 03:00:34,566] [INFO] [config.py:989:print_user_config] json = { "fp16": { "enabled": false, "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": true }, "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "none", "pin_memory": true }, "offload_param": { "device": "none", "pin_memory": true }, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1.000000e+09, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "stage3_max_live_parameters": 1.000000e+09, "stage3_max_reuse_distance": 1.000000e+09, "stage3_gather_16bit_weights_on_model_save": true }, "gradient_accumulation_steps": 2, "gradient_clipping": 1.0, "steps_per_print": inf, "train_batch_size": 14, "train_micro_batch_size_per_gpu": 1, "wall_clock_breakdown": false, "zero_optimization.reduce_bucket_size": 2.359296e+06, "zero_optimization.stage3_param_persistence_threshold": 1.536000e+04, "zero_optimization.stage3_prefetch_bucket_size": 2.123366e+06 } INFO 02-21 03:00:45 config.py:542] This model supports multiple tasks: {'classify', 'score', 'embed', 'reward', 'generate'}. Defaulting to 'generate'. WARNING 02-21 03:00:45 arg_utils.py:1079] --enable-prefix-caching is currently not supported for multimodal models in v0 and has been disabled. INFO 02-21 03:00:45 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.2) with config: model='/home/vlm/pretrain_model/Qwen2-VL-2B-Instruct', speculative_config=None, tokenizer='/home/vlm/pretrain_model/Qwen2-VL-2B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda:7, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/home/vlm/pretrain_model/Qwen2-VL-2B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False, INFO 02-21 03:00:46 cuda.py:230] Using Flash Attention backend. INFO 02-21 03:00:46 model_runner.py:1110] Starting to load model /home/vlm/pretrain_model/Qwen2-VL-2B-Instruct... INFO 02-21 03:00:47 config.py:2992] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256] is overridden by config [256, 128, 2, 1, 4, 136, 8, 144, 16, 152, 24, 160, 32, 168, 40, 176, 48, 184, 56, 192, 64, 200, 72, 208, 80, 216, 88, 120, 224, 96, 232, 104, 240, 112, 248] Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00 32768). Running this sequence through the model will result in indexing errors WARNING 02-21 03:00:59 profiling.py:187] The context length (32768) of the model is too short to hold the multi-modal embeddings in the worst case (49152 tokens in total, out of which {'image': 16384, 'video': 32768} are reserved for multi-modal embeddings). This may cause certain multi-modal inputs to fail during inference, even when the input text is short. To avoid this, you should increase `max_model_len`, reduce `max_num_seqs`, and/or reduce `mm_counts`. INFO 02-21 03:01:00 worker.py:267] Memory profiling takes 11.57 seconds INFO 02-21 03:01:00 worker.py:267] the current vLLM instance can use total_gpu_memory (79.32GiB) x gpu_memory_utilization (0.70) = 55.53GiB INFO 02-21 03:01:00 worker.py:267] model weights take 0.00GiB; non_torch_memory takes 0.00GiB; PyTorch activation peak memory takes 0.00GiB; the rest of the memory reserved for KV Cache is 55.53GiB. INFO 02-21 03:01:01 executor_base.py:110] # CUDA blocks: 129965, # CPU blocks: 9362 INFO 02-21 03:01:01 executor_base.py:115] Maximum concurrency for 32768 tokens per request: 63.46x INFO 02-21 03:01:03 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage. Capturing CUDA graph shapes: 0%| | 0/35 [00:002->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->1 [11] 3/-1/-1->2->1 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1 p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3659555 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0 p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3659557 [0] NCCL INFO Channel 00/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3659555 [1] NCCL INFO P2P Chunksize set to 524288 p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3659554 [2] NCCL INFO P2P Chunksize set to 524288 p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3659556 [6] NCCL INFO comm 0x7ef5a8072be0 rank 6 nRanks 7 nNodes 1 localRanks 7 localRank 6 MNNVL 0 p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3659557 [0] NCCL INFO Channel 01/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3659553 [3] NCCL INFO comm 0x7fcbb8074260 rank 3 nRanks 7 nNodes 1 localRanks 7 localRank 3 MNNVL 0 p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3659557 [0] NCCL INFO Channel 02/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3659557 [0] NCCL INFO Channel 03/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3659558 [5] NCCL INFO comm 0x7f6830072e00 rank 5 nRanks 7 nNodes 1 localRanks 7 localRank 5 MNNVL 0 p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3659557 [0] NCCL INFO Channel 04/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3659557 [0] NCCL INFO Channel 05/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3659557 [0] NCCL INFO Channel 06/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3659559 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 5/-1/-1->4->3 [7] 5/-1/-1->4->3 [8] 5/-1/-1->4->3 [9] 5/-1/-1->4->3 [10] 5/-1/-1->4->3 [11] 5/-1/-1->4->3 [12] 5/-1/-1->4->3 [13] 5/-1/-1->4->3 [14] 5/-1/-1->4->3 [15] 5/-1/-1->4->3 p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3659557 [0] NCCL INFO Channel 07/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3659559 [4] NCCL INFO P2P Chunksize set to 524288 p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3659556 [6] NCCL INFO Trees [0] -1/-1/-1->6->5 [1] -1/-1/-1->6->5 [2] -1/-1/-1->6->5 [3] -1/-1/-1->6->5 [4] -1/-1/-1->6->5 [5] -1/-1/-1->6->5 [6] -1/-1/-1->6->5 [7] -1/-1/-1->6->5 [8] -1/-1/-1->6->5 [9] -1/-1/-1->6->5 [10] -1/-1/-1->6->5 [11] -1/-1/-1->6->5 [12] -1/-1/-1->6->5 [13] -1/-1/-1->6->5 [14] -1/-1/-1->6->5 [15] -1/-1/-1->6->5 p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3659557 [0] NCCL INFO Channel 08/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3659556 [6] NCCL INFO P2P Chunksize set to 524288 p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3659557 [0] NCCL INFO Channel 09/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3659557 [0] NCCL INFO Channel 10/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3659557 [0] NCCL INFO Channel 11/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3659557 [0] NCCL INFO Channel 12/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3659557 [0] NCCL INFO Channel 13/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3659557 [0] NCCL INFO Channel 14/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3659557 [0] NCCL INFO Channel 15/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3659557 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3659557 [0] NCCL INFO P2P Chunksize set to 524288 p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3659553 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] 4/-1/-1->3->2 [7] 4/-1/-1->3->2 [8] 4/-1/-1->3->2 [9] 4/-1/-1->3->2 [10] 4/-1/-1->3->2 [11] 4/-1/-1->3->2 [12] 4/-1/-1->3->2 [13] 4/-1/-1->3->2 [14] 4/-1/-1->3->2 [15] 4/-1/-1->3->2 p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3659553 [3] NCCL INFO P2P Chunksize set to 524288 p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3659558 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 6/-1/-1->5->4 [7] 6/-1/-1->5->4 [8] 6/-1/-1->5->4 [9] 6/-1/-1->5->4 [10] 6/-1/-1->5->4 [11] 6/-1/-1->5->4 [12] 6/-1/-1->5->4 [13] 6/-1/-1->5->4 [14] 6/-1/-1->5->4 [15] 6/-1/-1->5->4 p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3659558 [5] NCCL INFO P2P Chunksize set to 524288 p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3659557 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3659557 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3659554 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3659558 [5] NCCL INFO Channel 00/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3659553 [3] NCCL INFO Channel 00/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3659556 [6] NCCL INFO Channel 00/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3659559 [4] NCCL INFO Channel 00/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3659557 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3659554 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3659553 [3] NCCL INFO Channel 01/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3659558 [5] NCCL INFO Channel 01/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3659556 [6] NCCL INFO Channel 01/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3659559 [4] NCCL INFO Channel 01/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3659557 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3659553 [3] NCCL INFO Channel 02/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3659554 [2] NCCL INFO Channel 02/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3659558 [5] NCCL INFO Channel 02/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3659555 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3659556 [6] NCCL INFO Channel 02/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3659559 [4] NCCL INFO Channel 02/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3659557 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3659553 [3] NCCL INFO Channel 03/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3659554 [2] NCCL INFO Channel 03/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3659555 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3659558 [5] NCCL INFO Channel 03/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3659556 [6] NCCL INFO Channel 03/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3659559 [4] NCCL INFO Channel 03/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3659557 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3659554 [2] NCCL INFO Channel 04/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3659555 [1] NCCL INFO Channel 02/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3659553 [3] NCCL INFO Channel 04/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3659558 [5] NCCL INFO Channel 04/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3659556 [6] NCCL INFO Channel 04/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3659559 [4] NCCL INFO Channel 04/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3659557 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3659554 [2] NCCL INFO Channel 05/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3659555 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3659553 [3] NCCL INFO Channel 05/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3659558 [5] NCCL INFO Channel 05/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3659556 [6] NCCL INFO Channel 05/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3659559 [4] NCCL INFO Channel 05/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3659557 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3659555 [1] NCCL INFO Channel 04/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3659554 [2] NCCL INFO Channel 06/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3659558 [5] NCCL INFO Channel 06/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3659553 [3] NCCL INFO Channel 06/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3659556 [6] NCCL INFO Channel 06/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3659559 [4] NCCL INFO Channel 06/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3659555 [1] NCCL INFO Channel 05/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3659557 [0] NCCL INFO Channel 08/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3659554 [2] NCCL INFO Channel 07/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3659558 [5] NCCL INFO Channel 07/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3659553 [3] NCCL INFO Channel 07/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3659556 [6] NCCL INFO Channel 07/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3659559 [4] NCCL INFO Channel 07/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3659555 [1] NCCL INFO Channel 06/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3659554 [2] NCCL INFO Channel 08/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3659557 [0] NCCL INFO Channel 09/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3659558 [5] NCCL INFO Channel 08/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3659553 [3] NCCL INFO Channel 08/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3659556 [6] NCCL INFO Channel 08/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3659559 [4] NCCL INFO Channel 08/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3659554 [2] NCCL INFO Channel 09/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3659553 [3] NCCL INFO Channel 09/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3659558 [5] NCCL INFO Channel 09/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3659555 [1] NCCL INFO Channel 07/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3659557 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3659556 [6] NCCL INFO Channel 09/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3659559 [4] NCCL INFO Channel 09/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3659554 [2] NCCL INFO Channel 10/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3659558 [5] NCCL INFO Channel 10/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3659553 [3] NCCL INFO Channel 10/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3659556 [6] NCCL INFO Channel 10/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3659557 [0] NCCL INFO Channel 11/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3659555 [1] NCCL INFO Channel 08/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3659559 [4] NCCL INFO Channel 10/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3659554 [2] NCCL INFO Channel 11/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3659558 [5] NCCL INFO Channel 11/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3659553 [3] NCCL INFO Channel 11/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3659555 [1] NCCL INFO Channel 09/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3659557 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3659556 [6] NCCL INFO Channel 11/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3659559 [4] NCCL INFO Channel 11/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3659553 [3] NCCL INFO Channel 12/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3659554 [2] NCCL INFO Channel 12/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3659558 [5] NCCL INFO Channel 12/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3659557 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3659555 [1] NCCL INFO Channel 10/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3659556 [6] NCCL INFO Channel 12/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3659559 [4] NCCL INFO Channel 12/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3659553 [3] NCCL INFO Channel 13/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3659558 [5] NCCL INFO Channel 13/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3659554 [2] NCCL INFO Channel 13/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3659557 [0] NCCL INFO Channel 14/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3659555 [1] NCCL INFO Channel 11/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3659556 [6] NCCL INFO Channel 13/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3659559 [4] NCCL INFO Channel 13/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3659558 [5] NCCL INFO Channel 14/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3659554 [2] NCCL INFO Channel 14/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3659553 [3] NCCL INFO Channel 14/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3659555 [1] NCCL INFO Channel 12/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3659557 [0] NCCL INFO Channel 15/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3659556 [6] NCCL INFO Channel 14/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3659559 [4] NCCL INFO Channel 14/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3659554 [2] NCCL INFO Channel 15/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3659555 [1] NCCL INFO Channel 13/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3659553 [3] NCCL INFO Channel 15/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3659558 [5] NCCL INFO Channel 15/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3659559 [4] NCCL INFO Channel 15/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3659556 [6] NCCL INFO Channel 15/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3659555 [1] NCCL INFO Channel 14/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3659555 [1] NCCL INFO Channel 15/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3659553 [3] NCCL INFO Connected all rings p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3659554 [2] NCCL INFO Connected all rings p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3659557 [0] NCCL INFO Connected all rings p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3659555 [1] NCCL INFO Connected all rings p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3659556 [6] NCCL INFO Connected all rings p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3659556 [6] NCCL INFO Channel 00/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3659559 [4] NCCL INFO Connected all rings p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3659558 [5] NCCL INFO Connected all rings p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3659556 [6] NCCL INFO Channel 01/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3659556 [6] NCCL INFO Channel 02/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3659556 [6] NCCL INFO Channel 03/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3659556 [6] NCCL INFO Channel 04/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3659556 [6] NCCL INFO Channel 05/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3659556 [6] NCCL INFO Channel 06/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3659556 [6] NCCL INFO Channel 07/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3659556 [6] NCCL INFO Channel 08/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3659556 [6] NCCL INFO Channel 09/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3659556 [6] NCCL INFO Channel 10/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3659556 [6] NCCL INFO Channel 11/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3659556 [6] NCCL INFO Channel 12/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3659556 [6] NCCL INFO Channel 13/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3659556 [6] NCCL INFO Channel 14/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3659556 [6] NCCL INFO Channel 15/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3659558 [5] NCCL INFO Channel 00/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3659555 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3659559 [4] NCCL INFO Channel 00/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3659553 [3] NCCL INFO Channel 00/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3659558 [5] NCCL INFO Channel 01/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3659555 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3659559 [4] NCCL INFO Channel 01/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3659553 [3] NCCL INFO Channel 01/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3659555 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3659559 [4] NCCL INFO Channel 02/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3659558 [5] NCCL INFO Channel 02/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3659553 [3] NCCL INFO Channel 02/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3659554 [2] NCCL INFO Channel 00/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3659555 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3659558 [5] NCCL INFO Channel 03/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3659559 [4] NCCL INFO Channel 03/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3659554 [2] NCCL INFO Channel 01/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3659553 [3] NCCL INFO Channel 03/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3659555 [1] NCCL INFO Channel 04/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3659554 [2] NCCL INFO Channel 02/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3659558 [5] NCCL INFO Channel 04/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3659559 [4] NCCL INFO Channel 04/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3659553 [3] NCCL INFO Channel 04/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3659555 [1] NCCL INFO Channel 05/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3659554 [2] NCCL INFO Channel 03/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3659558 [5] NCCL INFO Channel 05/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3659559 [4] NCCL INFO Channel 05/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3659553 [3] NCCL INFO Channel 05/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3659555 [1] NCCL INFO Channel 06/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3659554 [2] NCCL INFO Channel 04/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3659559 [4] NCCL INFO Channel 06/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3659558 [5] NCCL INFO Channel 06/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3659553 [3] NCCL INFO Channel 06/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3659555 [1] NCCL INFO Channel 07/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3659554 [2] NCCL INFO Channel 05/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3659559 [4] NCCL INFO Channel 07/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3659558 [5] NCCL INFO Channel 07/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3659553 [3] NCCL INFO Channel 07/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3659555 [1] NCCL INFO Channel 08/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3659554 [2] NCCL INFO Channel 06/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3659559 [4] NCCL INFO Channel 08/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3659558 [5] NCCL INFO Channel 08/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3659553 [3] NCCL INFO Channel 08/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3659555 [1] NCCL INFO Channel 09/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3659554 [2] NCCL INFO Channel 07/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3659559 [4] NCCL INFO Channel 09/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3659558 [5] NCCL INFO Channel 09/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3659553 [3] NCCL INFO Channel 09/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3659554 [2] NCCL INFO Channel 08/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3659555 [1] NCCL INFO Channel 10/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3659559 [4] NCCL INFO Channel 10/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3659558 [5] NCCL INFO Channel 10/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3659553 [3] NCCL INFO Channel 10/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3659554 [2] NCCL INFO Channel 09/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3659555 [1] NCCL INFO Channel 11/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3659558 [5] NCCL INFO Channel 11/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3659559 [4] NCCL INFO Channel 11/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3659553 [3] NCCL INFO Channel 11/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3659554 [2] NCCL INFO Channel 10/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3659555 [1] NCCL INFO Channel 12/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3659559 [4] NCCL INFO Channel 12/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3659558 [5] NCCL INFO Channel 12/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3659553 [3] NCCL INFO Channel 12/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3659554 [2] NCCL INFO Channel 11/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3659555 [1] NCCL INFO Channel 13/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3659559 [4] NCCL INFO Channel 13/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3659558 [5] NCCL INFO Channel 13/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3659553 [3] NCCL INFO Channel 13/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3659554 [2] NCCL INFO Channel 12/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3659555 [1] NCCL INFO Channel 14/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3659558 [5] NCCL INFO Channel 14/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3659559 [4] NCCL INFO Channel 14/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3659555 [1] NCCL INFO Channel 15/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3659554 [2] NCCL INFO Channel 13/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3659553 [3] NCCL INFO Channel 14/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3659558 [5] NCCL INFO Channel 15/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3659559 [4] NCCL INFO Channel 15/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3659554 [2] NCCL INFO Channel 14/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3659553 [3] NCCL INFO Channel 15/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3659554 [2] NCCL INFO Channel 15/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3659556 [6] NCCL INFO Connected all trees p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3659556 [6] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512 p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3659556 [6] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3659558 [5] NCCL INFO Connected all trees p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3659558 [5] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512 p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3659558 [5] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3659557 [0] NCCL INFO Connected all trees p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3659559 [4] NCCL INFO Connected all trees p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3659559 [4] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512 p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3659559 [4] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3659555 [1] NCCL INFO Connected all trees p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3659555 [1] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512 p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3659555 [1] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3659557 [0] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512 p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3659557 [0] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3659553 [3] NCCL INFO Connected all trees p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3659554 [2] NCCL INFO Connected all trees p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3659553 [3] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512 p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3659553 [3] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3659554 [2] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512 p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3659554 [2] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer p-phy-ctyun-gz-a800-node-prod-200-104:3650922:3659553 [3] NCCL INFO ncclCommSplit comm 0x7fcbb8074260 rank 3 nranks 7 cudaDev 3 nvmlDev 3 busId 59000 parent 0x55f1432b8ed0 color -1326228412 key 3 commId 0x3e3927561a3c993e - Init COMPLETE p-phy-ctyun-gz-a800-node-prod-200-104:3650920:3659555 [1] NCCL INFO ncclCommSplit comm 0x7f46c0074700 rank 1 nranks 7 cudaDev 1 nvmlDev 1 busId 2d000 parent 0x563f8ba04490 color -1326228412 key 1 commId 0x3e3927561a3c993e - Init COMPLETE p-phy-ctyun-gz-a800-node-prod-200-104:3650924:3659558 [5] NCCL INFO ncclCommSplit comm 0x7f6830072e00 rank 5 nranks 7 cudaDev 5 nvmlDev 5 busId 92000 parent 0x5646ab5579c0 color -1326228412 key 5 commId 0x3e3927561a3c993e - Init COMPLETE p-phy-ctyun-gz-a800-node-prod-200-104:3650925:3659556 [6] NCCL INFO ncclCommSplit comm 0x7ef5a8072be0 rank 6 nranks 7 cudaDev 6 nvmlDev 6 busId bf000 parent 0x5572b545e210 color -1326228412 key 6 commId 0x3e3927561a3c993e - Init COMPLETE p-phy-ctyun-gz-a800-node-prod-200-104:3650921:3659554 [2] NCCL INFO ncclCommSplit comm 0x7f8a00073340 rank 2 nranks 7 cudaDev 2 nvmlDev 2 busId 54000 parent 0x5628d6593600 color -1326228412 key 2 commId 0x3e3927561a3c993e - Init COMPLETE p-phy-ctyun-gz-a800-node-prod-200-104:3650923:3659559 [4] NCCL INFO ncclCommSplit comm 0x7fc938073480 rank 4 nranks 7 cudaDev 4 nvmlDev 4 busId 8d000 parent 0x5629d49b4430 color -1326228412 key 4 commId 0x3e3927561a3c993e - Init COMPLETE p-phy-ctyun-gz-a800-node-prod-200-104:3650919:3659557 [0] NCCL INFO ncclCommSplit comm 0x7f02980731d0 rank 0 nranks 7 cudaDev 0 nvmlDev 0 busId 27000 parent 0x55fcb5df2e20 color -1326228412 key 0 commId 0x3e3927561a3c993e - Init COMPLETE 0%| | 1/1610 [00:25<11:22:59, 25.47s/it] {'loss': 0.0, 'grad_norm': 5.395447107108823, 'learning_rate': 9.993788819875776e-07, 'completion_length': 216.2232208251953, 'rewards/accuracy_reward': 0.2500000149011612, 'rewards/format_reward': 0.1875, 'reward': 0.4375000149011612, 'reward_std': 0.5002032816410065, 'kl': 0.0, 'epoch': 0.0} 0%| | 1/1610 [00:25<11:22:59, 25.47s/it] 0%| | 2/1610 [00:39<8:28:41, 18.98s/it] {'loss': 0.0, 'grad_norm': 12.842537983180788, 'learning_rate': 9.987577639751552e-07, 'completion_length': 164.4196548461914, 'rewards/accuracy_reward': 0.1607142984867096, 'rewards/format_reward': 0.1428571492433548, 'reward': 0.3035714328289032, 'reward_std': 0.4609551876783371, 'kl': 0.00023603439331054688, 'epoch': 0.01} 0%| | 2/1610 [00:39<8:28:41, 18.98s/it] 0%| | 3/1610 [00:55<7:40:55, 17.21s/it] {'loss': 0.0, 'grad_norm': 1.9496232252368992, 'learning_rate': 9.981366459627329e-07, 'completion_length': 207.56250762939453, 'rewards/accuracy_reward': 0.1160714328289032, 'rewards/format_reward': 0.1517857238650322, 'reward': 0.2678571492433548, 'reward_std': 0.3651883751153946, 'kl': 0.00022125244140625, 'epoch': 0.01} 0%| | 3/1610 [00:55<7:40:55, 17.21s/it] 0%| | 4/1610 [01:09<7:06:46, 15.94s/it] {'loss': 0.0, 'grad_norm': 3.2687569859906187, 'learning_rate': 9.975155279503105e-07, 'completion_length': 136.08928680419922, 'rewards/accuracy_reward': 0.1517857201397419, 'rewards/format_reward': 0.1785714328289032, 'reward': 0.3303571492433548, 'reward_std': 0.41502685844898224, 'kl': 0.0006580352783203125, 'epoch': 0.01} 0%| | 4/1610 [01:09<7:06:46, 15.94s/it] 0%| | 5/1610 [01:22<6:44:01, 15.10s/it] {'loss': 0.0, 'grad_norm': 3.2205381816809036, 'learning_rate': 9.968944099378881e-07, 'completion_length': 162.33036041259766, 'rewards/accuracy_reward': 0.2410714402794838, 'rewards/format_reward': 0.196428582072258, 'reward': 0.4375000149011612, 'reward_std': 0.44000063836574554, 'kl': 0.0004892349243164062, 'epoch': 0.02} 0%| | 5/1610 [01:22<6:44:01, 15.10s/it] 0%| | 6/1610 [01:37<6:44:46, 15.14s/it] {'loss': 0.0001, 'grad_norm': 7.891459754371306, 'learning_rate': 9.962732919254658e-07, 'completion_length': 188.02679443359375, 'rewards/accuracy_reward': 0.1160714365541935, 'rewards/format_reward': 0.2500000074505806, 'reward': 0.3660714477300644, 'reward_std': 0.37571829557418823, 'kl': 0.00272369384765625, 'epoch': 0.02} 0%| | 6/1610 [01:37<6:44:46, 15.14s/it] 0%| | 7/1610 [01:52<6:42:20, 15.06s/it] {'loss': 0.0001, 'grad_norm': 3.2004353832168633, 'learning_rate': 9.956521739130434e-07, 'completion_length': 205.04464721679688, 'rewards/accuracy_reward': 0.1428571492433548, 'rewards/format_reward': 0.3660714477300644, 'reward': 0.5089285969734192, 'reward_std': 0.6114484965801239, 'kl': 0.00316619873046875, 'epoch': 0.02} 0%| | 7/1610 [01:52<6:42:20, 15.06s/it] 0%| | 8/1610 [02:06<6:29:52, 14.60s/it] {'loss': 0.0001, 'grad_norm': 2.9088406384029, 'learning_rate': 9.95031055900621e-07, 'completion_length': 178.44644165039062, 'rewards/accuracy_reward': 0.0714285746216774, 'rewards/format_reward': 0.3303571492433548, 'reward': 0.4017857164144516, 'reward_std': 0.4810594916343689, 'kl': 0.00312042236328125, 'epoch': 0.02} 0%| | 8/1610 [02:06<6:29:52, 14.60s/it] 1%| | 9/1610 [02:20<6:22:46, 14.35s/it] {'loss': 0.0003, 'grad_norm': 3.0524290218545205, 'learning_rate': 9.944099378881986e-07, 'completion_length': 133.6785774230957, 'rewards/accuracy_reward': 0.2410714477300644, 'rewards/format_reward': 0.383928582072258, 'reward': 0.6250000298023224, 'reward_std': 0.566312849521637, 'kl': 0.00732421875, 'epoch': 0.03} 1%| | 9/1610 [02:20<6:22:46, 14.35s/it] 1%| | 10/1610 [02:33<6:15:21, 14.08s/it] {'loss': 0.0007, 'grad_norm': 2.7526441357534366, 'learning_rate': 9.937888198757763e-07, 'completion_length': 127.84822082519531, 'rewards/accuracy_reward': 0.1071428582072258, 'rewards/format_reward': 0.6607142984867096, 'reward': 0.7678571939468384, 'reward_std': 0.4191117137670517, 'kl': 0.017578125, 'epoch': 0.03} 1%| | 10/1610 [02:33<6:15:21, 14.08s/it] 1%| | 11/1610 [02:47<6:11:00, 13.92s/it] {'loss': 0.001, 'grad_norm': 2.85111529869872, 'learning_rate': 9.93167701863354e-07, 'completion_length': 121.08036422729492, 'rewards/accuracy_reward': 0.098214291036129, 'rewards/format_reward': 0.8035714626312256, 'reward': 0.9017857611179352, 'reward_std': 0.42253391444683075, 'kl': 0.024658203125, 'epoch': 0.03} 1%| | 11/1610 [02:47<6:11:00, 13.92s/it] 1%| | 12/1610 [03:00<6:03:17, 13.64s/it] {'loss': 0.0011, 'grad_norm': 2.284149935863589, 'learning_rate': 9.925465838509315e-07, 'completion_length': 82.27679061889648, 'rewards/accuracy_reward': 0.1785714402794838, 'rewards/format_reward': 0.8750000298023224, 'reward': 1.0535714626312256, 'reward_std': 0.3913768082857132, 'kl': 0.02703857421875, 'epoch': 0.04} 1%| | 12/1610 [03:00<6:03:17, 13.64s/it] 1%| | 13/1610 [03:12<5:53:43, 13.29s/it] {'loss': 0.0011, 'grad_norm': 2.1739237047536615, 'learning_rate': 9.919254658385092e-07, 'completion_length': 101.08036041259766, 'rewards/accuracy_reward': 0.1785714402794838, 'rewards/format_reward': 0.8571428954601288, 'reward': 1.035714328289032, 'reward_std': 0.38699066638946533, 'kl': 0.02691650390625, 'epoch': 0.04} 1%| | 13/1610 [03:12<5:53:43, 13.29s/it] 1%| | 14/1610 [03:24<5:45:22, 12.98s/it] {'loss': 0.0013, 'grad_norm': 1.9185563709161042, 'learning_rate': 9.91304347826087e-07, 'completion_length': 99.25000381469727, 'rewards/accuracy_reward': 0.07142857648432255, 'rewards/format_reward': 0.9375000298023224, 'reward': 1.0089285969734192, 'reward_std': 0.2628158777952194, 'kl': 0.0316162109375, 'epoch': 0.04} 1%| | 14/1610 [03:24<5:45:22, 12.98s/it] 1%| | 15/1610 [03:35<5:27:07, 12.31s/it] {'loss': 0.0015, 'grad_norm': 3.4147703027355525, 'learning_rate': 9.906832298136647e-07, 'completion_length': 78.33928680419922, 'rewards/accuracy_reward': 0.160714291036129, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.0892857313156128, 'reward_std': 0.3620052635669708, 'kl': 0.0372314453125, 'epoch': 0.05} 1%| | 15/1610 [03:35<5:27:07, 12.31s/it] 1%| | 16/1610 [03:45<5:06:08, 11.52s/it] {'loss': 0.0016, 'grad_norm': 8.771959233064825, 'learning_rate': 9.900621118012423e-07, 'completion_length': 76.93750381469727, 'rewards/accuracy_reward': 0.1071428656578064, 'rewards/format_reward': 0.9196428954601288, 'reward': 1.0267857909202576, 'reward_std': 0.28765417635440826, 'kl': 0.041015625, 'epoch': 0.05} 1%| | 16/1610 [03:45<5:06:08, 11.52s/it] 1%| | 17/1610 [03:58<5:15:33, 11.89s/it] {'loss': 0.0014, 'grad_norm': 2.4296006474825496, 'learning_rate': 9.8944099378882e-07, 'completion_length': 106.77679443359375, 'rewards/accuracy_reward': 0.0892857164144516, 'rewards/format_reward': 0.9196429252624512, 'reward': 1.008928656578064, 'reward_std': 0.27695459872484207, 'kl': 0.0360107421875, 'epoch': 0.05} 1%| | 17/1610 [03:58<5:15:33, 11.89s/it] 1%| | 18/1610 [04:08<5:01:53, 11.38s/it] {'loss': 0.0016, 'grad_norm': 3.0489705926557527, 'learning_rate': 9.888198757763976e-07, 'completion_length': 82.83036041259766, 'rewards/accuracy_reward': 0.16964286006987095, 'rewards/format_reward': 0.9375000298023224, 'reward': 1.1071429252624512, 'reward_std': 0.22584806382656097, 'kl': 0.0390625, 'epoch': 0.06} 1%| | 18/1610 [04:08<5:01:53, 11.38s/it] 1%| | 19/1610 [04:17<4:47:32, 10.84s/it] {'loss': 0.0015, 'grad_norm': 6.929723362217721, 'learning_rate': 9.881987577639752e-07, 'completion_length': 72.32143211364746, 'rewards/accuracy_reward': 0.1785714365541935, 'rewards/format_reward': 1.0, 'reward': 1.1785715222358704, 'reward_std': 0.24498926103115082, 'kl': 0.03857421875, 'epoch': 0.06} 1%| | 19/1610 [04:17<4:47:32, 10.84s/it] 1%| | 20/1610 [04:28<4:46:14, 10.80s/it] {'loss': 0.0022, 'grad_norm': 1.0365395749604511, 'learning_rate': 9.875776397515528e-07, 'completion_length': 83.81250190734863, 'rewards/accuracy_reward': 0.053571430034935474, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.0446429252624512, 'reward_std': 0.11394162103533745, 'kl': 0.0552978515625, 'epoch': 0.06} 1%| | 20/1610 [04:28<4:46:14, 10.80s/it] 1%|▏ | 21/1610 [04:38<4:35:52, 10.42s/it] {'loss': 0.0019, 'grad_norm': 1.4161215560299052, 'learning_rate': 9.869565217391304e-07, 'completion_length': 79.62500381469727, 'rewards/accuracy_reward': 0.035714288242161274, 'rewards/format_reward': 1.0, 'reward': 1.035714328289032, 'reward_std': 0.06222161278128624, 'kl': 0.04833984375, 'epoch': 0.07} 1%|▏ | 21/1610 [04:38<4:35:52, 10.42s/it] 1%|▏ | 22/1610 [04:46<4:22:52, 9.93s/it] {'loss': 0.002, 'grad_norm': 14.193851654114837, 'learning_rate': 9.86335403726708e-07, 'completion_length': 67.64286041259766, 'rewards/accuracy_reward': 0.0892857164144516, 'rewards/format_reward': 1.0, 'reward': 1.0892857313156128, 'reward_std': 0.128351628780365, 'kl': 0.0511474609375, 'epoch': 0.07} 1%|▏ | 22/1610 [04:46<4:22:52, 9.93s/it] 1%|▏ | 23/1610 [04:58<4:38:40, 10.54s/it] {'loss': 0.002, 'grad_norm': 2.5444188641379712, 'learning_rate': 9.857142857142857e-07, 'completion_length': 73.80357360839844, 'rewards/accuracy_reward': 0.1785714328289032, 'rewards/format_reward': 1.0, 'reward': 1.1785714626312256, 'reward_std': 0.1575082391500473, 'kl': 0.0504150390625, 'epoch': 0.07} 1%|▏ | 23/1610 [04:58<4:38:40, 10.54s/it] 1%|▏ | 24/1610 [05:09<4:40:21, 10.61s/it] {'loss': 0.0024, 'grad_norm': 2.256300634245686, 'learning_rate': 9.850931677018633e-07, 'completion_length': 77.90179061889648, 'rewards/accuracy_reward': 0.044642859138548374, 'rewards/format_reward': 1.0, 'reward': 1.0446429252624512, 'reward_std': 0.08747543022036552, 'kl': 0.0589599609375, 'epoch': 0.07} 1%|▏ | 24/1610 [05:09<4:40:21, 10.61s/it] 2%|▏ | 25/1610 [05:17<4:21:19, 9.89s/it] {'loss': 0.0026, 'grad_norm': 2.8751716527503923, 'learning_rate': 9.84472049689441e-07, 'completion_length': 69.42857360839844, 'rewards/accuracy_reward': 0.2232142984867096, 'rewards/format_reward': 1.0, 'reward': 1.223214328289032, 'reward_std': 0.2954913079738617, 'kl': 0.064208984375, 'epoch': 0.08} 2%|▏ | 25/1610 [05:17<4:21:19, 9.89s/it] 2%|▏ | 26/1610 [05:25<4:04:22, 9.26s/it] {'loss': 0.0019, 'grad_norm': 3.086777221856502, 'learning_rate': 9.838509316770186e-07, 'completion_length': 61.160715103149414, 'rewards/accuracy_reward': 0.2053571566939354, 'rewards/format_reward': 1.0, 'reward': 1.2053571939468384, 'reward_std': 0.19239907711744308, 'kl': 0.0474853515625, 'epoch': 0.08} 2%|▏ | 26/1610 [05:25<4:04:22, 9.26s/it] 2%|▏ | 27/1610 [05:35<4:07:57, 9.40s/it] {'loss': 0.0021, 'grad_norm': 2.8724985437837454, 'learning_rate': 9.832298136645962e-07, 'completion_length': 74.27679061889648, 'rewards/accuracy_reward': 0.3125000149011612, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.3035714626312256, 'reward_std': 0.3531966358423233, 'kl': 0.052978515625, 'epoch': 0.08} 2%|▏ | 27/1610 [05:35<4:07:57, 9.40s/it] 2%|▏ | 28/1610 [05:44<4:04:38, 9.28s/it] {'loss': 0.0018, 'grad_norm': 8.361710589844822, 'learning_rate': 9.826086956521739e-07, 'completion_length': 74.83929061889648, 'rewards/accuracy_reward': 0.330357164144516, 'rewards/format_reward': 1.0, 'reward': 1.3303571939468384, 'reward_std': 0.2804734334349632, 'kl': 0.0458984375, 'epoch': 0.09} 2%|▏ | 28/1610 [05:44<4:04:38, 9.28s/it] 2%|▏ | 29/1610 [05:52<3:57:51, 9.03s/it] {'loss': 0.0019, 'grad_norm': 3.6455098769584047, 'learning_rate': 9.819875776397515e-07, 'completion_length': 66.74107360839844, 'rewards/accuracy_reward': 0.2410714477300644, 'rewards/format_reward': 1.0, 'reward': 1.2410715222358704, 'reward_std': 0.24229325354099274, 'kl': 0.0487060546875, 'epoch': 0.09} 2%|▏ | 29/1610 [05:52<3:57:51, 9.03s/it] 2%|▏ | 30/1610 [06:02<3:59:30, 9.10s/it] {'loss': 0.0022, 'grad_norm': 1.771192927203832, 'learning_rate': 9.813664596273291e-07, 'completion_length': 68.96428680419922, 'rewards/accuracy_reward': 0.330357164144516, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.321428656578064, 'reward_std': 0.1956884115934372, 'kl': 0.053955078125, 'epoch': 0.09} 2%|▏ | 30/1610 [06:02<3:59:30, 9.10s/it] 2%|▏ | 31/1610 [06:11<4:04:31, 9.29s/it] {'loss': 0.0018, 'grad_norm': 2.8417223169598063, 'learning_rate': 9.807453416149068e-07, 'completion_length': 76.40179061889648, 'rewards/accuracy_reward': 0.330357164144516, 'rewards/format_reward': 1.0, 'reward': 1.3303571939468384, 'reward_std': 0.3435651957988739, 'kl': 0.044921875, 'epoch': 0.1} 2%|▏ | 31/1610 [06:11<4:04:31, 9.29s/it] 2%|▏ | 32/1610 [06:20<4:01:21, 9.18s/it] {'loss': 0.0022, 'grad_norm': 2.1061989292969674, 'learning_rate': 9.801242236024844e-07, 'completion_length': 80.89286041259766, 'rewards/accuracy_reward': 0.2142857313156128, 'rewards/format_reward': 1.0, 'reward': 1.2142857909202576, 'reward_std': 0.3402758836746216, 'kl': 0.05517578125, 'epoch': 0.1} 2%|▏ | 32/1610 [06:20<4:01:21, 9.18s/it] 2%|▏ | 33/1610 [06:30<4:03:47, 9.28s/it] {'loss': 0.0019, 'grad_norm': 2.395306919645455, 'learning_rate': 9.79503105590062e-07, 'completion_length': 69.0089340209961, 'rewards/accuracy_reward': 0.2857142984867096, 'rewards/format_reward': 1.0, 'reward': 1.2857143878936768, 'reward_std': 0.2404673956334591, 'kl': 0.0477294921875, 'epoch': 0.1} 2%|▏ | 33/1610 [06:30<4:03:47, 9.28s/it] 2%|▏ | 34/1610 [06:40<4:14:40, 9.70s/it] {'loss': 0.0019, 'grad_norm': 2.278870675047882, 'learning_rate': 9.788819875776397e-07, 'completion_length': 79.63393020629883, 'rewards/accuracy_reward': 0.3482143059372902, 'rewards/format_reward': 1.0, 'reward': 1.348214328289032, 'reward_std': 0.12956401705741882, 'kl': 0.047607421875, 'epoch': 0.11} 2%|▏ | 34/1610 [06:40<4:14:40, 9.70s/it] 2%|▏ | 35/1610 [06:50<4:11:27, 9.58s/it] {'loss': 0.0023, 'grad_norm': 2.628423067914875, 'learning_rate': 9.782608695652173e-07, 'completion_length': 74.93750381469727, 'rewards/accuracy_reward': 0.2946428805589676, 'rewards/format_reward': 1.0, 'reward': 1.2946429252624512, 'reward_std': 0.388957679271698, 'kl': 0.056640625, 'epoch': 0.11} 2%|▏ | 35/1610 [06:50<4:11:27, 9.58s/it] 2%|▏ | 36/1610 [06:59<4:09:49, 9.52s/it] {'loss': 0.002, 'grad_norm': 2.7915761013923466, 'learning_rate': 9.77639751552795e-07, 'completion_length': 79.68750381469727, 'rewards/accuracy_reward': 0.330357164144516, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.3125000596046448, 'reward_std': 0.3778369873762131, 'kl': 0.0511474609375, 'epoch': 0.11} 2%|▏ | 36/1610 [06:59<4:09:49, 9.52s/it] 2%|▏ | 37/1610 [07:10<4:20:37, 9.94s/it] {'loss': 0.0021, 'grad_norm': 1.9342182711946958, 'learning_rate': 9.770186335403726e-07, 'completion_length': 77.83929061889648, 'rewards/accuracy_reward': 0.232142873108387, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.2142857313156128, 'reward_std': 0.36591367423534393, 'kl': 0.0533447265625, 'epoch': 0.11} 2%|▏ | 37/1610 [07:10<4:20:37, 9.94s/it] 2%|▏ | 38/1610 [07:19<4:14:06, 9.70s/it] {'loss': 0.0017, 'grad_norm': 3.7278383268194095, 'learning_rate': 9.763975155279502e-07, 'completion_length': 77.81250381469727, 'rewards/accuracy_reward': 0.196428582072258, 'rewards/format_reward': 1.0, 'reward': 1.196428656578064, 'reward_std': 0.2831694334745407, 'kl': 0.04345703125, 'epoch': 0.12} 2%|▏ | 38/1610 [07:19<4:14:06, 9.70s/it] 2%|▏ | 39/1610 [07:27<4:01:40, 9.23s/it] {'loss': 0.0026, 'grad_norm': 2.7496321818804286, 'learning_rate': 9.757763975155278e-07, 'completion_length': 66.72321891784668, 'rewards/accuracy_reward': 0.3035714477300644, 'rewards/format_reward': 1.0, 'reward': 1.3035715222358704, 'reward_std': 0.26572681963443756, 'kl': 0.065673828125, 'epoch': 0.12} 2%|▏ | 39/1610 [07:27<4:01:40, 9.23s/it] 2%|▏ | 40/1610 [07:37<4:06:55, 9.44s/it] {'loss': 0.0018, 'grad_norm': 9.728693140102614, 'learning_rate': 9.751552795031055e-07, 'completion_length': 88.8660774230957, 'rewards/accuracy_reward': 0.3035714477300644, 'rewards/format_reward': 1.0, 'reward': 1.3035715222358704, 'reward_std': 0.33879223465919495, 'kl': 0.0443115234375, 'epoch': 0.12} 2%|▏ | 40/1610 [07:37<4:06:55, 9.44s/it] 3%|▎ | 41/1610 [07:48<4:19:49, 9.94s/it] {'loss': 0.0016, 'grad_norm': 2.5355068285847375, 'learning_rate': 9.745341614906833e-07, 'completion_length': 96.54464721679688, 'rewards/accuracy_reward': 0.2589285895228386, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.2410715222358704, 'reward_std': 0.3237687796354294, 'kl': 0.041015625, 'epoch': 0.13} 3%|▎ | 41/1610 [07:48<4:19:49, 9.94s/it] 3%|▎ | 42/1610 [08:01<4:38:05, 10.64s/it] {'loss': 0.0018, 'grad_norm': 13.567135234293977, 'learning_rate': 9.73913043478261e-07, 'completion_length': 94.3660774230957, 'rewards/accuracy_reward': 0.330357164144516, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.3125000596046448, 'reward_std': 0.39980147778987885, 'kl': 0.044921875, 'epoch': 0.13} 3%|▎ | 42/1610 [08:01<4:38:05, 10.64s/it] 3%|▎ | 43/1610 [08:09<4:22:53, 10.07s/it] {'loss': 0.0018, 'grad_norm': 2.19475815553279, 'learning_rate': 9.732919254658386e-07, 'completion_length': 90.37500381469727, 'rewards/accuracy_reward': 0.294642873108387, 'rewards/format_reward': 1.0, 'reward': 1.2946429252624512, 'reward_std': 0.35562701523303986, 'kl': 0.0440673828125, 'epoch': 0.13} 3%|▎ | 43/1610 [08:09<4:22:53, 10.07s/it] 3%|▎ | 44/1610 [08:20<4:25:12, 10.16s/it] {'loss': 0.0018, 'grad_norm': 2.636235209366645, 'learning_rate': 9.726708074534162e-07, 'completion_length': 103.98214721679688, 'rewards/accuracy_reward': 0.2767857238650322, 'rewards/format_reward': 1.0, 'reward': 1.2767857909202576, 'reward_std': 0.2987862154841423, 'kl': 0.0458984375, 'epoch': 0.14} 3%|▎ | 44/1610 [08:20<4:25:12, 10.16s/it] 3%|▎ | 45/1610 [08:31<4:32:20, 10.44s/it] {'loss': 0.0018, 'grad_norm': 2.048010418030091, 'learning_rate': 9.720496894409938e-07, 'completion_length': 100.71429061889648, 'rewards/accuracy_reward': 0.3035714477300644, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.2946429252624512, 'reward_std': 0.3154004365205765, 'kl': 0.04541015625, 'epoch': 0.14} 3%|▎ | 45/1610 [08:31<4:32:20, 10.44s/it] 3%|▎ | 46/1610 [08:44<4:55:56, 11.35s/it] {'loss': 0.0017, 'grad_norm': 1.7540812095453724, 'learning_rate': 9.714285714285715e-07, 'completion_length': 107.8214340209961, 'rewards/accuracy_reward': 0.2678571566939354, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.2500000596046448, 'reward_std': 0.30880723148584366, 'kl': 0.0423583984375, 'epoch': 0.14} 3%|▎ | 46/1610 [08:44<4:55:56, 11.35s/it] 3%|▎ | 47/1610 [08:55<4:50:58, 11.17s/it] {'loss': 0.002, 'grad_norm': 1.9725119964463087, 'learning_rate': 9.708074534161491e-07, 'completion_length': 96.63393020629883, 'rewards/accuracy_reward': 0.2142857164144516, 'rewards/format_reward': 1.0, 'reward': 1.2142857313156128, 'reward_std': 0.3051137626171112, 'kl': 0.0511474609375, 'epoch': 0.15} 3%|▎ | 47/1610 [08:55<4:50:58, 11.17s/it] 3%|▎ | 48/1610 [09:05<4:41:40, 10.82s/it] {'loss': 0.0018, 'grad_norm': 1.4613109152726167, 'learning_rate': 9.701863354037265e-07, 'completion_length': 101.65178680419922, 'rewards/accuracy_reward': 0.2232142984867096, 'rewards/format_reward': 1.0, 'reward': 1.223214328289032, 'reward_std': 0.2513168230652809, 'kl': 0.045654296875, 'epoch': 0.15} 3%|▎ | 48/1610 [09:05<4:41:40, 10.82s/it] 3%|▎ | 49/1610 [09:14<4:30:25, 10.39s/it] {'loss': 0.002, 'grad_norm': 2.544149695933368, 'learning_rate': 9.695652173913042e-07, 'completion_length': 95.60714721679688, 'rewards/accuracy_reward': 0.2500000149011612, 'rewards/format_reward': 1.0, 'reward': 1.2500000596046448, 'reward_std': 0.25670325756073, 'kl': 0.048828125, 'epoch': 0.15} 3%|▎ | 49/1610 [09:14<4:30:25, 10.39s/it] 3%|▎ | 50/1610 [09:26<4:37:14, 10.66s/it] {'loss': 0.002, 'grad_norm': 3.320621243029186, 'learning_rate': 9.68944099378882e-07, 'completion_length': 104.08036422729492, 'rewards/accuracy_reward': 0.1071428656578064, 'rewards/format_reward': 1.0, 'reward': 1.1071429252624512, 'reward_std': 0.19057324528694153, 'kl': 0.0498046875, 'epoch': 0.16} 3%|▎ | 50/1610 [09:26<4:37:14, 10.66s/it] 3%|▎ | 51/1610 [09:35<4:26:27, 10.26s/it] {'loss': 0.0022, 'grad_norm': 1.5647540837580287, 'learning_rate': 9.683229813664596e-07, 'completion_length': 93.8214340209961, 'rewards/accuracy_reward': 0.267857164144516, 'rewards/format_reward': 1.0, 'reward': 1.2678572535514832, 'reward_std': 0.26181842386722565, 'kl': 0.05419921875, 'epoch': 0.16} 3%|▎ | 51/1610 [09:35<4:26:27, 10.26s/it] 3%|▎ | 52/1610 [09:46<4:31:41, 10.46s/it] {'loss': 0.0021, 'grad_norm': 3.3718018417638507, 'learning_rate': 9.677018633540373e-07, 'completion_length': 106.06250381469727, 'rewards/accuracy_reward': 0.258928582072258, 'rewards/format_reward': 1.0, 'reward': 1.258928656578064, 'reward_std': 0.2987862080335617, 'kl': 0.0535888671875, 'epoch': 0.16} 3%|▎ | 52/1610 [09:46<4:31:41, 10.46s/it] 3%|▎ | 53/1610 [09:56<4:29:39, 10.39s/it] {'loss': 0.0019, 'grad_norm': 2.3613320175030292, 'learning_rate': 9.67080745341615e-07, 'completion_length': 124.83036041259766, 'rewards/accuracy_reward': 0.1875000149011612, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.1785714626312256, 'reward_std': 0.2507179081439972, 'kl': 0.047607421875, 'epoch': 0.16} 3%|▎ | 53/1610 [09:56<4:29:39, 10.39s/it] 3%|▎ | 54/1610 [10:09<4:45:18, 11.00s/it] {'loss': 0.0022, 'grad_norm': 1.529035163841813, 'learning_rate': 9.664596273291925e-07, 'completion_length': 105.26786422729492, 'rewards/accuracy_reward': 0.16964286006987095, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.1517857313156128, 'reward_std': 0.24229323863983154, 'kl': 0.055419921875, 'epoch': 0.17} 3%|▎ | 54/1610 [10:09<4:45:18, 11.00s/it] 3%|▎ | 55/1610 [10:20<4:45:48, 11.03s/it] {'loss': 0.002, 'grad_norm': 2.0586007438732272, 'learning_rate': 9.658385093167702e-07, 'completion_length': 107.04464721679688, 'rewards/accuracy_reward': 0.3839285969734192, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.3750000596046448, 'reward_std': 0.4103030860424042, 'kl': 0.049560546875, 'epoch': 0.17} 3%|▎ | 55/1610 [10:20<4:45:48, 11.03s/it] 3%|▎ | 56/1610 [10:32<4:56:09, 11.43s/it] {'loss': 0.0018, 'grad_norm': 2.3569293381687326, 'learning_rate': 9.652173913043478e-07, 'completion_length': 120.01786422729492, 'rewards/accuracy_reward': 0.267857164144516, 'rewards/format_reward': 1.0, 'reward': 1.2678571939468384, 'reward_std': 0.30330248177051544, 'kl': 0.04443359375, 'epoch': 0.17} 3%|▎ | 56/1610 [10:32<4:56:09, 11.43s/it] 4%|▎ | 57/1610 [10:43<4:49:10, 11.17s/it] {'loss': 0.0017, 'grad_norm': 5.002646834139588, 'learning_rate': 9.645962732919254e-07, 'completion_length': 102.67857360839844, 'rewards/accuracy_reward': 0.2857142984867096, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.2767857909202576, 'reward_std': 0.36100782454013824, 'kl': 0.0423583984375, 'epoch': 0.18} 4%|▎ | 57/1610 [10:43<4:49:10, 11.17s/it] 4%|▎ | 58/1610 [10:55<4:56:29, 11.46s/it] {'loss': 0.0017, 'grad_norm': 2.4453827416733027, 'learning_rate': 9.63975155279503e-07, 'completion_length': 118.07143020629883, 'rewards/accuracy_reward': 0.2410714328289032, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.2053571939468384, 'reward_std': 0.4051344394683838, 'kl': 0.041259765625, 'epoch': 0.18} 4%|▎ | 58/1610 [10:55<4:56:29, 11.46s/it] 4%|▎ | 59/1610 [11:06<4:52:41, 11.32s/it] {'loss': 0.0022, 'grad_norm': 2.8966424112559856, 'learning_rate': 9.633540372670807e-07, 'completion_length': 100.45536041259766, 'rewards/accuracy_reward': 0.3214285969734192, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.3125000596046448, 'reward_std': 0.20020468533039093, 'kl': 0.05517578125, 'epoch': 0.18} 4%|▎ | 59/1610 [11:06<4:52:41, 11.32s/it] 4%|▎ | 60/1610 [11:16<4:43:41, 10.98s/it] {'loss': 0.0018, 'grad_norm': 3.611712567342244, 'learning_rate': 9.627329192546583e-07, 'completion_length': 101.64286041259766, 'rewards/accuracy_reward': 0.321428582072258, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.3035715222358704, 'reward_std': 0.3026890307664871, 'kl': 0.0455322265625, 'epoch': 0.19} 4%|▎ | 60/1610 [11:16<4:43:41, 10.98s/it] 4%|▍ | 61/1610 [11:28<4:48:46, 11.19s/it] {'loss': 0.0015, 'grad_norm': 1.3890862943771491, 'learning_rate': 9.62111801242236e-07, 'completion_length': 107.10714721679688, 'rewards/accuracy_reward': 0.4375000149011612, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4285715222358704, 'reward_std': 0.22606300562620163, 'kl': 0.0384521484375, 'epoch': 0.19} 4%|▍ | 61/1610 [11:28<4:48:46, 11.19s/it] 4%|▍ | 62/1610 [11:42<5:09:56, 12.01s/it] {'loss': 0.0018, 'grad_norm': 1.9088261186439852, 'learning_rate': 9.614906832298136e-07, 'completion_length': 127.85715103149414, 'rewards/accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 0.973214328289032, 'reward': 1.4732143878936768, 'reward_std': 0.42235928773880005, 'kl': 0.0452880859375, 'epoch': 0.19} 4%|▍ | 62/1610 [11:42<5:09:56, 12.01s/it] 4%|▍ | 63/1610 [11:54<5:14:42, 12.21s/it] {'loss': 0.002, 'grad_norm': 2.2625010099191125, 'learning_rate': 9.608695652173912e-07, 'completion_length': 100.15179061889648, 'rewards/accuracy_reward': 0.2857142984867096, 'rewards/format_reward': 1.0, 'reward': 1.285714328289032, 'reward_std': 0.22936689853668213, 'kl': 0.05126953125, 'epoch': 0.2} 4%|▍ | 63/1610 [11:54<5:14:42, 12.21s/it] 4%|▍ | 64/1610 [12:06<5:12:51, 12.14s/it] {'loss': 0.0022, 'grad_norm': 2.4126987609986865, 'learning_rate': 9.602484472049689e-07, 'completion_length': 110.6785774230957, 'rewards/accuracy_reward': 0.196428582072258, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.1785715222358704, 'reward_std': 0.2673321068286896, 'kl': 0.054443359375, 'epoch': 0.2} 4%|▍ | 64/1610 [12:06<5:12:51, 12.14s/it] 4%|▍ | 65/1610 [12:18<5:12:04, 12.12s/it] {'loss': 0.0018, 'grad_norm': 2.0715132384167245, 'learning_rate': 9.596273291925465e-07, 'completion_length': 113.58929061889648, 'rewards/accuracy_reward': 0.375, 'rewards/format_reward': 0.973214328289032, 'reward': 1.348214328289032, 'reward_std': 0.309016577899456, 'kl': 0.046142578125, 'epoch': 0.2} 4%|▍ | 65/1610 [12:18<5:12:04, 12.12s/it] 4%|▍ | 66/1610 [12:30<5:11:57, 12.12s/it] {'loss': 0.0022, 'grad_norm': 2.2684832464277997, 'learning_rate': 9.590062111801241e-07, 'completion_length': 117.20536422729492, 'rewards/accuracy_reward': 0.2767857238650322, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.258928656578064, 'reward_std': 0.2868572920560837, 'kl': 0.0557861328125, 'epoch': 0.2} 4%|▍ | 66/1610 [12:30<5:11:57, 12.12s/it] 4%|▍ | 67/1610 [12:41<5:02:36, 11.77s/it] {'loss': 0.0023, 'grad_norm': 3.913219413551043, 'learning_rate': 9.583850931677018e-07, 'completion_length': 104.64286041259766, 'rewards/accuracy_reward': 0.3750000298023224, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.3660714626312256, 'reward_std': 0.4153647869825363, 'kl': 0.0582275390625, 'epoch': 0.21} 4%|▍ | 67/1610 [12:41<5:02:36, 11.77s/it] 4%|▍ | 68/1610 [12:53<5:02:02, 11.75s/it] {'loss': 0.0019, 'grad_norm': 2.02053025436448, 'learning_rate': 9.577639751552796e-07, 'completion_length': 100.7589340209961, 'rewards/accuracy_reward': 0.4017857313156128, 'rewards/format_reward': 0.9732142984867096, 'reward': 1.3750000596046448, 'reward_std': 0.38026735186576843, 'kl': 0.04736328125, 'epoch': 0.21} 4%|▍ | 68/1610 [12:53<5:02:02, 11.75s/it] 4%|▍ | 69/1610 [13:05<5:04:37, 11.86s/it] {'loss': 0.0021, 'grad_norm': 1.8209296562258557, 'learning_rate': 9.571428571428572e-07, 'completion_length': 107.56250762939453, 'rewards/accuracy_reward': 0.3660714477300644, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.3571429252624512, 'reward_std': 0.29097503423690796, 'kl': 0.053466796875, 'epoch': 0.21} 4%|▍ | 69/1610 [13:05<5:04:37, 11.86s/it] 4%|▍ | 70/1610 [13:19<5:18:45, 12.42s/it] {'loss': 0.0021, 'grad_norm': 3.8559694903465966, 'learning_rate': 9.565217391304349e-07, 'completion_length': 122.69643020629883, 'rewards/accuracy_reward': 0.4642857313156128, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4464285969734192, 'reward_std': 0.2567032501101494, 'kl': 0.0516357421875, 'epoch': 0.22} 4%|▍ | 70/1610 [13:19<5:18:45, 12.42s/it] 4%|▍ | 71/1610 [13:31<5:12:20, 12.18s/it] {'loss': 0.0022, 'grad_norm': 1.6371901698186229, 'learning_rate': 9.559006211180125e-07, 'completion_length': 121.33036422729492, 'rewards/accuracy_reward': 0.321428582072258, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3035715222358704, 'reward_std': 0.3174412399530411, 'kl': 0.0550537109375, 'epoch': 0.22} 4%|▍ | 71/1610 [13:31<5:12:20, 12.18s/it] 4%|▍ | 72/1610 [13:42<5:07:23, 11.99s/it] {'loss': 0.0019, 'grad_norm': 2.7666995173046094, 'learning_rate': 9.5527950310559e-07, 'completion_length': 104.34821701049805, 'rewards/accuracy_reward': 0.1517857238650322, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.1339285969734192, 'reward_std': 0.31353841722011566, 'kl': 0.047607421875, 'epoch': 0.22} 4%|▍ | 72/1610 [13:42<5:07:23, 11.99s/it] 5%|▍ | 73/1610 [13:54<5:03:56, 11.86s/it] {'loss': 0.0021, 'grad_norm': 2.5034710592604377, 'learning_rate': 9.546583850931676e-07, 'completion_length': 105.3839340209961, 'rewards/accuracy_reward': 0.3571428805589676, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3392857909202576, 'reward_std': 0.34668102860450745, 'kl': 0.05322265625, 'epoch': 0.23} 5%|▍ | 73/1610 [13:54<5:03:56, 11.86s/it] 5%|▍ | 74/1610 [14:07<5:11:31, 12.17s/it] {'loss': 0.0014, 'grad_norm': 2.438006827725872, 'learning_rate': 9.540372670807452e-07, 'completion_length': 136.43750381469727, 'rewards/accuracy_reward': 0.2857143059372902, 'rewards/format_reward': 0.973214328289032, 'reward': 1.258928656578064, 'reward_std': 0.28707224875688553, 'kl': 0.03619384765625, 'epoch': 0.23} 5%|▍ | 74/1610 [14:07<5:11:31, 12.17s/it] 5%|▍ | 75/1610 [14:15<4:45:21, 11.15s/it] {'loss': 0.0022, 'grad_norm': 1.7910080459052025, 'learning_rate': 9.534161490683229e-07, 'completion_length': 90.34821701049805, 'rewards/accuracy_reward': 0.4107143133878708, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4017857909202576, 'reward_std': 0.33675143122673035, 'kl': 0.0543212890625, 'epoch': 0.23} 5%|▍ | 75/1610 [14:15<4:45:21, 11.15s/it] 5%|▍ | 76/1610 [14:29<5:03:50, 11.88s/it] {'loss': 0.0024, 'grad_norm': 2.052794016857258, 'learning_rate': 9.527950310559006e-07, 'completion_length': 122.58929061889648, 'rewards/accuracy_reward': 0.3125000149011612, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.2767857313156128, 'reward_std': 0.4707726240158081, 'kl': 0.0595703125, 'epoch': 0.24} 5%|▍ | 76/1610 [14:29<5:03:50, 11.88s/it] 5%|▍ | 77/1610 [14:40<5:00:03, 11.74s/it] {'loss': 0.0027, 'grad_norm': 1.9007559883518035, 'learning_rate': 9.521739130434783e-07, 'completion_length': 89.66071701049805, 'rewards/accuracy_reward': 0.3571428656578064, 'rewards/format_reward': 1.0, 'reward': 1.3571429252624512, 'reward_std': 0.22363825142383575, 'kl': 0.066650390625, 'epoch': 0.24} 5%|▍ | 77/1610 [14:40<5:00:03, 11.74s/it] 5%|▍ | 78/1610 [14:52<4:56:22, 11.61s/it] {'loss': 0.0023, 'grad_norm': 1.4948241271024265, 'learning_rate': 9.515527950310559e-07, 'completion_length': 99.54464721679688, 'rewards/accuracy_reward': 0.2678571492433548, 'rewards/format_reward': 1.0, 'reward': 1.2678571939468384, 'reward_std': 0.22363825142383575, 'kl': 0.0570068359375, 'epoch': 0.24} 5%|▍ | 78/1610 [14:52<4:56:22, 11.61s/it] 5%|▍ | 79/1610 [15:05<5:08:11, 12.08s/it] {'loss': 0.0019, 'grad_norm': 2.066815938000687, 'learning_rate': 9.509316770186336e-07, 'completion_length': 114.6964340209961, 'rewards/accuracy_reward': 0.267857164144516, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.258928656578064, 'reward_std': 0.3336714804172516, 'kl': 0.0474853515625, 'epoch': 0.25} 5%|▍ | 79/1610 [15:05<5:08:11, 12.08s/it] 5%|▍ | 80/1610 [15:17<5:06:33, 12.02s/it] {'loss': 0.0023, 'grad_norm': 2.1903747534694915, 'learning_rate': 9.503105590062112e-07, 'completion_length': 119.45536422729492, 'rewards/accuracy_reward': 0.3482143133878708, 'rewards/format_reward': 1.0, 'reward': 1.348214328289032, 'reward_std': 0.21973545849323273, 'kl': 0.0562744140625, 'epoch': 0.25} 5%|▍ | 80/1610 [15:17<5:06:33, 12.02s/it] 5%|▌ | 81/1610 [15:29<5:12:14, 12.25s/it] {'loss': 0.0026, 'grad_norm': 2.1814363425849947, 'learning_rate': 9.496894409937888e-07, 'completion_length': 103.56250381469727, 'rewards/accuracy_reward': 0.25, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.2410714626312256, 'reward_std': 0.28707224130630493, 'kl': 0.0657958984375, 'epoch': 0.25} 5%|▌ | 81/1610 [15:29<5:12:14, 12.25s/it] 5%|▌ | 82/1610 [15:40<5:01:44, 11.85s/it] {'loss': 0.0024, 'grad_norm': 1.8902700873466711, 'learning_rate': 9.490683229813665e-07, 'completion_length': 112.52679061889648, 'rewards/accuracy_reward': 0.3839285969734192, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.3660715222358704, 'reward_std': 0.37064486742019653, 'kl': 0.061279296875, 'epoch': 0.25} 5%|▌ | 82/1610 [15:40<5:01:44, 11.85s/it] 5%|▌ | 83/1610 [15:52<5:01:49, 11.86s/it] {'loss': 0.0028, 'grad_norm': 1.7790764188037784, 'learning_rate': 9.48447204968944e-07, 'completion_length': 108.83036422729492, 'rewards/accuracy_reward': 0.4732143133878708, 'rewards/format_reward': 1.0, 'reward': 1.4732143878936768, 'reward_std': 0.24889206886291504, 'kl': 0.070068359375, 'epoch': 0.26} 5%|▌ | 83/1610 [15:52<5:01:49, 11.86s/it] 5%|▌ | 84/1610 [16:06<5:12:35, 12.29s/it] {'loss': 0.0015, 'grad_norm': 2.034365136977003, 'learning_rate': 9.478260869565216e-07, 'completion_length': 119.47321701049805, 'rewards/accuracy_reward': 0.3125000223517418, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.3035714626312256, 'reward_std': 0.3246534913778305, 'kl': 0.0380859375, 'epoch': 0.26} 5%|▌ | 84/1610 [16:06<5:12:35, 12.29s/it] 5%|▌ | 85/1610 [16:19<5:21:34, 12.65s/it] {'loss': 0.0022, 'grad_norm': 1.5433763648800238, 'learning_rate': 9.472049689440993e-07, 'completion_length': 128.08036422729492, 'rewards/accuracy_reward': 0.258928582072258, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.2500000596046448, 'reward_std': 0.22363825142383575, 'kl': 0.0548095703125, 'epoch': 0.26} 5%|▌ | 85/1610 [16:19<5:21:34, 12.65s/it] 5%|▌ | 86/1610 [16:31<5:18:42, 12.55s/it] {'loss': 0.0019, 'grad_norm': 2.7089007048277556, 'learning_rate': 9.46583850931677e-07, 'completion_length': 109.90179061889648, 'rewards/accuracy_reward': 0.3839285969734192, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.3750000596046448, 'reward_std': 0.2792610377073288, 'kl': 0.0467529296875, 'epoch': 0.27} 5%|▌ | 86/1610 [16:31<5:18:42, 12.55s/it] 5%|▌ | 87/1610 [16:44<5:21:20, 12.66s/it] {'loss': 0.0013, 'grad_norm': 1.4051757520782995, 'learning_rate': 9.459627329192546e-07, 'completion_length': 131.41964721679688, 'rewards/accuracy_reward': 0.321428582072258, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.3125000596046448, 'reward_std': 0.336966410279274, 'kl': 0.03216552734375, 'epoch': 0.27} 5%|▌ | 87/1610 [16:44<5:21:20, 12.66s/it] 5%|▌ | 88/1610 [16:57<5:17:55, 12.53s/it] {'loss': 0.0018, 'grad_norm': 2.3608029626752143, 'learning_rate': 9.453416149068323e-07, 'completion_length': 110.06250381469727, 'rewards/accuracy_reward': 0.2946428656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.2767857909202576, 'reward_std': 0.24620164185762405, 'kl': 0.044189453125, 'epoch': 0.27} 5%|▌ | 88/1610 [16:57<5:17:55, 12.53s/it] 6%|▌ | 89/1610 [17:08<5:10:47, 12.26s/it] {'loss': 0.0015, 'grad_norm': 2.0828195951559514, 'learning_rate': 9.447204968944099e-07, 'completion_length': 103.51786041259766, 'rewards/accuracy_reward': 0.1250000074505806, 'rewards/format_reward': 1.0, 'reward': 1.1250000596046448, 'reward_std': 0.18458788841962814, 'kl': 0.03857421875, 'epoch': 0.28} 6%|▌ | 89/1610 [17:08<5:10:47, 12.26s/it] 6%|▌ | 90/1610 [17:21<5:12:29, 12.34s/it] {'loss': 0.0014, 'grad_norm': 1.918375868921056, 'learning_rate': 9.440993788819875e-07, 'completion_length': 142.42858123779297, 'rewards/accuracy_reward': 0.2410714477300644, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.223214328289032, 'reward_std': 0.2092282474040985, 'kl': 0.0340576171875, 'epoch': 0.28} 6%|▌ | 90/1610 [17:21<5:12:29, 12.34s/it] 6%|▌ | 91/1610 [17:33<5:14:41, 12.43s/it] {'loss': 0.0018, 'grad_norm': 1.841879169020255, 'learning_rate': 9.434782608695652e-07, 'completion_length': 128.10714721679688, 'rewards/accuracy_reward': 0.2946428805589676, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.2857143878936768, 'reward_std': 0.3020365983247757, 'kl': 0.0438232421875, 'epoch': 0.28} 6%|▌ | 91/1610 [17:33<5:14:41, 12.43s/it] 6%|▌ | 92/1610 [17:48<5:28:21, 12.98s/it] {'loss': 0.0016, 'grad_norm': 2.2257227699367204, 'learning_rate': 9.428571428571428e-07, 'completion_length': 146.08929443359375, 'rewards/accuracy_reward': 0.3303571492433548, 'rewards/format_reward': 0.9642857611179352, 'reward': 1.2946428656578064, 'reward_std': 0.3829132318496704, 'kl': 0.0404052734375, 'epoch': 0.29} 6%|▌ | 92/1610 [17:48<5:28:21, 12.98s/it] 6%|▌ | 93/1610 [17:59<5:16:17, 12.51s/it] {'loss': 0.0018, 'grad_norm': 1.5338710487101799, 'learning_rate': 9.422360248447204e-07, 'completion_length': 101.75000381469727, 'rewards/accuracy_reward': 0.2857143059372902, 'rewards/format_reward': 1.0, 'reward': 1.2857143878936768, 'reward_std': 0.17885926365852356, 'kl': 0.044921875, 'epoch': 0.29} 6%|▌ | 93/1610 [17:59<5:16:17, 12.51s/it] 6%|▌ | 94/1610 [18:12<5:19:57, 12.66s/it] {'loss': 0.0018, 'grad_norm': 2.3354158314748603, 'learning_rate': 9.41614906832298e-07, 'completion_length': 112.23214721679688, 'rewards/accuracy_reward': 0.4107143133878708, 'rewards/format_reward': 1.0, 'reward': 1.4107143878936768, 'reward_std': 0.3051137626171112, 'kl': 0.0439453125, 'epoch': 0.29} 6%|▌ | 94/1610 [18:12<5:19:57, 12.66s/it] 6%|▌ | 95/1610 [18:24<5:18:26, 12.61s/it] {'loss': 0.0019, 'grad_norm': 1.7513317104118502, 'learning_rate': 9.409937888198758e-07, 'completion_length': 113.72321701049805, 'rewards/accuracy_reward': 0.3303571492433548, 'rewards/format_reward': 0.9732142984867096, 'reward': 1.3035715222358704, 'reward_std': 0.36652712523937225, 'kl': 0.047119140625, 'epoch': 0.3} 6%|▌ | 95/1610 [18:24<5:18:26, 12.61s/it] 6%|▌ | 96/1610 [18:36<5:08:09, 12.21s/it] {'loss': 0.0018, 'grad_norm': 1.7495881119492245, 'learning_rate': 9.403726708074534e-07, 'completion_length': 108.41072082519531, 'rewards/accuracy_reward': 0.3660714477300644, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.3571429252624512, 'reward_std': 0.2248450294137001, 'kl': 0.0443115234375, 'epoch': 0.3} 6%|▌ | 96/1610 [18:36<5:08:09, 12.21s/it] 6%|▌ | 97/1610 [18:50<5:23:10, 12.82s/it] {'loss': 0.0023, 'grad_norm': 2.1056778403200185, 'learning_rate': 9.39751552795031e-07, 'completion_length': 112.44643020629883, 'rewards/accuracy_reward': 0.1964285746216774, 'rewards/format_reward': 0.973214328289032, 'reward': 1.1696428656578064, 'reward_std': 0.2928008586168289, 'kl': 0.0570068359375, 'epoch': 0.3} 6%|▌ | 97/1610 [18:50<5:23:10, 12.82s/it] 6%|▌ | 98/1610 [19:02<5:13:16, 12.43s/it] {'loss': 0.0022, 'grad_norm': 2.429660524265576, 'learning_rate': 9.391304347826087e-07, 'completion_length': 99.18750381469727, 'rewards/accuracy_reward': 0.3571428656578064, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.348214328289032, 'reward_std': 0.34538543224334717, 'kl': 0.0538330078125, 'epoch': 0.3} 6%|▌ | 98/1610 [19:02<5:13:16, 12.43s/it] 6%|▌ | 99/1610 [19:11<4:50:49, 11.55s/it] {'loss': 0.0018, 'grad_norm': 2.5819401749035062, 'learning_rate': 9.385093167701863e-07, 'completion_length': 97.27679061889648, 'rewards/accuracy_reward': 0.3392857313156128, 'rewards/format_reward': 1.0, 'reward': 1.3392857909202576, 'reward_std': 0.26181842386722565, 'kl': 0.045166015625, 'epoch': 0.31} 6%|▌ | 99/1610 [19:11<4:50:49, 11.55s/it] 6%|▌ | 100/1610 [19:23<4:53:30, 11.66s/it] {'loss': 0.002, 'grad_norm': 1.875224193345341, 'learning_rate': 9.37888198757764e-07, 'completion_length': 92.15179061889648, 'rewards/accuracy_reward': 0.258928582072258, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.2410714626312256, 'reward_std': 0.22094221413135529, 'kl': 0.0501708984375, 'epoch': 0.31} 6%|▌ | 100/1610 [19:23<4:53:30, 11.66s/it] 6%|▋ | 101/1610 [20:28<11:36:20, 27.69s/it] {'loss': 0.0014, 'grad_norm': 2.0401349545323857, 'learning_rate': 9.372670807453416e-07, 'completion_length': 135.7589340209961, 'rewards/accuracy_reward': 0.2946428656578064, 'rewards/format_reward': 0.973214328289032, 'reward': 1.2678571939468384, 'reward_std': 0.33737052977085114, 'kl': 0.03424072265625, 'epoch': 0.31} 6%|▋ | 101/1610 [20:28<11:36:20, 27.69s/it] 6%|▋ | 102/1610 [20:51<10:59:09, 26.23s/it] {'loss': 0.0018, 'grad_norm': 2.0826292367443835, 'learning_rate': 9.366459627329192e-07, 'completion_length': 114.37500762939453, 'rewards/accuracy_reward': 0.3125000149011612, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.2946429252624512, 'reward_std': 0.2515233904123306, 'kl': 0.0439453125, 'epoch': 0.32} 6%|▋ | 102/1610 [20:51<10:59:09, 26.23s/it] 6%|▋ | 103/1610 [21:11<10:15:15, 24.50s/it] {'loss': 0.0021, 'grad_norm': 1.5741183402339485, 'learning_rate': 9.360248447204968e-07, 'completion_length': 96.83036041259766, 'rewards/accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 1.0, 'reward': 1.5892857909202576, 'reward_std': 0.20021027326583862, 'kl': 0.0535888671875, 'epoch': 0.32} 6%|▋ | 103/1610 [21:11<10:15:15, 24.50s/it] 6%|▋ | 104/1610 [21:33<9:50:40, 23.53s/it] {'loss': 0.002, 'grad_norm': 2.0483535099128125, 'learning_rate': 9.354037267080745e-07, 'completion_length': 90.04464721679688, 'rewards/accuracy_reward': 0.2410714402794838, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.2321428656578064, 'reward_std': 0.30330249667167664, 'kl': 0.050048828125, 'epoch': 0.32} 6%|▋ | 104/1610 [21:33<9:50:40, 23.53s/it] 7%|▋ | 105/1610 [21:56<9:49:33, 23.50s/it] {'loss': 0.0018, 'grad_norm': 1.5401064740978474, 'learning_rate': 9.347826086956522e-07, 'completion_length': 100.79464721679688, 'rewards/accuracy_reward': 0.4196428805589676, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.4017857313156128, 'reward_std': 0.2628158628940582, 'kl': 0.04443359375, 'epoch': 0.33} 7%|▋ | 105/1610 [21:56<9:49:33, 23.50s/it] 7%|▋ | 106/1610 [22:16<9:24:04, 22.50s/it] {'loss': 0.0021, 'grad_norm': 2.044696913328528, 'learning_rate': 9.341614906832299e-07, 'completion_length': 100.09821701049805, 'rewards/accuracy_reward': 0.3125000149011612, 'rewards/format_reward': 1.0, 'reward': 1.3125000596046448, 'reward_std': 0.2804734408855438, 'kl': 0.052001953125, 'epoch': 0.33} 7%|▋ | 106/1610 [22:16<9:24:04, 22.50s/it] 7%|▋ | 107/1610 [22:37<9:07:32, 21.86s/it] {'loss': 0.0023, 'grad_norm': 2.199507079141717, 'learning_rate': 9.335403726708074e-07, 'completion_length': 83.82143020629883, 'rewards/accuracy_reward': 0.392857164144516, 'rewards/format_reward': 1.0, 'reward': 1.3928572535514832, 'reward_std': 0.2404673844575882, 'kl': 0.0570068359375, 'epoch': 0.33} 7%|▋ | 107/1610 [22:37<9:07:32, 21.86s/it] 7%|▋ | 108/1610 [23:01<9:23:13, 22.50s/it] {'loss': 0.0023, 'grad_norm': 3.2060840961666077, 'learning_rate': 9.32919254658385e-07, 'completion_length': 92.87500381469727, 'rewards/accuracy_reward': 0.4196428805589676, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4107143878936768, 'reward_std': 0.31540603935718536, 'kl': 0.0567626953125, 'epoch': 0.34} 7%|▋ | 108/1610 [23:01<9:23:13, 22.50s/it] 7%|▋ | 109/1610 [23:24<9:31:07, 22.83s/it] {'loss': 0.0024, 'grad_norm': 2.2332540925917344, 'learning_rate': 9.322981366459626e-07, 'completion_length': 93.6160774230957, 'rewards/accuracy_reward': 0.330357164144516, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.3214285969734192, 'reward_std': 0.2605525702238083, 'kl': 0.0594482421875, 'epoch': 0.34} 7%|▋ | 109/1610 [23:24<9:31:07, 22.83s/it] 7%|▋ | 110/1610 [23:46<9:22:45, 22.51s/it] {'loss': 0.0022, 'grad_norm': 2.232382499846331, 'learning_rate': 9.316770186335403e-07, 'completion_length': 93.52679061889648, 'rewards/accuracy_reward': 0.2410714402794838, 'rewards/format_reward': 1.0, 'reward': 1.2410714626312256, 'reward_std': 0.2993997037410736, 'kl': 0.0556640625, 'epoch': 0.34} 7%|▋ | 110/1610 [23:46<9:22:45, 22.51s/it] 7%|▋ | 111/1610 [24:06<9:01:43, 21.68s/it] {'loss': 0.0029, 'grad_norm': 2.449564100411282, 'learning_rate': 9.310559006211179e-07, 'completion_length': 78.6964340209961, 'rewards/accuracy_reward': 0.3482142984867096, 'rewards/format_reward': 1.0, 'reward': 1.348214328289032, 'reward_std': 0.3844357877969742, 'kl': 0.07177734375, 'epoch': 0.34} 7%|▋ | 111/1610 [24:06<9:01:43, 21.68s/it] 7%|▋ | 112/1610 [24:29<9:12:55, 22.15s/it] {'loss': 0.0023, 'grad_norm': 1.6656036073136866, 'learning_rate': 9.304347826086955e-07, 'completion_length': 105.58929061889648, 'rewards/accuracy_reward': 0.2500000149011612, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.2142857909202576, 'reward_std': 0.25583307445049286, 'kl': 0.0570068359375, 'epoch': 0.35} 7%|▋ | 112/1610 [24:29<9:12:55, 22.15s/it] 7%|▋ | 113/1610 [24:51<9:12:44, 22.15s/it] {'loss': 0.0032, 'grad_norm': 3.0431956203683215, 'learning_rate': 9.298136645962732e-07, 'completion_length': 94.35714721679688, 'rewards/accuracy_reward': 0.2321428656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.2142857909202576, 'reward_std': 0.2735324054956436, 'kl': 0.080078125, 'epoch': 0.35} 7%|▋ | 113/1610 [24:51<9:12:44, 22.15s/it] 7%|▋ | 114/1610 [25:15<9:22:16, 22.55s/it] {'loss': 0.0027, 'grad_norm': 2.1764601639242334, 'learning_rate': 9.291925465838509e-07, 'completion_length': 84.16964340209961, 'rewards/accuracy_reward': 0.232142873108387, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.2142857313156128, 'reward_std': 0.2907600998878479, 'kl': 0.0673828125, 'epoch': 0.35} 7%|▋ | 114/1610 [25:15<9:22:16, 22.55s/it] 7%|▋ | 115/1610 [25:38<9:27:43, 22.78s/it] {'loss': 0.0019, 'grad_norm': 1.7674296278519532, 'learning_rate': 9.285714285714285e-07, 'completion_length': 109.0089340209961, 'rewards/accuracy_reward': 0.2142857313156128, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.2053571939468384, 'reward_std': 0.30390138924121857, 'kl': 0.0479736328125, 'epoch': 0.36} 7%|▋ | 115/1610 [25:38<9:27:43, 22.78s/it] 7%|▋ | 116/1610 [25:59<9:13:30, 22.23s/it] {'loss': 0.0039, 'grad_norm': 1.9966842461637613, 'learning_rate': 9.279503105590062e-07, 'completion_length': 72.66964721679688, 'rewards/accuracy_reward': 0.3750000298023224, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.3660715222358704, 'reward_std': 0.3279428333044052, 'kl': 0.0986328125, 'epoch': 0.36} 7%|▋ | 116/1610 [25:59<9:13:30, 22.23s/it] 7%|▋ | 117/1610 [26:22<9:20:06, 22.51s/it] {'loss': 0.0028, 'grad_norm': 2.28734643893794, 'learning_rate': 9.273291925465838e-07, 'completion_length': 84.8839340209961, 'rewards/accuracy_reward': 0.3660714477300644, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.3571429252624512, 'reward_std': 0.3505062162876129, 'kl': 0.0692138671875, 'epoch': 0.36} 7%|▋ | 117/1610 [26:22<9:20:06, 22.51s/it] 7%|▋ | 118/1610 [26:42<9:00:56, 21.75s/it] {'loss': 0.0029, 'grad_norm': 3.919407379756498, 'learning_rate': 9.267080745341614e-07, 'completion_length': 69.23214721679688, 'rewards/accuracy_reward': 0.330357164144516, 'rewards/format_reward': 1.0, 'reward': 1.3303571939468384, 'reward_std': 0.2993997037410736, 'kl': 0.073486328125, 'epoch': 0.37} 7%|▋ | 118/1610 [26:42<9:00:56, 21.75s/it] 7%|▋ | 119/1610 [27:01<8:38:05, 20.85s/it] {'loss': 0.004, 'grad_norm': 3.264958208073755, 'learning_rate': 9.260869565217391e-07, 'completion_length': 50.20535850524902, 'rewards/accuracy_reward': 0.473214328289032, 'rewards/format_reward': 1.0, 'reward': 1.473214328289032, 'reward_std': 0.3273293375968933, 'kl': 0.1005859375, 'epoch': 0.37} 7%|▋ | 119/1610 [27:01<8:38:05, 20.85s/it] 7%|▋ | 120/1610 [27:22<8:37:36, 20.84s/it] {'loss': 0.0036, 'grad_norm': 1.9200387799499836, 'learning_rate': 9.254658385093167e-07, 'completion_length': 68.61607551574707, 'rewards/accuracy_reward': 0.223214291036129, 'rewards/format_reward': 1.0, 'reward': 1.223214328289032, 'reward_std': 0.20922822505235672, 'kl': 0.08935546875, 'epoch': 0.37} 7%|▋ | 120/1610 [27:22<8:37:36, 20.84s/it] 8%|▊ | 121/1610 [27:44<8:46:52, 21.23s/it] {'loss': 0.0034, 'grad_norm': 3.0818488477261625, 'learning_rate': 9.248447204968943e-07, 'completion_length': 76.78571701049805, 'rewards/accuracy_reward': 0.3392857313156128, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.3303571939468384, 'reward_std': 0.33246469497680664, 'kl': 0.084228515625, 'epoch': 0.38} 8%|▊ | 121/1610 [27:44<8:46:52, 21.23s/it] 8%|▊ | 122/1610 [28:04<8:42:23, 21.06s/it] {'loss': 0.0035, 'grad_norm': 2.955188993315093, 'learning_rate': 9.24223602484472e-07, 'completion_length': 66.53571891784668, 'rewards/accuracy_reward': 0.4642857313156128, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4553571939468384, 'reward_std': 0.3472169041633606, 'kl': 0.087646484375, 'epoch': 0.38} 8%|▊ | 122/1610 [28:04<8:42:23, 21.06s/it] 8%|▊ | 123/1610 [28:23<8:25:17, 20.39s/it] {'loss': 0.0038, 'grad_norm': 2.365864438915546, 'learning_rate': 9.236024844720497e-07, 'completion_length': 53.455360412597656, 'rewards/accuracy_reward': 0.3571428805589676, 'rewards/format_reward': 1.0, 'reward': 1.3571429252624512, 'reward_std': 0.26450884342193604, 'kl': 0.094970703125, 'epoch': 0.38} 8%|▊ | 123/1610 [28:23<8:25:17, 20.39s/it] 8%|▊ | 124/1610 [28:42<8:13:57, 19.94s/it] {'loss': 0.0041, 'grad_norm': 1.6984216819501539, 'learning_rate': 9.229813664596273e-07, 'completion_length': 57.48214530944824, 'rewards/accuracy_reward': 0.223214291036129, 'rewards/format_reward': 1.0, 'reward': 1.223214328289032, 'reward_std': 0.19178561866283417, 'kl': 0.10302734375, 'epoch': 0.39} 8%|▊ | 124/1610 [28:42<8:13:57, 19.94s/it] 8%|▊ | 125/1610 [29:01<8:06:37, 19.66s/it] {'loss': 0.0027, 'grad_norm': 4.069980133963279, 'learning_rate': 9.22360248447205e-07, 'completion_length': 65.56250190734863, 'rewards/accuracy_reward': 0.2232142984867096, 'rewards/format_reward': 1.0, 'reward': 1.2232143878936768, 'reward_std': 0.3111136853694916, 'kl': 0.066650390625, 'epoch': 0.39} 8%|▊ | 125/1610 [29:01<8:06:37, 19.66s/it] 8%|▊ | 126/1610 [29:21<8:10:06, 19.82s/it] {'loss': 0.0029, 'grad_norm': 2.864981135812964, 'learning_rate': 9.217391304347826e-07, 'completion_length': 68.67857360839844, 'rewards/accuracy_reward': 0.2142857164144516, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.2053571939468384, 'reward_std': 0.2987862080335617, 'kl': 0.07177734375, 'epoch': 0.39} 8%|▊ | 126/1610 [29:21<8:10:06, 19.82s/it] 8%|▊ | 127/1610 [29:43<8:20:46, 20.26s/it] {'loss': 0.0032, 'grad_norm': 1.6698727231106796, 'learning_rate': 9.211180124223602e-07, 'completion_length': 80.33929061889648, 'rewards/accuracy_reward': 0.258928582072258, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.2500000596046448, 'reward_std': 0.19938187301158905, 'kl': 0.07958984375, 'epoch': 0.39} 8%|▊ | 127/1610 [29:43<8:20:46, 20.26s/it] 8%|▊ | 128/1610 [30:03<8:18:29, 20.18s/it] {'loss': 0.0037, 'grad_norm': 2.5683416053442674, 'learning_rate': 9.204968944099379e-07, 'completion_length': 66.50000381469727, 'rewards/accuracy_reward': 0.196428582072258, 'rewards/format_reward': 1.0, 'reward': 1.1964285969734192, 'reward_std': 0.2741458863019943, 'kl': 0.091552734375, 'epoch': 0.4} 8%|▊ | 128/1610 [30:03<8:18:29, 20.18s/it] 8%|▊ | 129/1610 [30:22<8:11:40, 19.92s/it] {'loss': 0.0027, 'grad_norm': 2.19891386620045, 'learning_rate': 9.198757763975155e-07, 'completion_length': 77.54464721679688, 'rewards/accuracy_reward': 0.4107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.410714328289032, 'reward_std': 0.3026890158653259, 'kl': 0.0677490234375, 'epoch': 0.4} 8%|▊ | 129/1610 [30:22<8:11:40, 19.92s/it] 8%|▊ | 130/1610 [30:41<8:05:06, 19.67s/it] {'loss': 0.0034, 'grad_norm': 10.72633756795803, 'learning_rate': 9.19254658385093e-07, 'completion_length': 63.91071701049805, 'rewards/accuracy_reward': 0.4464286118745804, 'rewards/format_reward': 1.0, 'reward': 1.446428656578064, 'reward_std': 0.2858598530292511, 'kl': 0.0841064453125, 'epoch': 0.4} 8%|▊ | 130/1610 [30:41<8:05:06, 19.67s/it] 8%|▊ | 131/1610 [31:01<8:09:43, 19.87s/it] {'loss': 0.0035, 'grad_norm': 1.7068244563995705, 'learning_rate': 9.186335403726707e-07, 'completion_length': 64.96428680419922, 'rewards/accuracy_reward': 0.3035714477300644, 'rewards/format_reward': 1.0, 'reward': 1.3035714626312256, 'reward_std': 0.17433738708496094, 'kl': 0.08837890625, 'epoch': 0.41} 8%|▊ | 131/1610 [31:01<8:09:43, 19.87s/it] 8%|▊ | 132/1610 [31:23<8:21:03, 20.34s/it] {'loss': 0.0036, 'grad_norm': 2.025104193117115, 'learning_rate': 9.180124223602484e-07, 'completion_length': 77.03571701049805, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 1.0, 'reward': 1.6071429252624512, 'reward_std': 0.2831694334745407, 'kl': 0.091064453125, 'epoch': 0.41} 8%|▊ | 132/1610 [31:23<8:21:03, 20.34s/it] 8%|▊ | 133/1610 [31:42<8:14:46, 20.10s/it] {'loss': 0.0041, 'grad_norm': 1.6612103461895535, 'learning_rate': 9.17391304347826e-07, 'completion_length': 60.66964530944824, 'rewards/accuracy_reward': 0.2321428656578064, 'rewards/format_reward': 1.0, 'reward': 1.2321429252624512, 'reward_std': 0.11272923648357391, 'kl': 0.1025390625, 'epoch': 0.41} 8%|▊ | 133/1610 [31:42<8:14:46, 20.10s/it] 8%|▊ | 134/1610 [32:05<8:33:39, 20.88s/it] {'loss': 0.0028, 'grad_norm': 1.9587311056159118, 'learning_rate': 9.167701863354037e-07, 'completion_length': 114.18750381469727, 'rewards/accuracy_reward': 0.2857142984867096, 'rewards/format_reward': 1.0, 'reward': 1.285714328289032, 'reward_std': 0.35198988020420074, 'kl': 0.069091796875, 'epoch': 0.42} 8%|▊ | 134/1610 [32:05<8:33:39, 20.88s/it] 8%|▊ | 135/1610 [32:24<8:19:50, 20.33s/it] {'loss': 0.0033, 'grad_norm': 2.8612852010819774, 'learning_rate': 9.161490683229813e-07, 'completion_length': 67.13393020629883, 'rewards/accuracy_reward': 0.3392857164144516, 'rewards/format_reward': 1.0, 'reward': 1.3392857909202576, 'reward_std': 0.2377769947052002, 'kl': 0.08251953125, 'epoch': 0.42} 8%|▊ | 135/1610 [32:24<8:19:50, 20.33s/it] 8%|▊ | 136/1610 [32:44<8:16:23, 20.21s/it] {'loss': 0.0039, 'grad_norm': 2.2193770385916025, 'learning_rate': 9.155279503105589e-07, 'completion_length': 73.31250381469727, 'rewards/accuracy_reward': 0.3214285969734192, 'rewards/format_reward': 1.0, 'reward': 1.321428656578064, 'reward_std': 0.26243188977241516, 'kl': 0.097900390625, 'epoch': 0.42} 8%|▊ | 136/1610 [32:44<8:16:23, 20.21s/it] 9%|▊ | 137/1610 [33:04<8:13:11, 20.09s/it] {'loss': 0.0031, 'grad_norm': 2.409742822431569, 'learning_rate': 9.149068322981366e-07, 'completion_length': 74.01786041259766, 'rewards/accuracy_reward': 0.321428582072258, 'rewards/format_reward': 1.0, 'reward': 1.321428656578064, 'reward_std': 0.11272924393415451, 'kl': 0.076904296875, 'epoch': 0.43} 9%|▊ | 137/1610 [33:04<8:13:11, 20.09s/it] 9%|▊ | 138/1610 [33:27<8:37:10, 21.08s/it] {'loss': 0.0025, 'grad_norm': 2.2371096393615497, 'learning_rate': 9.142857142857142e-07, 'completion_length': 96.3660774230957, 'rewards/accuracy_reward': 0.2946428656578064, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.2767857909202576, 'reward_std': 0.29098063707351685, 'kl': 0.0616455078125, 'epoch': 0.43} 9%|▊ | 138/1610 [33:27<8:37:10, 21.08s/it] 9%|▊ | 139/1610 [33:49<8:41:32, 21.27s/it] {'loss': 0.003, 'grad_norm': 2.0456457813790774, 'learning_rate': 9.136645962732918e-07, 'completion_length': 100.08036041259766, 'rewards/accuracy_reward': 0.4910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.4910714626312256, 'reward_std': 0.28195707499980927, 'kl': 0.073974609375, 'epoch': 0.43} 9%|▊ | 139/1610 [33:49<8:41:32, 21.27s/it] 9%|▊ | 140/1610 [34:11<8:51:01, 21.67s/it] {'loss': 0.0023, 'grad_norm': 1.8250625535051093, 'learning_rate': 9.130434782608695e-07, 'completion_length': 102.34821701049805, 'rewards/accuracy_reward': 0.3392857238650322, 'rewards/format_reward': 1.0, 'reward': 1.3392857909202576, 'reward_std': 0.3123260587453842, 'kl': 0.0582275390625, 'epoch': 0.43} 9%|▊ | 140/1610 [34:11<8:51:01, 21.67s/it] 9%|▉ | 141/1610 [34:33<8:49:55, 21.64s/it] {'loss': 0.003, 'grad_norm': 2.8011919674916252, 'learning_rate': 9.124223602484472e-07, 'completion_length': 86.2410774230957, 'rewards/accuracy_reward': 0.4553571492433548, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.446428656578064, 'reward_std': 0.28437621891498566, 'kl': 0.0738525390625, 'epoch': 0.44} 9%|▉ | 141/1610 [34:33<8:49:55, 21.64s/it] 9%|▉ | 142/1610 [34:54<8:44:14, 21.43s/it] {'loss': 0.0022, 'grad_norm': 2.341276368280309, 'learning_rate': 9.118012422360248e-07, 'completion_length': 87.96429061889648, 'rewards/accuracy_reward': 0.4196428805589676, 'rewards/format_reward': 1.0, 'reward': 1.4196429252624512, 'reward_std': 0.3045148700475693, 'kl': 0.0560302734375, 'epoch': 0.44} 9%|▉ | 142/1610 [34:54<8:44:14, 21.43s/it] 9%|▉ | 143/1610 [35:16<8:52:03, 21.76s/it] {'loss': 0.0027, 'grad_norm': 3.1988856664008347, 'learning_rate': 9.111801242236025e-07, 'completion_length': 93.89286422729492, 'rewards/accuracy_reward': 0.3482142984867096, 'rewards/format_reward': 0.9732142984867096, 'reward': 1.321428656578064, 'reward_std': 0.3381787836551666, 'kl': 0.068359375, 'epoch': 0.44} 9%|▉ | 143/1610 [35:16<8:52:03, 21.76s/it] 9%|▉ | 144/1610 [35:39<8:56:46, 21.97s/it] {'loss': 0.0026, 'grad_norm': 1.9456761953767794, 'learning_rate': 9.105590062111801e-07, 'completion_length': 77.82143020629883, 'rewards/accuracy_reward': 0.3303571715950966, 'rewards/format_reward': 1.0, 'reward': 1.3303571939468384, 'reward_std': 0.22754104435443878, 'kl': 0.06396484375, 'epoch': 0.45} 9%|▉ | 144/1610 [35:39<8:56:46, 21.97s/it] 9%|▉ | 145/1610 [35:59<8:39:05, 21.26s/it] {'loss': 0.0025, 'grad_norm': 1.7183665016524763, 'learning_rate': 9.099378881987577e-07, 'completion_length': 89.54464721679688, 'rewards/accuracy_reward': 0.598214328289032, 'rewards/format_reward': 1.0, 'reward': 1.598214328289032, 'reward_std': 0.2248506247997284, 'kl': 0.0618896484375, 'epoch': 0.45} 9%|▉ | 145/1610 [35:59<8:39:05, 21.26s/it] 9%|▉ | 146/1610 [36:23<9:05:22, 22.35s/it] {'loss': 0.0024, 'grad_norm': 1.4417251950594483, 'learning_rate': 9.093167701863354e-07, 'completion_length': 147.5, 'rewards/accuracy_reward': 0.2589285895228386, 'rewards/format_reward': 0.973214328289032, 'reward': 1.2321428656578064, 'reward_std': 0.28437623381614685, 'kl': 0.059814453125, 'epoch': 0.45} 9%|▉ | 146/1610 [36:23<9:05:22, 22.35s/it] 9%|▉ | 147/1610 [36:48<9:18:13, 22.89s/it] {'loss': 0.002, 'grad_norm': 6.23064088155481, 'learning_rate': 9.08695652173913e-07, 'completion_length': 145.76786422729492, 'rewards/accuracy_reward': 0.294642873108387, 'rewards/format_reward': 0.9642857611179352, 'reward': 1.2589285969734192, 'reward_std': 0.3147451877593994, 'kl': 0.04931640625, 'epoch': 0.46} 9%|▉ | 147/1610 [36:48<9:18:13, 22.89s/it] 9%|▉ | 148/1610 [37:09<9:09:51, 22.57s/it] {'loss': 0.002, 'grad_norm': 1.433902281739846, 'learning_rate': 9.080745341614906e-07, 'completion_length': 106.47321701049805, 'rewards/accuracy_reward': 0.3571428805589676, 'rewards/format_reward': 1.0, 'reward': 1.3571429252624512, 'reward_std': 0.20740239322185516, 'kl': 0.0498046875, 'epoch': 0.46} 9%|▉ | 148/1610 [37:09<9:09:51, 22.57s/it] 9%|▉ | 149/1610 [37:32<9:12:09, 22.68s/it] {'loss': 0.0027, 'grad_norm': 2.557795731363117, 'learning_rate': 9.074534161490683e-07, 'completion_length': 109.2589340209961, 'rewards/accuracy_reward': 0.3660714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.3571429252624512, 'reward_std': 0.3018188178539276, 'kl': 0.068603515625, 'epoch': 0.46} 9%|▉ | 149/1610 [37:32<9:12:09, 22.68s/it] 9%|▉ | 150/1610 [37:55<9:14:49, 22.80s/it] {'loss': 0.0022, 'grad_norm': 5.917478810177446, 'learning_rate': 9.06832298136646e-07, 'completion_length': 129.50000762939453, 'rewards/accuracy_reward': 0.3303571715950966, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3125000596046448, 'reward_std': 0.3863392323255539, 'kl': 0.0557861328125, 'epoch': 0.47} 9%|▉ | 150/1610 [37:55<9:14:49, 22.80s/it] 9%|▉ | 151/1610 [38:18<9:15:16, 22.84s/it] {'loss': 0.0025, 'grad_norm': 1.57234530795205, 'learning_rate': 9.062111801242236e-07, 'completion_length': 134.3660774230957, 'rewards/accuracy_reward': 0.321428582072258, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3035714626312256, 'reward_std': 0.2948834300041199, 'kl': 0.06298828125, 'epoch': 0.47} 9%|▉ | 151/1610 [38:18<9:15:16, 22.84s/it] 9%|▉ | 152/1610 [38:43<9:27:26, 23.35s/it] {'loss': 0.0026, 'grad_norm': 2.182948192292322, 'learning_rate': 9.055900621118013e-07, 'completion_length': 136.77678680419922, 'rewards/accuracy_reward': 0.464285746216774, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.446428656578064, 'reward_std': 0.3174411952495575, 'kl': 0.0640869140625, 'epoch': 0.47} 9%|▉ | 152/1610 [38:43<9:27:26, 23.35s/it] 10%|▉ | 153/1610 [39:07<9:33:05, 23.60s/it] {'loss': 0.0023, 'grad_norm': 1.432760446891129, 'learning_rate': 9.049689440993789e-07, 'completion_length': 148.9464340209961, 'rewards/accuracy_reward': 0.5178571492433548, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.508928656578064, 'reward_std': 0.3348894566297531, 'kl': 0.058349609375, 'epoch': 0.48} 10%|▉ | 153/1610 [39:07<9:33:05, 23.60s/it] 10%|▉ | 154/1610 [39:30<9:31:22, 23.55s/it] {'loss': 0.0021, 'grad_norm': 2.2345373159895483, 'learning_rate': 9.043478260869564e-07, 'completion_length': 140.01786041259766, 'rewards/accuracy_reward': 0.383928582072258, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3660714626312256, 'reward_std': 0.3342759907245636, 'kl': 0.053466796875, 'epoch': 0.48} 10%|▉ | 154/1610 [39:30<9:31:22, 23.55s/it] 10%|▉ | 155/1610 [39:54<9:32:25, 23.60s/it] {'loss': 0.0021, 'grad_norm': 1.6520490757323136, 'learning_rate': 9.037267080745341e-07, 'completion_length': 144.5089340209961, 'rewards/accuracy_reward': 0.223214291036129, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.2053571939468384, 'reward_std': 0.2597358673810959, 'kl': 0.05224609375, 'epoch': 0.48} 10%|▉ | 155/1610 [39:54<9:32:25, 23.60s/it] 10%|▉ | 156/1610 [40:17<9:25:13, 23.32s/it] {'loss': 0.0023, 'grad_norm': 2.4677939749322335, 'learning_rate': 9.031055900621117e-07, 'completion_length': 132.58929061889648, 'rewards/accuracy_reward': 0.3750000298023224, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.3660715222358704, 'reward_std': 0.35893090069293976, 'kl': 0.0579833984375, 'epoch': 0.48} 10%|▉ | 156/1610 [40:17<9:25:13, 23.32s/it] 10%|▉ | 157/1610 [40:41<9:28:25, 23.47s/it] {'loss': 0.0017, 'grad_norm': 2.142196491338247, 'learning_rate': 9.024844720496893e-07, 'completion_length': 143.58929443359375, 'rewards/accuracy_reward': 0.3571428656578064, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.3392857313156128, 'reward_std': 0.4347340911626816, 'kl': 0.043212890625, 'epoch': 0.49} 10%|▉ | 157/1610 [40:41<9:28:25, 23.47s/it] 10%|▉ | 158/1610 [41:05<9:32:45, 23.67s/it] {'loss': 0.0024, 'grad_norm': 1.2546496304327637, 'learning_rate': 9.01863354037267e-07, 'completion_length': 156.8928680419922, 'rewards/accuracy_reward': 0.3928571492433548, 'rewards/format_reward': 0.973214328289032, 'reward': 1.3660714626312256, 'reward_std': 0.2993996888399124, 'kl': 0.0596923828125, 'epoch': 0.49} 10%|▉ | 158/1610 [41:05<9:32:45, 23.67s/it] 10%|▉ | 159/1610 [41:28<9:30:02, 23.57s/it] {'loss': 0.0022, 'grad_norm': 3.4341304997333504, 'learning_rate': 9.012422360248447e-07, 'completion_length': 117.33929061889648, 'rewards/accuracy_reward': 0.446428582072258, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4375000596046448, 'reward_std': 0.3396568149328232, 'kl': 0.0556640625, 'epoch': 0.49} 10%|▉ | 159/1610 [41:28<9:30:02, 23.57s/it] 10%|▉ | 160/1610 [41:52<9:28:59, 23.54s/it] {'loss': 0.002, 'grad_norm': 1.4204799332499036, 'learning_rate': 9.006211180124223e-07, 'completion_length': 162.17857360839844, 'rewards/accuracy_reward': 0.3392857313156128, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.3303571939468384, 'reward_std': 0.2987862229347229, 'kl': 0.049072265625, 'epoch': 0.5} 10%|▉ | 160/1610 [41:52<9:28:59, 23.54s/it] 10%|█ | 161/1610 [42:15<9:29:26, 23.58s/it] {'loss': 0.0019, 'grad_norm': 5.381764057982173, 'learning_rate': 9e-07, 'completion_length': 134.69643783569336, 'rewards/accuracy_reward': 0.2142857201397419, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.1964285969734192, 'reward_std': 0.3219631016254425, 'kl': 0.0462646484375, 'epoch': 0.5} 10%|█ | 161/1610 [42:15<9:29:26, 23.58s/it] 10%|█ | 162/1610 [42:39<9:27:28, 23.51s/it] {'loss': 0.0021, 'grad_norm': 0.8429713347146969, 'learning_rate': 8.993788819875776e-07, 'completion_length': 137.15179061889648, 'rewards/accuracy_reward': 0.1696428656578064, 'rewards/format_reward': 1.0, 'reward': 1.1696429252624512, 'reward_std': 0.13736958801746368, 'kl': 0.0526123046875, 'epoch': 0.5} 10%|█ | 162/1610 [42:39<9:27:28, 23.51s/it] 10%|█ | 163/1610 [43:01<9:19:25, 23.20s/it] {'loss': 0.0023, 'grad_norm': 2.332695857788542, 'learning_rate': 8.987577639751552e-07, 'completion_length': 136.18750381469727, 'rewards/accuracy_reward': 0.473214328289032, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4642857909202576, 'reward_std': 0.41810867190361023, 'kl': 0.0579833984375, 'epoch': 0.51} 10%|█ | 163/1610 [43:01<9:19:25, 23.20s/it] 10%|█ | 164/1610 [43:24<9:13:45, 22.98s/it] {'loss': 0.0019, 'grad_norm': 2.111250101279829, 'learning_rate': 8.981366459627329e-07, 'completion_length': 108.25000381469727, 'rewards/accuracy_reward': 0.4196428656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4017857909202576, 'reward_std': 0.40307384729385376, 'kl': 0.0479736328125, 'epoch': 0.51} 10%|█ | 164/1610 [43:24<9:13:45, 22.98s/it] 10%|█ | 165/1610 [43:48<9:25:14, 23.47s/it] {'loss': 0.0027, 'grad_norm': 2.1905106899088986, 'learning_rate': 8.975155279503105e-07, 'completion_length': 152.5089340209961, 'rewards/accuracy_reward': 0.1964285746216774, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.160714328289032, 'reward_std': 0.32976868748664856, 'kl': 0.0675048828125, 'epoch': 0.51} 10%|█ | 165/1610 [43:48<9:25:14, 23.47s/it] 10%|█ | 166/1610 [44:10<9:16:00, 23.10s/it] {'loss': 0.0022, 'grad_norm': 2.671261324818735, 'learning_rate': 8.968944099378881e-07, 'completion_length': 126.48215103149414, 'rewards/accuracy_reward': 0.392857164144516, 'rewards/format_reward': 1.0, 'reward': 1.3928571939468384, 'reward_std': 0.18397442996501923, 'kl': 0.0540771484375, 'epoch': 0.52} 10%|█ | 166/1610 [44:10<9:16:00, 23.10s/it] 10%|█ | 167/1610 [44:33<9:13:17, 23.01s/it] {'loss': 0.0019, 'grad_norm': 1.5920648137395181, 'learning_rate': 8.962732919254658e-07, 'completion_length': 143.92858123779297, 'rewards/accuracy_reward': 0.2767857313156128, 'rewards/format_reward': 1.0, 'reward': 1.2767857909202576, 'reward_std': 0.32013723254203796, 'kl': 0.04736328125, 'epoch': 0.52} 10%|█ | 167/1610 [44:33<9:13:17, 23.01s/it] 10%|█ | 168/1610 [44:58<9:24:41, 23.50s/it] {'loss': 0.0021, 'grad_norm': 1.3483125081769771, 'learning_rate': 8.956521739130435e-07, 'completion_length': 143.25000381469727, 'rewards/accuracy_reward': 0.258928582072258, 'rewards/format_reward': 0.973214328289032, 'reward': 1.2321429252624512, 'reward_std': 0.28768008947372437, 'kl': 0.052490234375, 'epoch': 0.52} 10%|█ | 168/1610 [44:58<9:24:41, 23.50s/it] 10%|█ | 169/1610 [45:21<9:23:26, 23.46s/it] {'loss': 0.002, 'grad_norm': 1.5135512234235429, 'learning_rate': 8.950310559006211e-07, 'completion_length': 147.21429061889648, 'rewards/accuracy_reward': 0.223214291036129, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.2142857909202576, 'reward_std': 0.22875342518091202, 'kl': 0.05078125, 'epoch': 0.52} 10%|█ | 169/1610 [45:21<9:23:26, 23.46s/it] 11%|█ | 170/1610 [45:46<9:30:55, 23.79s/it] {'loss': 0.0021, 'grad_norm': 4.501180638509113, 'learning_rate': 8.944099378881988e-07, 'completion_length': 145.76786041259766, 'rewards/accuracy_reward': 0.5089285969734192, 'rewards/format_reward': 0.955357164144516, 'reward': 1.4642857909202576, 'reward_std': 0.3932082802057266, 'kl': 0.0513916015625, 'epoch': 0.53} 11%|█ | 170/1610 [45:46<9:30:55, 23.79s/it] 11%|█ | 171/1610 [46:09<9:22:43, 23.46s/it] {'loss': 0.0024, 'grad_norm': 2.3333072643595374, 'learning_rate': 8.937888198757764e-07, 'completion_length': 123.81250381469727, 'rewards/accuracy_reward': 0.3303571492433548, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3125000596046448, 'reward_std': 0.41542385518550873, 'kl': 0.059326171875, 'epoch': 0.53} 11%|█ | 171/1610 [46:09<9:22:43, 23.46s/it] 11%|█ | 172/1610 [46:32<9:21:46, 23.44s/it] {'loss': 0.0024, 'grad_norm': 1.6338717605715392, 'learning_rate': 8.93167701863354e-07, 'completion_length': 136.23214721679688, 'rewards/accuracy_reward': 0.3660714477300644, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3482143878936768, 'reward_std': 0.3246389478445053, 'kl': 0.059326171875, 'epoch': 0.53} 11%|█ | 172/1610 [46:32<9:21:46, 23.44s/it] 11%|█ | 173/1610 [46:55<9:18:37, 23.32s/it] {'loss': 0.0024, 'grad_norm': 2.363891827011979, 'learning_rate': 8.925465838509317e-07, 'completion_length': 118.96429061889648, 'rewards/accuracy_reward': 0.3214285969734192, 'rewards/format_reward': 1.0, 'reward': 1.3214285969734192, 'reward_std': 0.3174412250518799, 'kl': 0.060302734375, 'epoch': 0.54} 11%|█ | 173/1610 [46:55<9:18:37, 23.32s/it] 11%|█ | 174/1610 [47:18<9:19:16, 23.37s/it] {'loss': 0.0025, 'grad_norm': 1.4511511957086112, 'learning_rate': 8.919254658385093e-07, 'completion_length': 111.6339340209961, 'rewards/accuracy_reward': 0.3660714477300644, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.3571429252624512, 'reward_std': 0.2630252093076706, 'kl': 0.0618896484375, 'epoch': 0.54} 11%|█ | 174/1610 [47:18<9:19:16, 23.37s/it] 11%|█ | 175/1610 [47:41<9:13:54, 23.16s/it] {'loss': 0.0026, 'grad_norm': 1.3997887111156346, 'learning_rate': 8.913043478260869e-07, 'completion_length': 119.93750762939453, 'rewards/accuracy_reward': 0.2857142984867096, 'rewards/format_reward': 1.0, 'reward': 1.285714328289032, 'reward_std': 0.22875341773033142, 'kl': 0.064453125, 'epoch': 0.54} 11%|█ | 175/1610 [47:41<9:13:54, 23.16s/it] 11%|█ | 176/1610 [48:04<9:11:56, 23.09s/it] {'loss': 0.0026, 'grad_norm': 2.06163215199048, 'learning_rate': 8.906832298136646e-07, 'completion_length': 102.2410774230957, 'rewards/accuracy_reward': 0.5446428954601288, 'rewards/format_reward': 1.0, 'reward': 1.5446429252624512, 'reward_std': 0.3072052597999573, 'kl': 0.06396484375, 'epoch': 0.55} 11%|█ | 176/1610 [48:04<9:11:56, 23.09s/it] 11%|█ | 177/1610 [48:27<9:11:43, 23.10s/it] {'loss': 0.0023, 'grad_norm': 2.709940874275261, 'learning_rate': 8.900621118012423e-07, 'completion_length': 113.97322082519531, 'rewards/accuracy_reward': 0.4732142984867096, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4642857909202576, 'reward_std': 0.32976868748664856, 'kl': 0.0574951171875, 'epoch': 0.55} 11%|█ | 177/1610 [48:27<9:11:43, 23.10s/it] 11%|█ | 178/1610 [48:51<9:19:05, 23.43s/it] {'loss': 0.002, 'grad_norm': 2.109574423498264, 'learning_rate': 8.894409937888198e-07, 'completion_length': 144.29464721679688, 'rewards/accuracy_reward': 0.2053571566939354, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.1875, 'reward_std': 0.3032490015029907, 'kl': 0.05126953125, 'epoch': 0.55} 11%|█ | 178/1610 [48:51<9:19:05, 23.43s/it] 11%|█ | 179/1610 [49:14<9:13:34, 23.21s/it] {'loss': 0.0032, 'grad_norm': 11.506845724403451, 'learning_rate': 8.888198757763975e-07, 'completion_length': 98.56250381469727, 'rewards/accuracy_reward': 0.3214285969734192, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.3125000596046448, 'reward_std': 0.2804734334349632, 'kl': 0.080810546875, 'epoch': 0.56} 11%|█ | 179/1610 [49:14<9:13:34, 23.21s/it] 11%|█ | 180/1610 [49:38<9:16:12, 23.34s/it] {'loss': 0.0023, 'grad_norm': 2.5086180022079305, 'learning_rate': 8.881987577639751e-07, 'completion_length': 136.99107360839844, 'rewards/accuracy_reward': 0.2678571492433548, 'rewards/format_reward': 0.973214328289032, 'reward': 1.2410714626312256, 'reward_std': 0.3585097938776016, 'kl': 0.05810546875, 'epoch': 0.56} 11%|█ | 180/1610 [49:38<9:16:12, 23.34s/it] 11%|█ | 181/1610 [50:01<9:14:45, 23.29s/it] {'loss': 0.002, 'grad_norm': 2.1968778300271907, 'learning_rate': 8.875776397515527e-07, 'completion_length': 130.9732208251953, 'rewards/accuracy_reward': 0.4107142984867096, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4017857909202576, 'reward_std': 0.3033080995082855, 'kl': 0.0499267578125, 'epoch': 0.56} 11%|█ | 181/1610 [50:01<9:14:45, 23.29s/it] 11%|█▏ | 182/1610 [50:24<9:13:53, 23.27s/it] {'loss': 0.0026, 'grad_norm': 2.8657008696250252, 'learning_rate': 8.869565217391303e-07, 'completion_length': 117.16965103149414, 'rewards/accuracy_reward': 0.285714291036129, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.2767857909202576, 'reward_std': 0.2020159661769867, 'kl': 0.063720703125, 'epoch': 0.57} 11%|█▏ | 182/1610 [50:24<9:13:53, 23.27s/it] 11%|█▏ | 183/1610 [50:49<9:22:41, 23.66s/it] {'loss': 0.0021, 'grad_norm': 6.144079586368972, 'learning_rate': 8.86335403726708e-07, 'completion_length': 146.7857208251953, 'rewards/accuracy_reward': 0.267857164144516, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.2500000596046448, 'reward_std': 0.29670368134975433, 'kl': 0.052490234375, 'epoch': 0.57} 11%|█▏ | 183/1610 [50:49<9:22:41, 23.66s/it] 11%|█▏ | 184/1610 [51:11<9:14:13, 23.32s/it] {'loss': 0.0018, 'grad_norm': 2.296868071351832, 'learning_rate': 8.857142857142856e-07, 'completion_length': 134.58036422729492, 'rewards/accuracy_reward': 0.2857142873108387, 'rewards/format_reward': 1.0, 'reward': 1.2857143878936768, 'reward_std': 0.1956884115934372, 'kl': 0.0458984375, 'epoch': 0.57} 11%|█▏ | 184/1610 [51:11<9:14:13, 23.32s/it] 11%|█▏ | 185/1610 [51:34<9:13:33, 23.31s/it] {'loss': 0.0021, 'grad_norm': 1.751048632052475, 'learning_rate': 8.850931677018632e-07, 'completion_length': 142.29465103149414, 'rewards/accuracy_reward': 0.3750000149011612, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.3660714626312256, 'reward_std': 0.3144085854291916, 'kl': 0.05322265625, 'epoch': 0.57} 11%|█▏ | 185/1610 [51:34<9:13:33, 23.31s/it] 12%|█▏ | 186/1610 [51:57<9:10:14, 23.18s/it] {'loss': 0.002, 'grad_norm': 1.9559076888757012, 'learning_rate': 8.84472049689441e-07, 'completion_length': 125.10715103149414, 'rewards/accuracy_reward': 0.4107143133878708, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4017857909202576, 'reward_std': 0.21638144552707672, 'kl': 0.0491943359375, 'epoch': 0.58} 12%|█▏ | 186/1610 [51:57<9:10:14, 23.18s/it] 12%|█▏ | 187/1610 [52:21<9:14:40, 23.39s/it] {'loss': 0.0018, 'grad_norm': 1.8141445261410365, 'learning_rate': 8.838509316770186e-07, 'completion_length': 126.14286041259766, 'rewards/accuracy_reward': 0.464285746216774, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4553572535514832, 'reward_std': 0.26506881415843964, 'kl': 0.0452880859375, 'epoch': 0.58} 12%|█▏ | 187/1610 [52:21<9:14:40, 23.39s/it] 12%|█▏ | 188/1610 [52:44<9:13:00, 23.33s/it] {'loss': 0.0022, 'grad_norm': 2.064421245615263, 'learning_rate': 8.832298136645962e-07, 'completion_length': 149.91965103149414, 'rewards/accuracy_reward': 0.3660714328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.348214328289032, 'reward_std': 0.2954912930727005, 'kl': 0.05517578125, 'epoch': 0.58} 12%|█▏ | 188/1610 [52:44<9:13:00, 23.33s/it] 12%|█▏ | 189/1610 [53:08<9:17:34, 23.54s/it] {'loss': 0.0017, 'grad_norm': 1.6526223781676774, 'learning_rate': 8.826086956521739e-07, 'completion_length': 121.66072082519531, 'rewards/accuracy_reward': 0.2857142984867096, 'rewards/format_reward': 0.973214328289032, 'reward': 1.2589285969734192, 'reward_std': 0.29258592426776886, 'kl': 0.041748046875, 'epoch': 0.59} 12%|█▏ | 189/1610 [53:08<9:17:34, 23.54s/it] 12%|█▏ | 190/1610 [53:30<9:02:52, 22.94s/it] {'loss': 0.0026, 'grad_norm': 3.882187979957121, 'learning_rate': 8.819875776397515e-07, 'completion_length': 94.29464721679688, 'rewards/accuracy_reward': 0.4107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.4107143878936768, 'reward_std': 0.244989275932312, 'kl': 0.06396484375, 'epoch': 0.59} 12%|█▏ | 190/1610 [53:30<9:02:52, 22.94s/it] 12%|█▏ | 191/1610 [53:53<9:02:18, 22.93s/it] {'loss': 0.0021, 'grad_norm': 1.6932184829635948, 'learning_rate': 8.813664596273291e-07, 'completion_length': 125.22322463989258, 'rewards/accuracy_reward': 0.5446428954601288, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.535714328289032, 'reward_std': 0.2669336050748825, 'kl': 0.052978515625, 'epoch': 0.59} 12%|█▏ | 191/1610 [53:53<9:02:18, 22.93s/it] 12%|█▏ | 192/1610 [54:16<9:02:38, 22.96s/it] {'loss': 0.0023, 'grad_norm': 1.2715000219474046, 'learning_rate': 8.807453416149068e-07, 'completion_length': 111.26786422729492, 'rewards/accuracy_reward': 0.3660714328289032, 'rewards/format_reward': 1.0, 'reward': 1.3660715222358704, 'reward_std': 0.15360544621944427, 'kl': 0.0577392578125, 'epoch': 0.6} 12%|█▏ | 192/1610 [54:16<9:02:38, 22.96s/it] 12%|█▏ | 193/1610 [54:37<8:48:55, 22.40s/it] {'loss': 0.0019, 'grad_norm': 1.906937755012969, 'learning_rate': 8.801242236024844e-07, 'completion_length': 109.67857360839844, 'rewards/accuracy_reward': 0.348214291036129, 'rewards/format_reward': 1.0, 'reward': 1.3482143878936768, 'reward_std': 0.30390140414237976, 'kl': 0.046875, 'epoch': 0.6} 12%|█▏ | 193/1610 [54:37<8:48:55, 22.40s/it] 12%|█▏ | 194/1610 [55:00<8:55:23, 22.69s/it] {'loss': 0.0018, 'grad_norm': 2.256310937752189, 'learning_rate': 8.79503105590062e-07, 'completion_length': 118.10715103149414, 'rewards/accuracy_reward': 0.464285746216774, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4553572535514832, 'reward_std': 0.26281585544347763, 'kl': 0.0440673828125, 'epoch': 0.6} 12%|█▏ | 194/1610 [55:00<8:55:23, 22.69s/it] 12%|█▏ | 195/1610 [55:23<8:53:01, 22.60s/it] {'loss': 0.002, 'grad_norm': 1.6640189830597518, 'learning_rate': 8.788819875776398e-07, 'completion_length': 117.87500381469727, 'rewards/accuracy_reward': 0.339285746216774, 'rewards/format_reward': 1.0, 'reward': 1.3392857909202576, 'reward_std': 0.17885926365852356, 'kl': 0.0496826171875, 'epoch': 0.61} 12%|█▏ | 195/1610 [55:23<8:53:01, 22.60s/it] 12%|█▏ | 196/1610 [55:45<8:51:50, 22.57s/it] {'loss': 0.0018, 'grad_norm': 1.4237664231166727, 'learning_rate': 8.782608695652174e-07, 'completion_length': 121.25000762939453, 'rewards/accuracy_reward': 0.4107142984867096, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4017857909202576, 'reward_std': 0.3402702808380127, 'kl': 0.04541015625, 'epoch': 0.61} 12%|█▏ | 196/1610 [55:45<8:51:50, 22.57s/it] 12%|█▏ | 197/1610 [56:06<8:40:31, 22.10s/it] {'loss': 0.0021, 'grad_norm': 1.669432775257197, 'learning_rate': 8.77639751552795e-07, 'completion_length': 96.41964721679688, 'rewards/accuracy_reward': 0.267857164144516, 'rewards/format_reward': 1.0, 'reward': 1.2678571939468384, 'reward_std': 0.22215460240840912, 'kl': 0.0533447265625, 'epoch': 0.61} 12%|█▏ | 197/1610 [56:06<8:40:31, 22.10s/it] 12%|█▏ | 198/1610 [56:28<8:39:43, 22.08s/it] {'loss': 0.0039, 'grad_norm': 2.391252264955947, 'learning_rate': 8.770186335403727e-07, 'completion_length': 79.36607360839844, 'rewards/accuracy_reward': 0.4464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.4464285969734192, 'reward_std': 0.3525831550359726, 'kl': 0.098876953125, 'epoch': 0.61} 12%|█▏ | 198/1610 [56:28<8:39:43, 22.08s/it] 12%|█▏ | 199/1610 [56:52<8:53:10, 22.67s/it] {'loss': 0.0021, 'grad_norm': 1.3134875928255354, 'learning_rate': 8.763975155279503e-07, 'completion_length': 124.30357360839844, 'rewards/accuracy_reward': 0.3482142984867096, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.3303571939468384, 'reward_std': 0.28707222640514374, 'kl': 0.05322265625, 'epoch': 0.62} 12%|█▏ | 199/1610 [56:52<8:53:10, 22.67s/it] 12%|█▏ | 200/1610 [57:16<9:02:45, 23.10s/it] {'loss': 0.002, 'grad_norm': 1.8277776319242829, 'learning_rate': 8.757763975155279e-07, 'completion_length': 121.81250381469727, 'rewards/accuracy_reward': 0.4821428805589676, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4642857909202576, 'reward_std': 0.3375263959169388, 'kl': 0.0506591796875, 'epoch': 0.62} 12%|█▏ | 200/1610 [57:16<9:02:45, 23.10s/it] 12%|█▏ | 201/1610 [58:26<14:29:33, 37.03s/it] {'loss': 0.0027, 'grad_norm': 2.3350753997190536, 'learning_rate': 8.751552795031055e-07, 'completion_length': 97.23214721679688, 'rewards/accuracy_reward': 0.3482143133878708, 'rewards/format_reward': 1.0, 'reward': 1.3482143878936768, 'reward_std': 0.26875944435596466, 'kl': 0.06640625, 'epoch': 0.62} 12%|█▏ | 201/1610 [58:26<14:29:33, 37.03s/it] 13%|█▎ | 202/1610 [58:46<12:31:34, 32.03s/it] {'loss': 0.0019, 'grad_norm': 1.2615244384224475, 'learning_rate': 8.745341614906831e-07, 'completion_length': 94.11607360839844, 'rewards/accuracy_reward': 0.3482143133878708, 'rewards/format_reward': 1.0, 'reward': 1.3482143878936768, 'reward_std': 0.14700662717223167, 'kl': 0.0474853515625, 'epoch': 0.63} 13%|█▎ | 202/1610 [58:46<12:31:34, 32.03s/it] 13%|█▎ | 203/1610 [59:08<11:14:30, 28.76s/it] {'loss': 0.0024, 'grad_norm': 1.101495003679478, 'learning_rate': 8.739130434782607e-07, 'completion_length': 93.53572082519531, 'rewards/accuracy_reward': 0.321428582072258, 'rewards/format_reward': 1.0, 'reward': 1.3214285969734192, 'reward_std': 0.1995968148112297, 'kl': 0.0596923828125, 'epoch': 0.63} 13%|█▎ | 203/1610 [59:08<11:14:30, 28.76s/it] 13%|█▎ | 204/1610 [59:28<10:16:27, 26.31s/it] {'loss': 0.0025, 'grad_norm': 1.766238197805906, 'learning_rate': 8.732919254658385e-07, 'completion_length': 92.7589340209961, 'rewards/accuracy_reward': 0.3660714626312256, 'rewards/format_reward': 1.0, 'reward': 1.3660714626312256, 'reward_std': 0.2564319968223572, 'kl': 0.0633544921875, 'epoch': 0.63} 13%|█▎ | 204/1610 [59:28<10:16:27, 26.31s/it] 13%|█▎ | 205/1610 [59:51<9:54:55, 25.41s/it] {'loss': 0.0025, 'grad_norm': 1.1567318735465795, 'learning_rate': 8.726708074534161e-07, 'completion_length': 102.77679061889648, 'rewards/accuracy_reward': 0.2767857313156128, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.2678571939468384, 'reward_std': 0.17885924875736237, 'kl': 0.0618896484375, 'epoch': 0.64} 13%|█▎ | 205/1610 [59:51<9:54:55, 25.41s/it] 13%|█▎ | 206/1610 [1:00:13<9:28:41, 24.30s/it] {'loss': 0.002, 'grad_norm': 2.4385401173078396, 'learning_rate': 8.720496894409937e-07, 'completion_length': 98.09821701049805, 'rewards/accuracy_reward': 0.3214285969734192, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.3125000596046448, 'reward_std': 0.2987862080335617, 'kl': 0.0494384765625, 'epoch': 0.64} 13%|█▎ | 206/1610 [1:00:13<9:28:41, 24.30s/it] 13%|█▎ | 207/1610 [1:00:34<9:06:51, 23.39s/it] {'loss': 0.0019, 'grad_norm': 1.8463478755995983, 'learning_rate': 8.714285714285714e-07, 'completion_length': 109.3035774230957, 'rewards/accuracy_reward': 0.446428582072258, 'rewards/format_reward': 1.0, 'reward': 1.446428656578064, 'reward_std': 0.2022872269153595, 'kl': 0.0472412109375, 'epoch': 0.64} 13%|█▎ | 207/1610 [1:00:34<9:06:51, 23.39s/it] 13%|█▎ | 208/1610 [1:00:56<8:51:50, 22.76s/it] {'loss': 0.0027, 'grad_norm': 1.4912096506279109, 'learning_rate': 8.70807453416149e-07, 'completion_length': 82.54464721679688, 'rewards/accuracy_reward': 0.5625000298023224, 'rewards/format_reward': 1.0, 'reward': 1.5625000596046448, 'reward_std': 0.20020467042922974, 'kl': 0.0672607421875, 'epoch': 0.65} 13%|█▎ | 208/1610 [1:00:56<8:51:50, 22.76s/it] 13%|█▎ | 209/1610 [1:01:20<9:03:59, 23.30s/it] {'loss': 0.0021, 'grad_norm': 1.919450414973495, 'learning_rate': 8.701863354037266e-07, 'completion_length': 127.01786041259766, 'rewards/accuracy_reward': 0.4285714477300644, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.410714328289032, 'reward_std': 0.3150164783000946, 'kl': 0.0521240234375, 'epoch': 0.65} 13%|█▎ | 209/1610 [1:01:20<9:03:59, 23.30s/it] 13%|█▎ | 210/1610 [1:01:41<8:42:40, 22.40s/it] {'loss': 0.0021, 'grad_norm': 1.6946720123282495, 'learning_rate': 8.695652173913043e-07, 'completion_length': 96.56250381469727, 'rewards/accuracy_reward': 0.4464286118745804, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4375001192092896, 'reward_std': 0.23265621066093445, 'kl': 0.0526123046875, 'epoch': 0.65} 13%|█▎ | 210/1610 [1:01:41<8:42:40, 22.40s/it] 13%|█▎ | 211/1610 [1:02:02<8:34:35, 22.07s/it] {'loss': 0.0025, 'grad_norm': 1.1499151169513613, 'learning_rate': 8.689440993788819e-07, 'completion_length': 87.12500381469727, 'rewards/accuracy_reward': 0.4910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.4910715222358704, 'reward_std': 0.20020466670393944, 'kl': 0.0626220703125, 'epoch': 0.66} 13%|█▎ | 211/1610 [1:02:02<8:34:35, 22.07s/it] 13%|█▎ | 212/1610 [1:02:25<8:38:50, 22.27s/it] {'loss': 0.0019, 'grad_norm': 1.3537328855203097, 'learning_rate': 8.683229813664595e-07, 'completion_length': 99.3839340209961, 'rewards/accuracy_reward': 0.4732143133878708, 'rewards/format_reward': 1.0, 'reward': 1.473214328289032, 'reward_std': 0.19838443398475647, 'kl': 0.047607421875, 'epoch': 0.66} 13%|█▎ | 212/1610 [1:02:25<8:38:50, 22.27s/it] 13%|█▎ | 213/1610 [1:02:47<8:39:49, 22.33s/it] {'loss': 0.002, 'grad_norm': 2.388746807493688, 'learning_rate': 8.677018633540373e-07, 'completion_length': 105.30357360839844, 'rewards/accuracy_reward': 0.330357164144516, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.321428656578064, 'reward_std': 0.32976868748664856, 'kl': 0.0499267578125, 'epoch': 0.66} 13%|█▎ | 213/1610 [1:02:47<8:39:49, 22.33s/it] 13%|█▎ | 214/1610 [1:03:10<8:40:43, 22.38s/it] {'loss': 0.0021, 'grad_norm': 5.063317259618089, 'learning_rate': 8.670807453416149e-07, 'completion_length': 101.71429061889648, 'rewards/accuracy_reward': 0.321428582072258, 'rewards/format_reward': 1.0, 'reward': 1.3214285969734192, 'reward_std': 0.25583307445049286, 'kl': 0.0517578125, 'epoch': 0.66} 13%|█▎ | 214/1610 [1:03:10<8:40:43, 22.38s/it] 13%|█▎ | 215/1610 [1:03:32<8:40:33, 22.39s/it] {'loss': 0.002, 'grad_norm': 1.965726649435631, 'learning_rate': 8.664596273291925e-07, 'completion_length': 118.3035774230957, 'rewards/accuracy_reward': 0.3571428805589676, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.348214328289032, 'reward_std': 0.2597358673810959, 'kl': 0.0509033203125, 'epoch': 0.67} 13%|█▎ | 215/1610 [1:03:32<8:40:33, 22.39s/it] 13%|█▎ | 216/1610 [1:03:53<8:33:52, 22.12s/it] {'loss': 0.0018, 'grad_norm': 1.835028767199392, 'learning_rate': 8.658385093167702e-07, 'completion_length': 95.89286422729492, 'rewards/accuracy_reward': 0.196428582072258, 'rewards/format_reward': 1.0, 'reward': 1.1964285969734192, 'reward_std': 0.22363825142383575, 'kl': 0.0452880859375, 'epoch': 0.67} 13%|█▎ | 216/1610 [1:03:53<8:33:52, 22.12s/it] 13%|█▎ | 217/1610 [1:04:14<8:21:09, 21.59s/it] {'loss': 0.0017, 'grad_norm': 1.5426068671935595, 'learning_rate': 8.652173913043478e-07, 'completion_length': 96.28571701049805, 'rewards/accuracy_reward': 0.258928582072258, 'rewards/format_reward': 1.0, 'reward': 1.258928656578064, 'reward_std': 0.3318512290716171, 'kl': 0.043701171875, 'epoch': 0.67} 13%|█▎ | 217/1610 [1:04:14<8:21:09, 21.59s/it] 14%|█▎ | 218/1610 [1:04:37<8:32:41, 22.10s/it] {'loss': 0.0021, 'grad_norm': 1.007221087808832, 'learning_rate': 8.645962732919254e-07, 'completion_length': 121.21429061889648, 'rewards/accuracy_reward': 0.1428571492433548, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.1339285969734192, 'reward_std': 0.17104806005954742, 'kl': 0.051513671875, 'epoch': 0.68} 14%|█▎ | 218/1610 [1:04:37<8:32:41, 22.10s/it] 14%|█▎ | 219/1610 [1:05:01<8:44:13, 22.61s/it] {'loss': 0.0015, 'grad_norm': 1.2235388997127046, 'learning_rate': 8.639751552795031e-07, 'completion_length': 143.0178680419922, 'rewards/accuracy_reward': 0.3571428805589676, 'rewards/format_reward': 1.0, 'reward': 1.3571429252624512, 'reward_std': 0.20471198856830597, 'kl': 0.03662109375, 'epoch': 0.68} 14%|█▎ | 219/1610 [1:05:01<8:44:13, 22.61s/it] 14%|█▎ | 220/1610 [1:05:23<8:37:40, 22.35s/it] {'loss': 0.0027, 'grad_norm': 1.8094516786661945, 'learning_rate': 8.633540372670807e-07, 'completion_length': 98.28571701049805, 'rewards/accuracy_reward': 0.4017857313156128, 'rewards/format_reward': 1.0, 'reward': 1.4017857909202576, 'reward_std': 0.24889205396175385, 'kl': 0.06640625, 'epoch': 0.68} 14%|█▎ | 220/1610 [1:05:23<8:37:40, 22.35s/it] 14%|█▎ | 221/1610 [1:05:45<8:41:04, 22.51s/it] {'loss': 0.0021, 'grad_norm': 1.028159206347135, 'learning_rate': 8.627329192546583e-07, 'completion_length': 118.18750381469727, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.508928656578064, 'reward_std': 0.20411307364702225, 'kl': 0.053466796875, 'epoch': 0.69} 14%|█▎ | 221/1610 [1:05:45<8:41:04, 22.51s/it] 14%|█▍ | 222/1610 [1:06:08<8:42:37, 22.59s/it] {'loss': 0.0018, 'grad_norm': 1.538787628751029, 'learning_rate': 8.621118012422361e-07, 'completion_length': 114.55357360839844, 'rewards/accuracy_reward': 0.3125000149011612, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.3035714626312256, 'reward_std': 0.2663402855396271, 'kl': 0.0447998046875, 'epoch': 0.69} 14%|█▍ | 222/1610 [1:06:08<8:42:37, 22.59s/it] 14%|█▍ | 223/1610 [1:06:30<8:34:33, 22.26s/it] {'loss': 0.0016, 'grad_norm': 1.2032523901593124, 'learning_rate': 8.614906832298137e-07, 'completion_length': 103.47321701049805, 'rewards/accuracy_reward': 0.4285714328289032, 'rewards/format_reward': 1.0, 'reward': 1.4285715222358704, 'reward_std': 0.17226044833660126, 'kl': 0.0411376953125, 'epoch': 0.69} 14%|█▍ | 223/1610 [1:06:30<8:34:33, 22.26s/it][2025-02-21 04:08:25,839] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time 14%|█▍ | 224/1610 [1:06:53<8:37:32, 22.40s/it] {'loss': 0.0022, 'grad_norm': 2.445092647706787, 'learning_rate': 8.608695652173913e-07, 'completion_length': 105.83036041259766, 'rewards/accuracy_reward': 0.4017857313156128, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.3928571939468384, 'reward_std': 0.2501044422388077, 'kl': 0.0557861328125, 'epoch': 0.7} 14%|█▍ | 224/1610 [1:06:53<8:37:32, 22.40s/it] 14%|█▍ | 225/1610 [1:07:12<8:19:38, 21.65s/it] {'loss': 0.0019, 'grad_norm': 1.359341910235483, 'learning_rate': 8.60248447204969e-07, 'completion_length': 110.87500381469727, 'rewards/accuracy_reward': 0.4196428954601288, 'rewards/format_reward': 1.0, 'reward': 1.4196429252624512, 'reward_std': 0.24498365819454193, 'kl': 0.047607421875, 'epoch': 0.7} 14%|█▍ | 225/1610 [1:07:12<8:19:38, 21.65s/it] 14%|█▍ | 226/1610 [1:07:36<8:34:34, 22.31s/it] {'loss': 0.0015, 'grad_norm': 1.4243133125757768, 'learning_rate': 8.596273291925465e-07, 'completion_length': 142.40179443359375, 'rewards/accuracy_reward': 0.5089285969734192, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.5000000596046448, 'reward_std': 0.27331747114658356, 'kl': 0.0367431640625, 'epoch': 0.7} 14%|█▍ | 226/1610 [1:07:36<8:34:34, 22.31s/it] 14%|█▍ | 227/1610 [1:08:01<8:49:52, 22.99s/it] {'loss': 0.0019, 'grad_norm': 1.0904456046968758, 'learning_rate': 8.590062111801241e-07, 'completion_length': 150.7589340209961, 'rewards/accuracy_reward': 0.321428582072258, 'rewards/format_reward': 0.973214328289032, 'reward': 1.2946428656578064, 'reward_std': 0.23057925701141357, 'kl': 0.046630859375, 'epoch': 0.7} 14%|█▍ | 227/1610 [1:08:01<8:49:52, 22.99s/it] 14%|█▍ | 228/1610 [1:08:24<8:48:48, 22.96s/it] {'loss': 0.0016, 'grad_norm': 1.1724857438388971, 'learning_rate': 8.583850931677018e-07, 'completion_length': 121.08929061889648, 'rewards/accuracy_reward': 0.3392857313156128, 'rewards/format_reward': 1.0, 'reward': 1.3392857313156128, 'reward_std': 0.16323687136173248, 'kl': 0.041015625, 'epoch': 0.71} 14%|█▍ | 228/1610 [1:08:24<8:48:48, 22.96s/it] 14%|█▍ | 229/1610 [1:08:47<8:50:33, 23.05s/it] {'loss': 0.0019, 'grad_norm': 2.271775316720809, 'learning_rate': 8.577639751552794e-07, 'completion_length': 124.73215103149414, 'rewards/accuracy_reward': 0.5178571939468384, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.5000001192092896, 'reward_std': 0.3123260587453842, 'kl': 0.04638671875, 'epoch': 0.71} 14%|█▍ | 229/1610 [1:08:47<8:50:33, 23.05s/it] 14%|█▍ | 230/1610 [1:09:08<8:38:06, 22.53s/it] {'loss': 0.0016, 'grad_norm': 2.517569073782675, 'learning_rate': 8.57142857142857e-07, 'completion_length': 116.14286041259766, 'rewards/accuracy_reward': 0.2946428656578064, 'rewards/format_reward': 1.0, 'reward': 1.2946429252624512, 'reward_std': 0.2696296274662018, 'kl': 0.041015625, 'epoch': 0.71} 14%|█▍ | 230/1610 [1:09:08<8:38:06, 22.53s/it] 14%|█▍ | 231/1610 [1:09:32<8:48:13, 22.98s/it] {'loss': 0.0014, 'grad_norm': 1.4862348844762037, 'learning_rate': 8.565217391304348e-07, 'completion_length': 137.65179443359375, 'rewards/accuracy_reward': 0.3750000149011612, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.3660714626312256, 'reward_std': 0.2921873927116394, 'kl': 0.03564453125, 'epoch': 0.72} 14%|█▍ | 231/1610 [1:09:32<8:48:13, 22.98s/it] 14%|█▍ | 232/1610 [1:09:54<8:37:08, 22.52s/it] {'loss': 0.0018, 'grad_norm': 1.6500112047343118, 'learning_rate': 8.559006211180124e-07, 'completion_length': 120.14286041259766, 'rewards/accuracy_reward': 0.3303571492433548, 'rewards/format_reward': 1.0, 'reward': 1.3303571939468384, 'reward_std': 0.260606050491333, 'kl': 0.0440673828125, 'epoch': 0.72} 14%|█▍ | 232/1610 [1:09:54<8:37:08, 22.52s/it] 14%|█▍ | 233/1610 [1:10:17<8:40:14, 22.67s/it] {'loss': 0.0017, 'grad_norm': 1.207329969100882, 'learning_rate': 8.5527950310559e-07, 'completion_length': 132.89286041259766, 'rewards/accuracy_reward': 0.4285714328289032, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4196429252624512, 'reward_std': 0.22996580600738525, 'kl': 0.04248046875, 'epoch': 0.72} 14%|█▍ | 233/1610 [1:10:17<8:40:14, 22.67s/it] 15%|█▍ | 234/1610 [1:10:40<8:45:04, 22.90s/it] {'loss': 0.0018, 'grad_norm': 1.2707537250106644, 'learning_rate': 8.546583850931677e-07, 'completion_length': 129.15179443359375, 'rewards/accuracy_reward': 0.3750000223517418, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.3660714626312256, 'reward_std': 0.27743519842624664, 'kl': 0.0445556640625, 'epoch': 0.73} 15%|█▍ | 234/1610 [1:10:40<8:45:04, 22.90s/it] 15%|█▍ | 235/1610 [1:11:03<8:46:14, 22.96s/it] {'loss': 0.0019, 'grad_norm': 1.6438112480899283, 'learning_rate': 8.540372670807453e-07, 'completion_length': 136.48214721679688, 'rewards/accuracy_reward': 0.4107142984867096, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4017857909202576, 'reward_std': 0.31622885167598724, 'kl': 0.04638671875, 'epoch': 0.73} 15%|█▍ | 235/1610 [1:11:03<8:46:14, 22.96s/it] 15%|█▍ | 236/1610 [1:11:25<8:36:25, 22.55s/it] {'loss': 0.002, 'grad_norm': 1.525734524866686, 'learning_rate': 8.534161490683229e-07, 'completion_length': 103.33928680419922, 'rewards/accuracy_reward': 0.5625, 'rewards/format_reward': 1.0, 'reward': 1.5625001192092896, 'reward_std': 0.24889206886291504, 'kl': 0.051025390625, 'epoch': 0.73} 15%|█▍ | 236/1610 [1:11:25<8:36:25, 22.55s/it] 15%|█▍ | 237/1610 [1:11:49<8:44:15, 22.91s/it] {'loss': 0.0022, 'grad_norm': 0.809356305194839, 'learning_rate': 8.527950310559006e-07, 'completion_length': 127.78572082519531, 'rewards/accuracy_reward': 0.4732143133878708, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4642857909202576, 'reward_std': 0.14964356273412704, 'kl': 0.054931640625, 'epoch': 0.74} 15%|█▍ | 237/1610 [1:11:49<8:44:15, 22.91s/it] 15%|█▍ | 238/1610 [1:12:13<8:50:46, 23.21s/it] {'loss': 0.002, 'grad_norm': 1.8253939497630554, 'learning_rate': 8.521739130434782e-07, 'completion_length': 122.91072082519531, 'rewards/accuracy_reward': 0.2946428805589676, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.285714328289032, 'reward_std': 0.3375742584466934, 'kl': 0.0499267578125, 'epoch': 0.74} 15%|█▍ | 238/1610 [1:12:13<8:50:46, 23.21s/it] 15%|█▍ | 239/1610 [1:12:36<8:51:12, 23.25s/it] {'loss': 0.0017, 'grad_norm': 2.553129186646326, 'learning_rate': 8.515527950310558e-07, 'completion_length': 141.2678680419922, 'rewards/accuracy_reward': 0.5089285969734192, 'rewards/format_reward': 1.0, 'reward': 1.5089285969734192, 'reward_std': 0.23265621066093445, 'kl': 0.0413818359375, 'epoch': 0.74} 15%|█▍ | 239/1610 [1:12:36<8:51:12, 23.25s/it] 15%|█▍ | 240/1610 [1:12:58<8:43:45, 22.94s/it] {'loss': 0.0018, 'grad_norm': 1.7390763747275142, 'learning_rate': 8.509316770186336e-07, 'completion_length': 118.97322463989258, 'rewards/accuracy_reward': 0.401785746216774, 'rewards/format_reward': 0.9732142984867096, 'reward': 1.3750000596046448, 'reward_std': 0.31418804824352264, 'kl': 0.04443359375, 'epoch': 0.75} 15%|█▍ | 240/1610 [1:12:58<8:43:45, 22.94s/it] 15%|█▍ | 241/1610 [1:13:22<8:49:58, 23.23s/it] {'loss': 0.0019, 'grad_norm': 1.5818428888581593, 'learning_rate': 8.503105590062112e-07, 'completion_length': 106.86607360839844, 'rewards/accuracy_reward': 0.392857164144516, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.3750000596046448, 'reward_std': 0.3506985902786255, 'kl': 0.0479736328125, 'epoch': 0.75} 15%|█▍ | 241/1610 [1:13:22<8:49:58, 23.23s/it] 15%|█▌ | 242/1610 [1:13:45<8:47:17, 23.13s/it] {'loss': 0.0015, 'grad_norm': 1.2179915353929651, 'learning_rate': 8.496894409937888e-07, 'completion_length': 128.99108123779297, 'rewards/accuracy_reward': 0.3303571492433548, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.321428656578064, 'reward_std': 0.22875341773033142, 'kl': 0.0380859375, 'epoch': 0.75} 15%|█▌ | 242/1610 [1:13:45<8:47:17, 23.13s/it] 15%|█▌ | 243/1610 [1:14:07<8:42:37, 22.94s/it] {'loss': 0.002, 'grad_norm': 1.9204919928475004, 'learning_rate': 8.490683229813665e-07, 'completion_length': 122.0714340209961, 'rewards/accuracy_reward': 0.3660714477300644, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.348214328289032, 'reward_std': 0.2513168305158615, 'kl': 0.0496826171875, 'epoch': 0.75} 15%|█▌ | 243/1610 [1:14:07<8:42:37, 22.94s/it] 15%|█▌ | 244/1610 [1:14:30<8:40:42, 22.87s/it] {'loss': 0.0015, 'grad_norm': 2.0371626032723866, 'learning_rate': 8.484472049689441e-07, 'completion_length': 128.34821701049805, 'rewards/accuracy_reward': 0.267857164144516, 'rewards/format_reward': 1.0, 'reward': 1.2678571939468384, 'reward_std': 0.25791003555059433, 'kl': 0.0386962890625, 'epoch': 0.76} 15%|█▌ | 244/1610 [1:14:30<8:40:42, 22.87s/it] 15%|█▌ | 245/1610 [1:14:53<8:37:25, 22.74s/it] {'loss': 0.002, 'grad_norm': 1.8798290940380427, 'learning_rate': 8.478260869565217e-07, 'completion_length': 128.24108123779297, 'rewards/accuracy_reward': 0.4375000298023224, 'rewards/format_reward': 1.0, 'reward': 1.4375000596046448, 'reward_std': 0.294877827167511, 'kl': 0.0489501953125, 'epoch': 0.76} 15%|█▌ | 245/1610 [1:14:53<8:37:25, 22.74s/it] 15%|█▌ | 246/1610 [1:15:13<8:23:07, 22.13s/it] {'loss': 0.0015, 'grad_norm': 0.8762186298836023, 'learning_rate': 8.472049689440994e-07, 'completion_length': 120.66964721679688, 'rewards/accuracy_reward': 0.1696428656578064, 'rewards/format_reward': 1.0, 'reward': 1.1696429252624512, 'reward_std': 0.12565560638904572, 'kl': 0.037109375, 'epoch': 0.76} 15%|█▌ | 246/1610 [1:15:13<8:23:07, 22.13s/it] 15%|█▌ | 247/1610 [1:15:36<8:30:12, 22.46s/it] {'loss': 0.0019, 'grad_norm': 1.5898560442300897, 'learning_rate': 8.46583850931677e-07, 'completion_length': 125.26786041259766, 'rewards/accuracy_reward': 0.4017857313156128, 'rewards/format_reward': 1.0, 'reward': 1.4017857909202576, 'reward_std': 0.19447603821754456, 'kl': 0.046875, 'epoch': 0.77} 15%|█▌ | 247/1610 [1:15:37<8:30:12, 22.46s/it] 15%|█▌ | 248/1610 [1:16:01<8:40:54, 22.95s/it] {'loss': 0.0016, 'grad_norm': 1.4785603846874573, 'learning_rate': 8.459627329192546e-07, 'completion_length': 144.6071548461914, 'rewards/accuracy_reward': 0.3750000149011612, 'rewards/format_reward': 1.0, 'reward': 1.3750000596046448, 'reward_std': 0.3240400403738022, 'kl': 0.039794921875, 'epoch': 0.77} 15%|█▌ | 248/1610 [1:16:01<8:40:54, 22.95s/it] 15%|█▌ | 249/1610 [1:16:24<8:44:55, 23.14s/it] {'loss': 0.0018, 'grad_norm': 1.4948574805451125, 'learning_rate': 8.453416149068324e-07, 'completion_length': 130.5178680419922, 'rewards/accuracy_reward': 0.321428582072258, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.3125000596046448, 'reward_std': 0.321561798453331, 'kl': 0.044677734375, 'epoch': 0.77} 15%|█▌ | 249/1610 [1:16:24<8:44:55, 23.14s/it] 16%|█▌ | 250/1610 [1:16:46<8:37:50, 22.85s/it] {'loss': 0.0017, 'grad_norm': 1.1424935617423346, 'learning_rate': 8.447204968944099e-07, 'completion_length': 122.3660774230957, 'rewards/accuracy_reward': 0.4107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.410714328289032, 'reward_std': 0.18666484951972961, 'kl': 0.042724609375, 'epoch': 0.78} 16%|█▌ | 250/1610 [1:16:46<8:37:50, 22.85s/it] 16%|█▌ | 251/1610 [1:17:08<8:28:41, 22.46s/it] {'loss': 0.0017, 'grad_norm': 1.017982571213339, 'learning_rate': 8.440993788819875e-07, 'completion_length': 97.04464721679688, 'rewards/accuracy_reward': 0.321428582072258, 'rewards/format_reward': 1.0, 'reward': 1.3214285969734192, 'reward_std': 0.11663763970136642, 'kl': 0.0433349609375, 'epoch': 0.78} 16%|█▌ | 251/1610 [1:17:08<8:28:41, 22.46s/it] 16%|█▌ | 252/1610 [1:17:28<8:15:27, 21.89s/it] {'loss': 0.0022, 'grad_norm': 3.477230965085254, 'learning_rate': 8.434782608695652e-07, 'completion_length': 93.58929061889648, 'rewards/accuracy_reward': 0.4375000298023224, 'rewards/format_reward': 1.0, 'reward': 1.4375000596046448, 'reward_std': 0.19959120452404022, 'kl': 0.0545654296875, 'epoch': 0.78} 16%|█▌ | 252/1610 [1:17:28<8:15:27, 21.89s/it] 16%|█▌ | 253/1610 [1:17:52<8:24:07, 22.29s/it] {'loss': 0.0019, 'grad_norm': 4.892282473564605, 'learning_rate': 8.428571428571428e-07, 'completion_length': 107.47322082519531, 'rewards/accuracy_reward': 0.4821428656578064, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.473214328289032, 'reward_std': 0.263644278049469, 'kl': 0.046630859375, 'epoch': 0.79} 16%|█▌ | 253/1610 [1:17:52<8:24:07, 22.29s/it] 16%|█▌ | 254/1610 [1:18:13<8:17:18, 22.01s/it] {'loss': 0.0025, 'grad_norm': 1.1209369751822749, 'learning_rate': 8.422360248447204e-07, 'completion_length': 87.25000381469727, 'rewards/accuracy_reward': 0.4732142984867096, 'rewards/format_reward': 1.0, 'reward': 1.473214328289032, 'reward_std': 0.1800716370344162, 'kl': 0.0616455078125, 'epoch': 0.79} 16%|█▌ | 254/1610 [1:18:13<8:17:18, 22.01s/it] 16%|█▌ | 255/1610 [1:18:34<8:08:11, 21.62s/it] {'loss': 0.0024, 'grad_norm': 2.301814560563317, 'learning_rate': 8.416149068322981e-07, 'completion_length': 86.81250381469727, 'rewards/accuracy_reward': 0.4196428656578064, 'rewards/format_reward': 1.0, 'reward': 1.4196429252624512, 'reward_std': 0.36100782454013824, 'kl': 0.0611572265625, 'epoch': 0.79} 16%|█▌ | 255/1610 [1:18:34<8:08:11, 21.62s/it] 16%|█▌ | 256/1610 [1:18:57<8:17:45, 22.06s/it] {'loss': 0.0014, 'grad_norm': 1.9659349637400694, 'learning_rate': 8.409937888198757e-07, 'completion_length': 136.89286422729492, 'rewards/accuracy_reward': 0.3035714477300644, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.2946429252624512, 'reward_std': 0.29098063707351685, 'kl': 0.0352783203125, 'epoch': 0.8} 16%|█▌ | 256/1610 [1:18:57<8:17:45, 22.06s/it] 16%|█▌ | 257/1610 [1:19:21<8:28:43, 22.56s/it] {'loss': 0.0017, 'grad_norm': 1.8765561378371254, 'learning_rate': 8.403726708074533e-07, 'completion_length': 122.83929061889648, 'rewards/accuracy_reward': 0.330357164144516, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.321428656578064, 'reward_std': 0.24619603157043457, 'kl': 0.043701171875, 'epoch': 0.8} 16%|█▌ | 257/1610 [1:19:21<8:28:43, 22.56s/it] 16%|█▌ | 258/1610 [1:19:44<8:34:09, 22.82s/it] {'loss': 0.0017, 'grad_norm': 2.726566119126589, 'learning_rate': 8.397515527950311e-07, 'completion_length': 125.88393020629883, 'rewards/accuracy_reward': 0.4107143133878708, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4017857909202576, 'reward_std': 0.2696296125650406, 'kl': 0.042236328125, 'epoch': 0.8} 16%|█▌ | 258/1610 [1:19:44<8:34:09, 22.82s/it] 16%|█▌ | 259/1610 [1:20:06<8:27:54, 22.56s/it] {'loss': 0.002, 'grad_norm': 1.9503221799484738, 'learning_rate': 8.391304347826087e-07, 'completion_length': 111.89286422729492, 'rewards/accuracy_reward': 0.3125000149011612, 'rewards/format_reward': 1.0, 'reward': 1.3125000596046448, 'reward_std': 0.15933408588171005, 'kl': 0.0494384765625, 'epoch': 0.8} 16%|█▌ | 259/1610 [1:20:06<8:27:54, 22.56s/it] 16%|█▌ | 260/1610 [1:20:27<8:20:44, 22.26s/it] {'loss': 0.002, 'grad_norm': 1.7324564373785816, 'learning_rate': 8.385093167701863e-07, 'completion_length': 105.90179061889648, 'rewards/accuracy_reward': 0.3660714477300644, 'rewards/format_reward': 1.0, 'reward': 1.3660714626312256, 'reward_std': 0.2657212167978287, 'kl': 0.05029296875, 'epoch': 0.81} 16%|█▌ | 260/1610 [1:20:27<8:20:44, 22.26s/it] 16%|█▌ | 261/1610 [1:20:52<8:34:01, 22.86s/it] {'loss': 0.0016, 'grad_norm': 1.0780647020705498, 'learning_rate': 8.37888198757764e-07, 'completion_length': 133.5535774230957, 'rewards/accuracy_reward': 0.3571428805589676, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.3392857313156128, 'reward_std': 0.23796935379505157, 'kl': 0.0399169921875, 'epoch': 0.81} 16%|█▌ | 261/1610 [1:20:52<8:34:01, 22.86s/it] 16%|█▋ | 262/1610 [1:21:14<8:27:53, 22.61s/it] {'loss': 0.0018, 'grad_norm': 1.657260534750704, 'learning_rate': 8.372670807453416e-07, 'completion_length': 111.39286041259766, 'rewards/accuracy_reward': 0.392857164144516, 'rewards/format_reward': 1.0, 'reward': 1.3928572535514832, 'reward_std': 0.18666484951972961, 'kl': 0.044921875, 'epoch': 0.81} 16%|█▋ | 262/1610 [1:21:14<8:27:53, 22.61s/it] 16%|█▋ | 263/1610 [1:21:35<8:15:23, 22.07s/it] {'loss': 0.0016, 'grad_norm': 3.078282603075667, 'learning_rate': 8.366459627329192e-07, 'completion_length': 103.22321701049805, 'rewards/accuracy_reward': 0.589285746216774, 'rewards/format_reward': 1.0, 'reward': 1.5892857909202576, 'reward_std': 0.3267304599285126, 'kl': 0.0406494140625, 'epoch': 0.82} 16%|█▋ | 263/1610 [1:21:35<8:15:23, 22.07s/it] 16%|█▋ | 264/1610 [1:21:59<8:27:53, 22.64s/it] {'loss': 0.0019, 'grad_norm': 1.9399054277018986, 'learning_rate': 8.360248447204969e-07, 'completion_length': 108.01786422729492, 'rewards/accuracy_reward': 0.535714328289032, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.5267857909202576, 'reward_std': 0.28544436395168304, 'kl': 0.0482177734375, 'epoch': 0.82} 16%|█▋ | 264/1610 [1:21:59<8:27:53, 22.64s/it] 16%|█▋ | 265/1610 [1:22:21<8:27:47, 22.65s/it] {'loss': 0.0018, 'grad_norm': 1.6190546963275336, 'learning_rate': 8.354037267080745e-07, 'completion_length': 111.78571701049805, 'rewards/accuracy_reward': 0.3482143133878708, 'rewards/format_reward': 1.0, 'reward': 1.3482143878936768, 'reward_std': 0.3117181956768036, 'kl': 0.044677734375, 'epoch': 0.82} 16%|█▋ | 265/1610 [1:22:21<8:27:47, 22.65s/it] 17%|█▋ | 266/1610 [1:22:45<8:37:45, 23.11s/it] {'loss': 0.0024, 'grad_norm': 1.3353210066912669, 'learning_rate': 8.347826086956521e-07, 'completion_length': 105.92857360839844, 'rewards/accuracy_reward': 0.5000000149011612, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4910715222358704, 'reward_std': 0.24290670454502106, 'kl': 0.0596923828125, 'epoch': 0.83} 17%|█▋ | 266/1610 [1:22:45<8:37:45, 23.11s/it] 17%|█▋ | 267/1610 [1:23:08<8:33:24, 22.94s/it] {'loss': 0.002, 'grad_norm': 2.3453969373667523, 'learning_rate': 8.341614906832299e-07, 'completion_length': 103.94643020629883, 'rewards/accuracy_reward': 0.392857164144516, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.383928656578064, 'reward_std': 0.2792666554450989, 'kl': 0.0511474609375, 'epoch': 0.83} 17%|█▋ | 267/1610 [1:23:08<8:33:24, 22.94s/it] 17%|█▋ | 268/1610 [1:23:30<8:27:28, 22.69s/it] {'loss': 0.0016, 'grad_norm': 2.1371915345313512, 'learning_rate': 8.335403726708075e-07, 'completion_length': 107.03572082519531, 'rewards/accuracy_reward': 0.5446428954601288, 'rewards/format_reward': 1.0, 'reward': 1.5446429252624512, 'reward_std': 0.2897626608610153, 'kl': 0.0400390625, 'epoch': 0.83} 17%|█▋ | 268/1610 [1:23:30<8:27:28, 22.69s/it] 17%|█▋ | 269/1610 [1:23:53<8:30:36, 22.85s/it] {'loss': 0.0019, 'grad_norm': 1.1684277345082754, 'learning_rate': 8.329192546583851e-07, 'completion_length': 110.47322082519531, 'rewards/accuracy_reward': 0.2678571492433548, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.258928656578064, 'reward_std': 0.18584203720092773, 'kl': 0.0478515625, 'epoch': 0.84} 17%|█▋ | 269/1610 [1:23:53<8:30:36, 22.85s/it] 17%|█▋ | 270/1610 [1:24:16<8:31:09, 22.89s/it] {'loss': 0.0017, 'grad_norm': 1.9550990449576373, 'learning_rate': 8.322981366459628e-07, 'completion_length': 128.15179443359375, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 1.0, 'reward': 1.5178572535514832, 'reward_std': 0.32013164460659027, 'kl': 0.043212890625, 'epoch': 0.84} 17%|█▋ | 270/1610 [1:24:16<8:31:09, 22.89s/it] 17%|█▋ | 271/1610 [1:24:39<8:29:06, 22.81s/it] {'loss': 0.0019, 'grad_norm': 1.21705948806368, 'learning_rate': 8.316770186335404e-07, 'completion_length': 131.23215103149414, 'rewards/accuracy_reward': 0.4285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.4285714626312256, 'reward_std': 0.2915885001420975, 'kl': 0.0487060546875, 'epoch': 0.84} 17%|█▋ | 271/1610 [1:24:39<8:29:06, 22.81s/it] 17%|█▋ | 272/1610 [1:25:02<8:33:52, 23.04s/it] {'loss': 0.0017, 'grad_norm': 3.4499074092420874, 'learning_rate': 8.31055900621118e-07, 'completion_length': 114.10714721679688, 'rewards/accuracy_reward': 0.3303571492433548, 'rewards/format_reward': 1.0, 'reward': 1.3303571939468384, 'reward_std': 0.3090165704488754, 'kl': 0.04296875, 'epoch': 0.84} 17%|█▋ | 272/1610 [1:25:02<8:33:52, 23.04s/it] 17%|█▋ | 273/1610 [1:25:25<8:30:54, 22.93s/it] {'loss': 0.0017, 'grad_norm': 2.3901679172353734, 'learning_rate': 8.304347826086955e-07, 'completion_length': 111.4910774230957, 'rewards/accuracy_reward': 0.3660714328289032, 'rewards/format_reward': 1.0, 'reward': 1.3660714626312256, 'reward_std': 0.24498365819454193, 'kl': 0.0433349609375, 'epoch': 0.85} 17%|█▋ | 273/1610 [1:25:25<8:30:54, 22.93s/it] 17%|█▋ | 274/1610 [1:25:45<8:13:24, 22.16s/it] {'loss': 0.0019, 'grad_norm': 1.5552509026229284, 'learning_rate': 8.298136645962732e-07, 'completion_length': 94.94643020629883, 'rewards/accuracy_reward': 0.4821428954601288, 'rewards/format_reward': 1.0, 'reward': 1.4821429252624512, 'reward_std': 0.2765706330537796, 'kl': 0.0472412109375, 'epoch': 0.85} 17%|█▋ | 274/1610 [1:25:45<8:13:24, 22.16s/it] 17%|█▋ | 275/1610 [1:26:06<8:00:46, 21.61s/it] {'loss': 0.0023, 'grad_norm': 1.2624038238989426, 'learning_rate': 8.291925465838508e-07, 'completion_length': 81.69643211364746, 'rewards/accuracy_reward': 0.4732143133878708, 'rewards/format_reward': 1.0, 'reward': 1.473214328289032, 'reward_std': 0.15360544621944427, 'kl': 0.056640625, 'epoch': 0.85} 17%|█▋ | 275/1610 [1:26:06<8:00:46, 21.61s/it] 17%|█▋ | 276/1610 [1:26:27<7:54:29, 21.34s/it] {'loss': 0.002, 'grad_norm': 2.6028345226936787, 'learning_rate': 8.285714285714285e-07, 'completion_length': 104.90179061889648, 'rewards/accuracy_reward': 0.4285714477300644, 'rewards/format_reward': 1.0, 'reward': 1.4285715222358704, 'reward_std': 0.2501044273376465, 'kl': 0.04931640625, 'epoch': 0.86} 17%|█▋ | 276/1610 [1:26:27<7:54:29, 21.34s/it] 17%|█▋ | 277/1610 [1:26:48<7:54:38, 21.36s/it] {'loss': 0.0021, 'grad_norm': 2.5957955920763967, 'learning_rate': 8.279503105590062e-07, 'completion_length': 101.8839340209961, 'rewards/accuracy_reward': 0.3660714328289032, 'rewards/format_reward': 1.0, 'reward': 1.3660714626312256, 'reward_std': 0.29367104172706604, 'kl': 0.0537109375, 'epoch': 0.86} 17%|█▋ | 277/1610 [1:26:48<7:54:38, 21.36s/it] 17%|█▋ | 278/1610 [1:27:10<8:01:59, 21.71s/it] {'loss': 0.002, 'grad_norm': 1.3718755212062472, 'learning_rate': 8.273291925465838e-07, 'completion_length': 123.33036422729492, 'rewards/accuracy_reward': 0.321428582072258, 'rewards/format_reward': 1.0, 'reward': 1.3214285969734192, 'reward_std': 0.2248450145125389, 'kl': 0.05126953125, 'epoch': 0.86} 17%|█▋ | 278/1610 [1:27:10<8:01:59, 21.71s/it] 17%|█▋ | 279/1610 [1:27:32<8:02:24, 21.75s/it] {'loss': 0.0024, 'grad_norm': 1.8509617514716084, 'learning_rate': 8.267080745341614e-07, 'completion_length': 103.54464721679688, 'rewards/accuracy_reward': 0.526785746216774, 'rewards/format_reward': 1.0, 'reward': 1.5267857909202576, 'reward_std': 0.208614781498909, 'kl': 0.059814453125, 'epoch': 0.87} 17%|█▋ | 279/1610 [1:27:32<8:02:24, 21.75s/it] 17%|█▋ | 280/1610 [1:27:55<8:07:10, 21.98s/it] {'loss': 0.0019, 'grad_norm': 1.6652337576805174, 'learning_rate': 8.260869565217391e-07, 'completion_length': 99.68750762939453, 'rewards/accuracy_reward': 0.4285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.4285714626312256, 'reward_std': 0.22875341773033142, 'kl': 0.0482177734375, 'epoch': 0.87} 17%|█▋ | 280/1610 [1:27:55<8:07:10, 21.98s/it] 17%|█▋ | 281/1610 [1:28:17<8:10:40, 22.15s/it] {'loss': 0.0019, 'grad_norm': 0.7854034682152443, 'learning_rate': 8.254658385093167e-07, 'completion_length': 114.30357360839844, 'rewards/accuracy_reward': 0.3571428656578064, 'rewards/format_reward': 1.0, 'reward': 1.3571429252624512, 'reward_std': 0.10040179640054703, 'kl': 0.0482177734375, 'epoch': 0.87} 17%|█▋ | 281/1610 [1:28:17<8:10:40, 22.15s/it] 18%|█▊ | 282/1610 [1:28:41<8:20:08, 22.60s/it] {'loss': 0.0018, 'grad_norm': 1.4086676053820069, 'learning_rate': 8.248447204968943e-07, 'completion_length': 112.89286041259766, 'rewards/accuracy_reward': 0.446428582072258, 'rewards/format_reward': 1.0, 'reward': 1.4464285969734192, 'reward_std': 0.128351628780365, 'kl': 0.0458984375, 'epoch': 0.88} 18%|█▊ | 282/1610 [1:28:41<8:20:08, 22.60s/it] 18%|█▊ | 283/1610 [1:29:04<8:22:13, 22.71s/it] {'loss': 0.0021, 'grad_norm': 1.7689120419150617, 'learning_rate': 8.24223602484472e-07, 'completion_length': 98.08928680419922, 'rewards/accuracy_reward': 0.464285746216774, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4553571939468384, 'reward_std': 0.20984171330928802, 'kl': 0.05322265625, 'epoch': 0.88} 18%|█▊ | 283/1610 [1:29:04<8:22:13, 22.71s/it] 18%|█▊ | 284/1610 [1:29:25<8:13:48, 22.34s/it] {'loss': 0.002, 'grad_norm': 2.3290332462618495, 'learning_rate': 8.236024844720496e-07, 'completion_length': 100.20536041259766, 'rewards/accuracy_reward': 0.4375000298023224, 'rewards/format_reward': 1.0, 'reward': 1.4375000596046448, 'reward_std': 0.3532022535800934, 'kl': 0.049560546875, 'epoch': 0.88} 18%|█▊ | 284/1610 [1:29:25<8:13:48, 22.34s/it] 18%|█▊ | 285/1610 [1:29:47<8:05:23, 21.98s/it] {'loss': 0.0026, 'grad_norm': 1.5722166410838352, 'learning_rate': 8.229813664596273e-07, 'completion_length': 87.77679061889648, 'rewards/accuracy_reward': 0.4642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.4642857909202576, 'reward_std': 0.1995968073606491, 'kl': 0.0648193359375, 'epoch': 0.89} 18%|█▊ | 285/1610 [1:29:47<8:05:23, 21.98s/it] 18%|█▊ | 286/1610 [1:30:10<8:17:15, 22.53s/it] {'loss': 0.0022, 'grad_norm': 1.0026907730528791, 'learning_rate': 8.22360248447205e-07, 'completion_length': 115.45536041259766, 'rewards/accuracy_reward': 0.642857164144516, 'rewards/format_reward': 0.9732142984867096, 'reward': 1.6160715222358704, 'reward_std': 0.2314494401216507, 'kl': 0.0552978515625, 'epoch': 0.89} 18%|█▊ | 286/1610 [1:30:10<8:17:15, 22.53s/it] 18%|█▊ | 287/1610 [1:30:33<8:16:25, 22.51s/it] {'loss': 0.0025, 'grad_norm': 1.9533608814096963, 'learning_rate': 8.217391304347826e-07, 'completion_length': 111.14286041259766, 'rewards/accuracy_reward': 0.392857164144516, 'rewards/format_reward': 1.0, 'reward': 1.3928571939468384, 'reward_std': 0.27926105260849, 'kl': 0.0618896484375, 'epoch': 0.89} 18%|█▊ | 287/1610 [1:30:33<8:16:25, 22.51s/it] 18%|█▊ | 288/1610 [1:30:55<8:11:14, 22.30s/it] {'loss': 0.0019, 'grad_norm': 1.4972360048643423, 'learning_rate': 8.211180124223602e-07, 'completion_length': 92.5089340209961, 'rewards/accuracy_reward': 0.2946428656578064, 'rewards/format_reward': 1.0, 'reward': 1.2946429252624512, 'reward_std': 0.24498367309570312, 'kl': 0.04638671875, 'epoch': 0.89} 18%|█▊ | 288/1610 [1:30:55<8:11:14, 22.30s/it] 18%|█▊ | 289/1610 [1:31:16<8:04:37, 22.01s/it] {'loss': 0.0018, 'grad_norm': 2.1095997035728735, 'learning_rate': 8.204968944099379e-07, 'completion_length': 96.03572082519531, 'rewards/accuracy_reward': 0.4017857313156128, 'rewards/format_reward': 1.0, 'reward': 1.4017857909202576, 'reward_std': 0.2960958033800125, 'kl': 0.04541015625, 'epoch': 0.9} 18%|█▊ | 289/1610 [1:31:16<8:04:37, 22.01s/it] 18%|█▊ | 290/1610 [1:31:39<8:10:18, 22.29s/it] {'loss': 0.0018, 'grad_norm': 2.137673146239301, 'learning_rate': 8.198757763975155e-07, 'completion_length': 103.9910774230957, 'rewards/accuracy_reward': 0.2946428656578064, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.285714328289032, 'reward_std': 0.3057272136211395, 'kl': 0.0450439453125, 'epoch': 0.9} 18%|█▊ | 290/1610 [1:31:39<8:10:18, 22.29s/it] 18%|█▊ | 291/1610 [1:32:02<8:13:01, 22.43s/it] {'loss': 0.0022, 'grad_norm': 1.5940885489773193, 'learning_rate': 8.192546583850931e-07, 'completion_length': 101.66964721679688, 'rewards/accuracy_reward': 0.4107143133878708, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4017857909202576, 'reward_std': 0.24889205396175385, 'kl': 0.0537109375, 'epoch': 0.9} 18%|█▊ | 291/1610 [1:32:02<8:13:01, 22.43s/it] 18%|█▊ | 292/1610 [1:32:22<8:01:45, 21.93s/it] {'loss': 0.002, 'grad_norm': 1.8333037742885747, 'learning_rate': 8.186335403726708e-07, 'completion_length': 104.22321701049805, 'rewards/accuracy_reward': 0.4107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.410714328289032, 'reward_std': 0.32976867258548737, 'kl': 0.0511474609375, 'epoch': 0.91} 18%|█▊ | 292/1610 [1:32:22<8:01:45, 21.93s/it] 18%|█▊ | 293/1610 [1:32:44<8:01:10, 21.92s/it] {'loss': 0.0024, 'grad_norm': 1.141757818354662, 'learning_rate': 8.180124223602484e-07, 'completion_length': 96.58036041259766, 'rewards/accuracy_reward': 0.4107143133878708, 'rewards/format_reward': 1.0, 'reward': 1.410714328289032, 'reward_std': 0.20021027326583862, 'kl': 0.06005859375, 'epoch': 0.91} 18%|█▊ | 293/1610 [1:32:44<8:01:10, 21.92s/it] 18%|█▊ | 294/1610 [1:33:05<7:50:10, 21.44s/it] {'loss': 0.0017, 'grad_norm': 6.554692901928579, 'learning_rate': 8.173913043478261e-07, 'completion_length': 100.58928680419922, 'rewards/accuracy_reward': 0.5089285969734192, 'rewards/format_reward': 1.0, 'reward': 1.5089285969734192, 'reward_std': 0.31292495131492615, 'kl': 0.04345703125, 'epoch': 0.91} 18%|█▊ | 294/1610 [1:33:05<7:50:10, 21.44s/it] 18%|█▊ | 295/1610 [1:33:27<7:52:31, 21.56s/it] {'loss': 0.0021, 'grad_norm': 1.900240171760579, 'learning_rate': 8.167701863354038e-07, 'completion_length': 112.60715103149414, 'rewards/accuracy_reward': 0.2857142984867096, 'rewards/format_reward': 1.0, 'reward': 1.285714328289032, 'reward_std': 0.10431019216775894, 'kl': 0.0516357421875, 'epoch': 0.92} 18%|█▊ | 295/1610 [1:33:27<7:52:31, 21.56s/it] 18%|█▊ | 296/1610 [1:33:49<7:58:46, 21.86s/it] {'loss': 0.002, 'grad_norm': 1.6257548216921889, 'learning_rate': 8.161490683229814e-07, 'completion_length': 103.34822082519531, 'rewards/accuracy_reward': 0.3571428805589676, 'rewards/format_reward': 1.0, 'reward': 1.3571429252624512, 'reward_std': 0.2344820648431778, 'kl': 0.0511474609375, 'epoch': 0.92} 18%|█▊ | 296/1610 [1:33:49<7:58:46, 21.86s/it] 18%|█▊ | 297/1610 [1:34:10<7:50:37, 21.51s/it] {'loss': 0.0031, 'grad_norm': 1.415116784444893, 'learning_rate': 8.155279503105589e-07, 'completion_length': 83.02679061889648, 'rewards/accuracy_reward': 0.5267857313156128, 'rewards/format_reward': 1.0, 'reward': 1.5267857909202576, 'reward_std': 0.10882645100355148, 'kl': 0.07861328125, 'epoch': 0.92} 18%|█▊ | 297/1610 [1:34:10<7:50:37, 21.51s/it] 19%|█▊ | 298/1610 [1:34:33<7:59:40, 21.94s/it] {'loss': 0.002, 'grad_norm': 1.4125918080355464, 'learning_rate': 8.149068322981366e-07, 'completion_length': 105.26786041259766, 'rewards/accuracy_reward': 0.580357164144516, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.5714285969734192, 'reward_std': 0.28680098056793213, 'kl': 0.051025390625, 'epoch': 0.93} 19%|█▊ | 298/1610 [1:34:33<7:59:40, 21.94s/it] 19%|█▊ | 299/1610 [1:34:55<8:04:26, 22.17s/it] {'loss': 0.0024, 'grad_norm': 1.5506893076797335, 'learning_rate': 8.142857142857142e-07, 'completion_length': 104.66964721679688, 'rewards/accuracy_reward': 0.3482143133878708, 'rewards/format_reward': 1.0, 'reward': 1.348214328289032, 'reward_std': 0.22094222903251648, 'kl': 0.0596923828125, 'epoch': 0.93} 19%|█▊ | 299/1610 [1:34:55<8:04:26, 22.17s/it] 19%|█▊ | 300/1610 [1:35:15<7:49:06, 21.49s/it] {'loss': 0.0026, 'grad_norm': 1.8490529598047392, 'learning_rate': 8.136645962732918e-07, 'completion_length': 71.42857360839844, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 1.0, 'reward': 1.6071429252624512, 'reward_std': 0.26841723918914795, 'kl': 0.064208984375, 'epoch': 0.93} 19%|█▊ | 300/1610 [1:35:15<7:49:06, 21.49s/it] 19%|█▊ | 301/1610 [1:36:09<11:20:17, 31.18s/it] {'loss': 0.0024, 'grad_norm': 2.187459688536623, 'learning_rate': 8.130434782608695e-07, 'completion_length': 87.9464340209961, 'rewards/accuracy_reward': 0.455357164144516, 'rewards/format_reward': 1.0, 'reward': 1.4553571939468384, 'reward_std': 0.2987862229347229, 'kl': 0.059326171875, 'epoch': 0.93} 19%|█▊ | 301/1610 [1:36:09<11:20:17, 31.18s/it] 19%|█▉ | 302/1610 [1:36:18<8:54:23, 24.51s/it] {'loss': 0.0027, 'grad_norm': 1.6572913524307022, 'learning_rate': 8.124223602484471e-07, 'completion_length': 73.49107360839844, 'rewards/accuracy_reward': 0.4285714477300644, 'rewards/format_reward': 1.0, 'reward': 1.4285715222358704, 'reward_std': 0.19057324528694153, 'kl': 0.0682373046875, 'epoch': 0.94} 19%|█▉ | 302/1610 [1:36:18<8:54:23, 24.51s/it] 19%|█▉ | 303/1610 [1:36:29<7:22:43, 20.32s/it] {'loss': 0.0027, 'grad_norm': 0.8741458639355193, 'learning_rate': 8.118012422360247e-07, 'completion_length': 94.50000381469727, 'rewards/accuracy_reward': 0.375, 'rewards/format_reward': 1.0, 'reward': 1.3750000596046448, 'reward_std': 0.09528662264347076, 'kl': 0.0672607421875, 'epoch': 0.94} 19%|█▉ | 303/1610 [1:36:29<7:22:43, 20.32s/it] 19%|█▉ | 304/1610 [1:36:40<6:23:58, 17.64s/it] {'loss': 0.0023, 'grad_norm': 2.534625738944371, 'learning_rate': 8.111801242236025e-07, 'completion_length': 105.39286041259766, 'rewards/accuracy_reward': 0.3660714328289032, 'rewards/format_reward': 1.0, 'reward': 1.3660714626312256, 'reward_std': 0.27535825967788696, 'kl': 0.0570068359375, 'epoch': 0.94} 19%|█▉ | 304/1610 [1:36:40<6:23:58, 17.64s/it] 19%|█▉ | 305/1610 [1:36:50<5:32:00, 15.26s/it] {'loss': 0.0021, 'grad_norm': 1.1499354759911886, 'learning_rate': 8.105590062111801e-07, 'completion_length': 86.90179061889648, 'rewards/accuracy_reward': 0.3482143133878708, 'rewards/format_reward': 1.0, 'reward': 1.3482143878936768, 'reward_std': 0.13736958801746368, 'kl': 0.0535888671875, 'epoch': 0.95} 19%|█▉ | 305/1610 [1:36:50<5:32:00, 15.26s/it] 19%|█▉ | 306/1610 [1:36:59<4:53:26, 13.50s/it] {'loss': 0.0026, 'grad_norm': 1.7883940421373112, 'learning_rate': 8.099378881987577e-07, 'completion_length': 78.55357360839844, 'rewards/accuracy_reward': 0.5357143133878708, 'rewards/format_reward': 1.0, 'reward': 1.535714328289032, 'reward_std': 0.14579425752162933, 'kl': 0.0654296875, 'epoch': 0.95} 19%|█▉ | 306/1610 [1:36:59<4:53:26, 13.50s/it] 19%|█▉ | 307/1610 [1:37:12<4:49:50, 13.35s/it] {'loss': 0.0019, 'grad_norm': 1.7275459652427119, 'learning_rate': 8.093167701863354e-07, 'completion_length': 123.8214340209961, 'rewards/accuracy_reward': 0.4732143133878708, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4642857909202576, 'reward_std': 0.2888980954885483, 'kl': 0.0469970703125, 'epoch': 0.95} 19%|█▉ | 307/1610 [1:37:12<4:49:50, 13.35s/it] 19%|█▉ | 308/1610 [1:37:24<4:41:47, 12.99s/it] {'loss': 0.0021, 'grad_norm': 1.6411649664469794, 'learning_rate': 8.08695652173913e-07, 'completion_length': 109.62500381469727, 'rewards/accuracy_reward': 0.3392857313156128, 'rewards/format_reward': 1.0, 'reward': 1.3392857909202576, 'reward_std': 0.16262340545654297, 'kl': 0.05224609375, 'epoch': 0.96} 19%|█▉ | 308/1610 [1:37:24<4:41:47, 12.99s/it] 19%|█▉ | 309/1610 [1:37:36<4:30:59, 12.50s/it] {'loss': 0.0018, 'grad_norm': 1.4501535383362671, 'learning_rate': 8.080745341614906e-07, 'completion_length': 109.31250381469727, 'rewards/accuracy_reward': 0.5267857313156128, 'rewards/format_reward': 1.0, 'reward': 1.5267857909202576, 'reward_std': 0.2254640981554985, 'kl': 0.0452880859375, 'epoch': 0.96} 19%|█▉ | 309/1610 [1:37:36<4:30:59, 12.50s/it] 19%|█▉ | 310/1610 [1:37:46<4:16:38, 11.85s/it] {'loss': 0.0023, 'grad_norm': 1.790291889402319, 'learning_rate': 8.074534161490683e-07, 'completion_length': 101.93750381469727, 'rewards/accuracy_reward': 0.4285714477300644, 'rewards/format_reward': 1.0, 'reward': 1.4285715222358704, 'reward_std': 0.2831694334745407, 'kl': 0.0587158203125, 'epoch': 0.96} 19%|█▉ | 310/1610 [1:37:46<4:16:38, 11.85s/it] 19%|█▉ | 311/1610 [1:37:57<4:09:56, 11.54s/it] {'loss': 0.0018, 'grad_norm': 1.6079739465613891, 'learning_rate': 8.068322981366459e-07, 'completion_length': 127.41964721679688, 'rewards/accuracy_reward': 0.4375000298023224, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4285715222358704, 'reward_std': 0.29999858140945435, 'kl': 0.0457763671875, 'epoch': 0.97} 19%|█▉ | 311/1610 [1:37:57<4:09:56, 11.54s/it] 19%|█▉ | 312/1610 [1:38:08<4:10:11, 11.57s/it] {'loss': 0.0025, 'grad_norm': 1.7875194846748896, 'learning_rate': 8.062111801242235e-07, 'completion_length': 109.83036422729492, 'rewards/accuracy_reward': 0.3392857313156128, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3035714626312256, 'reward_std': 0.24619603157043457, 'kl': 0.0626220703125, 'epoch': 0.97} 19%|█▉ | 312/1610 [1:38:08<4:10:11, 11.57s/it] 19%|█▉ | 313/1610 [1:38:21<4:13:32, 11.73s/it] {'loss': 0.0024, 'grad_norm': 1.6437650177601755, 'learning_rate': 8.055900621118013e-07, 'completion_length': 89.34821701049805, 'rewards/accuracy_reward': 0.526785746216774, 'rewards/format_reward': 1.0, 'reward': 1.5267857909202576, 'reward_std': 0.12054044008255005, 'kl': 0.0596923828125, 'epoch': 0.97} 19%|█▉ | 313/1610 [1:38:21<4:13:32, 11.73s/it] 20%|█▉ | 314/1610 [1:38:33<4:21:19, 12.10s/it] {'loss': 0.0016, 'grad_norm': 1.541145326609009, 'learning_rate': 8.049689440993789e-07, 'completion_length': 140.06250762939453, 'rewards/accuracy_reward': 0.4553571790456772, 'rewards/format_reward': 1.0, 'reward': 1.4553571939468384, 'reward_std': 0.3078097999095917, 'kl': 0.0390625, 'epoch': 0.98} 20%|█▉ | 314/1610 [1:38:33<4:21:19, 12.10s/it] 20%|█▉ | 315/1610 [1:38:46<4:21:56, 12.14s/it] {'loss': 0.0019, 'grad_norm': 1.450235521873338, 'learning_rate': 8.043478260869565e-07, 'completion_length': 115.51786041259766, 'rewards/accuracy_reward': 0.321428582072258, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.3125000596046448, 'reward_std': 0.3111136704683304, 'kl': 0.047119140625, 'epoch': 0.98} 20%|█▉ | 315/1610 [1:38:46<4:21:56, 12.14s/it] 20%|█▉ | 316/1610 [1:38:58<4:26:07, 12.34s/it] {'loss': 0.0018, 'grad_norm': 2.150013099821445, 'learning_rate': 8.037267080745342e-07, 'completion_length': 128.5535774230957, 'rewards/accuracy_reward': 0.3303571566939354, 'rewards/format_reward': 1.0, 'reward': 1.3303571939468384, 'reward_std': 0.31622883677482605, 'kl': 0.0455322265625, 'epoch': 0.98} 20%|█▉ | 316/1610 [1:38:59<4:26:07, 12.34s/it] 20%|█▉ | 317/1610 [1:39:11<4:26:11, 12.35s/it] {'loss': 0.0021, 'grad_norm': 2.7125263503351023, 'learning_rate': 8.031055900621118e-07, 'completion_length': 131.1607208251953, 'rewards/accuracy_reward': 0.392857164144516, 'rewards/format_reward': 1.0, 'reward': 1.3928571939468384, 'reward_std': 0.3784560561180115, 'kl': 0.052490234375, 'epoch': 0.98} 20%|█▉ | 317/1610 [1:39:11<4:26:11, 12.35s/it] 20%|█▉ | 318/1610 [1:39:22<4:18:44, 12.02s/it] {'loss': 0.0023, 'grad_norm': 1.3443493984389003, 'learning_rate': 8.024844720496894e-07, 'completion_length': 114.46429061889648, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 1.0, 'reward': 1.571428656578064, 'reward_std': 0.20801587402820587, 'kl': 0.0565185546875, 'epoch': 0.99} 20%|█▉ | 318/1610 [1:39:22<4:18:44, 12.02s/it] 20%|█▉ | 319/1610 [1:39:32<4:02:39, 11.28s/it] {'loss': 0.0017, 'grad_norm': 2.119390174128572, 'learning_rate': 8.018633540372671e-07, 'completion_length': 93.25000381469727, 'rewards/accuracy_reward': 0.4107143133878708, 'rewards/format_reward': 1.0, 'reward': 1.410714328289032, 'reward_std': 0.27926105260849, 'kl': 0.0423583984375, 'epoch': 0.99} 20%|█▉ | 319/1610 [1:39:32<4:02:39, 11.28s/it] 20%|█▉ | 320/1610 [1:39:43<4:02:31, 11.28s/it] {'loss': 0.0023, 'grad_norm': 2.703501732021391, 'learning_rate': 8.012422360248446e-07, 'completion_length': 111.31250381469727, 'rewards/accuracy_reward': 0.3482142984867096, 'rewards/format_reward': 1.0, 'reward': 1.348214328289032, 'reward_std': 0.31353843212127686, 'kl': 0.0584716796875, 'epoch': 0.99} 20%|█▉ | 320/1610 [1:39:43<4:02:31, 11.28s/it] 20%|█▉ | 321/1610 [1:39:53<3:54:32, 10.92s/it] {'loss': 0.0022, 'grad_norm': 2.030351620188695, 'learning_rate': 8.006211180124222e-07, 'completion_length': 88.91964721679688, 'rewards/accuracy_reward': 0.4821428954601288, 'rewards/format_reward': 1.0, 'reward': 1.4821429252624512, 'reward_std': 0.16504815965890884, 'kl': 0.0538330078125, 'epoch': 1.0} 20%|█▉ | 321/1610 [1:39:53<3:54:32, 10.92s/it] 20%|██ | 322/1610 [1:40:04<3:57:41, 11.07s/it] {'loss': 0.0019, 'grad_norm': 1.399592115238615, 'learning_rate': 8e-07, 'completion_length': 105.11607360839844, 'rewards/accuracy_reward': 0.4732143208384514, 'rewards/format_reward': 1.0, 'reward': 1.4732143878936768, 'reward_std': 0.20984169840812683, 'kl': 0.04638671875, 'epoch': 1.0} 20%|██ | 322/1610 [1:40:04<3:57:41, 11.07s/it] 20%|██ | 323/1610 [1:40:18<4:12:38, 11.78s/it] {'loss': 0.0024, 'grad_norm': 1.1371284336828402, 'learning_rate': 7.993788819875776e-07, 'completion_length': 128.10714721679688, 'rewards/accuracy_reward': 0.3750000298023224, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.3571429252624512, 'reward_std': 0.2377769947052002, 'kl': 0.061279296875, 'epoch': 1.0} 20%|██ | 323/1610 [1:40:18<4:12:38, 11.78s/it] 20%|██ | 324/1610 [1:40:30<4:15:58, 11.94s/it] {'loss': 0.0023, 'grad_norm': 2.213319251082988, 'learning_rate': 7.987577639751552e-07, 'completion_length': 105.91072082519531, 'rewards/accuracy_reward': 0.598214328289032, 'rewards/format_reward': 1.0, 'reward': 1.5982143878936768, 'reward_std': 0.26181282103061676, 'kl': 0.0584716796875, 'epoch': 1.01} 20%|██ | 324/1610 [1:40:30<4:15:58, 11.94s/it] 20%|██ | 325/1610 [1:40:42<4:16:24, 11.97s/it] {'loss': 0.0025, 'grad_norm': 1.9063000457038715, 'learning_rate': 7.981366459627329e-07, 'completion_length': 109.5714340209961, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 1.0, 'reward': 1.571428656578064, 'reward_std': 0.20801587402820587, 'kl': 0.0616455078125, 'epoch': 1.01} 20%|██ | 325/1610 [1:40:42<4:16:24, 11.97s/it] 20%|██ | 326/1610 [1:40:55<4:21:31, 12.22s/it] {'loss': 0.0029, 'grad_norm': 0.9548939277968392, 'learning_rate': 7.975155279503105e-07, 'completion_length': 130.91964721679688, 'rewards/accuracy_reward': 0.4285714477300644, 'rewards/format_reward': 1.0, 'reward': 1.4285715222358704, 'reward_std': 0.17495085299015045, 'kl': 0.07177734375, 'epoch': 1.01} 20%|██ | 326/1610 [1:40:55<4:21:31, 12.22s/it] 20%|██ | 327/1610 [1:41:08<4:24:31, 12.37s/it] {'loss': 0.0019, 'grad_norm': 2.079924485518292, 'learning_rate': 7.968944099378881e-07, 'completion_length': 128.9107208251953, 'rewards/accuracy_reward': 0.473214328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4553571939468384, 'reward_std': 0.3296493589878082, 'kl': 0.0465087890625, 'epoch': 1.02} 20%|██ | 327/1610 [1:41:08<4:24:31, 12.37s/it] 20%|██ | 328/1610 [1:41:20<4:22:09, 12.27s/it] {'loss': 0.0023, 'grad_norm': 1.8907411054331382, 'learning_rate': 7.962732919254658e-07, 'completion_length': 110.66072082519531, 'rewards/accuracy_reward': 0.580357164144516, 'rewards/format_reward': 1.0, 'reward': 1.5803571939468384, 'reward_std': 0.27474477887153625, 'kl': 0.056640625, 'epoch': 1.02} 20%|██ | 328/1610 [1:41:20<4:22:09, 12.27s/it] 20%|██ | 329/1610 [1:41:34<4:32:47, 12.78s/it] {'loss': 0.0022, 'grad_norm': 1.0500699984035697, 'learning_rate': 7.956521739130434e-07, 'completion_length': 120.30357360839844, 'rewards/accuracy_reward': 0.3571428805589676, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.348214328289032, 'reward_std': 0.17489738017320633, 'kl': 0.0556640625, 'epoch': 1.02} 20%|██ | 329/1610 [1:41:34<4:32:47, 12.78s/it] 20%|██ | 330/1610 [1:41:48<4:39:50, 13.12s/it] {'loss': 0.0021, 'grad_norm': 0.9594865303302759, 'learning_rate': 7.95031055900621e-07, 'completion_length': 135.41964721679688, 'rewards/accuracy_reward': 0.3571428656578064, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.3392857313156128, 'reward_std': 0.1963018849492073, 'kl': 0.052490234375, 'epoch': 1.02} 20%|██ | 330/1610 [1:41:48<4:39:50, 13.12s/it] 21%|██ | 331/1610 [1:41:58<4:20:45, 12.23s/it] {'loss': 0.0028, 'grad_norm': 2.252197740968094, 'learning_rate': 7.944099378881988e-07, 'completion_length': 81.56250381469727, 'rewards/accuracy_reward': 0.5625000298023224, 'rewards/format_reward': 1.0, 'reward': 1.5625000596046448, 'reward_std': 0.1418914645910263, 'kl': 0.069091796875, 'epoch': 1.03} 21%|██ | 331/1610 [1:41:58<4:20:45, 12.23s/it] 21%|██ | 332/1610 [1:42:09<4:10:39, 11.77s/it] {'loss': 0.0024, 'grad_norm': 1.9311783986884121, 'learning_rate': 7.937888198757764e-07, 'completion_length': 104.42857360839844, 'rewards/accuracy_reward': 0.4464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.446428656578064, 'reward_std': 0.28828462213277817, 'kl': 0.059814453125, 'epoch': 1.03} 21%|██ | 332/1610 [1:42:09<4:10:39, 11.77s/it] 21%|██ | 333/1610 [1:42:19<4:02:35, 11.40s/it] {'loss': 0.0032, 'grad_norm': 1.1337388104546802, 'learning_rate': 7.93167701863354e-07, 'completion_length': 83.53571701049805, 'rewards/accuracy_reward': 0.535714328289032, 'rewards/format_reward': 1.0, 'reward': 1.5357143878936768, 'reward_std': 0.2022872269153595, 'kl': 0.0799560546875, 'epoch': 1.03} 21%|██ | 333/1610 [1:42:19<4:02:35, 11.40s/it] 21%|██ | 334/1610 [1:42:29<3:53:26, 10.98s/it] {'loss': 0.0024, 'grad_norm': 0.8721098005191713, 'learning_rate': 7.925465838509317e-07, 'completion_length': 82.83929061889648, 'rewards/accuracy_reward': 0.5000000149011612, 'rewards/format_reward': 1.0, 'reward': 1.5000000596046448, 'reward_std': 0.12444322928786278, 'kl': 0.0604248046875, 'epoch': 1.04} 21%|██ | 334/1610 [1:42:29<3:53:26, 10.98s/it] 21%|██ | 335/1610 [1:42:38<3:42:56, 10.49s/it] {'loss': 0.0022, 'grad_norm': 1.4872061836609125, 'learning_rate': 7.919254658385093e-07, 'completion_length': 86.23214721679688, 'rewards/accuracy_reward': 0.5446428656578064, 'rewards/format_reward': 1.0, 'reward': 1.5446429252624512, 'reward_std': 0.16262901574373245, 'kl': 0.0550537109375, 'epoch': 1.04} 21%|██ | 335/1610 [1:42:38<3:42:56, 10.49s/it] 21%|██ | 336/1610 [1:42:48<3:36:20, 10.19s/it] {'loss': 0.003, 'grad_norm': 2.232414387415316, 'learning_rate': 7.913043478260869e-07, 'completion_length': 68.86607360839844, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 1.0, 'reward': 1.5535714626312256, 'reward_std': 0.15481781959533691, 'kl': 0.075439453125, 'epoch': 1.04} 21%|██ | 336/1610 [1:42:48<3:36:20, 10.19s/it] 21%|██ | 337/1610 [1:42:58<3:32:28, 10.01s/it] {'loss': 0.0037, 'grad_norm': 2.1572912811517546, 'learning_rate': 7.906832298136646e-07, 'completion_length': 78.97321891784668, 'rewards/accuracy_reward': 0.3750000298023224, 'rewards/format_reward': 1.0, 'reward': 1.3750000596046448, 'reward_std': 0.20532545447349548, 'kl': 0.09130859375, 'epoch': 1.05} 21%|██ | 337/1610 [1:42:58<3:32:28, 10.01s/it] 21%|██ | 338/1610 [1:43:07<3:31:10, 9.96s/it] {'loss': 0.0034, 'grad_norm': 0.8360965961959002, 'learning_rate': 7.900621118012422e-07, 'completion_length': 72.17857360839844, 'rewards/accuracy_reward': 0.5625000298023224, 'rewards/format_reward': 1.0, 'reward': 1.5625000596046448, 'reward_std': 0.09138382971286774, 'kl': 0.085693359375, 'epoch': 1.05} 21%|██ | 338/1610 [1:43:07<3:31:10, 9.96s/it] 21%|██ | 339/1610 [1:43:18<3:33:07, 10.06s/it] {'loss': 0.0021, 'grad_norm': 1.443670518206481, 'learning_rate': 7.894409937888198e-07, 'completion_length': 91.03572082519531, 'rewards/accuracy_reward': 0.3660714477300644, 'rewards/format_reward': 1.0, 'reward': 1.3660715222358704, 'reward_std': 0.260606050491333, 'kl': 0.05126953125, 'epoch': 1.05} 21%|██ | 339/1610 [1:43:18<3:33:07, 10.06s/it] 21%|██ | 340/1610 [1:43:28<3:34:03, 10.11s/it] {'loss': 0.0018, 'grad_norm': 1.4805094645248535, 'learning_rate': 7.888198757763976e-07, 'completion_length': 101.20536041259766, 'rewards/accuracy_reward': 0.2053571492433548, 'rewards/format_reward': 1.0, 'reward': 1.2053571939468384, 'reward_std': 0.2540072351694107, 'kl': 0.0455322265625, 'epoch': 1.06} 21%|██ | 340/1610 [1:43:28<3:34:03, 10.11s/it] 21%|██ | 341/1610 [1:43:38<3:36:15, 10.22s/it] {'loss': 0.0031, 'grad_norm': 3.1348358127206586, 'learning_rate': 7.881987577639752e-07, 'completion_length': 71.87500381469727, 'rewards/accuracy_reward': 0.6875000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6875001192092896, 'reward_std': 0.18787721544504166, 'kl': 0.07861328125, 'epoch': 1.06} 21%|██ | 341/1610 [1:43:38<3:36:15, 10.22s/it] 21%|██ | 342/1610 [1:43:48<3:32:47, 10.07s/it] {'loss': 0.0032, 'grad_norm': 1.1677782102597125, 'learning_rate': 7.875776397515528e-07, 'completion_length': 86.04464530944824, 'rewards/accuracy_reward': 0.4821428805589676, 'rewards/format_reward': 1.0, 'reward': 1.4821429252624512, 'reward_std': 0.16262340545654297, 'kl': 0.078857421875, 'epoch': 1.06} 21%|██ | 342/1610 [1:43:48<3:32:47, 10.07s/it] 21%|██▏ | 343/1610 [1:43:59<3:38:52, 10.37s/it] {'loss': 0.0028, 'grad_norm': 0.97621832554335, 'learning_rate': 7.869565217391305e-07, 'completion_length': 89.59821701049805, 'rewards/accuracy_reward': 0.2678571566939354, 'rewards/format_reward': 1.0, 'reward': 1.2678571939468384, 'reward_std': 0.09528662264347076, 'kl': 0.0692138671875, 'epoch': 1.07} 21%|██▏ | 343/1610 [1:43:59<3:38:52, 10.37s/it] 21%|██▏ | 344/1610 [1:44:09<3:33:41, 10.13s/it] {'loss': 0.0021, 'grad_norm': 2.648801414232936, 'learning_rate': 7.86335403726708e-07, 'completion_length': 72.02679061889648, 'rewards/accuracy_reward': 0.2053571566939354, 'rewards/format_reward': 1.0, 'reward': 1.2053571939468384, 'reward_std': 0.1866704449057579, 'kl': 0.0538330078125, 'epoch': 1.07} 21%|██▏ | 344/1610 [1:44:09<3:33:41, 10.13s/it] 21%|██▏ | 345/1610 [1:44:20<3:41:20, 10.50s/it] {'loss': 0.0025, 'grad_norm': 1.287293478560293, 'learning_rate': 7.857142857142856e-07, 'completion_length': 91.90179061889648, 'rewards/accuracy_reward': 0.3392857313156128, 'rewards/format_reward': 1.0, 'reward': 1.3392857313156128, 'reward_std': 0.1671452820301056, 'kl': 0.0615234375, 'epoch': 1.07} 21%|██▏ | 345/1610 [1:44:20<3:41:20, 10.50s/it] 21%|██▏ | 346/1610 [1:44:29<3:28:10, 9.88s/it] {'loss': 0.0026, 'grad_norm': 19.02826181829015, 'learning_rate': 7.850931677018633e-07, 'completion_length': 83.08928680419922, 'rewards/accuracy_reward': 0.4553571790456772, 'rewards/format_reward': 1.0, 'reward': 1.4553572535514832, 'reward_std': 0.24229323863983154, 'kl': 0.06396484375, 'epoch': 1.07} 21%|██▏ | 346/1610 [1:44:29<3:28:10, 9.88s/it] 22%|██▏ | 347/1610 [1:44:39<3:32:43, 10.11s/it] {'loss': 0.0022, 'grad_norm': 2.101758502062246, 'learning_rate': 7.844720496894409e-07, 'completion_length': 83.49107360839844, 'rewards/accuracy_reward': 0.455357164144516, 'rewards/format_reward': 1.0, 'reward': 1.4553571939468384, 'reward_std': 0.28707224130630493, 'kl': 0.0540771484375, 'epoch': 1.08} 22%|██▏ | 347/1610 [1:44:39<3:32:43, 10.11s/it] 22%|██▏ | 348/1610 [1:44:51<3:43:57, 10.65s/it] {'loss': 0.003, 'grad_norm': 1.2020465538196057, 'learning_rate': 7.838509316770185e-07, 'completion_length': 86.64286231994629, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5357143878936768, 'reward_std': 0.20157574117183685, 'kl': 0.075439453125, 'epoch': 1.08} 22%|██▏ | 348/1610 [1:44:51<3:43:57, 10.65s/it] 22%|██▏ | 349/1610 [1:45:00<3:35:59, 10.28s/it] {'loss': 0.003, 'grad_norm': 1.7087955191095396, 'learning_rate': 7.832298136645963e-07, 'completion_length': 82.3839340209961, 'rewards/accuracy_reward': 0.3750000298023224, 'rewards/format_reward': 1.0, 'reward': 1.3750000596046448, 'reward_std': 0.2344820499420166, 'kl': 0.075927734375, 'epoch': 1.08} 22%|██▏ | 349/1610 [1:45:00<3:35:59, 10.28s/it] 22%|██▏ | 350/1610 [1:45:11<3:38:59, 10.43s/it] {'loss': 0.0023, 'grad_norm': 1.8277119749581405, 'learning_rate': 7.826086956521739e-07, 'completion_length': 94.40178680419922, 'rewards/accuracy_reward': 0.2946428656578064, 'rewards/format_reward': 1.0, 'reward': 1.2946429252624512, 'reward_std': 0.2059243619441986, 'kl': 0.056396484375, 'epoch': 1.09} 22%|██▏ | 350/1610 [1:45:11<3:38:59, 10.43s/it] 22%|██▏ | 351/1610 [1:45:22<3:43:49, 10.67s/it] {'loss': 0.0023, 'grad_norm': 0.6942308761408723, 'learning_rate': 7.819875776397515e-07, 'completion_length': 110.04464721679688, 'rewards/accuracy_reward': 0.5625000298023224, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.5535715222358704, 'reward_std': 0.11657854914665222, 'kl': 0.0565185546875, 'epoch': 1.09} 22%|██▏ | 351/1610 [1:45:22<3:43:49, 10.67s/it] 22%|██▏ | 352/1610 [1:45:33<3:40:12, 10.50s/it] {'loss': 0.0024, 'grad_norm': 0.7005969626560297, 'learning_rate': 7.813664596273292e-07, 'completion_length': 74.6964340209961, 'rewards/accuracy_reward': 0.4553571715950966, 'rewards/format_reward': 1.0, 'reward': 1.4553571939468384, 'reward_std': 0.07003280520439148, 'kl': 0.0596923828125, 'epoch': 1.09} 22%|██▏ | 352/1610 [1:45:33<3:40:12, 10.50s/it] 22%|██▏ | 353/1610 [1:45:44<3:44:08, 10.70s/it] {'loss': 0.0029, 'grad_norm': 1.1090580379953499, 'learning_rate': 7.807453416149068e-07, 'completion_length': 76.77679061889648, 'rewards/accuracy_reward': 0.4196428805589676, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4107143878936768, 'reward_std': 0.15993298217654228, 'kl': 0.0712890625, 'epoch': 1.1} 22%|██▏ | 353/1610 [1:45:44<3:44:08, 10.70s/it] 22%|██▏ | 354/1610 [1:45:53<3:35:47, 10.31s/it] {'loss': 0.002, 'grad_norm': 1.1162058447247225, 'learning_rate': 7.801242236024844e-07, 'completion_length': 93.54464721679688, 'rewards/accuracy_reward': 0.3392857313156128, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.3303572535514832, 'reward_std': 0.15933408588171005, 'kl': 0.05078125, 'epoch': 1.1} 22%|██▏ | 354/1610 [1:45:53<3:35:47, 10.31s/it] 22%|██▏ | 355/1610 [1:46:03<3:31:29, 10.11s/it] {'loss': 0.0022, 'grad_norm': 1.944126008673611, 'learning_rate': 7.79503105590062e-07, 'completion_length': 75.08928871154785, 'rewards/accuracy_reward': 0.3928571790456772, 'rewards/format_reward': 1.0, 'reward': 1.3928571939468384, 'reward_std': 0.22875341773033142, 'kl': 0.0550537109375, 'epoch': 1.1} 22%|██▏ | 355/1610 [1:46:03<3:31:29, 10.11s/it] 22%|██▏ | 356/1610 [1:46:14<3:40:54, 10.57s/it] {'loss': 0.0025, 'grad_norm': 5.601992180097085, 'learning_rate': 7.788819875776397e-07, 'completion_length': 86.16964721679688, 'rewards/accuracy_reward': 0.330357164144516, 'rewards/format_reward': 0.973214328289032, 'reward': 1.3035714626312256, 'reward_std': 0.25131121277809143, 'kl': 0.0628662109375, 'epoch': 1.11} 22%|██▏ | 356/1610 [1:46:14<3:40:54, 10.57s/it] 22%|██▏ | 357/1610 [1:46:26<3:45:04, 10.78s/it] {'loss': 0.0029, 'grad_norm': 1.6530821703036822, 'learning_rate': 7.782608695652173e-07, 'completion_length': 96.23214721679688, 'rewards/accuracy_reward': 0.1696428656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.1517857313156128, 'reward_std': 0.25770068168640137, 'kl': 0.0731201171875, 'epoch': 1.11} 22%|██▏ | 357/1610 [1:46:26<3:45:04, 10.78s/it] 22%|██▏ | 358/1610 [1:46:37<3:48:28, 10.95s/it] {'loss': 0.0028, 'grad_norm': 1.5316226784203613, 'learning_rate': 7.776397515527951e-07, 'completion_length': 104.60714721679688, 'rewards/accuracy_reward': 0.5625000298023224, 'rewards/format_reward': 1.0, 'reward': 1.5625000596046448, 'reward_std': 0.2663346827030182, 'kl': 0.0712890625, 'epoch': 1.11} 22%|██▏ | 358/1610 [1:46:37<3:48:28, 10.95s/it] 22%|██▏ | 359/1610 [1:46:47<3:42:25, 10.67s/it] {'loss': 0.004, 'grad_norm': 2.8034855160506007, 'learning_rate': 7.770186335403727e-07, 'completion_length': 64.56250381469727, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 1.0, 'reward': 1.571428656578064, 'reward_std': 0.18788283318281174, 'kl': 0.10107421875, 'epoch': 1.11} 22%|██▏ | 359/1610 [1:46:47<3:42:25, 10.67s/it] 22%|██▏ | 360/1610 [1:46:58<3:45:34, 10.83s/it] {'loss': 0.0027, 'grad_norm': 2.9430332781236763, 'learning_rate': 7.763975155279503e-07, 'completion_length': 95.91964721679688, 'rewards/accuracy_reward': 0.2946428656578064, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.285714328289032, 'reward_std': 0.2798745036125183, 'kl': 0.0673828125, 'epoch': 1.12} 22%|██▏ | 360/1610 [1:46:58<3:45:34, 10.83s/it] 22%|██▏ | 361/1610 [1:47:09<3:47:15, 10.92s/it] {'loss': 0.0023, 'grad_norm': 1.4726317420194028, 'learning_rate': 7.75776397515528e-07, 'completion_length': 98.51786422729492, 'rewards/accuracy_reward': 0.3660714477300644, 'rewards/format_reward': 1.0, 'reward': 1.3660714626312256, 'reward_std': 0.17104806005954742, 'kl': 0.0584716796875, 'epoch': 1.12} 22%|██▏ | 361/1610 [1:47:09<3:47:15, 10.92s/it] 22%|██▏ | 362/1610 [1:47:19<3:38:13, 10.49s/it] {'loss': 0.0019, 'grad_norm': 2.015832738190011, 'learning_rate': 7.751552795031056e-07, 'completion_length': 87.36607360839844, 'rewards/accuracy_reward': 0.3660714477300644, 'rewards/format_reward': 1.0, 'reward': 1.3660715222358704, 'reward_std': 0.1704346016049385, 'kl': 0.047119140625, 'epoch': 1.12} 22%|██▏ | 362/1610 [1:47:19<3:38:13, 10.49s/it] 23%|██▎ | 363/1610 [1:47:31<3:51:15, 11.13s/it] {'loss': 0.0017, 'grad_norm': 1.5954804648527998, 'learning_rate': 7.745341614906832e-07, 'completion_length': 119.16072082519531, 'rewards/accuracy_reward': 0.455357164144516, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4464285969734192, 'reward_std': 0.34148265421390533, 'kl': 0.042236328125, 'epoch': 1.13} 23%|██▎ | 363/1610 [1:47:31<3:51:15, 11.13s/it] 23%|██▎ | 364/1610 [1:47:43<3:54:15, 11.28s/it] {'loss': 0.0024, 'grad_norm': 1.43926144047621, 'learning_rate': 7.739130434782608e-07, 'completion_length': 99.4285774230957, 'rewards/accuracy_reward': 0.5089285969734192, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.4910714626312256, 'reward_std': 0.20037931203842163, 'kl': 0.0594482421875, 'epoch': 1.13} 23%|██▎ | 364/1610 [1:47:43<3:54:15, 11.28s/it] 23%|██▎ | 365/1610 [1:47:55<3:57:41, 11.46s/it] {'loss': 0.0023, 'grad_norm': 2.1130489431578705, 'learning_rate': 7.732919254658385e-07, 'completion_length': 99.91964721679688, 'rewards/accuracy_reward': 0.3571428880095482, 'rewards/format_reward': 1.0, 'reward': 1.3571429252624512, 'reward_std': 0.09528662264347076, 'kl': 0.056884765625, 'epoch': 1.13} 23%|██▎ | 365/1610 [1:47:55<3:57:41, 11.46s/it] 23%|██▎ | 366/1610 [1:48:06<3:53:41, 11.27s/it] {'loss': 0.0024, 'grad_norm': 2.4070033069541226, 'learning_rate': 7.726708074534161e-07, 'completion_length': 100.46429061889648, 'rewards/accuracy_reward': 0.3750000298023224, 'rewards/format_reward': 1.0, 'reward': 1.3750000596046448, 'reward_std': 0.17495086789131165, 'kl': 0.05908203125, 'epoch': 1.14} 23%|██▎ | 366/1610 [1:48:06<3:53:41, 11.27s/it] 23%|██▎ | 367/1610 [1:48:17<3:55:02, 11.35s/it] {'loss': 0.0024, 'grad_norm': 0.8826109732935501, 'learning_rate': 7.720496894409939e-07, 'completion_length': 126.2410774230957, 'rewards/accuracy_reward': 0.5357143133878708, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.5267857909202576, 'reward_std': 0.13225442171096802, 'kl': 0.0592041015625, 'epoch': 1.14} 23%|██▎ | 367/1610 [1:48:17<3:55:02, 11.35s/it] 23%|██▎ | 368/1610 [1:48:30<4:05:21, 11.85s/it] {'loss': 0.0026, 'grad_norm': 1.472782875172759, 'learning_rate': 7.714285714285714e-07, 'completion_length': 98.9910774230957, 'rewards/accuracy_reward': 0.535714328289032, 'rewards/format_reward': 0.973214328289032, 'reward': 1.508928656578064, 'reward_std': 0.2636442929506302, 'kl': 0.06494140625, 'epoch': 1.14} 23%|██▎ | 368/1610 [1:48:30<4:05:21, 11.85s/it] 23%|██▎ | 369/1610 [1:48:41<3:59:15, 11.57s/it] {'loss': 0.0039, 'grad_norm': 1.687798416988034, 'learning_rate': 7.70807453416149e-07, 'completion_length': 69.01786041259766, 'rewards/accuracy_reward': 0.6160714328289032, 'rewards/format_reward': 1.0, 'reward': 1.6160715222358704, 'reward_std': 0.15872061997652054, 'kl': 0.098388671875, 'epoch': 1.15} 23%|██▎ | 369/1610 [1:48:41<3:59:15, 11.57s/it] 23%|██▎ | 370/1610 [1:48:52<3:55:19, 11.39s/it] {'loss': 0.0021, 'grad_norm': 1.7765004769857538, 'learning_rate': 7.701863354037266e-07, 'completion_length': 108.21429061889648, 'rewards/accuracy_reward': 0.3839285969734192, 'rewards/format_reward': 1.0, 'reward': 1.383928656578064, 'reward_std': 0.2287590205669403, 'kl': 0.05224609375, 'epoch': 1.15} 23%|██▎ | 370/1610 [1:48:52<3:55:19, 11.39s/it] 23%|██▎ | 371/1610 [1:49:03<3:50:09, 11.15s/it] {'loss': 0.0022, 'grad_norm': 1.7027393301227214, 'learning_rate': 7.695652173913043e-07, 'completion_length': 85.81250381469727, 'rewards/accuracy_reward': 0.4821428954601288, 'rewards/format_reward': 1.0, 'reward': 1.4821429252624512, 'reward_std': 0.2410808801651001, 'kl': 0.0556640625, 'epoch': 1.15} 23%|██▎ | 371/1610 [1:49:03<3:50:09, 11.15s/it] 23%|██▎ | 372/1610 [1:49:15<3:54:00, 11.34s/it] {'loss': 0.0025, 'grad_norm': 1.1989003640576403, 'learning_rate': 7.689440993788819e-07, 'completion_length': 107.71429061889648, 'rewards/accuracy_reward': 0.3750000149011612, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.3571429252624512, 'reward_std': 0.21044061332941055, 'kl': 0.0628662109375, 'epoch': 1.16} 23%|██▎ | 372/1610 [1:49:15<3:54:00, 11.34s/it] 23%|██▎ | 373/1610 [1:49:25<3:45:01, 10.91s/it] {'loss': 0.002, 'grad_norm': 2.0403469135724794, 'learning_rate': 7.683229813664595e-07, 'completion_length': 81.97321891784668, 'rewards/accuracy_reward': 0.3482142984867096, 'rewards/format_reward': 1.0, 'reward': 1.348214328289032, 'reward_std': 0.1704345941543579, 'kl': 0.0506591796875, 'epoch': 1.16} 23%|██▎ | 373/1610 [1:49:25<3:45:01, 10.91s/it] 23%|██▎ | 374/1610 [1:49:35<3:44:53, 10.92s/it] {'loss': 0.0047, 'grad_norm': 0.8273943917993306, 'learning_rate': 7.677018633540372e-07, 'completion_length': 71.85714721679688, 'rewards/accuracy_reward': 0.4910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.4910714626312256, 'reward_std': 0.0964989960193634, 'kl': 0.11767578125, 'epoch': 1.16} 23%|██▎ | 374/1610 [1:49:35<3:44:53, 10.92s/it] 23%|██▎ | 375/1610 [1:49:49<4:00:26, 11.68s/it] {'loss': 0.0029, 'grad_norm': 1.618946467653902, 'learning_rate': 7.670807453416148e-07, 'completion_length': 124.91964721679688, 'rewards/accuracy_reward': 0.2321428656578064, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.2232143878936768, 'reward_std': 0.14969704672694206, 'kl': 0.0731201171875, 'epoch': 1.16} 23%|██▎ | 375/1610 [1:49:49<4:00:26, 11.68s/it] 23%|██▎ | 376/1610 [1:50:00<3:55:05, 11.43s/it] {'loss': 0.0019, 'grad_norm': 2.4307840107707057, 'learning_rate': 7.664596273291925e-07, 'completion_length': 110.28572082519531, 'rewards/accuracy_reward': 0.446428582072258, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4375000596046448, 'reward_std': 0.32282766699790955, 'kl': 0.048095703125, 'epoch': 1.17} 23%|██▎ | 376/1610 [1:50:00<3:55:05, 11.43s/it] 23%|██▎ | 377/1610 [1:50:11<3:51:18, 11.26s/it] {'loss': 0.0022, 'grad_norm': 1.6589845967685184, 'learning_rate': 7.658385093167702e-07, 'completion_length': 98.7589340209961, 'rewards/accuracy_reward': 0.5357142984867096, 'rewards/format_reward': 1.0, 'reward': 1.5357143878936768, 'reward_std': 0.2936766594648361, 'kl': 0.0562744140625, 'epoch': 1.17} 23%|██▎ | 377/1610 [1:50:11<3:51:18, 11.26s/it] 23%|██▎ | 378/1610 [1:50:21<3:43:26, 10.88s/it] {'loss': 0.003, 'grad_norm': 1.0309541168258, 'learning_rate': 7.652173913043478e-07, 'completion_length': 88.2410774230957, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 1.0, 'reward': 1.5535715222358704, 'reward_std': 0.1509094163775444, 'kl': 0.074462890625, 'epoch': 1.17} 23%|██▎ | 378/1610 [1:50:21<3:43:26, 10.88s/it] 24%|██▎ | 379/1610 [1:50:34<3:55:58, 11.50s/it] {'loss': 0.0022, 'grad_norm': 1.9849780146193203, 'learning_rate': 7.645962732919254e-07, 'completion_length': 102.66071701049805, 'rewards/accuracy_reward': 0.4464285969734192, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.4285715222358704, 'reward_std': 0.32586027681827545, 'kl': 0.0538330078125, 'epoch': 1.18} 24%|██▎ | 379/1610 [1:50:34<3:55:58, 11.50s/it] 24%|██▎ | 380/1610 [1:50:45<3:57:39, 11.59s/it] {'loss': 0.0024, 'grad_norm': 2.4060089882143667, 'learning_rate': 7.639751552795031e-07, 'completion_length': 93.64286231994629, 'rewards/accuracy_reward': 0.3035714402794838, 'rewards/format_reward': 1.0, 'reward': 1.3035714626312256, 'reward_std': 0.167145274579525, 'kl': 0.0595703125, 'epoch': 1.18} 24%|██▎ | 380/1610 [1:50:45<3:57:39, 11.59s/it] 24%|██▎ | 381/1610 [1:50:57<3:59:23, 11.69s/it] {'loss': 0.0018, 'grad_norm': 4.008778372450152, 'learning_rate': 7.633540372670807e-07, 'completion_length': 121.16072463989258, 'rewards/accuracy_reward': 0.2678571566939354, 'rewards/format_reward': 1.0, 'reward': 1.2678571939468384, 'reward_std': 0.1575082316994667, 'kl': 0.046142578125, 'epoch': 1.18} 24%|██▎ | 381/1610 [1:50:57<3:59:23, 11.69s/it] 24%|██▎ | 382/1610 [1:51:10<4:03:35, 11.90s/it] {'loss': 0.0022, 'grad_norm': 1.8167977266525088, 'learning_rate': 7.627329192546583e-07, 'completion_length': 118.3035774230957, 'rewards/accuracy_reward': 0.4285714328289032, 'rewards/format_reward': 0.9732142984867096, 'reward': 1.4017857909202576, 'reward_std': 0.27206334471702576, 'kl': 0.0543212890625, 'epoch': 1.19} 24%|██▎ | 382/1610 [1:51:10<4:03:35, 11.90s/it] 24%|██▍ | 383/1610 [1:51:21<3:58:54, 11.68s/it] {'loss': 0.0036, 'grad_norm': 1.953017348917588, 'learning_rate': 7.62111801242236e-07, 'completion_length': 80.8660774230957, 'rewards/accuracy_reward': 0.7142857313156128, 'rewards/format_reward': 1.0, 'reward': 1.7142857909202576, 'reward_std': 0.18397442996501923, 'kl': 0.0888671875, 'epoch': 1.19} 24%|██▍ | 383/1610 [1:51:21<3:58:54, 11.68s/it] 24%|██▍ | 384/1610 [1:51:33<3:59:35, 11.73s/it] {'loss': 0.0022, 'grad_norm': 1.8193479443343354, 'learning_rate': 7.614906832298136e-07, 'completion_length': 108.04464721679688, 'rewards/accuracy_reward': 0.3839285969734192, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.3660715222358704, 'reward_std': 0.23711898922920227, 'kl': 0.0538330078125, 'epoch': 1.19} 24%|██▍ | 384/1610 [1:51:33<3:59:35, 11.73s/it] 24%|██▍ | 385/1610 [1:51:46<4:06:21, 12.07s/it] {'loss': 0.0031, 'grad_norm': 2.9700386717207574, 'learning_rate': 7.608695652173913e-07, 'completion_length': 114.68750381469727, 'rewards/accuracy_reward': 0.4910714477300644, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4821429252624512, 'reward_std': 0.3135328143835068, 'kl': 0.076416015625, 'epoch': 1.2} 24%|██▍ | 385/1610 [1:51:46<4:06:21, 12.07s/it] 24%|██▍ | 386/1610 [1:51:56<3:56:19, 11.58s/it] {'loss': 0.0041, 'grad_norm': 1.5437429993040288, 'learning_rate': 7.60248447204969e-07, 'completion_length': 85.20536041259766, 'rewards/accuracy_reward': 0.3660714477300644, 'rewards/format_reward': 1.0, 'reward': 1.3660715222358704, 'reward_std': 0.1704346090555191, 'kl': 0.1015625, 'epoch': 1.2} 24%|██▍ | 386/1610 [1:51:56<3:56:19, 11.58s/it] 24%|██▍ | 387/1610 [1:52:08<4:00:15, 11.79s/it] {'loss': 0.0028, 'grad_norm': 2.3639446652769758, 'learning_rate': 7.596273291925466e-07, 'completion_length': 106.45536041259766, 'rewards/accuracy_reward': 0.4196428656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4017857909202576, 'reward_std': 0.19178562611341476, 'kl': 0.071044921875, 'epoch': 1.2} 24%|██▍ | 387/1610 [1:52:08<4:00:15, 11.79s/it] 24%|██▍ | 388/1610 [1:52:19<3:51:24, 11.36s/it] {'loss': 0.0024, 'grad_norm': 1.8789410575119596, 'learning_rate': 7.590062111801242e-07, 'completion_length': 107.8660774230957, 'rewards/accuracy_reward': 0.4732143133878708, 'rewards/format_reward': 1.0, 'reward': 1.473214328289032, 'reward_std': 0.27535824477672577, 'kl': 0.060302734375, 'epoch': 1.2} 24%|██▍ | 388/1610 [1:52:19<3:51:24, 11.36s/it] 24%|██▍ | 389/1610 [1:52:32<4:06:13, 12.10s/it] {'loss': 0.0019, 'grad_norm': 1.683695005588531, 'learning_rate': 7.583850931677019e-07, 'completion_length': 148.12500762939453, 'rewards/accuracy_reward': 0.2678571492433548, 'rewards/format_reward': 0.973214328289032, 'reward': 1.2410714626312256, 'reward_std': 0.2924629747867584, 'kl': 0.0467529296875, 'epoch': 1.21} 24%|██▍ | 389/1610 [1:52:32<4:06:13, 12.10s/it] 24%|██▍ | 390/1610 [1:52:44<4:01:45, 11.89s/it] {'loss': 0.0023, 'grad_norm': 0.9028087147814698, 'learning_rate': 7.577639751552795e-07, 'completion_length': 112.68750381469727, 'rewards/accuracy_reward': 0.4107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.410714328289032, 'reward_std': 0.12175281345844269, 'kl': 0.056640625, 'epoch': 1.21} 24%|██▍ | 390/1610 [1:52:44<4:01:45, 11.89s/it] 24%|██▍ | 391/1610 [1:52:56<4:05:23, 12.08s/it] {'loss': 0.0028, 'grad_norm': 1.4791933993086506, 'learning_rate': 7.57142857142857e-07, 'completion_length': 89.14286422729492, 'rewards/accuracy_reward': 0.5267857313156128, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.5178571939468384, 'reward_std': 0.1995968073606491, 'kl': 0.071044921875, 'epoch': 1.21} 24%|██▍ | 391/1610 [1:52:56<4:05:23, 12.08s/it] 24%|██▍ | 392/1610 [1:53:08<4:02:57, 11.97s/it] {'loss': 0.0031, 'grad_norm': 2.5957760579656277, 'learning_rate': 7.565217391304347e-07, 'completion_length': 93.43750381469727, 'rewards/accuracy_reward': 0.4017857313156128, 'rewards/format_reward': 1.0, 'reward': 1.4017857909202576, 'reward_std': 0.2500988394021988, 'kl': 0.077880859375, 'epoch': 1.22} 24%|██▍ | 392/1610 [1:53:08<4:02:57, 11.97s/it] 24%|██▍ | 393/1610 [1:53:20<4:00:46, 11.87s/it] {'loss': 0.0022, 'grad_norm': 0.8168755779905102, 'learning_rate': 7.559006211180123e-07, 'completion_length': 110.1875, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 1.0, 'reward': 1.571428656578064, 'reward_std': 0.1575082391500473, 'kl': 0.0560302734375, 'epoch': 1.22} 24%|██▍ | 393/1610 [1:53:20<4:00:46, 11.87s/it] 24%|██▍ | 394/1610 [1:53:31<3:59:13, 11.80s/it] {'loss': 0.0026, 'grad_norm': 0.6832658199478693, 'learning_rate': 7.5527950310559e-07, 'completion_length': 107.7589340209961, 'rewards/accuracy_reward': 0.446428582072258, 'rewards/format_reward': 1.0, 'reward': 1.446428656578064, 'reward_std': 0.0835726372897625, 'kl': 0.0638427734375, 'epoch': 1.22} 24%|██▍ | 394/1610 [1:53:31<3:59:13, 11.80s/it] 25%|██▍ | 395/1610 [1:53:45<4:11:23, 12.41s/it] {'loss': 0.0023, 'grad_norm': 1.6105558388628152, 'learning_rate': 7.546583850931677e-07, 'completion_length': 110.41964721679688, 'rewards/accuracy_reward': 0.526785746216774, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.508928656578064, 'reward_std': 0.2149568870663643, 'kl': 0.0584716796875, 'epoch': 1.23} 25%|██▍ | 395/1610 [1:53:45<4:11:23, 12.41s/it] 25%|██▍ | 396/1610 [1:53:55<3:57:42, 11.75s/it] {'loss': 0.0031, 'grad_norm': 1.930262172410025, 'learning_rate': 7.540372670807453e-07, 'completion_length': 89.61607360839844, 'rewards/accuracy_reward': 0.535714328289032, 'rewards/format_reward': 1.0, 'reward': 1.535714328289032, 'reward_std': 0.23959724605083466, 'kl': 0.076416015625, 'epoch': 1.23} 25%|██▍ | 396/1610 [1:53:55<3:57:42, 11.75s/it] 25%|██▍ | 397/1610 [1:54:08<4:03:29, 12.04s/it] {'loss': 0.0025, 'grad_norm': 0.8883415642057975, 'learning_rate': 7.534161490683229e-07, 'completion_length': 110.03571701049805, 'rewards/accuracy_reward': 0.3392857313156128, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.3303572535514832, 'reward_std': 0.1379830613732338, 'kl': 0.0615234375, 'epoch': 1.23} 25%|██▍ | 397/1610 [1:54:08<4:03:29, 12.04s/it] 25%|██▍ | 398/1610 [1:54:21<4:09:08, 12.33s/it] {'loss': 0.0023, 'grad_norm': 1.7686429973347675, 'learning_rate': 7.527950310559006e-07, 'completion_length': 109.29464721679688, 'rewards/accuracy_reward': 0.6339285969734192, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.6250000596046448, 'reward_std': 0.29010486602783203, 'kl': 0.05810546875, 'epoch': 1.24} 25%|██▍ | 398/1610 [1:54:21<4:09:08, 12.33s/it] 25%|██▍ | 399/1610 [1:54:33<4:05:11, 12.15s/it] {'loss': 0.0023, 'grad_norm': 1.2901983741750196, 'learning_rate': 7.521739130434782e-07, 'completion_length': 87.24107360839844, 'rewards/accuracy_reward': 0.6160714328289032, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.6071429252624512, 'reward_std': 0.2053254395723343, 'kl': 0.057373046875, 'epoch': 1.24} 25%|██▍ | 399/1610 [1:54:33<4:05:11, 12.15s/it] 25%|██▍ | 400/1610 [1:54:44<3:56:24, 11.72s/it] {'loss': 0.0027, 'grad_norm': 2.1080787288851557, 'learning_rate': 7.515527950310558e-07, 'completion_length': 89.68750381469727, 'rewards/accuracy_reward': 0.3214285969734192, 'rewards/format_reward': 1.0, 'reward': 1.321428656578064, 'reward_std': 0.22936689853668213, 'kl': 0.068115234375, 'epoch': 1.24} 25%|██▍ | 400/1610 [1:54:44<3:56:24, 11.72s/it] 25%|██▍ | 401/1610 [1:55:35<7:53:21, 23.49s/it] {'loss': 0.0028, 'grad_norm': 1.7573033704158207, 'learning_rate': 7.509316770186335e-07, 'completion_length': 99.04464721679688, 'rewards/accuracy_reward': 0.3392857313156128, 'rewards/format_reward': 1.0, 'reward': 1.3392857909202576, 'reward_std': 0.20021027326583862, 'kl': 0.0701904296875, 'epoch': 1.25} 25%|██▍ | 401/1610 [1:55:35<7:53:21, 23.49s/it] 25%|██▍ | 402/1610 [1:55:46<6:41:24, 19.94s/it] {'loss': 0.0018, 'grad_norm': 1.5311060936683472, 'learning_rate': 7.503105590062111e-07, 'completion_length': 111.83036041259766, 'rewards/accuracy_reward': 0.660714328289032, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.6517857909202576, 'reward_std': 0.260606050491333, 'kl': 0.0445556640625, 'epoch': 1.25} 25%|██▍ | 402/1610 [1:55:46<6:41:24, 19.94s/it] 25%|██▌ | 403/1610 [1:55:55<5:36:33, 16.73s/it] {'loss': 0.0029, 'grad_norm': 1.5680837226663458, 'learning_rate': 7.496894409937888e-07, 'completion_length': 85.30357360839844, 'rewards/accuracy_reward': 0.4285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.4285714626312256, 'reward_std': 0.23327527940273285, 'kl': 0.072021484375, 'epoch': 1.25} 25%|██▌ | 403/1610 [1:55:55<5:36:33, 16.73s/it] 25%|██▌ | 404/1610 [1:56:06<4:58:29, 14.85s/it] {'loss': 0.0021, 'grad_norm': 1.1427559417741586, 'learning_rate': 7.490683229813665e-07, 'completion_length': 93.08929061889648, 'rewards/accuracy_reward': 0.392857164144516, 'rewards/format_reward': 1.0, 'reward': 1.3928571939468384, 'reward_std': 0.09528662264347076, 'kl': 0.05322265625, 'epoch': 1.25} 25%|██▌ | 404/1610 [1:56:06<4:58:29, 14.85s/it] 25%|██▌ | 405/1610 [1:56:17<4:37:25, 13.81s/it] {'loss': 0.0026, 'grad_norm': 1.4771664226046441, 'learning_rate': 7.484472049689441e-07, 'completion_length': 107.60714721679688, 'rewards/accuracy_reward': 0.2500000149011612, 'rewards/format_reward': 1.0, 'reward': 1.2500000596046448, 'reward_std': 0.18397442996501923, 'kl': 0.065673828125, 'epoch': 1.26} 25%|██▌ | 405/1610 [1:56:17<4:37:25, 13.81s/it] 25%|██▌ | 406/1610 [1:56:28<4:21:03, 13.01s/it] {'loss': 0.0022, 'grad_norm': 0.9094158572206374, 'learning_rate': 7.478260869565217e-07, 'completion_length': 110.34822082519531, 'rewards/accuracy_reward': 0.4642857164144516, 'rewards/format_reward': 1.0, 'reward': 1.4642857909202576, 'reward_std': 0.16922222077846527, 'kl': 0.0560302734375, 'epoch': 1.26} 25%|██▌ | 406/1610 [1:56:28<4:21:03, 13.01s/it] 25%|██▌ | 407/1610 [1:56:39<4:04:44, 12.21s/it] {'loss': 0.0018, 'grad_norm': 1.3450567561902145, 'learning_rate': 7.472049689440994e-07, 'completion_length': 113.9464340209961, 'rewards/accuracy_reward': 0.4017857313156128, 'rewards/format_reward': 1.0, 'reward': 1.4017857909202576, 'reward_std': 0.260606050491333, 'kl': 0.0455322265625, 'epoch': 1.26} 25%|██▌ | 407/1610 [1:56:39<4:04:44, 12.21s/it] 25%|██▌ | 408/1610 [1:56:52<4:09:08, 12.44s/it] {'loss': 0.0019, 'grad_norm': 1.0443046025660774, 'learning_rate': 7.46583850931677e-07, 'completion_length': 140.96429443359375, 'rewards/accuracy_reward': 0.4017857313156128, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.3928571939468384, 'reward_std': 0.21044061332941055, 'kl': 0.0482177734375, 'epoch': 1.27} 25%|██▌ | 408/1610 [1:56:52<4:09:08, 12.44s/it] 25%|██▌ | 409/1610 [1:57:03<4:01:06, 12.05s/it] {'loss': 0.0023, 'grad_norm': 1.1719564907599933, 'learning_rate': 7.459627329192546e-07, 'completion_length': 97.33928680419922, 'rewards/accuracy_reward': 0.526785746216774, 'rewards/format_reward': 1.0, 'reward': 1.5267857909202576, 'reward_std': 0.2092282474040985, 'kl': 0.05810546875, 'epoch': 1.27} 25%|██▌ | 409/1610 [1:57:03<4:01:06, 12.05s/it] 25%|██▌ | 410/1610 [1:57:16<4:05:16, 12.26s/it] {'loss': 0.0022, 'grad_norm': 1.5538235145458623, 'learning_rate': 7.453416149068323e-07, 'completion_length': 140.19643783569336, 'rewards/accuracy_reward': 0.4732142984867096, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4642857909202576, 'reward_std': 0.2765706181526184, 'kl': 0.0540771484375, 'epoch': 1.27} 25%|██▌ | 410/1610 [1:57:16<4:05:16, 12.26s/it] 26%|██▌ | 411/1610 [1:57:28<4:05:01, 12.26s/it] {'loss': 0.0024, 'grad_norm': 3.735022252215363, 'learning_rate': 7.447204968944099e-07, 'completion_length': 113.87500381469727, 'rewards/accuracy_reward': 0.4642857313156128, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4553571939468384, 'reward_std': 0.23838484287261963, 'kl': 0.0596923828125, 'epoch': 1.28} 26%|██▌ | 411/1610 [1:57:28<4:05:01, 12.26s/it] 26%|██▌ | 412/1610 [1:57:40<4:01:04, 12.07s/it] {'loss': 0.0021, 'grad_norm': 1.881523100793052, 'learning_rate': 7.440993788819876e-07, 'completion_length': 116.1964340209961, 'rewards/accuracy_reward': 0.2410714477300644, 'rewards/format_reward': 1.0, 'reward': 1.2410715222358704, 'reward_std': 0.2663346827030182, 'kl': 0.0513916015625, 'epoch': 1.28} 26%|██▌ | 412/1610 [1:57:40<4:01:04, 12.07s/it] 26%|██▌ | 413/1610 [1:57:52<4:02:12, 12.14s/it] {'loss': 0.0019, 'grad_norm': 2.1375013410051054, 'learning_rate': 7.434782608695653e-07, 'completion_length': 123.59821701049805, 'rewards/accuracy_reward': 0.3660714626312256, 'rewards/format_reward': 1.0, 'reward': 1.3660714626312256, 'reward_std': 0.3700314164161682, 'kl': 0.048583984375, 'epoch': 1.28} 26%|██▌ | 413/1610 [1:57:52<4:02:12, 12.14s/it] 26%|██▌ | 414/1610 [1:58:04<3:59:13, 12.00s/it] {'loss': 0.0023, 'grad_norm': 1.7489112948931838, 'learning_rate': 7.428571428571429e-07, 'completion_length': 112.5535774230957, 'rewards/accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 1.0, 'reward': 1.5000000596046448, 'reward_std': 0.2443758025765419, 'kl': 0.0576171875, 'epoch': 1.29} 26%|██▌ | 414/1610 [1:58:04<3:59:13, 12.00s/it] 26%|██▌ | 415/1610 [1:58:14<3:51:47, 11.64s/it] {'loss': 0.0021, 'grad_norm': 1.5126037797089995, 'learning_rate': 7.422360248447204e-07, 'completion_length': 111.34822082519531, 'rewards/accuracy_reward': 0.4285714477300644, 'rewards/format_reward': 1.0, 'reward': 1.4285714626312256, 'reward_std': 0.343907430768013, 'kl': 0.05322265625, 'epoch': 1.29} 26%|██▌ | 415/1610 [1:58:14<3:51:47, 11.64s/it] 26%|██▌ | 416/1610 [1:58:26<3:50:37, 11.59s/it] {'loss': 0.0019, 'grad_norm': 1.0982666382630208, 'learning_rate': 7.416149068322981e-07, 'completion_length': 117.2410774230957, 'rewards/accuracy_reward': 0.4910714477300644, 'rewards/format_reward': 1.0, 'reward': 1.4910715222358704, 'reward_std': 0.17104807496070862, 'kl': 0.04736328125, 'epoch': 1.29} 26%|██▌ | 416/1610 [1:58:26<3:50:37, 11.59s/it] 26%|██▌ | 417/1610 [1:58:39<3:59:04, 12.02s/it] {'loss': 0.0022, 'grad_norm': 1.4322347994374651, 'learning_rate': 7.409937888198757e-07, 'completion_length': 133.00000762939453, 'rewards/accuracy_reward': 0.4910714477300644, 'rewards/format_reward': 1.0, 'reward': 1.4910715222358704, 'reward_std': 0.23838484287261963, 'kl': 0.0543212890625, 'epoch': 1.3} 26%|██▌ | 417/1610 [1:58:39<3:59:04, 12.02s/it] 26%|██▌ | 418/1610 [1:58:50<3:56:02, 11.88s/it] {'loss': 0.0014, 'grad_norm': 1.706989535529418, 'learning_rate': 7.403726708074533e-07, 'completion_length': 117.1160774230957, 'rewards/accuracy_reward': 0.2678571492433548, 'rewards/format_reward': 1.0, 'reward': 1.2678571939468384, 'reward_std': 0.2696240097284317, 'kl': 0.0340576171875, 'epoch': 1.3} 26%|██▌ | 418/1610 [1:58:50<3:56:02, 11.88s/it] 26%|██▌ | 419/1610 [1:59:04<4:06:59, 12.44s/it] {'loss': 0.002, 'grad_norm': 1.4645646881839551, 'learning_rate': 7.39751552795031e-07, 'completion_length': 135.32143783569336, 'rewards/accuracy_reward': 0.2142857313156128, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.196428656578064, 'reward_std': 0.2576950713992119, 'kl': 0.049072265625, 'epoch': 1.3} 26%|██▌ | 419/1610 [1:59:04<4:06:59, 12.44s/it] 26%|██▌ | 420/1610 [1:59:17<4:12:15, 12.72s/it] {'loss': 0.0022, 'grad_norm': 2.578537848133944, 'learning_rate': 7.391304347826086e-07, 'completion_length': 133.87500762939453, 'rewards/accuracy_reward': 0.3035714328289032, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.2946429252624512, 'reward_std': 0.19197798520326614, 'kl': 0.0546875, 'epoch': 1.3} 26%|██▌ | 420/1610 [1:59:17<4:12:15, 12.72s/it] 26%|██▌ | 421/1610 [1:59:30<4:09:37, 12.60s/it] {'loss': 0.0018, 'grad_norm': 1.592151657042735, 'learning_rate': 7.385093167701863e-07, 'completion_length': 132.28571701049805, 'rewards/accuracy_reward': 0.3482142984867096, 'rewards/format_reward': 1.0, 'reward': 1.348214328289032, 'reward_std': 0.26423758268356323, 'kl': 0.04541015625, 'epoch': 1.31} 26%|██▌ | 421/1610 [1:59:30<4:09:37, 12.60s/it] 26%|██▌ | 422/1610 [1:59:43<4:13:06, 12.78s/it] {'loss': 0.0027, 'grad_norm': 1.2429057119414377, 'learning_rate': 7.37888198757764e-07, 'completion_length': 117.66072082519531, 'rewards/accuracy_reward': 0.4553571492433548, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.446428656578064, 'reward_std': 0.17885926365852356, 'kl': 0.066650390625, 'epoch': 1.31} 26%|██▌ | 422/1610 [1:59:43<4:13:06, 12.78s/it] 26%|██▋ | 423/1610 [1:59:56<4:12:37, 12.77s/it] {'loss': 0.0021, 'grad_norm': 1.4377240529145707, 'learning_rate': 7.372670807453416e-07, 'completion_length': 138.1339340209961, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.598214328289032, 'reward_std': 0.2512577176094055, 'kl': 0.0523681640625, 'epoch': 1.31} 26%|██▋ | 423/1610 [1:59:56<4:12:37, 12.77s/it] 26%|██▋ | 424/1610 [2:00:07<4:02:13, 12.25s/it] {'loss': 0.0026, 'grad_norm': 2.1026365259053774, 'learning_rate': 7.366459627329192e-07, 'completion_length': 121.6964340209961, 'rewards/accuracy_reward': 0.3392857313156128, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.3303571939468384, 'reward_std': 0.2954912781715393, 'kl': 0.066162109375, 'epoch': 1.32} 26%|██▋ | 424/1610 [2:00:07<4:02:13, 12.25s/it] 26%|██▋ | 425/1610 [2:00:20<4:06:41, 12.49s/it] {'loss': 0.0019, 'grad_norm': 1.5466011312115922, 'learning_rate': 7.360248447204969e-07, 'completion_length': 150.0357208251953, 'rewards/accuracy_reward': 0.348214291036129, 'rewards/format_reward': 0.9732142984867096, 'reward': 1.321428656578064, 'reward_std': 0.20150023698806763, 'kl': 0.0478515625, 'epoch': 1.32} 26%|██▋ | 425/1610 [2:00:20<4:06:41, 12.49s/it] 26%|██▋ | 426/1610 [2:00:33<4:08:04, 12.57s/it] {'loss': 0.0017, 'grad_norm': 1.7048526185218602, 'learning_rate': 7.354037267080745e-07, 'completion_length': 139.86607360839844, 'rewards/accuracy_reward': 0.464285746216774, 'rewards/format_reward': 1.0, 'reward': 1.4642857909202576, 'reward_std': 0.33306360244750977, 'kl': 0.0413818359375, 'epoch': 1.32} 26%|██▋ | 426/1610 [2:00:33<4:08:04, 12.57s/it] 27%|██▋ | 427/1610 [2:00:46<4:12:17, 12.80s/it] {'loss': 0.0022, 'grad_norm': 3.009409286287458, 'learning_rate': 7.347826086956521e-07, 'completion_length': 130.31250762939453, 'rewards/accuracy_reward': 0.4017857313156128, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.3928571939468384, 'reward_std': 0.2501044422388077, 'kl': 0.0555419921875, 'epoch': 1.33} 27%|██▋ | 427/1610 [2:00:46<4:12:17, 12.80s/it] 27%|██▋ | 428/1610 [2:00:58<4:10:37, 12.72s/it] {'loss': 0.0027, 'grad_norm': 1.2443863734333698, 'learning_rate': 7.341614906832298e-07, 'completion_length': 137.20536422729492, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6607143878936768, 'reward_std': 0.3334621340036392, 'kl': 0.06640625, 'epoch': 1.33} 27%|██▋ | 428/1610 [2:00:58<4:10:37, 12.72s/it] 27%|██▋ | 429/1610 [2:01:11<4:09:29, 12.68s/it] {'loss': 0.002, 'grad_norm': 1.4061720796837736, 'learning_rate': 7.335403726708074e-07, 'completion_length': 144.5714340209961, 'rewards/accuracy_reward': 0.232142873108387, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.223214328289032, 'reward_std': 0.2540072351694107, 'kl': 0.0494384765625, 'epoch': 1.33} 27%|██▋ | 429/1610 [2:01:11<4:09:29, 12.68s/it] 27%|██▋ | 430/1610 [2:01:24<4:13:42, 12.90s/it] {'loss': 0.002, 'grad_norm': 1.3811056915100712, 'learning_rate': 7.329192546583851e-07, 'completion_length': 123.58929443359375, 'rewards/accuracy_reward': 0.5625000298023224, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.5446429252624512, 'reward_std': 0.2884967774152756, 'kl': 0.0499267578125, 'epoch': 1.34} 27%|██▋ | 430/1610 [2:01:24<4:13:42, 12.90s/it] 27%|██▋ | 431/1610 [2:01:38<4:19:30, 13.21s/it] {'loss': 0.0016, 'grad_norm': 1.1881201519105857, 'learning_rate': 7.322981366459628e-07, 'completion_length': 141.5982208251953, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5000000596046448, 'reward_std': 0.2710343599319458, 'kl': 0.04052734375, 'epoch': 1.34} 27%|██▋ | 431/1610 [2:01:38<4:19:30, 13.21s/it] 27%|██▋ | 432/1610 [2:01:51<4:12:54, 12.88s/it] {'loss': 0.0021, 'grad_norm': 1.6550717148077347, 'learning_rate': 7.316770186335404e-07, 'completion_length': 123.5089340209961, 'rewards/accuracy_reward': 0.4910714328289032, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4821429252624512, 'reward_std': 0.3180546760559082, 'kl': 0.05224609375, 'epoch': 1.34} 27%|██▋ | 432/1610 [2:01:51<4:12:54, 12.88s/it] 27%|██▋ | 433/1610 [2:02:02<4:07:18, 12.61s/it] {'loss': 0.003, 'grad_norm': 1.9105235392568416, 'learning_rate': 7.31055900621118e-07, 'completion_length': 103.12500381469727, 'rewards/accuracy_reward': 0.5357142984867096, 'rewards/format_reward': 1.0, 'reward': 1.535714328289032, 'reward_std': 0.2022872269153595, 'kl': 0.07421875, 'epoch': 1.34} 27%|██▋ | 433/1610 [2:02:02<4:07:18, 12.61s/it] 27%|██▋ | 434/1610 [2:02:15<4:03:48, 12.44s/it] {'loss': 0.0021, 'grad_norm': 4.050111902767282, 'learning_rate': 7.304347826086957e-07, 'completion_length': 112.03572082519531, 'rewards/accuracy_reward': 0.4196428805589676, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4107143878936768, 'reward_std': 0.2726622223854065, 'kl': 0.0531005859375, 'epoch': 1.35} 27%|██▋ | 434/1610 [2:02:15<4:03:48, 12.44s/it] 27%|██▋ | 435/1610 [2:02:27<4:06:35, 12.59s/it] {'loss': 0.0029, 'grad_norm': 1.2020222389506543, 'learning_rate': 7.298136645962733e-07, 'completion_length': 96.22322082519531, 'rewards/accuracy_reward': 0.4464285969734192, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4375000596046448, 'reward_std': 0.2119186520576477, 'kl': 0.072265625, 'epoch': 1.35} 27%|██▋ | 435/1610 [2:02:27<4:06:35, 12.59s/it] 27%|██▋ | 436/1610 [2:02:39<3:59:16, 12.23s/it] {'loss': 0.002, 'grad_norm': 2.4395499319655327, 'learning_rate': 7.291925465838509e-07, 'completion_length': 135.49107360839844, 'rewards/accuracy_reward': 0.2142857238650322, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.2053571939468384, 'reward_std': 0.22605740278959274, 'kl': 0.049560546875, 'epoch': 1.35} 27%|██▋ | 436/1610 [2:02:39<3:59:16, 12.23s/it] 27%|██▋ | 437/1610 [2:02:51<3:56:07, 12.08s/it] {'loss': 0.002, 'grad_norm': 0.9205596145709571, 'learning_rate': 7.285714285714286e-07, 'completion_length': 112.29464721679688, 'rewards/accuracy_reward': 0.3571428656578064, 'rewards/format_reward': 1.0, 'reward': 1.3571429252624512, 'reward_std': 0.16653180867433548, 'kl': 0.050537109375, 'epoch': 1.36} 27%|██▋ | 437/1610 [2:02:51<3:56:07, 12.08s/it] 27%|██▋ | 438/1610 [2:03:03<3:56:53, 12.13s/it] {'loss': 0.0021, 'grad_norm': 1.8061417653525718, 'learning_rate': 7.279503105590061e-07, 'completion_length': 95.47321701049805, 'rewards/accuracy_reward': 0.3750000149011612, 'rewards/format_reward': 1.0, 'reward': 1.3750000596046448, 'reward_std': 0.2383904606103897, 'kl': 0.05224609375, 'epoch': 1.36} 27%|██▋ | 438/1610 [2:03:03<3:56:53, 12.13s/it] 27%|██▋ | 439/1610 [2:03:14<3:49:03, 11.74s/it] {'loss': 0.0022, 'grad_norm': 1.5221501094677123, 'learning_rate': 7.273291925465838e-07, 'completion_length': 107.39286041259766, 'rewards/accuracy_reward': 0.4732142984867096, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4642857909202576, 'reward_std': 0.23959723114967346, 'kl': 0.055419921875, 'epoch': 1.36} 27%|██▋ | 439/1610 [2:03:14<3:49:03, 11.74s/it] 27%|██▋ | 440/1610 [2:03:26<3:50:25, 11.82s/it] {'loss': 0.0032, 'grad_norm': 1.6831217279519983, 'learning_rate': 7.267080745341615e-07, 'completion_length': 102.41072082519531, 'rewards/accuracy_reward': 0.4285714477300644, 'rewards/format_reward': 1.0, 'reward': 1.4285714626312256, 'reward_std': 0.17885926365852356, 'kl': 0.080078125, 'epoch': 1.37} 27%|██▋ | 440/1610 [2:03:26<3:50:25, 11.82s/it] 27%|██▋ | 441/1610 [2:03:37<3:49:41, 11.79s/it] {'loss': 0.0023, 'grad_norm': 2.059690499159865, 'learning_rate': 7.260869565217391e-07, 'completion_length': 112.81250381469727, 'rewards/accuracy_reward': 0.5982142984867096, 'rewards/format_reward': 1.0, 'reward': 1.5982143878936768, 'reward_std': 0.2314494401216507, 'kl': 0.05859375, 'epoch': 1.37} 27%|██▋ | 441/1610 [2:03:37<3:49:41, 11.79s/it] 27%|██▋ | 442/1610 [2:03:49<3:49:38, 11.80s/it] {'loss': 0.0021, 'grad_norm': 1.6725518940951167, 'learning_rate': 7.254658385093167e-07, 'completion_length': 110.64286041259766, 'rewards/accuracy_reward': 0.3571428805589676, 'rewards/format_reward': 1.0, 'reward': 1.3571429252624512, 'reward_std': 0.17885925620794296, 'kl': 0.052978515625, 'epoch': 1.37} 27%|██▋ | 442/1610 [2:03:49<3:49:38, 11.80s/it] 28%|██▊ | 443/1610 [2:04:03<3:58:41, 12.27s/it] {'loss': 0.0027, 'grad_norm': 2.20894661085748, 'learning_rate': 7.248447204968943e-07, 'completion_length': 119.59822082519531, 'rewards/accuracy_reward': 0.3839285969734192, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.3750000596046448, 'reward_std': 0.27414587140083313, 'kl': 0.067626953125, 'epoch': 1.38} 28%|██▊ | 443/1610 [2:04:03<3:58:41, 12.27s/it] 28%|██▊ | 444/1610 [2:04:13<3:46:02, 11.63s/it] {'loss': 0.0022, 'grad_norm': 3.491291123898361, 'learning_rate': 7.24223602484472e-07, 'completion_length': 96.03572082519531, 'rewards/accuracy_reward': 0.6339285969734192, 'rewards/format_reward': 1.0, 'reward': 1.633928656578064, 'reward_std': 0.283163845539093, 'kl': 0.0545654296875, 'epoch': 1.38} 28%|██▊ | 444/1610 [2:04:13<3:46:02, 11.63s/it] 28%|██▊ | 445/1610 [2:04:21<3:29:10, 10.77s/it] {'loss': 0.0032, 'grad_norm': 1.4748653077392544, 'learning_rate': 7.236024844720496e-07, 'completion_length': 75.96429061889648, 'rewards/accuracy_reward': 0.455357164144516, 'rewards/format_reward': 1.0, 'reward': 1.4553571939468384, 'reward_std': 0.19447603821754456, 'kl': 0.078857421875, 'epoch': 1.38} 28%|██▊ | 445/1610 [2:04:21<3:29:10, 10.77s/it] 28%|██▊ | 446/1610 [2:04:32<3:25:04, 10.57s/it] {'loss': 0.0032, 'grad_norm': 5.733009611534221, 'learning_rate': 7.229813664596272e-07, 'completion_length': 86.95536041259766, 'rewards/accuracy_reward': 0.3660714328289032, 'rewards/format_reward': 1.0, 'reward': 1.3660715222358704, 'reward_std': 0.24229325354099274, 'kl': 0.080322265625, 'epoch': 1.39} 28%|██▊ | 446/1610 [2:04:32<3:25:04, 10.57s/it] 28%|██▊ | 447/1610 [2:04:44<3:35:04, 11.10s/it] {'loss': 0.0023, 'grad_norm': 1.3977606618635414, 'learning_rate': 7.223602484472049e-07, 'completion_length': 102.78571701049805, 'rewards/accuracy_reward': 0.3125000149011612, 'rewards/format_reward': 1.0, 'reward': 1.3125000596046448, 'reward_std': 0.20922823250293732, 'kl': 0.057373046875, 'epoch': 1.39} 28%|██▊ | 447/1610 [2:04:44<3:35:04, 11.10s/it] 28%|██▊ | 448/1610 [2:04:54<3:30:22, 10.86s/it] {'loss': 0.0029, 'grad_norm': 1.4731471689744167, 'learning_rate': 7.217391304347826e-07, 'completion_length': 78.56250381469727, 'rewards/accuracy_reward': 0.5089285969734192, 'rewards/format_reward': 1.0, 'reward': 1.5089285969734192, 'reward_std': 0.18214857578277588, 'kl': 0.072509765625, 'epoch': 1.39} 28%|██▊ | 448/1610 [2:04:54<3:30:22, 10.86s/it] 28%|██▊ | 449/1610 [2:05:04<3:23:09, 10.50s/it] {'loss': 0.0036, 'grad_norm': 1.938190383705347, 'learning_rate': 7.211180124223603e-07, 'completion_length': 79.40179061889648, 'rewards/accuracy_reward': 0.4821428805589676, 'rewards/format_reward': 1.0, 'reward': 1.4821429252624512, 'reward_std': 0.22276806831359863, 'kl': 0.08935546875, 'epoch': 1.39} 28%|██▊ | 449/1610 [2:05:04<3:23:09, 10.50s/it] 28%|██▊ | 450/1610 [2:05:13<3:15:37, 10.12s/it] {'loss': 0.0019, 'grad_norm': 2.500650804936078, 'learning_rate': 7.204968944099379e-07, 'completion_length': 80.5089340209961, 'rewards/accuracy_reward': 0.4285714328289032, 'rewards/format_reward': 1.0, 'reward': 1.4285715222358704, 'reward_std': 0.1860513761639595, 'kl': 0.0472412109375, 'epoch': 1.4} 28%|██▊ | 450/1610 [2:05:13<3:15:37, 10.12s/it] 28%|██▊ | 451/1610 [2:05:23<3:16:38, 10.18s/it] {'loss': 0.0021, 'grad_norm': 1.2581235494214524, 'learning_rate': 7.198757763975155e-07, 'completion_length': 96.52679061889648, 'rewards/accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 1.0, 'reward': 1.5892857909202576, 'reward_std': 0.18458788096904755, 'kl': 0.052978515625, 'epoch': 1.4} 28%|██▊ | 451/1610 [2:05:23<3:16:38, 10.18s/it] 28%|██▊ | 452/1610 [2:05:33<3:14:51, 10.10s/it] {'loss': 0.0024, 'grad_norm': 3.958285243132307, 'learning_rate': 7.192546583850931e-07, 'completion_length': 90.62500381469727, 'rewards/accuracy_reward': 0.4196428805589676, 'rewards/format_reward': 1.0, 'reward': 1.4196429252624512, 'reward_std': 0.2314494252204895, 'kl': 0.0609130859375, 'epoch': 1.4} 28%|██▊ | 452/1610 [2:05:33<3:14:51, 10.10s/it] 28%|██▊ | 453/1610 [2:05:43<3:10:52, 9.90s/it] {'loss': 0.0028, 'grad_norm': 1.0679241251795701, 'learning_rate': 7.186335403726708e-07, 'completion_length': 76.40178680419922, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 1.0, 'reward': 1.571428656578064, 'reward_std': 0.14970265328884125, 'kl': 0.0704345703125, 'epoch': 1.41} 28%|██▊ | 453/1610 [2:05:43<3:10:52, 9.90s/it] 28%|██▊ | 454/1610 [2:05:53<3:12:37, 10.00s/it] {'loss': 0.0027, 'grad_norm': 2.028044480416528, 'learning_rate': 7.180124223602484e-07, 'completion_length': 86.84821701049805, 'rewards/accuracy_reward': 0.5357143133878708, 'rewards/format_reward': 1.0, 'reward': 1.5357143878936768, 'reward_std': 0.32915520668029785, 'kl': 0.06884765625, 'epoch': 1.41} 28%|██▊ | 454/1610 [2:05:53<3:12:37, 10.00s/it] 28%|██▊ | 455/1610 [2:06:04<3:16:40, 10.22s/it] {'loss': 0.0022, 'grad_norm': 2.1118953141654475, 'learning_rate': 7.17391304347826e-07, 'completion_length': 97.47321701049805, 'rewards/accuracy_reward': 0.446428582072258, 'rewards/format_reward': 1.0, 'reward': 1.446428656578064, 'reward_std': 0.2344820499420166, 'kl': 0.0545654296875, 'epoch': 1.41} 28%|██▊ | 455/1610 [2:06:04<3:16:40, 10.22s/it] 28%|██▊ | 456/1610 [2:06:14<3:15:04, 10.14s/it] {'loss': 0.0033, 'grad_norm': 1.4945716338512496, 'learning_rate': 7.167701863354037e-07, 'completion_length': 80.74107360839844, 'rewards/accuracy_reward': 0.6696428954601288, 'rewards/format_reward': 1.0, 'reward': 1.6696429252624512, 'reward_std': 0.15360544621944427, 'kl': 0.082763671875, 'epoch': 1.42} 28%|██▊ | 456/1610 [2:06:14<3:15:04, 10.14s/it] 28%|██▊ | 457/1610 [2:06:25<3:21:00, 10.46s/it] {'loss': 0.0027, 'grad_norm': 2.007366853686808, 'learning_rate': 7.161490683229814e-07, 'completion_length': 80.39286041259766, 'rewards/accuracy_reward': 0.4375000298023224, 'rewards/format_reward': 1.0, 'reward': 1.4375000596046448, 'reward_std': 0.16141103208065033, 'kl': 0.067138671875, 'epoch': 1.42} 28%|██▊ | 457/1610 [2:06:25<3:21:00, 10.46s/it] 28%|██▊ | 458/1610 [2:06:38<3:34:41, 11.18s/it] {'loss': 0.0027, 'grad_norm': 1.7729121570995752, 'learning_rate': 7.15527950310559e-07, 'completion_length': 92.72321701049805, 'rewards/accuracy_reward': 0.455357164144516, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.4375000596046448, 'reward_std': 0.3499073088169098, 'kl': 0.06640625, 'epoch': 1.42} 28%|██▊ | 458/1610 [2:06:38<3:34:41, 11.18s/it] 29%|██▊ | 459/1610 [2:06:49<3:33:40, 11.14s/it] {'loss': 0.0043, 'grad_norm': 1.470230429497017, 'learning_rate': 7.149068322981367e-07, 'completion_length': 76.46428680419922, 'rewards/accuracy_reward': 0.5357142984867096, 'rewards/format_reward': 1.0, 'reward': 1.5357143878936768, 'reward_std': 0.22754664719104767, 'kl': 0.107177734375, 'epoch': 1.43} 29%|██▊ | 459/1610 [2:06:49<3:33:40, 11.14s/it] 29%|██▊ | 460/1610 [2:06:59<3:28:15, 10.87s/it] {'loss': 0.0032, 'grad_norm': 2.992256023719593, 'learning_rate': 7.142857142857143e-07, 'completion_length': 87.9464340209961, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 1.0, 'reward': 1.5714285969734192, 'reward_std': 0.29339978098869324, 'kl': 0.0789794921875, 'epoch': 1.43} 29%|██▊ | 460/1610 [2:06:59<3:28:15, 10.87s/it] 29%|██▊ | 461/1610 [2:07:09<3:21:17, 10.51s/it] {'loss': 0.003, 'grad_norm': 1.8052406600925615, 'learning_rate': 7.136645962732919e-07, 'completion_length': 76.08929061889648, 'rewards/accuracy_reward': 0.4196428954601288, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4107143878936768, 'reward_std': 0.21313104778528214, 'kl': 0.076171875, 'epoch': 1.43} 29%|██▊ | 461/1610 [2:07:09<3:21:17, 10.51s/it] 29%|██▊ | 462/1610 [2:07:20<3:23:21, 10.63s/it] {'loss': 0.0025, 'grad_norm': 2.5687519397127834, 'learning_rate': 7.130434782608695e-07, 'completion_length': 101.41964721679688, 'rewards/accuracy_reward': 0.4732142984867096, 'rewards/format_reward': 1.0, 'reward': 1.473214328289032, 'reward_std': 0.2540072351694107, 'kl': 0.063720703125, 'epoch': 1.43} 29%|██▊ | 462/1610 [2:07:20<3:23:21, 10.63s/it] 29%|██▉ | 463/1610 [2:07:31<3:27:42, 10.87s/it] {'loss': 0.0029, 'grad_norm': 1.4804359658515591, 'learning_rate': 7.124223602484471e-07, 'completion_length': 91.87500381469727, 'rewards/accuracy_reward': 0.3660714477300644, 'rewards/format_reward': 1.0, 'reward': 1.3660715222358704, 'reward_std': 0.05831881985068321, 'kl': 0.0726318359375, 'epoch': 1.44} 29%|██▉ | 463/1610 [2:07:31<3:27:42, 10.87s/it] 29%|██▉ | 464/1610 [2:07:41<3:22:41, 10.61s/it] {'loss': 0.0021, 'grad_norm': 5.369109998876274, 'learning_rate': 7.118012422360247e-07, 'completion_length': 96.75000381469727, 'rewards/accuracy_reward': 0.3750000149011612, 'rewards/format_reward': 1.0, 'reward': 1.3750000596046448, 'reward_std': 0.33184562623500824, 'kl': 0.051513671875, 'epoch': 1.44} 29%|██▉ | 464/1610 [2:07:41<3:22:41, 10.61s/it] 29%|██▉ | 465/1610 [2:07:50<3:15:47, 10.26s/it] {'loss': 0.0026, 'grad_norm': 1.9816099364891349, 'learning_rate': 7.111801242236024e-07, 'completion_length': 87.08929061889648, 'rewards/accuracy_reward': 0.5535714328289032, 'rewards/format_reward': 1.0, 'reward': 1.5535715222358704, 'reward_std': 0.24315781891345978, 'kl': 0.06396484375, 'epoch': 1.44} 29%|██▉ | 465/1610 [2:07:50<3:15:47, 10.26s/it] 29%|██▉ | 466/1610 [2:08:02<3:22:53, 10.64s/it] {'loss': 0.0029, 'grad_norm': 1.7665565134003918, 'learning_rate': 7.105590062111801e-07, 'completion_length': 100.58036041259766, 'rewards/accuracy_reward': 0.1785714365541935, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.1696429252624512, 'reward_std': 0.10882645472884178, 'kl': 0.0712890625, 'epoch': 1.45} 29%|██▉ | 466/1610 [2:08:02<3:22:53, 10.64s/it] 29%|██▉ | 467/1610 [2:08:12<3:19:26, 10.47s/it] {'loss': 0.0036, 'grad_norm': 1.1186594890856698, 'learning_rate': 7.099378881987577e-07, 'completion_length': 78.85714721679688, 'rewards/accuracy_reward': 0.401785746216774, 'rewards/format_reward': 1.0, 'reward': 1.4017857909202576, 'reward_std': 0.10821297764778137, 'kl': 0.0888671875, 'epoch': 1.45} 29%|██▉ | 467/1610 [2:08:12<3:19:26, 10.47s/it] 29%|██▉ | 468/1610 [2:08:23<3:24:29, 10.74s/it] {'loss': 0.0021, 'grad_norm': 1.7138751720912715, 'learning_rate': 7.093167701863354e-07, 'completion_length': 91.2589340209961, 'rewards/accuracy_reward': 0.5267857313156128, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.5178571939468384, 'reward_std': 0.16631686687469482, 'kl': 0.0533447265625, 'epoch': 1.45} 29%|██▉ | 468/1610 [2:08:23<3:24:29, 10.74s/it] 29%|██▉ | 469/1610 [2:08:32<3:11:28, 10.07s/it] {'loss': 0.0036, 'grad_norm': 2.6003854422866906, 'learning_rate': 7.08695652173913e-07, 'completion_length': 65.91071701049805, 'rewards/accuracy_reward': 0.642857164144516, 'rewards/format_reward': 1.0, 'reward': 1.6428571939468384, 'reward_std': 0.14518077671527863, 'kl': 0.0908203125, 'epoch': 1.46} 29%|██▉ | 469/1610 [2:08:32<3:11:28, 10.07s/it] 29%|██▉ | 470/1610 [2:08:43<3:15:10, 10.27s/it] {'loss': 0.0022, 'grad_norm': 2.3861448243800716, 'learning_rate': 7.080745341614906e-07, 'completion_length': 103.29464721679688, 'rewards/accuracy_reward': 0.5357142984867096, 'rewards/format_reward': 1.0, 'reward': 1.535714328289032, 'reward_std': 0.17164697498083115, 'kl': 0.0540771484375, 'epoch': 1.46} 29%|██▉ | 470/1610 [2:08:43<3:15:10, 10.27s/it] 29%|██▉ | 471/1610 [2:08:53<3:13:35, 10.20s/it] {'loss': 0.0039, 'grad_norm': 1.345848801842966, 'learning_rate': 7.074534161490683e-07, 'completion_length': 66.94643020629883, 'rewards/accuracy_reward': 0.294642873108387, 'rewards/format_reward': 1.0, 'reward': 1.2946429252624512, 'reward_std': 0.05831882357597351, 'kl': 0.097412109375, 'epoch': 1.46} 29%|██▉ | 471/1610 [2:08:53<3:13:35, 10.20s/it] 29%|██▉ | 472/1610 [2:09:02<3:10:52, 10.06s/it] {'loss': 0.0022, 'grad_norm': 1.9222479496698797, 'learning_rate': 7.068322981366459e-07, 'completion_length': 83.41071701049805, 'rewards/accuracy_reward': 0.4464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.446428656578064, 'reward_std': 0.22936688363552094, 'kl': 0.0550537109375, 'epoch': 1.47} 29%|██▉ | 472/1610 [2:09:02<3:10:52, 10.06s/it] 29%|██▉ | 473/1610 [2:09:13<3:13:32, 10.21s/it] {'loss': 0.0023, 'grad_norm': 0.6288146353702012, 'learning_rate': 7.062111801242235e-07, 'completion_length': 92.41072082519531, 'rewards/accuracy_reward': 0.21428572572767735, 'rewards/format_reward': 1.0, 'reward': 1.2142857909202576, 'reward_std': 0.06613001227378845, 'kl': 0.0582275390625, 'epoch': 1.47} 29%|██▉ | 473/1610 [2:09:13<3:13:32, 10.21s/it] 29%|██▉ | 474/1610 [2:09:25<3:20:57, 10.61s/it] {'loss': 0.0027, 'grad_norm': 1.3991965513751927, 'learning_rate': 7.055900621118012e-07, 'completion_length': 84.57143020629883, 'rewards/accuracy_reward': 0.464285746216774, 'rewards/format_reward': 1.0, 'reward': 1.4642857909202576, 'reward_std': 0.16922221705317497, 'kl': 0.06689453125, 'epoch': 1.47} 29%|██▉ | 474/1610 [2:09:25<3:20:57, 10.61s/it] 30%|██▉ | 475/1610 [2:09:37<3:28:06, 11.00s/it] {'loss': 0.0028, 'grad_norm': 2.5715542586251727, 'learning_rate': 7.049689440993789e-07, 'completion_length': 91.8214340209961, 'rewards/accuracy_reward': 0.392857164144516, 'rewards/format_reward': 1.0, 'reward': 1.3928571939468384, 'reward_std': 0.2819514572620392, 'kl': 0.0704345703125, 'epoch': 1.48} 30%|██▉ | 475/1610 [2:09:37<3:28:06, 11.00s/it] 30%|██▉ | 476/1610 [2:09:47<3:23:26, 10.76s/it] {'loss': 0.0022, 'grad_norm': 1.7422924719871618, 'learning_rate': 7.043478260869565e-07, 'completion_length': 81.35714721679688, 'rewards/accuracy_reward': 0.4642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.4642857313156128, 'reward_std': 0.21972984075546265, 'kl': 0.0557861328125, 'epoch': 1.48} 30%|██▉ | 476/1610 [2:09:47<3:23:26, 10.76s/it] 30%|██▉ | 477/1610 [2:10:02<3:47:46, 12.06s/it] {'loss': 0.002, 'grad_norm': 1.7303993202675563, 'learning_rate': 7.037267080745342e-07, 'completion_length': 119.3035774230957, 'rewards/accuracy_reward': 0.2232142984867096, 'rewards/format_reward': 1.0, 'reward': 1.223214328289032, 'reward_std': 0.2579156309366226, 'kl': 0.049560546875, 'epoch': 1.48} 30%|██▉ | 477/1610 [2:10:02<3:47:46, 12.06s/it] 30%|██▉ | 478/1610 [2:10:14<3:50:02, 12.19s/it] {'loss': 0.0021, 'grad_norm': 1.6037519947553374, 'learning_rate': 7.031055900621118e-07, 'completion_length': 116.00893020629883, 'rewards/accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 1.0, 'reward': 1.5892857909202576, 'reward_std': 0.32646480202674866, 'kl': 0.0516357421875, 'epoch': 1.48} 30%|██▉ | 478/1610 [2:10:14<3:50:02, 12.19s/it] 30%|██▉ | 479/1610 [2:10:26<3:47:46, 12.08s/it] {'loss': 0.003, 'grad_norm': 0.5980978560092152, 'learning_rate': 7.024844720496894e-07, 'completion_length': 80.63393020629883, 'rewards/accuracy_reward': 0.455357164144516, 'rewards/format_reward': 1.0, 'reward': 1.4553571939468384, 'reward_std': 0.05831881985068321, 'kl': 0.07470703125, 'epoch': 1.49} 30%|██▉ | 479/1610 [2:10:26<3:47:46, 12.08s/it] 30%|██▉ | 480/1610 [2:10:38<3:43:42, 11.88s/it] {'loss': 0.0021, 'grad_norm': 1.7664021917070467, 'learning_rate': 7.018633540372671e-07, 'completion_length': 112.2410774230957, 'rewards/accuracy_reward': 0.464285746216774, 'rewards/format_reward': 1.0, 'reward': 1.4642857909202576, 'reward_std': 0.33966241776943207, 'kl': 0.052490234375, 'epoch': 1.49} 30%|██▉ | 480/1610 [2:10:38<3:43:42, 11.88s/it] 30%|██▉ | 481/1610 [2:10:48<3:38:11, 11.60s/it] {'loss': 0.0026, 'grad_norm': 1.5752083512808852, 'learning_rate': 7.012422360248447e-07, 'completion_length': 90.66964721679688, 'rewards/accuracy_reward': 0.4910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.4910714626312256, 'reward_std': 0.07003280520439148, 'kl': 0.0657958984375, 'epoch': 1.49} 30%|██▉ | 481/1610 [2:10:48<3:38:11, 11.60s/it] 30%|██▉ | 482/1610 [2:11:01<3:45:51, 12.01s/it] {'loss': 0.0024, 'grad_norm': 1.7019873384490867, 'learning_rate': 7.006211180124223e-07, 'completion_length': 96.73214721679688, 'rewards/accuracy_reward': 0.464285746216774, 'rewards/format_reward': 1.0, 'reward': 1.4642857909202576, 'reward_std': 0.1575082391500473, 'kl': 0.0604248046875, 'epoch': 1.5} 30%|██▉ | 482/1610 [2:11:01<3:45:51, 12.01s/it] 30%|███ | 483/1610 [2:11:13<3:42:40, 11.85s/it] {'loss': 0.0021, 'grad_norm': 1.7876380815667352, 'learning_rate': 7e-07, 'completion_length': 113.53571701049805, 'rewards/accuracy_reward': 0.5625000298023224, 'rewards/format_reward': 1.0, 'reward': 1.5625000596046448, 'reward_std': 0.1704346016049385, 'kl': 0.052001953125, 'epoch': 1.5} 30%|███ | 483/1610 [2:11:13<3:42:40, 11.85s/it] 30%|███ | 484/1610 [2:11:25<3:41:10, 11.79s/it] {'loss': 0.0024, 'grad_norm': 12.36931370630113, 'learning_rate': 6.993788819875777e-07, 'completion_length': 107.11607360839844, 'rewards/accuracy_reward': 0.4732143133878708, 'rewards/format_reward': 1.0, 'reward': 1.4732143878936768, 'reward_std': 0.24229323863983154, 'kl': 0.060302734375, 'epoch': 1.5} 30%|███ | 484/1610 [2:11:25<3:41:10, 11.79s/it] 30%|███ | 485/1610 [2:11:37<3:46:02, 12.06s/it] {'loss': 0.0029, 'grad_norm': 0.9708878301945298, 'learning_rate': 6.987577639751553e-07, 'completion_length': 94.66071701049805, 'rewards/accuracy_reward': 0.4285714477300644, 'rewards/format_reward': 1.0, 'reward': 1.4285715222358704, 'reward_std': 0.10040178894996643, 'kl': 0.072265625, 'epoch': 1.51} 30%|███ | 485/1610 [2:11:37<3:46:02, 12.06s/it] 30%|███ | 486/1610 [2:11:50<3:49:20, 12.24s/it] {'loss': 0.0016, 'grad_norm': 1.88515612278829, 'learning_rate': 6.981366459627329e-07, 'completion_length': 137.38393783569336, 'rewards/accuracy_reward': 0.5625000149011612, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.5535715222358704, 'reward_std': 0.28437620401382446, 'kl': 0.03900146484375, 'epoch': 1.51} 30%|███ | 486/1610 [2:11:50<3:49:20, 12.24s/it] 30%|███ | 487/1610 [2:12:02<3:50:42, 12.33s/it] {'loss': 0.0019, 'grad_norm': 1.5253350459740576, 'learning_rate': 6.975155279503105e-07, 'completion_length': 114.73215103149414, 'rewards/accuracy_reward': 0.258928582072258, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.2500000596046448, 'reward_std': 0.2410808801651001, 'kl': 0.0484619140625, 'epoch': 1.51} 30%|███ | 487/1610 [2:12:02<3:50:42, 12.33s/it] 30%|███ | 488/1610 [2:12:14<3:48:27, 12.22s/it] {'loss': 0.0028, 'grad_norm': 1.3406835393540752, 'learning_rate': 6.968944099378881e-07, 'completion_length': 110.96429061889648, 'rewards/accuracy_reward': 0.4732142984867096, 'rewards/format_reward': 1.0, 'reward': 1.473214328289032, 'reward_std': 0.1704345941543579, 'kl': 0.070068359375, 'epoch': 1.52} 30%|███ | 488/1610 [2:12:14<3:48:27, 12.22s/it] 30%|███ | 489/1610 [2:12:25<3:38:17, 11.68s/it] {'loss': 0.0028, 'grad_norm': 1.3961787943134192, 'learning_rate': 6.962732919254658e-07, 'completion_length': 90.16072082519531, 'rewards/accuracy_reward': 0.4107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.410714328289032, 'reward_std': 0.18666484206914902, 'kl': 0.07080078125, 'epoch': 1.52} 30%|███ | 489/1610 [2:12:25<3:38:17, 11.68s/it] 30%|███ | 490/1610 [2:12:37<3:42:24, 11.91s/it] {'loss': 0.0024, 'grad_norm': 1.0349053040130343, 'learning_rate': 6.956521739130434e-07, 'completion_length': 132.2589340209961, 'rewards/accuracy_reward': 0.4017857238650322, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.3928571939468384, 'reward_std': 0.18458788841962814, 'kl': 0.0589599609375, 'epoch': 1.52} 30%|███ | 490/1610 [2:12:37<3:42:24, 11.91s/it] 30%|███ | 491/1610 [2:12:50<3:46:45, 12.16s/it] {'loss': 0.0019, 'grad_norm': 1.546406031434387, 'learning_rate': 6.95031055900621e-07, 'completion_length': 132.99107360839844, 'rewards/accuracy_reward': 0.3660714328289032, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.3571428656578064, 'reward_std': 0.16653180867433548, 'kl': 0.0478515625, 'epoch': 1.52} 30%|███ | 491/1610 [2:12:50<3:46:45, 12.16s/it] 31%|███ | 492/1610 [2:13:01<3:41:09, 11.87s/it] {'loss': 0.002, 'grad_norm': 1.377497325599626, 'learning_rate': 6.944099378881987e-07, 'completion_length': 129.41964721679688, 'rewards/accuracy_reward': 0.3125000223517418, 'rewards/format_reward': 1.0, 'reward': 1.3125000596046448, 'reward_std': 0.19838443398475647, 'kl': 0.04931640625, 'epoch': 1.53} 31%|███ | 492/1610 [2:13:01<3:41:09, 11.87s/it] 31%|███ | 493/1610 [2:13:12<3:37:33, 11.69s/it] {'loss': 0.0023, 'grad_norm': 1.0327239174321627, 'learning_rate': 6.937888198757764e-07, 'completion_length': 104.45536041259766, 'rewards/accuracy_reward': 0.3750000149011612, 'rewards/format_reward': 1.0, 'reward': 1.3750000596046448, 'reward_std': 0.157508235424757, 'kl': 0.05712890625, 'epoch': 1.53} 31%|███ | 493/1610 [2:13:13<3:37:33, 11.69s/it] 31%|███ | 494/1610 [2:13:23<3:31:10, 11.35s/it] {'loss': 0.0021, 'grad_norm': 1.3551842779225993, 'learning_rate': 6.93167701863354e-07, 'completion_length': 117.8660774230957, 'rewards/accuracy_reward': 0.3482143133878708, 'rewards/format_reward': 1.0, 'reward': 1.3482143878936768, 'reward_std': 0.23717807233333588, 'kl': 0.0516357421875, 'epoch': 1.53} 31%|███ | 494/1610 [2:13:23<3:31:10, 11.35s/it] 31%|███ | 495/1610 [2:13:36<3:42:32, 11.98s/it] {'loss': 0.0023, 'grad_norm': 1.2639486500615678, 'learning_rate': 6.925465838509317e-07, 'completion_length': 124.25000381469727, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.508928656578064, 'reward_std': 0.27206332981586456, 'kl': 0.0565185546875, 'epoch': 1.54} 31%|███ | 495/1610 [2:13:37<3:42:32, 11.98s/it] 31%|███ | 496/1610 [2:13:48<3:42:10, 11.97s/it] {'loss': 0.0021, 'grad_norm': 0.9963581092787416, 'learning_rate': 6.919254658385093e-07, 'completion_length': 121.65179443359375, 'rewards/accuracy_reward': 0.4375000298023224, 'rewards/format_reward': 1.0, 'reward': 1.4375000596046448, 'reward_std': 0.2059243693947792, 'kl': 0.051513671875, 'epoch': 1.54} 31%|███ | 496/1610 [2:13:48<3:42:10, 11.97s/it] 31%|███ | 497/1610 [2:14:01<3:43:35, 12.05s/it] {'loss': 0.0024, 'grad_norm': 1.8294517408081263, 'learning_rate': 6.913043478260869e-07, 'completion_length': 119.23214721679688, 'rewards/accuracy_reward': 0.4732143133878708, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.4553571939468384, 'reward_std': 0.2513168156147003, 'kl': 0.05914306640625, 'epoch': 1.54} 31%|███ | 497/1610 [2:14:01<3:43:35, 12.05s/it] 31%|███ | 498/1610 [2:14:12<3:41:49, 11.97s/it] {'loss': 0.002, 'grad_norm': 1.3733316204807824, 'learning_rate': 6.906832298136646e-07, 'completion_length': 117.58929061889648, 'rewards/accuracy_reward': 0.4196428805589676, 'rewards/format_reward': 1.0, 'reward': 1.4196429252624512, 'reward_std': 0.227541022002697, 'kl': 0.050048828125, 'epoch': 1.55} 31%|███ | 498/1610 [2:14:12<3:41:49, 11.97s/it] 31%|███ | 499/1610 [2:14:24<3:37:29, 11.75s/it] {'loss': 0.0022, 'grad_norm': 1.0376963121529101, 'learning_rate': 6.900621118012422e-07, 'completion_length': 104.1785774230957, 'rewards/accuracy_reward': 0.6160714626312256, 'rewards/format_reward': 1.0, 'reward': 1.6160715222358704, 'reward_std': 0.1704345941543579, 'kl': 0.0548095703125, 'epoch': 1.55} 31%|███ | 499/1610 [2:14:24<3:37:29, 11.75s/it] 31%|███ | 500/1610 [2:14:35<3:32:05, 11.46s/it] {'loss': 0.0018, 'grad_norm': 1.9227278979096873, 'learning_rate': 6.894409937888198e-07, 'completion_length': 105.3839340209961, 'rewards/accuracy_reward': 0.392857164144516, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.383928656578064, 'reward_std': 0.3111136704683304, 'kl': 0.0455322265625, 'epoch': 1.55} 31%|███ | 500/1610 [2:14:35<3:32:05, 11.46s/it] 31%|███ | 501/1610 [2:15:27<7:21:40, 23.90s/it] {'loss': 0.0021, 'grad_norm': 1.707791409545432, 'learning_rate': 6.888198757763975e-07, 'completion_length': 111.20536422729492, 'rewards/accuracy_reward': 0.4821428954601288, 'rewards/format_reward': 1.0, 'reward': 1.4821429252624512, 'reward_std': 0.25791002810001373, 'kl': 0.0531005859375, 'epoch': 1.56} 31%|███ | 501/1610 [2:15:27<7:21:40, 23.90s/it] 31%|███ | 502/1610 [2:15:39<6:14:04, 20.26s/it] {'loss': 0.0021, 'grad_norm': 2.282827618661661, 'learning_rate': 6.881987577639752e-07, 'completion_length': 108.21429061889648, 'rewards/accuracy_reward': 0.5982142984867096, 'rewards/format_reward': 1.0, 'reward': 1.598214328289032, 'reward_std': 0.23057925701141357, 'kl': 0.052734375, 'epoch': 1.56} 31%|███ | 502/1610 [2:15:39<6:14:04, 20.26s/it] 31%|███ | 503/1610 [2:15:51<5:29:04, 17.84s/it] {'loss': 0.0019, 'grad_norm': 2.9579599200859037, 'learning_rate': 6.875776397515528e-07, 'completion_length': 123.61607360839844, 'rewards/accuracy_reward': 0.3660714477300644, 'rewards/format_reward': 1.0, 'reward': 1.3660714626312256, 'reward_std': 0.2804734408855438, 'kl': 0.0484619140625, 'epoch': 1.56} 31%|███ | 503/1610 [2:15:51<5:29:04, 17.84s/it] 31%|███▏ | 504/1610 [2:16:05<5:03:56, 16.49s/it] {'loss': 0.0021, 'grad_norm': 1.7042419093826189, 'learning_rate': 6.869565217391305e-07, 'completion_length': 125.08929061889648, 'rewards/accuracy_reward': 0.4375000149011612, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4285714626312256, 'reward_std': 0.3123260587453842, 'kl': 0.0521240234375, 'epoch': 1.57} 31%|███▏ | 504/1610 [2:16:05<5:03:56, 16.49s/it] 31%|███▏ | 505/1610 [2:16:15<4:31:01, 14.72s/it] {'loss': 0.0025, 'grad_norm': 1.3819378320192621, 'learning_rate': 6.863354037267081e-07, 'completion_length': 94.8839340209961, 'rewards/accuracy_reward': 0.4375000298023224, 'rewards/format_reward': 1.0, 'reward': 1.4375000596046448, 'reward_std': 0.22667087614536285, 'kl': 0.0633544921875, 'epoch': 1.57} 31%|███▏ | 505/1610 [2:16:15<4:31:01, 14.72s/it] 31%|███▏ | 506/1610 [2:16:27<4:12:59, 13.75s/it] {'loss': 0.0025, 'grad_norm': 2.229479662012147, 'learning_rate': 6.857142857142857e-07, 'completion_length': 90.91071701049805, 'rewards/accuracy_reward': 0.5535714477300644, 'rewards/format_reward': 1.0, 'reward': 1.5535715222358704, 'reward_std': 0.2567032426595688, 'kl': 0.06201171875, 'epoch': 1.57} 31%|███▏ | 506/1610 [2:16:27<4:12:59, 13.75s/it] 31%|███▏ | 507/1610 [2:16:37<3:55:56, 12.83s/it] {'loss': 0.0022, 'grad_norm': 1.280332658938439, 'learning_rate': 6.850931677018634e-07, 'completion_length': 115.69643020629883, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6250000596046448, 'reward_std': 0.2248450219631195, 'kl': 0.0552978515625, 'epoch': 1.57} 31%|███▏ | 507/1610 [2:16:37<3:55:56, 12.83s/it] 32%|███▏ | 508/1610 [2:16:48<3:43:47, 12.18s/it] {'loss': 0.0021, 'grad_norm': 1.6918380958852448, 'learning_rate': 6.84472049689441e-07, 'completion_length': 102.16964721679688, 'rewards/accuracy_reward': 0.3571428880095482, 'rewards/format_reward': 1.0, 'reward': 1.3571429252624512, 'reward_std': 0.26181840896606445, 'kl': 0.0533447265625, 'epoch': 1.58} 32%|███▏ | 508/1610 [2:16:48<3:43:47, 12.18s/it] 32%|███▏ | 509/1610 [2:16:59<3:35:41, 11.75s/it] {'loss': 0.0026, 'grad_norm': 1.4128498559079714, 'learning_rate': 6.838509316770185e-07, 'completion_length': 99.4285774230957, 'rewards/accuracy_reward': 0.3928571492433548, 'rewards/format_reward': 1.0, 'reward': 1.3928572535514832, 'reward_std': 0.19057324528694153, 'kl': 0.065673828125, 'epoch': 1.58} 32%|███▏ | 509/1610 [2:16:59<3:35:41, 11.75s/it] 32%|███▏ | 510/1610 [2:17:10<3:31:25, 11.53s/it] {'loss': 0.0024, 'grad_norm': 1.8451701468248483, 'learning_rate': 6.832298136645962e-07, 'completion_length': 102.29464721679688, 'rewards/accuracy_reward': 0.4375000298023224, 'rewards/format_reward': 1.0, 'reward': 1.4375000596046448, 'reward_std': 0.25912240147590637, 'kl': 0.0594482421875, 'epoch': 1.58} 32%|███▏ | 510/1610 [2:17:10<3:31:25, 11.53s/it] 32%|███▏ | 511/1610 [2:17:21<3:27:07, 11.31s/it] {'loss': 0.0022, 'grad_norm': 1.3548447214485804, 'learning_rate': 6.826086956521738e-07, 'completion_length': 106.99107360839844, 'rewards/accuracy_reward': 0.6339285969734192, 'rewards/format_reward': 1.0, 'reward': 1.633928656578064, 'reward_std': 0.23656460642814636, 'kl': 0.05419921875, 'epoch': 1.59} 32%|███▏ | 511/1610 [2:17:21<3:27:07, 11.31s/it] 32%|███▏ | 512/1610 [2:17:34<3:37:22, 11.88s/it] {'loss': 0.002, 'grad_norm': 1.3847985609482931, 'learning_rate': 6.819875776397515e-07, 'completion_length': 117.95536041259766, 'rewards/accuracy_reward': 0.446428582072258, 'rewards/format_reward': 1.0, 'reward': 1.446428656578064, 'reward_std': 0.18397442251443863, 'kl': 0.04931640625, 'epoch': 1.59} 32%|███▏ | 512/1610 [2:17:34<3:37:22, 11.88s/it] 32%|███▏ | 513/1610 [2:17:47<3:44:07, 12.26s/it] {'loss': 0.0022, 'grad_norm': 1.1859384091737286, 'learning_rate': 6.813664596273292e-07, 'completion_length': 119.02679443359375, 'rewards/accuracy_reward': 0.5089285969734192, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.5000001192092896, 'reward_std': 0.23676259070634842, 'kl': 0.0538330078125, 'epoch': 1.59} 32%|███▏ | 513/1610 [2:17:47<3:44:07, 12.26s/it] 32%|███▏ | 514/1610 [2:18:00<3:48:02, 12.48s/it] {'loss': 0.0022, 'grad_norm': 2.0221137012903254, 'learning_rate': 6.807453416149068e-07, 'completion_length': 124.91965103149414, 'rewards/accuracy_reward': 0.535714328289032, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.5178572535514832, 'reward_std': 0.2931848466396332, 'kl': 0.0543212890625, 'epoch': 1.6} 32%|███▏ | 514/1610 [2:18:00<3:48:02, 12.48s/it] 32%|███▏ | 515/1610 [2:18:12<3:46:07, 12.39s/it] {'loss': 0.0017, 'grad_norm': 2.0459637568805107, 'learning_rate': 6.801242236024844e-07, 'completion_length': 132.83036041259766, 'rewards/accuracy_reward': 0.508928582072258, 'rewards/format_reward': 1.0, 'reward': 1.508928656578064, 'reward_std': 0.2837773263454437, 'kl': 0.0435791015625, 'epoch': 1.6} 32%|███▏ | 515/1610 [2:18:12<3:46:07, 12.39s/it] 32%|███▏ | 516/1610 [2:18:23<3:39:26, 12.04s/it] {'loss': 0.0019, 'grad_norm': 1.91830144015684, 'learning_rate': 6.795031055900621e-07, 'completion_length': 106.0535774230957, 'rewards/accuracy_reward': 0.3571428656578064, 'rewards/format_reward': 1.0, 'reward': 1.3571429252624512, 'reward_std': 0.2792610377073288, 'kl': 0.048583984375, 'epoch': 1.6} 32%|███▏ | 516/1610 [2:18:23<3:39:26, 12.04s/it] 32%|███▏ | 517/1610 [2:18:35<3:35:22, 11.82s/it] {'loss': 0.003, 'grad_norm': 1.3154363056234673, 'learning_rate': 6.788819875776397e-07, 'completion_length': 110.71429061889648, 'rewards/accuracy_reward': 0.526785746216774, 'rewards/format_reward': 1.0, 'reward': 1.5267857909202576, 'reward_std': 0.1827620565891266, 'kl': 0.073974609375, 'epoch': 1.61} 32%|███▏ | 517/1610 [2:18:35<3:35:22, 11.82s/it] 32%|███▏ | 518/1610 [2:18:46<3:30:30, 11.57s/it] {'loss': 0.003, 'grad_norm': 1.8514668242307206, 'learning_rate': 6.782608695652173e-07, 'completion_length': 110.35714721679688, 'rewards/accuracy_reward': 0.5089285969734192, 'rewards/format_reward': 1.0, 'reward': 1.508928656578064, 'reward_std': 0.27535825967788696, 'kl': 0.07568359375, 'epoch': 1.61} 32%|███▏ | 518/1610 [2:18:46<3:30:30, 11.57s/it] 32%|███▏ | 519/1610 [2:18:56<3:23:34, 11.20s/it] {'loss': 0.0024, 'grad_norm': 1.0101775826308619, 'learning_rate': 6.77639751552795e-07, 'completion_length': 109.4285774230957, 'rewards/accuracy_reward': 0.4107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.4107143878936768, 'reward_std': 0.14970264956355095, 'kl': 0.058837890625, 'epoch': 1.61} 32%|███▏ | 519/1610 [2:18:56<3:23:34, 11.20s/it] 32%|███▏ | 520/1610 [2:19:09<3:34:08, 11.79s/it] {'loss': 0.0026, 'grad_norm': 0.8129371054147774, 'learning_rate': 6.770186335403726e-07, 'completion_length': 128.4196548461914, 'rewards/accuracy_reward': 0.5089286118745804, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.5000000596046448, 'reward_std': 0.1509094163775444, 'kl': 0.0640869140625, 'epoch': 1.61} 32%|███▏ | 520/1610 [2:19:09<3:34:08, 11.79s/it] 32%|███▏ | 521/1610 [2:19:23<3:42:38, 12.27s/it] {'loss': 0.0021, 'grad_norm': 1.754720025989457, 'learning_rate': 6.763975155279503e-07, 'completion_length': 129.68750762939453, 'rewards/accuracy_reward': 0.4642857313156128, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4553571939468384, 'reward_std': 0.2320038229227066, 'kl': 0.0513916015625, 'epoch': 1.62} 32%|███▏ | 521/1610 [2:19:23<3:42:38, 12.27s/it] 32%|███▏ | 522/1610 [2:19:34<3:37:08, 11.97s/it] {'loss': 0.0021, 'grad_norm': 1.6297040383875048, 'learning_rate': 6.75776397515528e-07, 'completion_length': 118.08929061889648, 'rewards/accuracy_reward': 0.4375000149011612, 'rewards/format_reward': 1.0, 'reward': 1.4375000596046448, 'reward_std': 0.2657212167978287, 'kl': 0.0523681640625, 'epoch': 1.62} 32%|███▏ | 522/1610 [2:19:34<3:37:08, 11.97s/it] 32%|███▏ | 523/1610 [2:19:44<3:25:59, 11.37s/it] {'loss': 0.0023, 'grad_norm': 1.6052672956018514, 'learning_rate': 6.751552795031056e-07, 'completion_length': 97.86607360839844, 'rewards/accuracy_reward': 0.4732143133878708, 'rewards/format_reward': 1.0, 'reward': 1.4732143878936768, 'reward_std': 0.32013726234436035, 'kl': 0.0584716796875, 'epoch': 1.62} 32%|███▏ | 523/1610 [2:19:44<3:25:59, 11.37s/it] 33%|███▎ | 524/1610 [2:19:54<3:18:06, 10.95s/it] {'loss': 0.0023, 'grad_norm': 1.3488273543114049, 'learning_rate': 6.745341614906832e-07, 'completion_length': 95.58036422729492, 'rewards/accuracy_reward': 0.2678571492433548, 'rewards/format_reward': 1.0, 'reward': 1.2678571939468384, 'reward_std': 0.1963018923997879, 'kl': 0.05859375, 'epoch': 1.63} 33%|███▎ | 524/1610 [2:19:54<3:18:06, 10.95s/it] 33%|███▎ | 525/1610 [2:20:07<3:27:46, 11.49s/it] {'loss': 0.0026, 'grad_norm': 1.8043043209097152, 'learning_rate': 6.739130434782609e-07, 'completion_length': 116.47322082519531, 'rewards/accuracy_reward': 0.446428582072258, 'rewards/format_reward': 1.0, 'reward': 1.446428656578064, 'reward_std': 0.32013165950775146, 'kl': 0.066162109375, 'epoch': 1.63} 33%|███▎ | 525/1610 [2:20:07<3:27:46, 11.49s/it] 33%|███▎ | 526/1610 [2:20:19<3:32:21, 11.75s/it] {'loss': 0.0021, 'grad_norm': 1.9826280865655646, 'learning_rate': 6.732919254658385e-07, 'completion_length': 128.93750381469727, 'rewards/accuracy_reward': 0.580357164144516, 'rewards/format_reward': 1.0, 'reward': 1.5803571939468384, 'reward_std': 0.23569443076848984, 'kl': 0.0521240234375, 'epoch': 1.63} 33%|███▎ | 526/1610 [2:20:19<3:32:21, 11.75s/it] 33%|███▎ | 527/1610 [2:20:32<3:38:58, 12.13s/it] {'loss': 0.0032, 'grad_norm': 2.3833862788739526, 'learning_rate': 6.726708074534161e-07, 'completion_length': 111.0714340209961, 'rewards/accuracy_reward': 0.5089285969734192, 'rewards/format_reward': 1.0, 'reward': 1.5089285969734192, 'reward_std': 0.07576144114136696, 'kl': 0.0810546875, 'epoch': 1.64} 33%|███▎ | 527/1610 [2:20:32<3:38:58, 12.13s/it] 33%|███▎ | 528/1610 [2:20:44<3:36:58, 12.03s/it] {'loss': 0.0025, 'grad_norm': 1.5827383124551144, 'learning_rate': 6.720496894409938e-07, 'completion_length': 100.33929061889648, 'rewards/accuracy_reward': 0.4107142984867096, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4017857909202576, 'reward_std': 0.2597358673810959, 'kl': 0.063720703125, 'epoch': 1.64} 33%|███▎ | 528/1610 [2:20:44<3:36:58, 12.03s/it] 33%|███▎ | 529/1610 [2:20:56<3:36:33, 12.02s/it] {'loss': 0.002, 'grad_norm': 1.2045913903470047, 'learning_rate': 6.714285714285714e-07, 'completion_length': 116.8660774230957, 'rewards/accuracy_reward': 0.4642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.4642857313156128, 'reward_std': 0.128351628780365, 'kl': 0.04931640625, 'epoch': 1.64} 33%|███▎ | 529/1610 [2:20:56<3:36:33, 12.02s/it] 33%|███▎ | 530/1610 [2:21:09<3:42:03, 12.34s/it] {'loss': 0.0031, 'grad_norm': 1.5126140327366715, 'learning_rate': 6.708074534161491e-07, 'completion_length': 106.38393020629883, 'rewards/accuracy_reward': 0.6160714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.6071429252624512, 'reward_std': 0.16262340545654297, 'kl': 0.078369140625, 'epoch': 1.65} 33%|███▎ | 530/1610 [2:21:09<3:42:03, 12.34s/it] 33%|███▎ | 531/1610 [2:21:20<3:33:23, 11.87s/it] {'loss': 0.0027, 'grad_norm': 3.1547070668282067, 'learning_rate': 6.701863354037268e-07, 'completion_length': 90.91071701049805, 'rewards/accuracy_reward': 0.455357164144516, 'rewards/format_reward': 1.0, 'reward': 1.4553571939468384, 'reward_std': 0.21434339880943298, 'kl': 0.06787109375, 'epoch': 1.65} 33%|███▎ | 531/1610 [2:21:20<3:33:23, 11.87s/it] 33%|███▎ | 532/1610 [2:21:32<3:36:19, 12.04s/it] {'loss': 0.0018, 'grad_norm': 0.8934969109398644, 'learning_rate': 6.695652173913044e-07, 'completion_length': 122.14286422729492, 'rewards/accuracy_reward': 0.3839285969734192, 'rewards/format_reward': 1.0, 'reward': 1.383928656578064, 'reward_std': 0.15872061252593994, 'kl': 0.046142578125, 'epoch': 1.65} 33%|███▎ | 532/1610 [2:21:32<3:36:19, 12.04s/it] 33%|███▎ | 533/1610 [2:21:44<3:34:51, 11.97s/it] {'loss': 0.0023, 'grad_norm': 1.0151703751277206, 'learning_rate': 6.689440993788819e-07, 'completion_length': 115.8035774230957, 'rewards/accuracy_reward': 0.3750000298023224, 'rewards/format_reward': 1.0, 'reward': 1.3750000596046448, 'reward_std': 0.12686797976493835, 'kl': 0.0570068359375, 'epoch': 1.66} 33%|███▎ | 533/1610 [2:21:44<3:34:51, 11.97s/it] 33%|███▎ | 534/1610 [2:21:55<3:28:17, 11.61s/it] {'loss': 0.002, 'grad_norm': 4.592704114970942, 'learning_rate': 6.683229813664595e-07, 'completion_length': 104.4464340209961, 'rewards/accuracy_reward': 0.4285714477300644, 'rewards/format_reward': 1.0, 'reward': 1.4285715222358704, 'reward_std': 0.2858598679304123, 'kl': 0.0499267578125, 'epoch': 1.66} 33%|███▎ | 534/1610 [2:21:55<3:28:17, 11.61s/it] 33%|███▎ | 535/1610 [2:22:06<3:26:47, 11.54s/it] {'loss': 0.0028, 'grad_norm': 2.2571235116234747, 'learning_rate': 6.677018633540372e-07, 'completion_length': 110.12500381469727, 'rewards/accuracy_reward': 0.4553571790456772, 'rewards/format_reward': 1.0, 'reward': 1.4553571939468384, 'reward_std': 0.3033080995082855, 'kl': 0.06982421875, 'epoch': 1.66} 33%|███▎ | 535/1610 [2:22:06<3:26:47, 11.54s/it] 33%|███▎ | 536/1610 [2:22:18<3:27:19, 11.58s/it] {'loss': 0.0029, 'grad_norm': 1.805753097741709, 'learning_rate': 6.670807453416148e-07, 'completion_length': 98.34822082519531, 'rewards/accuracy_reward': 0.6071428805589676, 'rewards/format_reward': 1.0, 'reward': 1.6071429252624512, 'reward_std': 0.23898376524448395, 'kl': 0.0718994140625, 'epoch': 1.66} 33%|███▎ | 536/1610 [2:22:18<3:27:19, 11.58s/it] 33%|███▎ | 537/1610 [2:22:30<3:30:15, 11.76s/it] {'loss': 0.0023, 'grad_norm': 1.2220759339061362, 'learning_rate': 6.664596273291924e-07, 'completion_length': 94.73214721679688, 'rewards/accuracy_reward': 0.383928582072258, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3660714626312256, 'reward_std': 0.23769798129796982, 'kl': 0.056640625, 'epoch': 1.67} 33%|███▎ | 537/1610 [2:22:30<3:30:15, 11.76s/it] 33%|███▎ | 538/1610 [2:22:43<3:35:05, 12.04s/it] {'loss': 0.0026, 'grad_norm': 1.2737383974439969, 'learning_rate': 6.658385093167701e-07, 'completion_length': 107.5714340209961, 'rewards/accuracy_reward': 0.6428571939468384, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.633928656578064, 'reward_std': 0.23508097231388092, 'kl': 0.065185546875, 'epoch': 1.67} 33%|███▎ | 538/1610 [2:22:43<3:35:05, 12.04s/it] 33%|███▎ | 539/1610 [2:22:54<3:31:30, 11.85s/it] {'loss': 0.0034, 'grad_norm': 2.2998789057642437, 'learning_rate': 6.652173913043478e-07, 'completion_length': 84.53572082519531, 'rewards/accuracy_reward': 0.4553571790456772, 'rewards/format_reward': 1.0, 'reward': 1.4553571939468384, 'reward_std': 0.19838443398475647, 'kl': 0.085205078125, 'epoch': 1.67} 33%|███▎ | 539/1610 [2:22:54<3:31:30, 11.85s/it] 34%|███▎ | 540/1610 [2:23:04<3:23:23, 11.41s/it] {'loss': 0.0026, 'grad_norm': 2.195234112786202, 'learning_rate': 6.645962732919254e-07, 'completion_length': 102.12500762939453, 'rewards/accuracy_reward': 0.4107143133878708, 'rewards/format_reward': 1.0, 'reward': 1.4107143878936768, 'reward_std': 0.18397442251443863, 'kl': 0.06494140625, 'epoch': 1.68} 34%|███▎ | 540/1610 [2:23:04<3:23:23, 11.41s/it] 34%|███▎ | 541/1610 [2:23:18<3:33:15, 11.97s/it] {'loss': 0.002, 'grad_norm': 1.8319378862893219, 'learning_rate': 6.639751552795031e-07, 'completion_length': 130.60714721679688, 'rewards/accuracy_reward': 0.4196428954601288, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4017857909202576, 'reward_std': 0.3741321712732315, 'kl': 0.0506591796875, 'epoch': 1.68} 34%|███▎ | 541/1610 [2:23:18<3:33:15, 11.97s/it] 34%|███▎ | 542/1610 [2:23:29<3:30:02, 11.80s/it] {'loss': 0.0026, 'grad_norm': 1.3720988484228116, 'learning_rate': 6.633540372670807e-07, 'completion_length': 98.68750381469727, 'rewards/accuracy_reward': 0.464285746216774, 'rewards/format_reward': 1.0, 'reward': 1.4642857909202576, 'reward_std': 0.21703943610191345, 'kl': 0.06591796875, 'epoch': 1.68} 34%|███▎ | 542/1610 [2:23:29<3:30:02, 11.80s/it] 34%|███▎ | 543/1610 [2:23:41<3:29:25, 11.78s/it] {'loss': 0.0016, 'grad_norm': 1.6113652701039403, 'learning_rate': 6.627329192546583e-07, 'completion_length': 128.66965103149414, 'rewards/accuracy_reward': 0.1517857238650322, 'rewards/format_reward': 1.0, 'reward': 1.1517857909202576, 'reward_std': 0.2928008586168289, 'kl': 0.040283203125, 'epoch': 1.69} 34%|███▎ | 543/1610 [2:23:41<3:29:25, 11.78s/it] 34%|███▍ | 544/1610 [2:23:51<3:21:46, 11.36s/it] {'loss': 0.0021, 'grad_norm': 3.033859181672958, 'learning_rate': 6.62111801242236e-07, 'completion_length': 100.20536041259766, 'rewards/accuracy_reward': 0.4910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.4910715222358704, 'reward_std': 0.2948778420686722, 'kl': 0.0531005859375, 'epoch': 1.69} 34%|███▍ | 544/1610 [2:23:51<3:21:46, 11.36s/it] 34%|███▍ | 545/1610 [2:24:02<3:19:31, 11.24s/it] {'loss': 0.002, 'grad_norm': 1.6052724768014135, 'learning_rate': 6.614906832298136e-07, 'completion_length': 109.65179061889648, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6250000596046448, 'reward_std': 0.2061956226825714, 'kl': 0.05126953125, 'epoch': 1.69} 34%|███▍ | 545/1610 [2:24:02<3:19:31, 11.24s/it] 34%|███▍ | 546/1610 [2:24:15<3:30:22, 11.86s/it] {'loss': 0.002, 'grad_norm': 1.6095278593709335, 'learning_rate': 6.608695652173912e-07, 'completion_length': 115.14286422729492, 'rewards/accuracy_reward': 0.3125000149011612, 'rewards/format_reward': 1.0, 'reward': 1.3125000596046448, 'reward_std': 0.22754104435443878, 'kl': 0.0496826171875, 'epoch': 1.7} 34%|███▍ | 546/1610 [2:24:15<3:30:22, 11.86s/it] 34%|███▍ | 547/1610 [2:24:27<3:30:46, 11.90s/it] {'loss': 0.0032, 'grad_norm': 1.7727164099950115, 'learning_rate': 6.602484472049689e-07, 'completion_length': 82.52678680419922, 'rewards/accuracy_reward': 0.4107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.410714328289032, 'reward_std': 0.14970265328884125, 'kl': 0.080322265625, 'epoch': 1.7} 34%|███▍ | 547/1610 [2:24:27<3:30:46, 11.90s/it] 34%|███▍ | 548/1610 [2:24:39<3:28:49, 11.80s/it] {'loss': 0.0022, 'grad_norm': 0.869213865705563, 'learning_rate': 6.596273291925466e-07, 'completion_length': 108.19643020629883, 'rewards/accuracy_reward': 0.3482143059372902, 'rewards/format_reward': 1.0, 'reward': 1.348214328289032, 'reward_std': 0.17495646327733994, 'kl': 0.054443359375, 'epoch': 1.7} 34%|███▍ | 548/1610 [2:24:39<3:28:49, 11.80s/it] 34%|███▍ | 549/1610 [2:24:51<3:30:33, 11.91s/it] {'loss': 0.002, 'grad_norm': 1.6943609937908548, 'learning_rate': 6.590062111801242e-07, 'completion_length': 111.65179061889648, 'rewards/accuracy_reward': 0.4196428805589676, 'rewards/format_reward': 1.0, 'reward': 1.4196429252624512, 'reward_std': 0.19057884812355042, 'kl': 0.0498046875, 'epoch': 1.7} 34%|███▍ | 549/1610 [2:24:51<3:30:33, 11.91s/it] 34%|███▍ | 550/1610 [2:25:03<3:31:36, 11.98s/it] {'loss': 0.0027, 'grad_norm': 1.121744224313958, 'learning_rate': 6.583850931677019e-07, 'completion_length': 108.1964340209961, 'rewards/accuracy_reward': 0.7500000298023224, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.7410714626312256, 'reward_std': 0.15933407843112946, 'kl': 0.067626953125, 'epoch': 1.71} 34%|███▍ | 550/1610 [2:25:03<3:31:36, 11.98s/it] 34%|███▍ | 551/1610 [2:25:15<3:27:39, 11.77s/it] {'loss': 0.0019, 'grad_norm': 0.994420721792473, 'learning_rate': 6.577639751552795e-07, 'completion_length': 123.14286041259766, 'rewards/accuracy_reward': 0.6071428656578064, 'rewards/format_reward': 1.0, 'reward': 1.6071429252624512, 'reward_std': 0.21911637485027313, 'kl': 0.046630859375, 'epoch': 1.71} 34%|███▍ | 551/1610 [2:25:15<3:27:39, 11.77s/it] 34%|███▍ | 552/1610 [2:25:27<3:31:57, 12.02s/it] {'loss': 0.0027, 'grad_norm': 1.552188012674004, 'learning_rate': 6.571428571428571e-07, 'completion_length': 132.3482208251953, 'rewards/accuracy_reward': 0.6875000298023224, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6517857909202576, 'reward_std': 0.2987862229347229, 'kl': 0.068603515625, 'epoch': 1.71} 34%|███▍ | 552/1610 [2:25:27<3:31:57, 12.02s/it] 34%|███▍ | 553/1610 [2:25:38<3:24:53, 11.63s/it] {'loss': 0.0015, 'grad_norm': 1.3225465006852666, 'learning_rate': 6.565217391304348e-07, 'completion_length': 111.4910774230957, 'rewards/accuracy_reward': 0.321428582072258, 'rewards/format_reward': 1.0, 'reward': 1.3214285969734192, 'reward_std': 0.19837883114814758, 'kl': 0.0377197265625, 'epoch': 1.72} 34%|███▍ | 553/1610 [2:25:38<3:24:53, 11.63s/it] 34%|███▍ | 554/1610 [2:25:51<3:30:06, 11.94s/it] {'loss': 0.0028, 'grad_norm': 1.9518442331893961, 'learning_rate': 6.559006211180124e-07, 'completion_length': 114.68750381469727, 'rewards/accuracy_reward': 0.3125000149011612, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.3035715222358704, 'reward_std': 0.167145274579525, 'kl': 0.071044921875, 'epoch': 1.72} 34%|███▍ | 554/1610 [2:25:51<3:30:06, 11.94s/it] 34%|███▍ | 555/1610 [2:26:04<3:35:28, 12.25s/it] {'loss': 0.002, 'grad_norm': 2.0943317019168446, 'learning_rate': 6.5527950310559e-07, 'completion_length': 152.27679061889648, 'rewards/accuracy_reward': 0.321428582072258, 'rewards/format_reward': 0.9732142984867096, 'reward': 1.2946429252624512, 'reward_std': 0.2888924777507782, 'kl': 0.0499267578125, 'epoch': 1.72} 34%|███▍ | 555/1610 [2:26:04<3:35:28, 12.25s/it] 35%|███▍ | 556/1610 [2:26:16<3:36:51, 12.34s/it] {'loss': 0.002, 'grad_norm': 1.9758848175684294, 'learning_rate': 6.546583850931676e-07, 'completion_length': 123.83036041259766, 'rewards/accuracy_reward': 0.5267857313156128, 'rewards/format_reward': 1.0, 'reward': 1.5267857909202576, 'reward_std': 0.20411308109760284, 'kl': 0.0498046875, 'epoch': 1.73} 35%|███▍ | 556/1610 [2:26:16<3:36:51, 12.34s/it] 35%|███▍ | 557/1610 [2:26:27<3:29:40, 11.95s/it] {'loss': 0.0022, 'grad_norm': 1.5841851253863393, 'learning_rate': 6.540372670807453e-07, 'completion_length': 100.31250762939453, 'rewards/accuracy_reward': 0.6160714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.6071429252624512, 'reward_std': 0.2371724545955658, 'kl': 0.0537109375, 'epoch': 1.73} 35%|███▍ | 557/1610 [2:26:27<3:29:40, 11.95s/it] 35%|███▍ | 558/1610 [2:26:38<3:26:04, 11.75s/it] {'loss': 0.0022, 'grad_norm': 1.5232786272950387, 'learning_rate': 6.534161490683229e-07, 'completion_length': 122.22322082519531, 'rewards/accuracy_reward': 0.3839285969734192, 'rewards/format_reward': 1.0, 'reward': 1.383928656578064, 'reward_std': 0.19447603821754456, 'kl': 0.0550537109375, 'epoch': 1.73} 35%|███▍ | 558/1610 [2:26:38<3:26:04, 11.75s/it] 35%|███▍ | 559/1610 [2:26:49<3:22:11, 11.54s/it] {'loss': 0.0023, 'grad_norm': 2.6006031065204827, 'learning_rate': 6.527950310559006e-07, 'completion_length': 106.51786041259766, 'rewards/accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 1.0, 'reward': 1.5000000596046448, 'reward_std': 0.1575082316994667, 'kl': 0.0577392578125, 'epoch': 1.74} 35%|███▍ | 559/1610 [2:26:49<3:22:11, 11.54s/it] 35%|███▍ | 560/1610 [2:27:00<3:17:46, 11.30s/it] {'loss': 0.0019, 'grad_norm': 1.8860686984149124, 'learning_rate': 6.521739130434782e-07, 'completion_length': 115.75000762939453, 'rewards/accuracy_reward': 0.330357164144516, 'rewards/format_reward': 1.0, 'reward': 1.3303571939468384, 'reward_std': 0.32013724744319916, 'kl': 0.046875, 'epoch': 1.74} 35%|███▍ | 560/1610 [2:27:00<3:17:46, 11.30s/it] 35%|███▍ | 561/1610 [2:27:14<3:31:23, 12.09s/it] {'loss': 0.0022, 'grad_norm': 1.7317729192489086, 'learning_rate': 6.515527950310558e-07, 'completion_length': 124.46429061889648, 'rewards/accuracy_reward': 0.4017857313156128, 'rewards/format_reward': 1.0, 'reward': 1.4017857909202576, 'reward_std': 0.2092282399535179, 'kl': 0.0557861328125, 'epoch': 1.74} 35%|███▍ | 561/1610 [2:27:14<3:31:23, 12.09s/it] 35%|███▍ | 562/1610 [2:27:27<3:37:56, 12.48s/it] {'loss': 0.003, 'grad_norm': 1.4950346765352873, 'learning_rate': 6.509316770186335e-07, 'completion_length': 124.4910774230957, 'rewards/accuracy_reward': 0.321428582072258, 'rewards/format_reward': 1.0, 'reward': 1.3214285969734192, 'reward_std': 0.17226043343544006, 'kl': 0.075927734375, 'epoch': 1.75} 35%|███▍ | 562/1610 [2:27:27<3:37:56, 12.48s/it] 35%|███▍ | 563/1610 [2:27:39<3:33:53, 12.26s/it] {'loss': 0.0022, 'grad_norm': 1.2591797163349878, 'learning_rate': 6.503105590062111e-07, 'completion_length': 116.31250381469727, 'rewards/accuracy_reward': 0.4464286118745804, 'rewards/format_reward': 1.0, 'reward': 1.446428656578064, 'reward_std': 0.20021028816699982, 'kl': 0.0550537109375, 'epoch': 1.75} 35%|███▍ | 563/1610 [2:27:39<3:33:53, 12.26s/it] 35%|███▌ | 564/1610 [2:27:52<3:36:59, 12.45s/it] {'loss': 0.0023, 'grad_norm': 1.988956019311098, 'learning_rate': 6.496894409937887e-07, 'completion_length': 114.2589340209961, 'rewards/accuracy_reward': 0.4375000298023224, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4285714626312256, 'reward_std': 0.244989275932312, 'kl': 0.05810546875, 'epoch': 1.75} 35%|███▌ | 564/1610 [2:27:52<3:36:59, 12.45s/it] 35%|███▌ | 565/1610 [2:28:04<3:34:12, 12.30s/it] {'loss': 0.002, 'grad_norm': 2.6824779807338124, 'learning_rate': 6.490683229813664e-07, 'completion_length': 148.11607360839844, 'rewards/accuracy_reward': 0.455357164144516, 'rewards/format_reward': 1.0, 'reward': 1.4553571939468384, 'reward_std': 0.20411308109760284, 'kl': 0.04931640625, 'epoch': 1.75} 35%|███▌ | 565/1610 [2:28:04<3:34:12, 12.30s/it] 35%|███▌ | 566/1610 [2:28:16<3:31:12, 12.14s/it] {'loss': 0.0018, 'grad_norm': 2.5835343844334018, 'learning_rate': 6.484472049689441e-07, 'completion_length': 124.60715103149414, 'rewards/accuracy_reward': 0.5267857313156128, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.5178571939468384, 'reward_std': 0.34539104998111725, 'kl': 0.0440673828125, 'epoch': 1.76} 35%|███▌ | 566/1610 [2:28:16<3:31:12, 12.14s/it] 35%|███▌ | 567/1610 [2:28:28<3:32:05, 12.20s/it] {'loss': 0.0021, 'grad_norm': 0.797456440143346, 'learning_rate': 6.478260869565217e-07, 'completion_length': 123.22322463989258, 'rewards/accuracy_reward': 0.5089285969734192, 'rewards/format_reward': 1.0, 'reward': 1.508928656578064, 'reward_std': 0.13225442171096802, 'kl': 0.0528564453125, 'epoch': 1.76} 35%|███▌ | 567/1610 [2:28:28<3:32:05, 12.20s/it] 35%|███▌ | 568/1610 [2:28:41<3:35:24, 12.40s/it] {'loss': 0.0018, 'grad_norm': 1.3629132505178876, 'learning_rate': 6.472049689440994e-07, 'completion_length': 137.08036041259766, 'rewards/accuracy_reward': 0.4553571492433548, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4375000596046448, 'reward_std': 0.26982197165489197, 'kl': 0.0443115234375, 'epoch': 1.76} 35%|███▌ | 568/1610 [2:28:41<3:35:24, 12.40s/it] 35%|███▌ | 569/1610 [2:28:53<3:34:21, 12.36s/it] {'loss': 0.0022, 'grad_norm': 1.9436489956474472, 'learning_rate': 6.46583850931677e-07, 'completion_length': 115.9910774230957, 'rewards/accuracy_reward': 0.4642857313156128, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4553571939468384, 'reward_std': 0.304299920797348, 'kl': 0.0545654296875, 'epoch': 1.77} 35%|███▌ | 569/1610 [2:28:53<3:34:21, 12.36s/it] 35%|███▌ | 570/1610 [2:29:06<3:34:41, 12.39s/it] {'loss': 0.0019, 'grad_norm': 0.9135920420364023, 'learning_rate': 6.459627329192546e-07, 'completion_length': 138.4732208251953, 'rewards/accuracy_reward': 0.2589285895228386, 'rewards/format_reward': 1.0, 'reward': 1.258928656578064, 'reward_std': 0.1638357937335968, 'kl': 0.0469970703125, 'epoch': 1.77} 35%|███▌ | 570/1610 [2:29:06<3:34:41, 12.39s/it] 35%|███▌ | 571/1610 [2:29:17<3:29:38, 12.11s/it] {'loss': 0.0022, 'grad_norm': 1.9312287416785534, 'learning_rate': 6.453416149068323e-07, 'completion_length': 124.58036041259766, 'rewards/accuracy_reward': 0.5089285969734192, 'rewards/format_reward': 1.0, 'reward': 1.508928656578064, 'reward_std': 0.30693960189819336, 'kl': 0.05419921875, 'epoch': 1.77} 35%|███▌ | 571/1610 [2:29:17<3:29:38, 12.11s/it] 36%|███▌ | 572/1610 [2:29:31<3:40:31, 12.75s/it] {'loss': 0.0025, 'grad_norm': 1.2279902801979123, 'learning_rate': 6.447204968944099e-07, 'completion_length': 140.2232208251953, 'rewards/accuracy_reward': 0.4196428805589676, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4017857909202576, 'reward_std': 0.17489738389849663, 'kl': 0.0616455078125, 'epoch': 1.78} 36%|███▌ | 572/1610 [2:29:31<3:40:31, 12.75s/it] 36%|███▌ | 573/1610 [2:29:45<3:44:10, 12.97s/it] {'loss': 0.002, 'grad_norm': 1.3903225578704956, 'learning_rate': 6.440993788819875e-07, 'completion_length': 137.9821548461914, 'rewards/accuracy_reward': 0.455357164144516, 'rewards/format_reward': 1.0, 'reward': 1.4553571939468384, 'reward_std': 0.2314494401216507, 'kl': 0.0494384765625, 'epoch': 1.78} 36%|███▌ | 573/1610 [2:29:45<3:44:10, 12.97s/it] 36%|███▌ | 574/1610 [2:29:59<3:48:38, 13.24s/it] {'loss': 0.0018, 'grad_norm': 0.6883193214689548, 'learning_rate': 6.434782608695652e-07, 'completion_length': 152.92858123779297, 'rewards/accuracy_reward': 0.321428582072258, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.3035714626312256, 'reward_std': 0.176762156188488, 'kl': 0.046142578125, 'epoch': 1.78} 36%|███▌ | 574/1610 [2:29:59<3:48:38, 13.24s/it] 36%|███▌ | 575/1610 [2:30:12<3:49:59, 13.33s/it] {'loss': 0.0021, 'grad_norm': 1.9118225362991892, 'learning_rate': 6.428571428571429e-07, 'completion_length': 119.43750381469727, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.5089285969734192, 'reward_std': 0.19751426577568054, 'kl': 0.0517578125, 'epoch': 1.79} 36%|███▌ | 575/1610 [2:30:12<3:49:59, 13.33s/it] 36%|███▌ | 576/1610 [2:30:25<3:46:41, 13.15s/it] {'loss': 0.0017, 'grad_norm': 7.831399521857581, 'learning_rate': 6.422360248447205e-07, 'completion_length': 153.86607360839844, 'rewards/accuracy_reward': 0.455357164144516, 'rewards/format_reward': 1.0, 'reward': 1.4553571939468384, 'reward_std': 0.1800716444849968, 'kl': 0.0421142578125, 'epoch': 1.79} 36%|███▌ | 576/1610 [2:30:25<3:46:41, 13.15s/it] 36%|███▌ | 577/1610 [2:30:38<3:45:01, 13.07s/it] {'loss': 0.0024, 'grad_norm': 0.9309308402379477, 'learning_rate': 6.416149068322982e-07, 'completion_length': 138.73215103149414, 'rewards/accuracy_reward': 0.580357164144516, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.571428656578064, 'reward_std': 0.24619604647159576, 'kl': 0.0599365234375, 'epoch': 1.79} 36%|███▌ | 577/1610 [2:30:38<3:45:01, 13.07s/it] 36%|███▌ | 578/1610 [2:30:51<3:42:38, 12.94s/it] {'loss': 0.0019, 'grad_norm': 2.132196223598397, 'learning_rate': 6.409937888198758e-07, 'completion_length': 144.08929061889648, 'rewards/accuracy_reward': 0.5178571790456772, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5000000596046448, 'reward_std': 0.2960902005434036, 'kl': 0.047607421875, 'epoch': 1.8} 36%|███▌ | 578/1610 [2:30:51<3:42:38, 12.94s/it] 36%|███▌ | 579/1610 [2:31:04<3:44:53, 13.09s/it] {'loss': 0.0019, 'grad_norm': 1.291446821023317, 'learning_rate': 6.403726708074534e-07, 'completion_length': 118.34821701049805, 'rewards/accuracy_reward': 0.4375000149011612, 'rewards/format_reward': 0.9642857611179352, 'reward': 1.4017857909202576, 'reward_std': 0.3256509304046631, 'kl': 0.046875, 'epoch': 1.8} 36%|███▌ | 579/1610 [2:31:04<3:44:53, 13.09s/it] 36%|███▌ | 580/1610 [2:31:16<3:38:39, 12.74s/it] {'loss': 0.0019, 'grad_norm': 4.4075803284831805, 'learning_rate': 6.39751552795031e-07, 'completion_length': 124.19643020629883, 'rewards/accuracy_reward': 0.5625000298023224, 'rewards/format_reward': 1.0, 'reward': 1.5625000596046448, 'reward_std': 0.15360544621944427, 'kl': 0.0484619140625, 'epoch': 1.8} 36%|███▌ | 580/1610 [2:31:16<3:38:39, 12.74s/it] 36%|███▌ | 581/1610 [2:31:28<3:37:07, 12.66s/it] {'loss': 0.0019, 'grad_norm': 1.62561003442188, 'learning_rate': 6.391304347826086e-07, 'completion_length': 127.68750381469727, 'rewards/accuracy_reward': 0.2857143059372902, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.2767857909202576, 'reward_std': 0.24889206886291504, 'kl': 0.047119140625, 'epoch': 1.8} 36%|███▌ | 581/1610 [2:31:28<3:37:07, 12.66s/it] 36%|███▌ | 582/1610 [2:31:42<3:39:23, 12.80s/it] {'loss': 0.0023, 'grad_norm': 1.557345795086232, 'learning_rate': 6.385093167701862e-07, 'completion_length': 135.8839340209961, 'rewards/accuracy_reward': 0.4375, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4285715222358704, 'reward_std': 0.272048756480217, 'kl': 0.056640625, 'epoch': 1.81} 36%|███▌ | 582/1610 [2:31:42<3:39:23, 12.80s/it] 36%|███▌ | 583/1610 [2:31:56<3:45:02, 13.15s/it] {'loss': 0.0021, 'grad_norm': 2.076901838613889, 'learning_rate': 6.378881987577639e-07, 'completion_length': 161.52678680419922, 'rewards/accuracy_reward': 0.535714328289032, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.5267857909202576, 'reward_std': 0.3045148551464081, 'kl': 0.0535888671875, 'epoch': 1.81} 36%|███▌ | 583/1610 [2:31:56<3:45:02, 13.15s/it] 36%|███▋ | 584/1610 [2:32:09<3:48:34, 13.37s/it] {'loss': 0.0019, 'grad_norm': 1.3581401181111947, 'learning_rate': 6.372670807453416e-07, 'completion_length': 134.08036422729492, 'rewards/accuracy_reward': 0.4285714477300644, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4196428656578064, 'reward_std': 0.22094222530722618, 'kl': 0.0479736328125, 'epoch': 1.81} 36%|███▋ | 584/1610 [2:32:09<3:48:34, 13.37s/it] 36%|███▋ | 585/1610 [2:32:22<3:46:43, 13.27s/it] {'loss': 0.0026, 'grad_norm': 2.2135588103184665, 'learning_rate': 6.366459627329192e-07, 'completion_length': 97.18750762939453, 'rewards/accuracy_reward': 0.660714328289032, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.6517857909202576, 'reward_std': 0.37123818695545197, 'kl': 0.0640869140625, 'epoch': 1.82} 36%|███▋ | 585/1610 [2:32:22<3:46:43, 13.27s/it] 36%|███▋ | 586/1610 [2:32:35<3:40:11, 12.90s/it] {'loss': 0.0015, 'grad_norm': 1.2592981635181717, 'learning_rate': 6.360248447204969e-07, 'completion_length': 131.08929443359375, 'rewards/accuracy_reward': 0.4107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.410714328289032, 'reward_std': 0.26181842386722565, 'kl': 0.0374755859375, 'epoch': 1.82} 36%|███▋ | 586/1610 [2:32:35<3:40:11, 12.90s/it] 36%|███▋ | 587/1610 [2:32:46<3:33:58, 12.55s/it] {'loss': 0.0024, 'grad_norm': 2.3326443589483077, 'learning_rate': 6.354037267080745e-07, 'completion_length': 111.78571701049805, 'rewards/accuracy_reward': 0.5267857313156128, 'rewards/format_reward': 1.0, 'reward': 1.5267857909202576, 'reward_std': 0.19447603821754456, 'kl': 0.059814453125, 'epoch': 1.82} 36%|███▋ | 587/1610 [2:32:46<3:33:58, 12.55s/it] 37%|███▋ | 588/1610 [2:32:59<3:37:01, 12.74s/it] {'loss': 0.0024, 'grad_norm': 2.1320204493660104, 'learning_rate': 6.347826086956521e-07, 'completion_length': 138.23215103149414, 'rewards/accuracy_reward': 0.6160714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.6071429252624512, 'reward_std': 0.2946684956550598, 'kl': 0.0604248046875, 'epoch': 1.83} 37%|███▋ | 588/1610 [2:32:59<3:37:01, 12.74s/it] 37%|███▋ | 589/1610 [2:33:13<3:42:10, 13.06s/it] {'loss': 0.0025, 'grad_norm': 1.6991123212803854, 'learning_rate': 6.341614906832298e-07, 'completion_length': 144.6071548461914, 'rewards/accuracy_reward': 0.580357164144516, 'rewards/format_reward': 0.9553571939468384, 'reward': 1.535714328289032, 'reward_std': 0.25791000574827194, 'kl': 0.061279296875, 'epoch': 1.83} 37%|███▋ | 589/1610 [2:33:13<3:42:10, 13.06s/it] 37%|███▋ | 590/1610 [2:33:26<3:42:14, 13.07s/it] {'loss': 0.0022, 'grad_norm': 2.55659510362515, 'learning_rate': 6.335403726708074e-07, 'completion_length': 133.2678680419922, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.5982143878936768, 'reward_std': 0.20411308109760284, 'kl': 0.0543212890625, 'epoch': 1.83} 37%|███▋ | 590/1610 [2:33:26<3:42:14, 13.07s/it] 37%|███▋ | 591/1610 [2:33:40<3:43:32, 13.16s/it] {'loss': 0.0018, 'grad_norm': 1.578127222080786, 'learning_rate': 6.32919254658385e-07, 'completion_length': 126.37500381469727, 'rewards/accuracy_reward': 0.4464286118745804, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4375001192092896, 'reward_std': 0.30060645937919617, 'kl': 0.0438232421875, 'epoch': 1.84} 37%|███▋ | 591/1610 [2:33:40<3:43:32, 13.16s/it] 37%|███▋ | 592/1610 [2:33:52<3:37:31, 12.82s/it] {'loss': 0.0026, 'grad_norm': 1.085092614353533, 'learning_rate': 6.322981366459627e-07, 'completion_length': 122.68750762939453, 'rewards/accuracy_reward': 0.4107143133878708, 'rewards/format_reward': 1.0, 'reward': 1.4107143878936768, 'reward_std': 0.17433738708496094, 'kl': 0.066162109375, 'epoch': 1.84} 37%|███▋ | 592/1610 [2:33:52<3:37:31, 12.82s/it] 37%|███▋ | 593/1610 [2:34:03<3:31:48, 12.50s/it] {'loss': 0.0024, 'grad_norm': 1.6950944639260166, 'learning_rate': 6.316770186335404e-07, 'completion_length': 110.25000381469727, 'rewards/accuracy_reward': 0.4196428656578064, 'rewards/format_reward': 1.0, 'reward': 1.4196429252624512, 'reward_std': 0.19178561866283417, 'kl': 0.060302734375, 'epoch': 1.84} 37%|███▋ | 593/1610 [2:34:03<3:31:48, 12.50s/it] 37%|███▋ | 594/1610 [2:34:16<3:33:24, 12.60s/it] {'loss': 0.0021, 'grad_norm': 2.3268691398391033, 'learning_rate': 6.31055900621118e-07, 'completion_length': 106.59821701049805, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 1.0, 'reward': 1.5178571939468384, 'reward_std': 0.3498927503824234, 'kl': 0.053466796875, 'epoch': 1.84} 37%|███▋ | 594/1610 [2:34:16<3:33:24, 12.60s/it] 37%|███▋ | 595/1610 [2:34:29<3:35:46, 12.75s/it] {'loss': 0.0023, 'grad_norm': 1.1470510751322955, 'learning_rate': 6.304347826086957e-07, 'completion_length': 134.4464340209961, 'rewards/accuracy_reward': 0.375, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.3660715222358704, 'reward_std': 0.17495647072792053, 'kl': 0.0577392578125, 'epoch': 1.85} 37%|███▋ | 595/1610 [2:34:29<3:35:46, 12.75s/it] 37%|███▋ | 596/1610 [2:34:41<3:29:49, 12.42s/it] {'loss': 0.0018, 'grad_norm': 1.742537177010586, 'learning_rate': 6.298136645962733e-07, 'completion_length': 119.08036041259766, 'rewards/accuracy_reward': 0.4196428805589676, 'rewards/format_reward': 1.0, 'reward': 1.4196429252624512, 'reward_std': 0.23057925701141357, 'kl': 0.04443359375, 'epoch': 1.85} 37%|███▋ | 596/1610 [2:34:41<3:29:49, 12.42s/it] 37%|███▋ | 597/1610 [2:34:53<3:25:12, 12.15s/it] {'loss': 0.0018, 'grad_norm': 1.1740429273874984, 'learning_rate': 6.291925465838509e-07, 'completion_length': 108.35715103149414, 'rewards/accuracy_reward': 0.5625000298023224, 'rewards/format_reward': 1.0, 'reward': 1.5625000596046448, 'reward_std': 0.13225441798567772, 'kl': 0.0457763671875, 'epoch': 1.85} 37%|███▋ | 597/1610 [2:34:53<3:25:12, 12.15s/it] 37%|███▋ | 598/1610 [2:35:04<3:21:06, 11.92s/it] {'loss': 0.0018, 'grad_norm': 1.7140725764367781, 'learning_rate': 6.285714285714286e-07, 'completion_length': 123.9285774230957, 'rewards/accuracy_reward': 0.3214285969734192, 'rewards/format_reward': 1.0, 'reward': 1.3214285969734192, 'reward_std': 0.21192426979541779, 'kl': 0.04443359375, 'epoch': 1.86} 37%|███▋ | 598/1610 [2:35:04<3:21:06, 11.92s/it] 37%|███▋ | 599/1610 [2:35:15<3:15:30, 11.60s/it] {'loss': 0.002, 'grad_norm': 1.164800062564214, 'learning_rate': 6.279503105590062e-07, 'completion_length': 109.2589340209961, 'rewards/accuracy_reward': 0.4196428656578064, 'rewards/format_reward': 1.0, 'reward': 1.4196429252624512, 'reward_std': 0.20411308109760284, 'kl': 0.05029296875, 'epoch': 1.86} 37%|███▋ | 599/1610 [2:35:15<3:15:30, 11.60s/it] 37%|███▋ | 600/1610 [2:35:28<3:20:54, 11.94s/it] {'loss': 0.0021, 'grad_norm': 1.6524828384946333, 'learning_rate': 6.273291925465838e-07, 'completion_length': 121.27679061889648, 'rewards/accuracy_reward': 0.2678571492433548, 'rewards/format_reward': 1.0, 'reward': 1.2678571939468384, 'reward_std': 0.16922222077846527, 'kl': 0.0517578125, 'epoch': 1.86} 37%|███▋ | 600/1610 [2:35:28<3:20:54, 11.94s/it] 37%|███▋ | 601/1610 [2:36:28<7:26:42, 26.56s/it] {'loss': 0.0023, 'grad_norm': 1.9133683467813005, 'learning_rate': 6.267080745341615e-07, 'completion_length': 111.76786041259766, 'rewards/accuracy_reward': 0.6160714328289032, 'rewards/format_reward': 1.0, 'reward': 1.6160715222358704, 'reward_std': 0.2630307972431183, 'kl': 0.057373046875, 'epoch': 1.87} 37%|███▋ | 601/1610 [2:36:28<7:26:42, 26.56s/it] 37%|███▋ | 602/1610 [2:36:39<6:06:45, 21.83s/it] {'loss': 0.0021, 'grad_norm': 1.361998720369169, 'learning_rate': 6.260869565217392e-07, 'completion_length': 119.65179061889648, 'rewards/accuracy_reward': 0.455357164144516, 'rewards/format_reward': 1.0, 'reward': 1.4553571939468384, 'reward_std': 0.24316342175006866, 'kl': 0.0526123046875, 'epoch': 1.87} 37%|███▋ | 602/1610 [2:36:39<6:06:45, 21.83s/it] 37%|███▋ | 603/1610 [2:36:54<5:30:36, 19.70s/it] {'loss': 0.0024, 'grad_norm': 2.1118332437971907, 'learning_rate': 6.254658385093168e-07, 'completion_length': 151.3482208251953, 'rewards/accuracy_reward': 0.4107143133878708, 'rewards/format_reward': 0.973214328289032, 'reward': 1.383928656578064, 'reward_std': 0.26572122424840927, 'kl': 0.0595703125, 'epoch': 1.87} 37%|███▋ | 603/1610 [2:36:54<5:30:36, 19.70s/it] 38%|███▊ | 604/1610 [2:37:06<4:52:27, 17.44s/it] {'loss': 0.0026, 'grad_norm': 5.808453234904357, 'learning_rate': 6.248447204968945e-07, 'completion_length': 105.70536041259766, 'rewards/accuracy_reward': 0.4196428805589676, 'rewards/format_reward': 1.0, 'reward': 1.4196429252624512, 'reward_std': 0.2897626608610153, 'kl': 0.0654296875, 'epoch': 1.88} 38%|███▊ | 604/1610 [2:37:06<4:52:27, 17.44s/it] 38%|███▊ | 605/1610 [2:37:17<4:22:01, 15.64s/it] {'loss': 0.0021, 'grad_norm': 1.4014266054536972, 'learning_rate': 6.24223602484472e-07, 'completion_length': 124.39286041259766, 'rewards/accuracy_reward': 0.3482142984867096, 'rewards/format_reward': 1.0, 'reward': 1.348214328289032, 'reward_std': 0.3111136704683304, 'kl': 0.0513916015625, 'epoch': 1.88} 38%|███▊ | 605/1610 [2:37:17<4:22:01, 15.64s/it] 38%|███▊ | 606/1610 [2:37:30<4:05:21, 14.66s/it] {'loss': 0.0019, 'grad_norm': 1.354012373751942, 'learning_rate': 6.236024844720496e-07, 'completion_length': 128.9464340209961, 'rewards/accuracy_reward': 0.535714328289032, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.5267857909202576, 'reward_std': 0.24558819085359573, 'kl': 0.047119140625, 'epoch': 1.88} 38%|███▊ | 606/1610 [2:37:30<4:05:21, 14.66s/it] 38%|███▊ | 607/1610 [2:37:42<3:51:36, 13.85s/it] {'loss': 0.0018, 'grad_norm': 5.065969787238105, 'learning_rate': 6.229813664596273e-07, 'completion_length': 129.56250762939453, 'rewards/accuracy_reward': 0.3392857313156128, 'rewards/format_reward': 1.0, 'reward': 1.3392857313156128, 'reward_std': 0.23327529430389404, 'kl': 0.0443115234375, 'epoch': 1.89} 38%|███▊ | 607/1610 [2:37:42<3:51:36, 13.85s/it] 38%|███▊ | 608/1610 [2:37:53<3:39:55, 13.17s/it] {'loss': 0.0019, 'grad_norm': 1.3152824612456482, 'learning_rate': 6.223602484472049e-07, 'completion_length': 121.5535774230957, 'rewards/accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 1.0, 'reward': 1.5000000596046448, 'reward_std': 0.2410808801651001, 'kl': 0.0474853515625, 'epoch': 1.89} 38%|███▊ | 608/1610 [2:37:53<3:39:55, 13.17s/it] 38%|███▊ | 609/1610 [2:38:05<3:31:16, 12.66s/it] {'loss': 0.0024, 'grad_norm': 2.142874673169164, 'learning_rate': 6.217391304347825e-07, 'completion_length': 120.46429061889648, 'rewards/accuracy_reward': 0.401785746216774, 'rewards/format_reward': 1.0, 'reward': 1.4017857909202576, 'reward_std': 0.21825182437896729, 'kl': 0.060791015625, 'epoch': 1.89} 38%|███▊ | 609/1610 [2:38:05<3:31:16, 12.66s/it] 38%|███▊ | 610/1610 [2:38:16<3:25:18, 12.32s/it] {'loss': 0.0019, 'grad_norm': 1.1034781699925142, 'learning_rate': 6.211180124223601e-07, 'completion_length': 113.8660774230957, 'rewards/accuracy_reward': 0.4196428805589676, 'rewards/format_reward': 1.0, 'reward': 1.4196429252624512, 'reward_std': 0.24889206886291504, 'kl': 0.048095703125, 'epoch': 1.89} 38%|███▊ | 610/1610 [2:38:16<3:25:18, 12.32s/it] 38%|███▊ | 611/1610 [2:38:29<3:25:14, 12.33s/it] {'loss': 0.0029, 'grad_norm': 1.071255067247148, 'learning_rate': 6.204968944099379e-07, 'completion_length': 123.18750762939453, 'rewards/accuracy_reward': 0.446428582072258, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4375000596046448, 'reward_std': 0.2092282474040985, 'kl': 0.071533203125, 'epoch': 1.9} 38%|███▊ | 611/1610 [2:38:29<3:25:14, 12.33s/it] 38%|███▊ | 612/1610 [2:38:41<3:23:42, 12.25s/it] {'loss': 0.0023, 'grad_norm': 1.2339655262581162, 'learning_rate': 6.198757763975155e-07, 'completion_length': 101.50000381469727, 'rewards/accuracy_reward': 0.4285714477300644, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4196429252624512, 'reward_std': 0.11536617204546928, 'kl': 0.057373046875, 'epoch': 1.9} 38%|███▊ | 612/1610 [2:38:41<3:23:42, 12.25s/it] 38%|███▊ | 613/1610 [2:38:53<3:23:37, 12.25s/it] {'loss': 0.0021, 'grad_norm': 0.9486471099386323, 'learning_rate': 6.192546583850932e-07, 'completion_length': 113.56250762939453, 'rewards/accuracy_reward': 0.580357164144516, 'rewards/format_reward': 1.0, 'reward': 1.5803572535514832, 'reward_std': 0.13225442171096802, 'kl': 0.05224609375, 'epoch': 1.9} 38%|███▊ | 613/1610 [2:38:53<3:23:37, 12.25s/it] 38%|███▊ | 614/1610 [2:39:05<3:22:58, 12.23s/it] {'loss': 0.0025, 'grad_norm': 1.9098467264778565, 'learning_rate': 6.186335403726708e-07, 'completion_length': 122.64286422729492, 'rewards/accuracy_reward': 0.3035714477300644, 'rewards/format_reward': 1.0, 'reward': 1.3035714626312256, 'reward_std': 0.21313104033470154, 'kl': 0.0614013671875, 'epoch': 1.91} 38%|███▊ | 614/1610 [2:39:05<3:22:58, 12.23s/it] 38%|███▊ | 615/1610 [2:39:18<3:24:55, 12.36s/it] {'loss': 0.0021, 'grad_norm': 1.432257360163851, 'learning_rate': 6.180124223602484e-07, 'completion_length': 131.30358123779297, 'rewards/accuracy_reward': 0.3571428656578064, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.348214328289032, 'reward_std': 0.20349960029125214, 'kl': 0.052490234375, 'epoch': 1.91} 38%|███▊ | 615/1610 [2:39:18<3:24:55, 12.36s/it] 38%|███▊ | 616/1610 [2:39:30<3:26:09, 12.44s/it] {'loss': 0.0022, 'grad_norm': 1.4153186818583858, 'learning_rate': 6.17391304347826e-07, 'completion_length': 131.96429061889648, 'rewards/accuracy_reward': 0.5446428656578064, 'rewards/format_reward': 1.0, 'reward': 1.5446429252624512, 'reward_std': 0.208614781498909, 'kl': 0.055908203125, 'epoch': 1.91} 38%|███▊ | 616/1610 [2:39:30<3:26:09, 12.44s/it] 38%|███▊ | 617/1610 [2:39:40<3:14:02, 11.72s/it] {'loss': 0.0026, 'grad_norm': 2.1951365971853187, 'learning_rate': 6.167701863354037e-07, 'completion_length': 96.70536041259766, 'rewards/accuracy_reward': 0.5625000298023224, 'rewards/format_reward': 1.0, 'reward': 1.5625000596046448, 'reward_std': 0.14969704300165176, 'kl': 0.06396484375, 'epoch': 1.92} 38%|███▊ | 617/1610 [2:39:40<3:14:02, 11.72s/it] 38%|███▊ | 618/1610 [2:39:54<3:23:27, 12.31s/it] {'loss': 0.0024, 'grad_norm': 2.08647624159517, 'learning_rate': 6.161490683229813e-07, 'completion_length': 106.55357360839844, 'rewards/accuracy_reward': 0.4910714626312256, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.4732143878936768, 'reward_std': 0.2979160398244858, 'kl': 0.0594482421875, 'epoch': 1.92} 38%|███▊ | 618/1610 [2:39:54<3:23:27, 12.31s/it] 38%|███▊ | 619/1610 [2:40:08<3:31:11, 12.79s/it] {'loss': 0.0027, 'grad_norm': 0.9983532607035948, 'learning_rate': 6.15527950310559e-07, 'completion_length': 120.53572082519531, 'rewards/accuracy_reward': 0.392857164144516, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.3750000596046448, 'reward_std': 0.2344820648431778, 'kl': 0.06787109375, 'epoch': 1.92} 38%|███▊ | 619/1610 [2:40:08<3:31:11, 12.79s/it] 39%|███▊ | 620/1610 [2:40:20<3:28:45, 12.65s/it] {'loss': 0.0027, 'grad_norm': 0.9973561436821169, 'learning_rate': 6.149068322981367e-07, 'completion_length': 107.7410774230957, 'rewards/accuracy_reward': 0.5625000298023224, 'rewards/format_reward': 1.0, 'reward': 1.5625000596046448, 'reward_std': 0.0964989997446537, 'kl': 0.066650390625, 'epoch': 1.93} 39%|███▊ | 620/1610 [2:40:20<3:28:45, 12.65s/it] 39%|███▊ | 621/1610 [2:40:32<3:23:46, 12.36s/it] {'loss': 0.003, 'grad_norm': 0.7963014872797404, 'learning_rate': 6.142857142857143e-07, 'completion_length': 95.18750381469727, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 1.0, 'reward': 1.5178571939468384, 'reward_std': 0.07124518603086472, 'kl': 0.073974609375, 'epoch': 1.93} 39%|███▊ | 621/1610 [2:40:32<3:23:46, 12.36s/it] 39%|███▊ | 622/1610 [2:40:44<3:22:44, 12.31s/it] {'loss': 0.002, 'grad_norm': 1.8634410242644082, 'learning_rate': 6.13664596273292e-07, 'completion_length': 127.02679061889648, 'rewards/accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 1.0, 'reward': 1.5000000596046448, 'reward_std': 0.2696240097284317, 'kl': 0.0501708984375, 'epoch': 1.93} 39%|███▊ | 622/1610 [2:40:44<3:22:44, 12.31s/it] 39%|███▊ | 623/1610 [2:40:56<3:20:10, 12.17s/it] {'loss': 0.0022, 'grad_norm': 1.8409552383795025, 'learning_rate': 6.130434782608696e-07, 'completion_length': 130.56250381469727, 'rewards/accuracy_reward': 0.4821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.4821429252624512, 'reward_std': 0.2501044422388077, 'kl': 0.054443359375, 'epoch': 1.93} 39%|███▊ | 623/1610 [2:40:56<3:20:10, 12.17s/it] 39%|███▉ | 624/1610 [2:41:09<3:23:27, 12.38s/it] {'loss': 0.0019, 'grad_norm': 2.6246481533266417, 'learning_rate': 6.124223602484472e-07, 'completion_length': 136.17857360839844, 'rewards/accuracy_reward': 0.5625000298023224, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.5535715222358704, 'reward_std': 0.2675470560789108, 'kl': 0.0487060546875, 'epoch': 1.94} 39%|███▉ | 624/1610 [2:41:09<3:23:27, 12.38s/it] 39%|███▉ | 625/1610 [2:41:22<3:28:07, 12.68s/it] {'loss': 0.0034, 'grad_norm': 0.5048509819040515, 'learning_rate': 6.118012422360248e-07, 'completion_length': 129.52679443359375, 'rewards/accuracy_reward': 0.464285746216774, 'rewards/format_reward': 1.0, 'reward': 1.4642857909202576, 'reward_std': 0.06222161278128624, 'kl': 0.0845947265625, 'epoch': 1.94} 39%|███▉ | 625/1610 [2:41:22<3:28:07, 12.68s/it] 39%|███▉ | 626/1610 [2:41:34<3:23:28, 12.41s/it] {'loss': 0.002, 'grad_norm': 1.223513940764701, 'learning_rate': 6.111801242236025e-07, 'completion_length': 105.62500762939453, 'rewards/accuracy_reward': 0.4375000149011612, 'rewards/format_reward': 1.0, 'reward': 1.4375000596046448, 'reward_std': 0.18787721544504166, 'kl': 0.050537109375, 'epoch': 1.94} 39%|███▉ | 626/1610 [2:41:34<3:23:28, 12.41s/it] 39%|███▉ | 627/1610 [2:41:45<3:17:52, 12.08s/it] {'loss': 0.0023, 'grad_norm': 2.050687790712586, 'learning_rate': 6.105590062111801e-07, 'completion_length': 118.03572082519531, 'rewards/accuracy_reward': 0.383928582072258, 'rewards/format_reward': 1.0, 'reward': 1.3839285969734192, 'reward_std': 0.2540072351694107, 'kl': 0.058349609375, 'epoch': 1.95} 39%|███▉ | 627/1610 [2:41:45<3:17:52, 12.08s/it] 39%|███▉ | 628/1610 [2:41:57<3:17:19, 12.06s/it] {'loss': 0.0022, 'grad_norm': 1.2833932772307817, 'learning_rate': 6.099378881987576e-07, 'completion_length': 110.21429443359375, 'rewards/accuracy_reward': 0.464285746216774, 'rewards/format_reward': 1.0, 'reward': 1.4642857909202576, 'reward_std': 0.2119242548942566, 'kl': 0.0537109375, 'epoch': 1.95} 39%|███▉ | 628/1610 [2:41:57<3:17:19, 12.06s/it] 39%|███▉ | 629/1610 [2:42:09<3:15:31, 11.96s/it] {'loss': 0.0029, 'grad_norm': 0.9761279538694521, 'learning_rate': 6.093167701863354e-07, 'completion_length': 107.77679061889648, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6250000596046448, 'reward_std': 0.17226044833660126, 'kl': 0.071533203125, 'epoch': 1.95} 39%|███▉ | 629/1610 [2:42:09<3:15:31, 11.96s/it] 39%|███▉ | 630/1610 [2:42:22<3:17:21, 12.08s/it] {'loss': 0.0021, 'grad_norm': 1.3627326061453549, 'learning_rate': 6.08695652173913e-07, 'completion_length': 125.84822463989258, 'rewards/accuracy_reward': 0.3125000149011612, 'rewards/format_reward': 1.0, 'reward': 1.3125000596046448, 'reward_std': 0.23656460642814636, 'kl': 0.0523681640625, 'epoch': 1.96} 39%|███▉ | 630/1610 [2:42:22<3:17:21, 12.08s/it] 39%|███▉ | 631/1610 [2:42:35<3:24:49, 12.55s/it] {'loss': 0.0025, 'grad_norm': 1.0742891788856832, 'learning_rate': 6.080745341614906e-07, 'completion_length': 134.69643783569336, 'rewards/accuracy_reward': 0.5446428656578064, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.5357143878936768, 'reward_std': 0.15233397483825684, 'kl': 0.0611572265625, 'epoch': 1.96} 39%|███▉ | 631/1610 [2:42:35<3:24:49, 12.55s/it] 39%|███▉ | 632/1610 [2:42:47<3:22:42, 12.44s/it] {'loss': 0.0019, 'grad_norm': 1.3527597865006176, 'learning_rate': 6.074534161490683e-07, 'completion_length': 122.75893783569336, 'rewards/accuracy_reward': 0.196428582072258, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.1875000596046448, 'reward_std': 0.15360543876886368, 'kl': 0.0469970703125, 'epoch': 1.96} 39%|███▉ | 632/1610 [2:42:47<3:22:42, 12.44s/it] 39%|███▉ | 633/1610 [2:42:59<3:17:58, 12.16s/it] {'loss': 0.0022, 'grad_norm': 1.4587543531772027, 'learning_rate': 6.068322981366459e-07, 'completion_length': 119.63393783569336, 'rewards/accuracy_reward': 0.508928582072258, 'rewards/format_reward': 1.0, 'reward': 1.508928656578064, 'reward_std': 0.1866704523563385, 'kl': 0.054931640625, 'epoch': 1.97} 39%|███▉ | 633/1610 [2:42:59<3:17:58, 12.16s/it] 39%|███▉ | 634/1610 [2:43:11<3:17:56, 12.17s/it] {'loss': 0.0021, 'grad_norm': 1.6366187761000728, 'learning_rate': 6.062111801242235e-07, 'completion_length': 131.12500762939453, 'rewards/accuracy_reward': 0.3035714477300644, 'rewards/format_reward': 1.0, 'reward': 1.3035714626312256, 'reward_std': 0.13346679508686066, 'kl': 0.0523681640625, 'epoch': 1.97} 39%|███▉ | 634/1610 [2:43:11<3:17:56, 12.17s/it] 39%|███▉ | 635/1610 [2:43:24<3:19:40, 12.29s/it] {'loss': 0.002, 'grad_norm': 1.3202646819071548, 'learning_rate': 6.055900621118012e-07, 'completion_length': 120.7589340209961, 'rewards/accuracy_reward': 0.5357143133878708, 'rewards/format_reward': 1.0, 'reward': 1.535714328289032, 'reward_std': 0.157508235424757, 'kl': 0.0498046875, 'epoch': 1.97} 39%|███▉ | 635/1610 [2:43:24<3:19:40, 12.29s/it] 40%|███▉ | 636/1610 [2:43:34<3:10:24, 11.73s/it] {'loss': 0.0021, 'grad_norm': 0.6956531380565273, 'learning_rate': 6.049689440993788e-07, 'completion_length': 113.29464721679688, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 1.0, 'reward': 1.5535714626312256, 'reward_std': 0.09528661891818047, 'kl': 0.0516357421875, 'epoch': 1.98} 40%|███▉ | 636/1610 [2:43:34<3:10:24, 11.73s/it] 40%|███▉ | 637/1610 [2:43:47<3:16:07, 12.09s/it] {'loss': 0.0023, 'grad_norm': 1.814210498865245, 'learning_rate': 6.043478260869564e-07, 'completion_length': 116.00000381469727, 'rewards/accuracy_reward': 0.3660714477300644, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.3571429252624512, 'reward_std': 0.16141663491725922, 'kl': 0.056396484375, 'epoch': 1.98} 40%|███▉ | 637/1610 [2:43:47<3:16:07, 12.09s/it] 40%|███▉ | 638/1610 [2:43:59<3:17:45, 12.21s/it] {'loss': 0.002, 'grad_norm': 1.8015434579943728, 'learning_rate': 6.037267080745342e-07, 'completion_length': 131.93750762939453, 'rewards/accuracy_reward': 0.4285714477300644, 'rewards/format_reward': 1.0, 'reward': 1.4285715222358704, 'reward_std': 0.3505062460899353, 'kl': 0.0489501953125, 'epoch': 1.98} 40%|███▉ | 638/1610 [2:43:59<3:17:45, 12.21s/it] 40%|███▉ | 639/1610 [2:44:11<3:14:04, 11.99s/it] {'loss': 0.0017, 'grad_norm': 1.0909117344475292, 'learning_rate': 6.031055900621118e-07, 'completion_length': 131.95536041259766, 'rewards/accuracy_reward': 0.6160714626312256, 'rewards/format_reward': 1.0, 'reward': 1.6160715222358704, 'reward_std': 0.2275410294532776, 'kl': 0.0419921875, 'epoch': 1.98} 40%|███▉ | 639/1610 [2:44:11<3:14:04, 11.99s/it] 40%|███▉ | 640/1610 [2:44:22<3:09:19, 11.71s/it] {'loss': 0.0023, 'grad_norm': 1.4721311245633544, 'learning_rate': 6.024844720496894e-07, 'completion_length': 120.02679061889648, 'rewards/accuracy_reward': 0.4285714477300644, 'rewards/format_reward': 1.0, 'reward': 1.4285715222358704, 'reward_std': 0.19837883114814758, 'kl': 0.0565185546875, 'epoch': 1.99} 40%|███▉ | 640/1610 [2:44:22<3:09:19, 11.71s/it] 40%|███▉ | 641/1610 [2:44:34<3:10:09, 11.77s/it] {'loss': 0.0025, 'grad_norm': 1.1927316122875895, 'learning_rate': 6.018633540372671e-07, 'completion_length': 108.05357360839844, 'rewards/accuracy_reward': 0.4642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.4642857313156128, 'reward_std': 0.18093620240688324, 'kl': 0.0634765625, 'epoch': 1.99} 40%|███▉ | 641/1610 [2:44:34<3:10:09, 11.77s/it] 40%|███▉ | 642/1610 [2:44:47<3:17:46, 12.26s/it] {'loss': 0.002, 'grad_norm': 1.5269714738369609, 'learning_rate': 6.012422360248447e-07, 'completion_length': 150.6339340209961, 'rewards/accuracy_reward': 0.3214285969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3035714626312256, 'reward_std': 0.20801585167646408, 'kl': 0.04931640625, 'epoch': 1.99} 40%|███▉ | 642/1610 [2:44:47<3:17:46, 12.26s/it] 40%|███▉ | 643/1610 [2:45:00<3:17:59, 12.28s/it] {'loss': 0.0022, 'grad_norm': 1.1766139380460068, 'learning_rate': 6.006211180124223e-07, 'completion_length': 134.33036041259766, 'rewards/accuracy_reward': 0.5446428954601288, 'rewards/format_reward': 1.0, 'reward': 1.5446429252624512, 'reward_std': 0.19178561866283417, 'kl': 0.054931640625, 'epoch': 2.0} 40%|███▉ | 643/1610 [2:45:00<3:17:59, 12.28s/it] 40%|████ | 644/1610 [2:45:12<3:17:28, 12.27s/it] {'loss': 0.0023, 'grad_norm': 1.3792601348946338, 'learning_rate': 6e-07, 'completion_length': 136.52679443359375, 'rewards/accuracy_reward': 0.4821428805589676, 'rewards/format_reward': 1.0, 'reward': 1.4821429252624512, 'reward_std': 0.22754664719104767, 'kl': 0.056396484375, 'epoch': 2.0} 40%|████ | 644/1610 [2:45:12<3:17:28, 12.27s/it] 40%|████ | 645/1610 [2:45:25<3:22:22, 12.58s/it] {'loss': 0.0021, 'grad_norm': 2.015918275782598, 'learning_rate': 5.993788819875776e-07, 'completion_length': 136.17857360839844, 'rewards/accuracy_reward': 0.3035714477300644, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.2946429252624512, 'reward_std': 0.3213440179824829, 'kl': 0.0533447265625, 'epoch': 2.0} 40%|████ | 645/1610 [2:45:25<3:22:22, 12.58s/it] 40%|████ | 646/1610 [2:45:38<3:24:20, 12.72s/it] {'loss': 0.0016, 'grad_norm': 1.7611165254628605, 'learning_rate': 5.987577639751552e-07, 'completion_length': 149.3303680419922, 'rewards/accuracy_reward': 0.2053571566939354, 'rewards/format_reward': 1.0, 'reward': 1.2053571939468384, 'reward_std': 0.18849068135023117, 'kl': 0.0394287109375, 'epoch': 2.01} 40%|████ | 646/1610 [2:45:38<3:24:20, 12.72s/it] 40%|████ | 647/1610 [2:45:50<3:20:10, 12.47s/it] {'loss': 0.0023, 'grad_norm': 1.4237261698799135, 'learning_rate': 5.98136645962733e-07, 'completion_length': 117.1160774230957, 'rewards/accuracy_reward': 0.4375000298023224, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4285715222358704, 'reward_std': 0.1575082391500473, 'kl': 0.05810546875, 'epoch': 2.01} 40%|████ | 647/1610 [2:45:50<3:20:10, 12.47s/it] 40%|████ | 648/1610 [2:46:02<3:17:29, 12.32s/it] {'loss': 0.0021, 'grad_norm': 1.533698962012887, 'learning_rate': 5.975155279503106e-07, 'completion_length': 128.2232208251953, 'rewards/accuracy_reward': 0.3660714477300644, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.348214328289032, 'reward_std': 0.2540072202682495, 'kl': 0.052001953125, 'epoch': 2.01} 40%|████ | 648/1610 [2:46:02<3:17:29, 12.32s/it] 40%|████ | 649/1610 [2:46:14<3:15:13, 12.19s/it] {'loss': 0.0025, 'grad_norm': 1.9414610814557842, 'learning_rate': 5.968944099378882e-07, 'completion_length': 121.47322082519531, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.6696429252624512, 'reward_std': 0.1379830539226532, 'kl': 0.062744140625, 'epoch': 2.02} 40%|████ | 649/1610 [2:46:14<3:15:13, 12.19s/it] 40%|████ | 650/1610 [2:46:27<3:17:40, 12.36s/it] {'loss': 0.0023, 'grad_norm': 3.54816344310991, 'learning_rate': 5.962732919254659e-07, 'completion_length': 134.25000381469727, 'rewards/accuracy_reward': 0.5357142984867096, 'rewards/format_reward': 1.0, 'reward': 1.5357143878936768, 'reward_std': 0.25670325756073, 'kl': 0.0565185546875, 'epoch': 2.02} 40%|████ | 650/1610 [2:46:27<3:17:40, 12.36s/it] 40%|████ | 651/1610 [2:46:39<3:17:17, 12.34s/it] {'loss': 0.002, 'grad_norm': 1.071810341678755, 'learning_rate': 5.956521739130435e-07, 'completion_length': 132.96429061889648, 'rewards/accuracy_reward': 0.2500000186264515, 'rewards/format_reward': 1.0, 'reward': 1.2500000596046448, 'reward_std': 0.19448164105415344, 'kl': 0.0511474609375, 'epoch': 2.02} 40%|████ | 651/1610 [2:46:39<3:17:17, 12.34s/it] 40%|████ | 652/1610 [2:46:52<3:21:37, 12.63s/it] {'loss': 0.0023, 'grad_norm': 1.5847801867695666, 'learning_rate': 5.95031055900621e-07, 'completion_length': 140.9732208251953, 'rewards/accuracy_reward': 0.3125000149011612, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.3035714626312256, 'reward_std': 0.2792610228061676, 'kl': 0.056884765625, 'epoch': 2.02} 40%|████ | 652/1610 [2:46:52<3:21:37, 12.63s/it] 41%|████ | 653/1610 [2:47:05<3:20:47, 12.59s/it] {'loss': 0.0021, 'grad_norm': 1.5853414943344306, 'learning_rate': 5.944099378881987e-07, 'completion_length': 128.41071701049805, 'rewards/accuracy_reward': 0.473214328289032, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4642857909202576, 'reward_std': 0.2410808652639389, 'kl': 0.0513916015625, 'epoch': 2.03} 41%|████ | 653/1610 [2:47:05<3:20:47, 12.59s/it] 41%|████ | 654/1610 [2:47:17<3:19:32, 12.52s/it] {'loss': 0.0019, 'grad_norm': 1.2208778747858584, 'learning_rate': 5.937888198757763e-07, 'completion_length': 128.01786041259766, 'rewards/accuracy_reward': 0.4107143133878708, 'rewards/format_reward': 1.0, 'reward': 1.410714328289032, 'reward_std': 0.1956884190440178, 'kl': 0.0482177734375, 'epoch': 2.03} 41%|████ | 654/1610 [2:47:17<3:19:32, 12.52s/it] 41%|████ | 655/1610 [2:47:30<3:20:49, 12.62s/it] {'loss': 0.0019, 'grad_norm': 1.1514577456976367, 'learning_rate': 5.931677018633539e-07, 'completion_length': 160.61607360839844, 'rewards/accuracy_reward': 0.4375000298023224, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4285714626312256, 'reward_std': 0.2314438298344612, 'kl': 0.0484619140625, 'epoch': 2.03} 41%|████ | 655/1610 [2:47:30<3:20:49, 12.62s/it] 41%|████ | 656/1610 [2:47:42<3:18:24, 12.48s/it] {'loss': 0.0018, 'grad_norm': 1.6689813827476394, 'learning_rate': 5.925465838509317e-07, 'completion_length': 115.6160774230957, 'rewards/accuracy_reward': 0.535714328289032, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.5267857909202576, 'reward_std': 0.2767828106880188, 'kl': 0.045654296875, 'epoch': 2.04} 41%|████ | 656/1610 [2:47:42<3:18:24, 12.48s/it] 41%|████ | 657/1610 [2:47:54<3:12:52, 12.14s/it] {'loss': 0.0017, 'grad_norm': 7.857425692893853, 'learning_rate': 5.919254658385093e-07, 'completion_length': 118.95536422729492, 'rewards/accuracy_reward': 0.5178571939468384, 'rewards/format_reward': 1.0, 'reward': 1.5178571939468384, 'reward_std': 0.18397442996501923, 'kl': 0.0435791015625, 'epoch': 2.04} 41%|████ | 657/1610 [2:47:54<3:12:52, 12.14s/it] 41%|████ | 658/1610 [2:48:08<3:21:24, 12.69s/it] {'loss': 0.0024, 'grad_norm': 0.651439470728017, 'learning_rate': 5.913043478260869e-07, 'completion_length': 149.8214340209961, 'rewards/accuracy_reward': 0.401785746216774, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.3928571939468384, 'reward_std': 0.14579425752162933, 'kl': 0.059814453125, 'epoch': 2.04} 41%|████ | 658/1610 [2:48:08<3:21:24, 12.69s/it] 41%|████ | 659/1610 [2:48:22<3:27:42, 13.10s/it] {'loss': 0.0021, 'grad_norm': 1.0085490637708647, 'learning_rate': 5.906832298136646e-07, 'completion_length': 112.5089340209961, 'rewards/accuracy_reward': 0.6339285969734192, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.6160715222358704, 'reward_std': 0.19751425087451935, 'kl': 0.0528564453125, 'epoch': 2.05} 41%|████ | 659/1610 [2:48:22<3:27:42, 13.10s/it] 41%|████ | 660/1610 [2:48:35<3:26:54, 13.07s/it] {'loss': 0.0024, 'grad_norm': 0.5459469347946824, 'learning_rate': 5.900621118012422e-07, 'completion_length': 125.23215103149414, 'rewards/accuracy_reward': 0.4821428805589676, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4642857313156128, 'reward_std': 0.0835726335644722, 'kl': 0.060302734375, 'epoch': 2.05} 41%|████ | 660/1610 [2:48:35<3:26:54, 13.07s/it] 41%|████ | 661/1610 [2:48:47<3:25:31, 12.99s/it] {'loss': 0.0018, 'grad_norm': 3.469622228365689, 'learning_rate': 5.894409937888198e-07, 'completion_length': 145.1428680419922, 'rewards/accuracy_reward': 0.4285714328289032, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4196428656578064, 'reward_std': 0.2831638306379318, 'kl': 0.0462646484375, 'epoch': 2.05} 41%|████ | 661/1610 [2:48:47<3:25:31, 12.99s/it] 41%|████ | 662/1610 [2:48:59<3:19:33, 12.63s/it] {'loss': 0.0024, 'grad_norm': 1.3908472482327299, 'learning_rate': 5.888198757763975e-07, 'completion_length': 107.60714721679688, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.5625000596046448, 'reward_std': 0.24620164930820465, 'kl': 0.0594482421875, 'epoch': 2.06} 41%|████ | 662/1610 [2:48:59<3:19:33, 12.63s/it] 41%|████ | 663/1610 [2:49:10<3:10:17, 12.06s/it] {'loss': 0.0024, 'grad_norm': 1.412917580428107, 'learning_rate': 5.881987577639751e-07, 'completion_length': 99.16964721679688, 'rewards/accuracy_reward': 0.4642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.4642857313156128, 'reward_std': 0.17495086789131165, 'kl': 0.060546875, 'epoch': 2.06} 41%|████ | 663/1610 [2:49:10<3:10:17, 12.06s/it] 41%|████ | 664/1610 [2:49:21<3:03:17, 11.63s/it] {'loss': 0.0027, 'grad_norm': 2.210575995898596, 'learning_rate': 5.875776397515527e-07, 'completion_length': 96.50000381469727, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 1.0, 'reward': 1.5535715222358704, 'reward_std': 0.26450882852077484, 'kl': 0.06689453125, 'epoch': 2.06} 41%|████ | 664/1610 [2:49:21<3:03:17, 11.63s/it] 41%|████▏ | 665/1610 [2:49:31<2:56:30, 11.21s/it] {'loss': 0.0029, 'grad_norm': 1.0471683390514812, 'learning_rate': 5.869565217391305e-07, 'completion_length': 88.0535774230957, 'rewards/accuracy_reward': 0.4910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.4910715222358704, 'reward_std': 0.05831881985068321, 'kl': 0.0712890625, 'epoch': 2.07} 41%|████▏ | 665/1610 [2:49:31<2:56:30, 11.21s/it] 41%|████▏ | 666/1610 [2:49:42<2:58:24, 11.34s/it] {'loss': 0.0028, 'grad_norm': 1.1313988107327544, 'learning_rate': 5.863354037267081e-07, 'completion_length': 109.25000762939453, 'rewards/accuracy_reward': 0.4732143133878708, 'rewards/format_reward': 1.0, 'reward': 1.4732143878936768, 'reward_std': 0.1379830613732338, 'kl': 0.06982421875, 'epoch': 2.07} 41%|████▏ | 666/1610 [2:49:42<2:58:24, 11.34s/it] 41%|████▏ | 667/1610 [2:49:54<2:59:04, 11.39s/it] {'loss': 0.0025, 'grad_norm': 1.654433474893647, 'learning_rate': 5.857142857142857e-07, 'completion_length': 102.01786041259766, 'rewards/accuracy_reward': 0.580357164144516, 'rewards/format_reward': 1.0, 'reward': 1.5803572535514832, 'reward_std': 0.26572123169898987, 'kl': 0.0628662109375, 'epoch': 2.07} 41%|████▏ | 667/1610 [2:49:54<2:59:04, 11.39s/it] 41%|████▏ | 668/1610 [2:50:05<2:55:01, 11.15s/it] {'loss': 0.003, 'grad_norm': 1.4491508076841366, 'learning_rate': 5.850931677018634e-07, 'completion_length': 104.9464340209961, 'rewards/accuracy_reward': 0.5089285969734192, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.5000000596046448, 'reward_std': 0.18397442251443863, 'kl': 0.074951171875, 'epoch': 2.07} 41%|████▏ | 668/1610 [2:50:05<2:55:01, 11.15s/it] 42%|████▏ | 669/1610 [2:50:16<2:58:03, 11.35s/it] {'loss': 0.0023, 'grad_norm': 1.9332940340407647, 'learning_rate': 5.84472049689441e-07, 'completion_length': 104.76786422729492, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.6696429252624512, 'reward_std': 0.2921874076128006, 'kl': 0.057373046875, 'epoch': 2.08} 42%|████▏ | 669/1610 [2:50:16<2:58:03, 11.35s/it] 42%|████▏ | 670/1610 [2:50:30<3:06:39, 11.91s/it] {'loss': 0.0022, 'grad_norm': 3.237583119787372, 'learning_rate': 5.838509316770186e-07, 'completion_length': 101.78572082519531, 'rewards/accuracy_reward': 0.375, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.3660715222358704, 'reward_std': 0.12054044008255005, 'kl': 0.0543212890625, 'epoch': 2.08} 42%|████▏ | 670/1610 [2:50:30<3:06:39, 11.91s/it] 42%|████▏ | 671/1610 [2:50:41<3:02:33, 11.66s/it] {'loss': 0.0028, 'grad_norm': 2.369739464152371, 'learning_rate': 5.832298136645963e-07, 'completion_length': 89.46428680419922, 'rewards/accuracy_reward': 0.5089285969734192, 'rewards/format_reward': 1.0, 'reward': 1.5089285969734192, 'reward_std': 0.14969705045223236, 'kl': 0.071044921875, 'epoch': 2.08} 42%|████▏ | 671/1610 [2:50:41<3:02:33, 11.66s/it] 42%|████▏ | 672/1610 [2:50:51<2:55:28, 11.22s/it] {'loss': 0.0032, 'grad_norm': 0.7788155993812133, 'learning_rate': 5.826086956521739e-07, 'completion_length': 103.41964721679688, 'rewards/accuracy_reward': 0.598214328289032, 'rewards/format_reward': 1.0, 'reward': 1.5982143878936768, 'reward_std': 0.1827620565891266, 'kl': 0.0810546875, 'epoch': 2.09} 42%|████▏ | 672/1610 [2:50:51<2:55:28, 11.22s/it] 42%|████▏ | 673/1610 [2:51:02<2:57:16, 11.35s/it] {'loss': 0.0032, 'grad_norm': 1.2526392508023625, 'learning_rate': 5.819875776397515e-07, 'completion_length': 83.45536041259766, 'rewards/accuracy_reward': 0.4910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.4910714626312256, 'reward_std': 0.10821298509836197, 'kl': 0.08056640625, 'epoch': 2.09} 42%|████▏ | 673/1610 [2:51:02<2:57:16, 11.35s/it] 42%|████▏ | 674/1610 [2:51:14<2:57:17, 11.36s/it] {'loss': 0.0025, 'grad_norm': 1.781045716229595, 'learning_rate': 5.813664596273293e-07, 'completion_length': 104.45536422729492, 'rewards/accuracy_reward': 0.4821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.4821429252624512, 'reward_std': 0.23266181349754333, 'kl': 0.0628662109375, 'epoch': 2.09} 42%|████▏ | 674/1610 [2:51:14<2:57:17, 11.36s/it] 42%|████▏ | 675/1610 [2:51:24<2:51:47, 11.02s/it] {'loss': 0.0022, 'grad_norm': 1.308097069345323, 'learning_rate': 5.807453416149069e-07, 'completion_length': 87.06250381469727, 'rewards/accuracy_reward': 0.2946428656578064, 'rewards/format_reward': 1.0, 'reward': 1.2946429252624512, 'reward_std': 0.17495647072792053, 'kl': 0.0538330078125, 'epoch': 2.1} 42%|████▏ | 675/1610 [2:51:24<2:51:47, 11.02s/it] 42%|████▏ | 676/1610 [2:51:36<2:55:33, 11.28s/it] {'loss': 0.0023, 'grad_norm': 0.9287294935564608, 'learning_rate': 5.801242236024844e-07, 'completion_length': 101.66964721679688, 'rewards/accuracy_reward': 0.446428582072258, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4375000596046448, 'reward_std': 0.1800716295838356, 'kl': 0.0584716796875, 'epoch': 2.1} 42%|████▏ | 676/1610 [2:51:36<2:55:33, 11.28s/it] 42%|████▏ | 677/1610 [2:51:47<2:53:52, 11.18s/it] {'loss': 0.0025, 'grad_norm': 1.0490446830881082, 'learning_rate': 5.795031055900621e-07, 'completion_length': 92.34821701049805, 'rewards/accuracy_reward': 0.5267857313156128, 'rewards/format_reward': 1.0, 'reward': 1.5267857909202576, 'reward_std': 0.16531942784786224, 'kl': 0.0626220703125, 'epoch': 2.1} 42%|████▏ | 677/1610 [2:51:47<2:53:52, 11.18s/it] 42%|████▏ | 678/1610 [2:51:58<2:52:42, 11.12s/it] {'loss': 0.0025, 'grad_norm': 2.535007210445705, 'learning_rate': 5.788819875776397e-07, 'completion_length': 104.08036422729492, 'rewards/accuracy_reward': 0.4285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.4285714626312256, 'reward_std': 0.1929979920387268, 'kl': 0.0621337890625, 'epoch': 2.11} 42%|████▏ | 678/1610 [2:51:58<2:52:42, 11.12s/it] 42%|████▏ | 679/1610 [2:52:09<2:50:39, 11.00s/it] {'loss': 0.0026, 'grad_norm': 1.4106250054178566, 'learning_rate': 5.782608695652173e-07, 'completion_length': 91.41071701049805, 'rewards/accuracy_reward': 0.6696428656578064, 'rewards/format_reward': 1.0, 'reward': 1.6696429252624512, 'reward_std': 0.27743519842624664, 'kl': 0.0640869140625, 'epoch': 2.11} 42%|████▏ | 679/1610 [2:52:09<2:50:39, 11.00s/it] 42%|████▏ | 680/1610 [2:52:21<2:58:12, 11.50s/it] {'loss': 0.0034, 'grad_norm': 1.6622077470807668, 'learning_rate': 5.77639751552795e-07, 'completion_length': 100.6160774230957, 'rewards/accuracy_reward': 0.5803571939468384, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5446429252624512, 'reward_std': 0.18457334488630295, 'kl': 0.084228515625, 'epoch': 2.11} 42%|████▏ | 680/1610 [2:52:21<2:58:12, 11.50s/it] 42%|████▏ | 681/1610 [2:52:33<3:01:01, 11.69s/it] {'loss': 0.0029, 'grad_norm': 1.0570051004792707, 'learning_rate': 5.770186335403726e-07, 'completion_length': 100.74107360839844, 'rewards/accuracy_reward': 0.5625000298023224, 'rewards/format_reward': 1.0, 'reward': 1.5625000596046448, 'reward_std': 0.13225442171096802, 'kl': 0.072265625, 'epoch': 2.11} 42%|████▏ | 681/1610 [2:52:33<3:01:01, 11.69s/it] 42%|████▏ | 682/1610 [2:52:45<2:57:59, 11.51s/it] {'loss': 0.0022, 'grad_norm': 1.5556936215293287, 'learning_rate': 5.763975155279502e-07, 'completion_length': 88.32143020629883, 'rewards/accuracy_reward': 0.4375000149011612, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4285715222358704, 'reward_std': 0.2278832346200943, 'kl': 0.0556640625, 'epoch': 2.12} 42%|████▏ | 682/1610 [2:52:45<2:57:59, 11.51s/it] 42%|████▏ | 683/1610 [2:52:56<2:56:11, 11.40s/it] {'loss': 0.0032, 'grad_norm': 0.5150284163667778, 'learning_rate': 5.75776397515528e-07, 'completion_length': 95.70536041259766, 'rewards/accuracy_reward': 0.3482143133878708, 'rewards/format_reward': 1.0, 'reward': 1.348214328289032, 'reward_std': 0.025253813713788986, 'kl': 0.078857421875, 'epoch': 2.12} 42%|████▏ | 683/1610 [2:52:56<2:56:11, 11.40s/it] 42%|████▏ | 684/1610 [2:53:07<2:53:53, 11.27s/it] {'loss': 0.0026, 'grad_norm': 1.7663496651949218, 'learning_rate': 5.751552795031056e-07, 'completion_length': 84.9285774230957, 'rewards/accuracy_reward': 0.6875000596046448, 'rewards/format_reward': 1.0, 'reward': 1.6875000596046448, 'reward_std': 0.23326969146728516, 'kl': 0.06494140625, 'epoch': 2.12} 42%|████▏ | 684/1610 [2:53:07<2:53:53, 11.27s/it] 43%|████▎ | 685/1610 [2:53:17<2:51:46, 11.14s/it] {'loss': 0.003, 'grad_norm': 1.7366104196010035, 'learning_rate': 5.745341614906832e-07, 'completion_length': 93.8839340209961, 'rewards/accuracy_reward': 0.455357164144516, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4464285969734192, 'reward_std': 0.2675470560789108, 'kl': 0.074951171875, 'epoch': 2.13} 43%|████▎ | 685/1610 [2:53:17<2:51:46, 11.14s/it] 43%|████▎ | 686/1610 [2:53:29<2:52:22, 11.19s/it] {'loss': 0.0021, 'grad_norm': 1.2399835647541844, 'learning_rate': 5.739130434782609e-07, 'completion_length': 102.78571701049805, 'rewards/accuracy_reward': 0.4285714477300644, 'rewards/format_reward': 1.0, 'reward': 1.4285715222358704, 'reward_std': 0.17885926365852356, 'kl': 0.05322265625, 'epoch': 2.13} 43%|████▎ | 686/1610 [2:53:29<2:52:22, 11.19s/it] 43%|████▎ | 687/1610 [2:53:40<2:51:01, 11.12s/it] {'loss': 0.0025, 'grad_norm': 2.69119918207675, 'learning_rate': 5.732919254658385e-07, 'completion_length': 111.52679061889648, 'rewards/accuracy_reward': 0.4464285969734192, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4375000596046448, 'reward_std': 0.34538544714450836, 'kl': 0.06201171875, 'epoch': 2.13} 43%|████▎ | 687/1610 [2:53:40<2:51:01, 11.12s/it] 43%|████▎ | 688/1610 [2:53:50<2:48:49, 10.99s/it] {'loss': 0.0027, 'grad_norm': 1.1378769347586595, 'learning_rate': 5.726708074534161e-07, 'completion_length': 95.6160774230957, 'rewards/accuracy_reward': 0.3125000149011612, 'rewards/format_reward': 1.0, 'reward': 1.3125000596046448, 'reward_std': 0.15212179720401764, 'kl': 0.06640625, 'epoch': 2.14} 43%|████▎ | 688/1610 [2:53:50<2:48:49, 10.99s/it] 43%|████▎ | 689/1610 [2:54:03<2:56:08, 11.48s/it] {'loss': 0.0024, 'grad_norm': 0.9491417585536397, 'learning_rate': 5.720496894409938e-07, 'completion_length': 114.75893783569336, 'rewards/accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4910714626312256, 'reward_std': 0.16587380319833755, 'kl': 0.060546875, 'epoch': 2.14} 43%|████▎ | 689/1610 [2:54:03<2:56:08, 11.48s/it] 43%|████▎ | 690/1610 [2:54:14<2:54:42, 11.39s/it] {'loss': 0.0031, 'grad_norm': 1.762853560524542, 'learning_rate': 5.714285714285714e-07, 'completion_length': 87.3660774230957, 'rewards/accuracy_reward': 0.428571455180645, 'rewards/format_reward': 1.0, 'reward': 1.4285715222358704, 'reward_std': 0.128351628780365, 'kl': 0.0771484375, 'epoch': 2.14} 43%|████▎ | 690/1610 [2:54:14<2:54:42, 11.39s/it] 43%|████▎ | 691/1610 [2:54:26<2:56:32, 11.53s/it] {'loss': 0.0029, 'grad_norm': 1.6878065112461134, 'learning_rate': 5.70807453416149e-07, 'completion_length': 105.82143020629883, 'rewards/accuracy_reward': 0.4553571790456772, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.446428656578064, 'reward_std': 0.26243188977241516, 'kl': 0.0736083984375, 'epoch': 2.15} 43%|████▎ | 691/1610 [2:54:26<2:56:32, 11.53s/it] 43%|████▎ | 692/1610 [2:54:38<2:57:10, 11.58s/it] {'loss': 0.0027, 'grad_norm': 1.5596002800063615, 'learning_rate': 5.701863354037268e-07, 'completion_length': 89.55357360839844, 'rewards/accuracy_reward': 0.5446428656578064, 'rewards/format_reward': 1.0, 'reward': 1.5446429252624512, 'reward_std': 0.19178561866283417, 'kl': 0.06640625, 'epoch': 2.15} 43%|████▎ | 692/1610 [2:54:38<2:57:10, 11.58s/it] 43%|████▎ | 693/1610 [2:54:50<2:59:20, 11.73s/it] {'loss': 0.0026, 'grad_norm': 2.825683384083851, 'learning_rate': 5.695652173913044e-07, 'completion_length': 126.51786041259766, 'rewards/accuracy_reward': 0.330357164144516, 'rewards/format_reward': 1.0, 'reward': 1.3303571939468384, 'reward_std': 0.20411308109760284, 'kl': 0.06396484375, 'epoch': 2.15} 43%|████▎ | 693/1610 [2:54:50<2:59:20, 11.73s/it] 43%|████▎ | 694/1610 [2:55:02<2:59:50, 11.78s/it] {'loss': 0.0024, 'grad_norm': 1.2311088038943894, 'learning_rate': 5.68944099378882e-07, 'completion_length': 108.89286041259766, 'rewards/accuracy_reward': 0.5267857313156128, 'rewards/format_reward': 1.0, 'reward': 1.5267857909202576, 'reward_std': 0.14639315754175186, 'kl': 0.058837890625, 'epoch': 2.16} 43%|████▎ | 694/1610 [2:55:02<2:59:50, 11.78s/it] 43%|████▎ | 695/1610 [2:55:14<3:00:12, 11.82s/it] {'loss': 0.0026, 'grad_norm': 1.2555137248465138, 'learning_rate': 5.683229813664597e-07, 'completion_length': 93.81250381469727, 'rewards/accuracy_reward': 0.3750000149011612, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.3660714626312256, 'reward_std': 0.2275410294532776, 'kl': 0.0654296875, 'epoch': 2.16} 43%|████▎ | 695/1610 [2:55:14<3:00:12, 11.82s/it] 43%|████▎ | 696/1610 [2:55:26<3:02:17, 11.97s/it] {'loss': 0.0029, 'grad_norm': 1.4494291479040038, 'learning_rate': 5.677018633540373e-07, 'completion_length': 102.46429061889648, 'rewards/accuracy_reward': 0.5982142984867096, 'rewards/format_reward': 1.0, 'reward': 1.598214328289032, 'reward_std': 0.23656461387872696, 'kl': 0.0712890625, 'epoch': 2.16} 43%|████▎ | 696/1610 [2:55:26<3:02:17, 11.97s/it] 43%|████▎ | 697/1610 [2:55:38<3:02:55, 12.02s/it] {'loss': 0.0025, 'grad_norm': 1.3724516372587647, 'learning_rate': 5.670807453416149e-07, 'completion_length': 114.91072082519531, 'rewards/accuracy_reward': 0.2589285895228386, 'rewards/format_reward': 1.0, 'reward': 1.258928656578064, 'reward_std': 0.20411306619644165, 'kl': 0.0615234375, 'epoch': 2.16} 43%|████▎ | 697/1610 [2:55:38<3:02:55, 12.02s/it] 43%|████▎ | 698/1610 [2:55:51<3:07:49, 12.36s/it] {'loss': 0.0025, 'grad_norm': 2.6862303537742602, 'learning_rate': 5.664596273291926e-07, 'completion_length': 108.0089340209961, 'rewards/accuracy_reward': 0.321428582072258, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3035714626312256, 'reward_std': 0.2540128380060196, 'kl': 0.0635986328125, 'epoch': 2.17} 43%|████▎ | 698/1610 [2:55:51<3:07:49, 12.36s/it] 43%|████▎ | 699/1610 [2:56:02<3:02:12, 12.00s/it] {'loss': 0.0032, 'grad_norm': 9.398520116342302, 'learning_rate': 5.658385093167701e-07, 'completion_length': 99.62500381469727, 'rewards/accuracy_reward': 0.598214328289032, 'rewards/format_reward': 1.0, 'reward': 1.598214328289032, 'reward_std': 0.2086147665977478, 'kl': 0.08056640625, 'epoch': 2.17} 43%|████▎ | 699/1610 [2:56:02<3:02:12, 12.00s/it] 43%|████▎ | 700/1610 [2:56:14<2:58:52, 11.79s/it] {'loss': 0.0021, 'grad_norm': 1.7659305440061492, 'learning_rate': 5.652173913043477e-07, 'completion_length': 107.80357360839844, 'rewards/accuracy_reward': 0.5089285969734192, 'rewards/format_reward': 1.0, 'reward': 1.5089285969734192, 'reward_std': 0.20411306619644165, 'kl': 0.0533447265625, 'epoch': 2.17} 43%|████▎ | 700/1610 [2:56:14<2:58:52, 11.79s/it] 44%|████▎ | 701/1610 [2:57:13<6:33:15, 25.96s/it] {'loss': 0.003, 'grad_norm': 0.8699314417466667, 'learning_rate': 5.645962732919255e-07, 'completion_length': 85.87500381469727, 'rewards/accuracy_reward': 0.6517857313156128, 'rewards/format_reward': 1.0, 'reward': 1.6517857909202576, 'reward_std': 0.12054044008255005, 'kl': 0.073974609375, 'epoch': 2.18} 44%|████▎ | 701/1610 [2:57:13<6:33:15, 25.96s/it] 44%|████▎ | 702/1610 [2:57:25<5:29:58, 21.81s/it] {'loss': 0.0028, 'grad_norm': 1.6184358020141028, 'learning_rate': 5.639751552795031e-07, 'completion_length': 90.48214721679688, 'rewards/accuracy_reward': 0.5089285969734192, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.4910715222358704, 'reward_std': 0.2092282474040985, 'kl': 0.069091796875, 'epoch': 2.18} 44%|████▎ | 702/1610 [2:57:25<5:29:58, 21.81s/it] 44%|████▎ | 703/1610 [2:57:36<4:39:30, 18.49s/it] {'loss': 0.0025, 'grad_norm': 1.96565230297625, 'learning_rate': 5.633540372670807e-07, 'completion_length': 88.91071701049805, 'rewards/accuracy_reward': 0.571428582072258, 'rewards/format_reward': 1.0, 'reward': 1.571428656578064, 'reward_std': 0.1509094201028347, 'kl': 0.0628662109375, 'epoch': 2.18} 44%|████▎ | 703/1610 [2:57:36<4:39:30, 18.49s/it] 44%|████▎ | 704/1610 [2:57:47<4:07:49, 16.41s/it] {'loss': 0.0022, 'grad_norm': 2.040711029939877, 'learning_rate': 5.627329192546583e-07, 'completion_length': 116.39286041259766, 'rewards/accuracy_reward': 0.598214328289032, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.5892857909202576, 'reward_std': 0.2119242660701275, 'kl': 0.0545654296875, 'epoch': 2.19} 44%|████▎ | 704/1610 [2:57:47<4:07:49, 16.41s/it] 44%|████▍ | 705/1610 [2:58:00<3:51:08, 15.32s/it] {'loss': 0.003, 'grad_norm': 1.6322488430570745, 'learning_rate': 5.62111801242236e-07, 'completion_length': 112.54465103149414, 'rewards/accuracy_reward': 0.5089285969734192, 'rewards/format_reward': 1.0, 'reward': 1.508928656578064, 'reward_std': 0.19838443398475647, 'kl': 0.075439453125, 'epoch': 2.19} 44%|████▍ | 705/1610 [2:58:00<3:51:08, 15.32s/it] 44%|████▍ | 706/1610 [2:58:13<3:40:01, 14.60s/it] {'loss': 0.0025, 'grad_norm': 1.6998816714892577, 'learning_rate': 5.614906832298136e-07, 'completion_length': 129.76786422729492, 'rewards/accuracy_reward': 0.330357164144516, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3125000596046448, 'reward_std': 0.2636442631483078, 'kl': 0.0616455078125, 'epoch': 2.19} 44%|████▍ | 706/1610 [2:58:13<3:40:01, 14.60s/it] 44%|████▍ | 707/1610 [2:58:24<3:26:02, 13.69s/it] {'loss': 0.0023, 'grad_norm': 9.117622550211008, 'learning_rate': 5.608695652173912e-07, 'completion_length': 99.77679061889648, 'rewards/accuracy_reward': 0.5625000298023224, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.5535714626312256, 'reward_std': 0.26243188977241516, 'kl': 0.058837890625, 'epoch': 2.2} 44%|████▍ | 707/1610 [2:58:24<3:26:02, 13.69s/it] 44%|████▍ | 708/1610 [2:58:37<3:20:18, 13.32s/it] {'loss': 0.0029, 'grad_norm': 1.023040782140674, 'learning_rate': 5.602484472049689e-07, 'completion_length': 87.08036041259766, 'rewards/accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 1.0, 'reward': 1.5000000596046448, 'reward_std': 0.10101525485515594, 'kl': 0.07177734375, 'epoch': 2.2} 44%|████▍ | 708/1610 [2:58:37<3:20:18, 13.32s/it] 44%|████▍ | 709/1610 [2:58:48<3:11:57, 12.78s/it] {'loss': 0.0027, 'grad_norm': 3.488490678818448, 'learning_rate': 5.596273291925465e-07, 'completion_length': 99.66964721679688, 'rewards/accuracy_reward': 0.3839285969734192, 'rewards/format_reward': 1.0, 'reward': 1.3839285969734192, 'reward_std': 0.15872059762477875, 'kl': 0.06689453125, 'epoch': 2.2} 44%|████▍ | 709/1610 [2:58:48<3:11:57, 12.78s/it] 44%|████▍ | 710/1610 [2:58:59<3:00:25, 12.03s/it] {'loss': 0.0031, 'grad_norm': 1.125306944623417, 'learning_rate': 5.590062111801241e-07, 'completion_length': 83.09821701049805, 'rewards/accuracy_reward': 0.723214328289032, 'rewards/format_reward': 1.0, 'reward': 1.7232143878936768, 'reward_std': 0.1379830539226532, 'kl': 0.076171875, 'epoch': 2.2} 44%|████▍ | 710/1610 [2:58:59<3:00:25, 12.03s/it] 44%|████▍ | 711/1610 [2:59:09<2:51:39, 11.46s/it] {'loss': 0.0025, 'grad_norm': 1.279517068689923, 'learning_rate': 5.583850931677019e-07, 'completion_length': 82.9910774230957, 'rewards/accuracy_reward': 0.3839285969734192, 'rewards/format_reward': 1.0, 'reward': 1.383928656578064, 'reward_std': 0.1866704523563385, 'kl': 0.0626220703125, 'epoch': 2.21} 44%|████▍ | 711/1610 [2:59:09<2:51:39, 11.46s/it] 44%|████▍ | 712/1610 [2:59:21<2:55:16, 11.71s/it] {'loss': 0.0029, 'grad_norm': 1.7669320064668534, 'learning_rate': 5.577639751552795e-07, 'completion_length': 104.8035774230957, 'rewards/accuracy_reward': 0.4821428954601288, 'rewards/format_reward': 1.0, 'reward': 1.4821429252624512, 'reward_std': 0.1963018774986267, 'kl': 0.07275390625, 'epoch': 2.21} 44%|████▍ | 712/1610 [2:59:21<2:55:16, 11.71s/it] 44%|████▍ | 713/1610 [2:59:33<2:53:55, 11.63s/it] {'loss': 0.0027, 'grad_norm': 1.2935206859942716, 'learning_rate': 5.571428571428571e-07, 'completion_length': 110.77679061889648, 'rewards/accuracy_reward': 0.3482143133878708, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.3392857909202576, 'reward_std': 0.17885926365852356, 'kl': 0.06640625, 'epoch': 2.21} 44%|████▍ | 713/1610 [2:59:33<2:53:55, 11.63s/it] 44%|████▍ | 714/1610 [2:59:44<2:54:17, 11.67s/it] {'loss': 0.0025, 'grad_norm': 2.3061624710632205, 'learning_rate': 5.565217391304348e-07, 'completion_length': 108.85714721679688, 'rewards/accuracy_reward': 0.2678571492433548, 'rewards/format_reward': 1.0, 'reward': 1.2678571939468384, 'reward_std': 0.22875341773033142, 'kl': 0.0614013671875, 'epoch': 2.22} 44%|████▍ | 714/1610 [2:59:44<2:54:17, 11.67s/it] 44%|████▍ | 715/1610 [2:59:56<2:53:17, 11.62s/it] {'loss': 0.0025, 'grad_norm': 1.3962939206618519, 'learning_rate': 5.559006211180124e-07, 'completion_length': 104.54464721679688, 'rewards/accuracy_reward': 0.4375000298023224, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4285715222358704, 'reward_std': 0.2948834300041199, 'kl': 0.061767578125, 'epoch': 2.22} 44%|████▍ | 715/1610 [2:59:56<2:53:17, 11.62s/it] 44%|████▍ | 716/1610 [3:00:06<2:46:34, 11.18s/it] {'loss': 0.0028, 'grad_norm': 2.047300093951089, 'learning_rate': 5.5527950310559e-07, 'completion_length': 89.85714340209961, 'rewards/accuracy_reward': 0.6785714328289032, 'rewards/format_reward': 1.0, 'reward': 1.6785715222358704, 'reward_std': 0.2540128231048584, 'kl': 0.071044921875, 'epoch': 2.22} 44%|████▍ | 716/1610 [3:00:06<2:46:34, 11.18s/it] 45%|████▍ | 717/1610 [3:00:18<2:51:59, 11.56s/it] {'loss': 0.0023, 'grad_norm': 1.0540664368062005, 'learning_rate': 5.546583850931677e-07, 'completion_length': 113.7410774230957, 'rewards/accuracy_reward': 0.5357142984867096, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.5267857909202576, 'reward_std': 0.09138382971286774, 'kl': 0.0574951171875, 'epoch': 2.23} 45%|████▍ | 717/1610 [3:00:18<2:51:59, 11.56s/it] 45%|████▍ | 718/1610 [3:00:32<2:59:20, 12.06s/it] {'loss': 0.0029, 'grad_norm': 1.076404301036689, 'learning_rate': 5.540372670807453e-07, 'completion_length': 111.7410774230957, 'rewards/accuracy_reward': 0.5446428656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5267857909202576, 'reward_std': 0.1866704486310482, 'kl': 0.072509765625, 'epoch': 2.23} 45%|████▍ | 718/1610 [3:00:32<2:59:20, 12.06s/it] 45%|████▍ | 719/1610 [3:00:43<2:53:53, 11.71s/it] {'loss': 0.0028, 'grad_norm': 2.114302845400904, 'learning_rate': 5.534161490683229e-07, 'completion_length': 93.59822082519531, 'rewards/accuracy_reward': 0.4910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.4910714626312256, 'reward_std': 0.12956400215625763, 'kl': 0.070556640625, 'epoch': 2.23} 45%|████▍ | 719/1610 [3:00:43<2:53:53, 11.71s/it] 45%|████▍ | 720/1610 [3:00:53<2:49:18, 11.41s/it] {'loss': 0.003, 'grad_norm': 1.8424012727451458, 'learning_rate': 5.527950310559007e-07, 'completion_length': 88.1339340209961, 'rewards/accuracy_reward': 0.5178571939468384, 'rewards/format_reward': 1.0, 'reward': 1.5178571939468384, 'reward_std': 0.18397442996501923, 'kl': 0.074462890625, 'epoch': 2.24} 45%|████▍ | 720/1610 [3:00:53<2:49:18, 11.41s/it] 45%|████▍ | 721/1610 [3:01:04<2:44:21, 11.09s/it] {'loss': 0.0022, 'grad_norm': 1.3071718915160602, 'learning_rate': 5.521739130434783e-07, 'completion_length': 92.0714340209961, 'rewards/accuracy_reward': 0.4732143133878708, 'rewards/format_reward': 1.0, 'reward': 1.4732143878936768, 'reward_std': 0.20922823250293732, 'kl': 0.05615234375, 'epoch': 2.24} 45%|████▍ | 721/1610 [3:01:04<2:44:21, 11.09s/it] 45%|████▍ | 722/1610 [3:01:17<2:52:13, 11.64s/it] {'loss': 0.0026, 'grad_norm': 0.7067236339899698, 'learning_rate': 5.515527950310559e-07, 'completion_length': 129.4464340209961, 'rewards/accuracy_reward': 0.4196428805589676, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4107143878936768, 'reward_std': 0.10040179640054703, 'kl': 0.0655517578125, 'epoch': 2.24} 45%|████▍ | 722/1610 [3:01:17<2:52:13, 11.64s/it] 45%|████▍ | 723/1610 [3:01:27<2:48:06, 11.37s/it] {'loss': 0.0026, 'grad_norm': 1.7218301454843803, 'learning_rate': 5.509316770186335e-07, 'completion_length': 96.35714721679688, 'rewards/accuracy_reward': 0.4017857313156128, 'rewards/format_reward': 1.0, 'reward': 1.4017857909202576, 'reward_std': 0.14969705045223236, 'kl': 0.06591796875, 'epoch': 2.25} 45%|████▍ | 723/1610 [3:01:27<2:48:06, 11.37s/it] 45%|████▍ | 724/1610 [3:01:38<2:46:18, 11.26s/it] {'loss': 0.0023, 'grad_norm': 1.3907798720828743, 'learning_rate': 5.503105590062111e-07, 'completion_length': 97.31250381469727, 'rewards/accuracy_reward': 0.6696428954601288, 'rewards/format_reward': 1.0, 'reward': 1.6696429252624512, 'reward_std': 0.12054044008255005, 'kl': 0.0574951171875, 'epoch': 2.25} 45%|████▍ | 724/1610 [3:01:38<2:46:18, 11.26s/it] 45%|████▌ | 725/1610 [3:01:51<2:52:20, 11.68s/it] {'loss': 0.0027, 'grad_norm': 1.3565025443266667, 'learning_rate': 5.496894409937887e-07, 'completion_length': 101.71429061889648, 'rewards/accuracy_reward': 0.6160714626312256, 'rewards/format_reward': 1.0, 'reward': 1.6160714626312256, 'reward_std': 0.15360544621944427, 'kl': 0.068603515625, 'epoch': 2.25} 45%|████▌ | 725/1610 [3:01:51<2:52:20, 11.68s/it] 45%|████▌ | 726/1610 [3:02:01<2:43:34, 11.10s/it] {'loss': 0.0022, 'grad_norm': 1.6674476309122543, 'learning_rate': 5.490683229813664e-07, 'completion_length': 101.58929061889648, 'rewards/accuracy_reward': 0.660714328289032, 'rewards/format_reward': 1.0, 'reward': 1.660714328289032, 'reward_std': 0.17885926365852356, 'kl': 0.0543212890625, 'epoch': 2.25} 45%|████▌ | 726/1610 [3:02:01<2:43:34, 11.10s/it] 45%|████▌ | 727/1610 [3:02:11<2:39:30, 10.84s/it] {'loss': 0.0024, 'grad_norm': 1.3133762810518823, 'learning_rate': 5.48447204968944e-07, 'completion_length': 104.14286422729492, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.5982143878936768, 'reward_std': 0.1575138419866562, 'kl': 0.0589599609375, 'epoch': 2.26} 45%|████▌ | 727/1610 [3:02:11<2:39:30, 10.84s/it] 45%|████▌ | 728/1610 [3:02:23<2:43:53, 11.15s/it] {'loss': 0.0025, 'grad_norm': 1.016311459547209, 'learning_rate': 5.478260869565216e-07, 'completion_length': 123.83036422729492, 'rewards/accuracy_reward': 0.5625000149011612, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.5535715222358704, 'reward_std': 0.19112762063741684, 'kl': 0.0628662109375, 'epoch': 2.26} 45%|████▌ | 728/1610 [3:02:23<2:43:53, 11.15s/it] 45%|████▌ | 729/1610 [3:02:35<2:48:43, 11.49s/it] {'loss': 0.002, 'grad_norm': 0.9776510847940741, 'learning_rate': 5.472049689440994e-07, 'completion_length': 97.45536041259766, 'rewards/accuracy_reward': 0.7767857611179352, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.7678572535514832, 'reward_std': 0.09919501841068268, 'kl': 0.050537109375, 'epoch': 2.26} 45%|████▌ | 729/1610 [3:02:35<2:48:43, 11.49s/it] 45%|████▌ | 730/1610 [3:02:47<2:50:33, 11.63s/it] {'loss': 0.0032, 'grad_norm': 6.954493095909912, 'learning_rate': 5.46583850931677e-07, 'completion_length': 121.93750381469727, 'rewards/accuracy_reward': 0.3660714477300644, 'rewards/format_reward': 0.955357164144516, 'reward': 1.321428656578064, 'reward_std': 0.2119242623448372, 'kl': 0.079833984375, 'epoch': 2.27} 45%|████▌ | 730/1610 [3:02:47<2:50:33, 11.63s/it] 45%|████▌ | 731/1610 [3:02:59<2:52:16, 11.76s/it] {'loss': 0.003, 'grad_norm': 8.12460907968498, 'learning_rate': 5.459627329192546e-07, 'completion_length': 112.00000762939453, 'rewards/accuracy_reward': 0.5267857313156128, 'rewards/format_reward': 0.9732142984867096, 'reward': 1.5000000596046448, 'reward_std': 0.267630260437727, 'kl': 0.07568359375, 'epoch': 2.27} 45%|████▌ | 731/1610 [3:02:59<2:52:16, 11.76s/it] 45%|████▌ | 732/1610 [3:03:10<2:49:52, 11.61s/it] {'loss': 0.0025, 'grad_norm': 1.9881932635167652, 'learning_rate': 5.453416149068323e-07, 'completion_length': 109.0714340209961, 'rewards/accuracy_reward': 0.6160714626312256, 'rewards/format_reward': 1.0, 'reward': 1.6160714626312256, 'reward_std': 0.1800716444849968, 'kl': 0.063232421875, 'epoch': 2.27} 45%|████▌ | 732/1610 [3:03:10<2:49:52, 11.61s/it] 46%|████▌ | 733/1610 [3:03:23<2:54:48, 11.96s/it] {'loss': 0.0032, 'grad_norm': 2.152903987167819, 'learning_rate': 5.447204968944099e-07, 'completion_length': 98.82143020629883, 'rewards/accuracy_reward': 0.5267857611179352, 'rewards/format_reward': 1.0, 'reward': 1.5267857909202576, 'reward_std': 0.23717807233333588, 'kl': 0.0799560546875, 'epoch': 2.28} 46%|████▌ | 733/1610 [3:03:23<2:54:48, 11.96s/it] 46%|████▌ | 734/1610 [3:03:34<2:51:38, 11.76s/it] {'loss': 0.0026, 'grad_norm': 3.8038629995242195, 'learning_rate': 5.440993788819875e-07, 'completion_length': 104.22321701049805, 'rewards/accuracy_reward': 0.4910714477300644, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4821429252624512, 'reward_std': 0.16653179749846458, 'kl': 0.0640869140625, 'epoch': 2.28} 46%|████▌ | 734/1610 [3:03:34<2:51:38, 11.76s/it] 46%|████▌ | 735/1610 [3:03:47<2:54:15, 11.95s/it] {'loss': 0.0026, 'grad_norm': 0.4679580698249381, 'learning_rate': 5.434782608695652e-07, 'completion_length': 136.50000381469727, 'rewards/accuracy_reward': 0.383928582072258, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.3750001192092896, 'reward_std': 0.0835726410150528, 'kl': 0.06396484375, 'epoch': 2.28} 46%|████▌ | 735/1610 [3:03:47<2:54:15, 11.95s/it] 46%|████▌ | 736/1610 [3:03:58<2:52:39, 11.85s/it] {'loss': 0.0025, 'grad_norm': 1.5593693504991184, 'learning_rate': 5.428571428571428e-07, 'completion_length': 119.80358123779297, 'rewards/accuracy_reward': 0.5625000298023224, 'rewards/format_reward': 1.0, 'reward': 1.5625000596046448, 'reward_std': 0.3018244504928589, 'kl': 0.0615234375, 'epoch': 2.29} 46%|████▌ | 736/1610 [3:03:58<2:52:39, 11.85s/it] 46%|████▌ | 737/1610 [3:04:09<2:44:56, 11.34s/it] {'loss': 0.0032, 'grad_norm': 2.735880547680639, 'learning_rate': 5.422360248447204e-07, 'completion_length': 87.51786041259766, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 1.0, 'reward': 1.5178571939468384, 'reward_std': 0.20080359280109406, 'kl': 0.079833984375, 'epoch': 2.29} 46%|████▌ | 737/1610 [3:04:09<2:44:56, 11.34s/it] 46%|████▌ | 738/1610 [3:04:22<2:52:40, 11.88s/it] {'loss': 0.0028, 'grad_norm': 1.4015407979287242, 'learning_rate': 5.416149068322982e-07, 'completion_length': 109.52679061889648, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.5446429252624512, 'reward_std': 0.23711898922920227, 'kl': 0.0699462890625, 'epoch': 2.29} 46%|████▌ | 738/1610 [3:04:22<2:52:40, 11.88s/it] 46%|████▌ | 739/1610 [3:04:34<2:55:36, 12.10s/it] {'loss': 0.0022, 'grad_norm': 2.2002814707809395, 'learning_rate': 5.409937888198758e-07, 'completion_length': 114.2589340209961, 'rewards/accuracy_reward': 0.473214328289032, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4642857909202576, 'reward_std': 0.30390700697898865, 'kl': 0.054443359375, 'epoch': 2.3} 46%|████▌ | 739/1610 [3:04:34<2:55:36, 12.10s/it] 46%|████▌ | 740/1610 [3:04:46<2:54:13, 12.02s/it] {'loss': 0.0025, 'grad_norm': 1.5429791459141147, 'learning_rate': 5.403726708074534e-07, 'completion_length': 117.17857360839844, 'rewards/accuracy_reward': 0.4375000149011612, 'rewards/format_reward': 1.0, 'reward': 1.4375000596046448, 'reward_std': 0.22094221413135529, 'kl': 0.062255859375, 'epoch': 2.3} 46%|████▌ | 740/1610 [3:04:46<2:54:13, 12.02s/it] 46%|████▌ | 741/1610 [3:04:59<2:56:59, 12.22s/it] {'loss': 0.002, 'grad_norm': 2.223221809203951, 'learning_rate': 5.397515527950311e-07, 'completion_length': 120.29464721679688, 'rewards/accuracy_reward': 0.5446428954601288, 'rewards/format_reward': 1.0, 'reward': 1.5446429252624512, 'reward_std': 0.3252524137496948, 'kl': 0.0511474609375, 'epoch': 2.3} 46%|████▌ | 741/1610 [3:04:59<2:56:59, 12.22s/it] 46%|████▌ | 742/1610 [3:05:11<2:58:20, 12.33s/it] {'loss': 0.0027, 'grad_norm': 1.3022408813664517, 'learning_rate': 5.391304347826087e-07, 'completion_length': 123.60714721679688, 'rewards/accuracy_reward': 0.455357164144516, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.446428656578064, 'reward_std': 0.17737561464309692, 'kl': 0.0672607421875, 'epoch': 2.3} 46%|████▌ | 742/1610 [3:05:11<2:58:20, 12.33s/it] 46%|████▌ | 743/1610 [3:05:24<2:57:45, 12.30s/it] {'loss': 0.0029, 'grad_norm': 1.6318238221850636, 'learning_rate': 5.385093167701863e-07, 'completion_length': 113.78572082519531, 'rewards/accuracy_reward': 0.5089285969734192, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.5000000596046448, 'reward_std': 0.22875341027975082, 'kl': 0.072509765625, 'epoch': 2.31} 46%|████▌ | 743/1610 [3:05:24<2:57:45, 12.30s/it] 46%|████▌ | 744/1610 [3:05:35<2:54:33, 12.09s/it] {'loss': 0.0023, 'grad_norm': 1.355416200774909, 'learning_rate': 5.37888198757764e-07, 'completion_length': 113.6785774230957, 'rewards/accuracy_reward': 0.5535714328289032, 'rewards/format_reward': 1.0, 'reward': 1.5535715222358704, 'reward_std': 0.14579425007104874, 'kl': 0.058349609375, 'epoch': 2.31} 46%|████▌ | 744/1610 [3:05:35<2:54:33, 12.09s/it] 46%|████▋ | 745/1610 [3:05:48<2:55:56, 12.20s/it] {'loss': 0.0026, 'grad_norm': 1.0992070012711965, 'learning_rate': 5.372670807453416e-07, 'completion_length': 123.64286041259766, 'rewards/accuracy_reward': 0.6428571939468384, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.633928656578064, 'reward_std': 0.15360543876886368, 'kl': 0.0648193359375, 'epoch': 2.31} 46%|████▋ | 745/1610 [3:05:48<2:55:56, 12.20s/it] 46%|████▋ | 746/1610 [3:06:00<2:54:25, 12.11s/it] {'loss': 0.0022, 'grad_norm': 1.5754706724950307, 'learning_rate': 5.366459627329191e-07, 'completion_length': 127.90178680419922, 'rewards/accuracy_reward': 0.5982142984867096, 'rewards/format_reward': 1.0, 'reward': 1.5982143878936768, 'reward_std': 0.19447603821754456, 'kl': 0.0555419921875, 'epoch': 2.32} 46%|████▋ | 746/1610 [3:06:00<2:54:25, 12.11s/it] 46%|████▋ | 747/1610 [3:06:12<2:54:45, 12.15s/it] {'loss': 0.0024, 'grad_norm': 2.0390088992623627, 'learning_rate': 5.360248447204969e-07, 'completion_length': 119.04464721679688, 'rewards/accuracy_reward': 0.5267857313156128, 'rewards/format_reward': 1.0, 'reward': 1.5267857909202576, 'reward_std': 0.3111136853694916, 'kl': 0.06005859375, 'epoch': 2.32} 46%|████▋ | 747/1610 [3:06:12<2:54:45, 12.15s/it] 46%|████▋ | 748/1610 [3:06:25<3:00:03, 12.53s/it] {'loss': 0.0025, 'grad_norm': 1.4661065222739866, 'learning_rate': 5.354037267080745e-07, 'completion_length': 107.43750381469727, 'rewards/accuracy_reward': 0.4642857313156128, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4553571939468384, 'reward_std': 0.20100155472755432, 'kl': 0.062744140625, 'epoch': 2.32} 46%|████▋ | 748/1610 [3:06:25<3:00:03, 12.53s/it] 47%|████▋ | 749/1610 [3:06:37<2:56:23, 12.29s/it] {'loss': 0.0023, 'grad_norm': 1.292952422225747, 'learning_rate': 5.347826086956521e-07, 'completion_length': 124.69643783569336, 'rewards/accuracy_reward': 0.4375000298023224, 'rewards/format_reward': 1.0, 'reward': 1.4375000596046448, 'reward_std': 0.1704345941543579, 'kl': 0.056396484375, 'epoch': 2.33} 47%|████▋ | 749/1610 [3:06:37<2:56:23, 12.29s/it] 47%|████▋ | 750/1610 [3:06:50<2:58:17, 12.44s/it] {'loss': 0.0026, 'grad_norm': 1.1756877964300023, 'learning_rate': 5.341614906832298e-07, 'completion_length': 128.2857208251953, 'rewards/accuracy_reward': 0.392857164144516, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3750000596046448, 'reward_std': 0.15481781959533691, 'kl': 0.064697265625, 'epoch': 2.33} 47%|████▋ | 750/1610 [3:06:50<2:58:17, 12.44s/it] 47%|████▋ | 751/1610 [3:07:00<2:46:49, 11.65s/it] {'loss': 0.002, 'grad_norm': 1.5460655151775533, 'learning_rate': 5.335403726708074e-07, 'completion_length': 113.83929061889648, 'rewards/accuracy_reward': 0.5357143133878708, 'rewards/format_reward': 1.0, 'reward': 1.535714328289032, 'reward_std': 0.15481781959533691, 'kl': 0.0509033203125, 'epoch': 2.33} 47%|████▋ | 751/1610 [3:07:00<2:46:49, 11.65s/it] 47%|████▋ | 752/1610 [3:07:13<2:52:25, 12.06s/it] {'loss': 0.0022, 'grad_norm': 1.2385161726764853, 'learning_rate': 5.32919254658385e-07, 'completion_length': 117.72322082519531, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6250000596046448, 'reward_std': 0.13408027216792107, 'kl': 0.0548095703125, 'epoch': 2.34} 47%|████▋ | 752/1610 [3:07:13<2:52:25, 12.06s/it] 47%|████▋ | 753/1610 [3:07:24<2:47:59, 11.76s/it] {'loss': 0.0022, 'grad_norm': 1.283808975611293, 'learning_rate': 5.322981366459627e-07, 'completion_length': 129.46429443359375, 'rewards/accuracy_reward': 0.4910714477300644, 'rewards/format_reward': 1.0, 'reward': 1.4910715222358704, 'reward_std': 0.18214857578277588, 'kl': 0.05615234375, 'epoch': 2.34} 47%|████▋ | 753/1610 [3:07:24<2:47:59, 11.76s/it] 47%|████▋ | 754/1610 [3:07:37<2:54:38, 12.24s/it] {'loss': 0.0024, 'grad_norm': 1.1949971253952252, 'learning_rate': 5.316770186335403e-07, 'completion_length': 147.6071548461914, 'rewards/accuracy_reward': 0.4642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.4642857909202576, 'reward_std': 0.16262340173125267, 'kl': 0.058837890625, 'epoch': 2.34} 47%|████▋ | 754/1610 [3:07:37<2:54:38, 12.24s/it] 47%|████▋ | 755/1610 [3:07:51<3:01:44, 12.75s/it] {'loss': 0.0025, 'grad_norm': 0.6464051330145315, 'learning_rate': 5.310559006211179e-07, 'completion_length': 139.4732208251953, 'rewards/accuracy_reward': 0.7321428954601288, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.723214328289032, 'reward_std': 0.0964989997446537, 'kl': 0.063232421875, 'epoch': 2.34} 47%|████▋ | 755/1610 [3:07:51<3:01:44, 12.75s/it] 47%|████▋ | 756/1610 [3:08:03<2:57:04, 12.44s/it] {'loss': 0.0025, 'grad_norm': 1.5370436155234586, 'learning_rate': 5.304347826086957e-07, 'completion_length': 118.95536041259766, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6250000596046448, 'reward_std': 0.21972985565662384, 'kl': 0.0621337890625, 'epoch': 2.35} 47%|████▋ | 756/1610 [3:08:03<2:57:04, 12.44s/it] 47%|████▋ | 757/1610 [3:08:17<3:03:48, 12.93s/it] {'loss': 0.0019, 'grad_norm': 1.3815887783264984, 'learning_rate': 5.298136645962733e-07, 'completion_length': 150.2857208251953, 'rewards/accuracy_reward': 0.2500000149011612, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.2321429252624512, 'reward_std': 0.22094783931970596, 'kl': 0.048583984375, 'epoch': 2.35} 47%|████▋ | 757/1610 [3:08:17<3:03:48, 12.93s/it] 47%|████▋ | 758/1610 [3:08:29<3:02:25, 12.85s/it] {'loss': 0.0029, 'grad_norm': 1.8904225663758052, 'learning_rate': 5.291925465838509e-07, 'completion_length': 152.7321548461914, 'rewards/accuracy_reward': 0.4196428805589676, 'rewards/format_reward': 1.0, 'reward': 1.4196429252624512, 'reward_std': 0.12565559893846512, 'kl': 0.07177734375, 'epoch': 2.35} 47%|████▋ | 758/1610 [3:08:29<3:02:25, 12.85s/it] 47%|████▋ | 759/1610 [3:08:44<3:09:36, 13.37s/it] {'loss': 0.0024, 'grad_norm': 2.2900809392793713, 'learning_rate': 5.285714285714286e-07, 'completion_length': 147.33036041259766, 'rewards/accuracy_reward': 0.4375000149011612, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4285714626312256, 'reward_std': 0.2410808801651001, 'kl': 0.060302734375, 'epoch': 2.36} 47%|████▋ | 759/1610 [3:08:44<3:09:36, 13.37s/it] 47%|████▋ | 760/1610 [3:08:56<3:03:11, 12.93s/it] {'loss': 0.0023, 'grad_norm': 2.3679813873043956, 'learning_rate': 5.279503105590062e-07, 'completion_length': 120.66072082519531, 'rewards/accuracy_reward': 0.3482143133878708, 'rewards/format_reward': 1.0, 'reward': 1.3482143878936768, 'reward_std': 0.16531942412257195, 'kl': 0.0572509765625, 'epoch': 2.36} 47%|████▋ | 760/1610 [3:08:56<3:03:11, 12.93s/it] 47%|████▋ | 761/1610 [3:09:09<3:05:31, 13.11s/it] {'loss': 0.0026, 'grad_norm': 1.1284539454053264, 'learning_rate': 5.273291925465838e-07, 'completion_length': 146.5803680419922, 'rewards/accuracy_reward': 0.5625, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.5535715222358704, 'reward_std': 0.29097501933574677, 'kl': 0.0640869140625, 'epoch': 2.36} 47%|████▋ | 761/1610 [3:09:09<3:05:31, 13.11s/it] 47%|████▋ | 762/1610 [3:09:21<3:00:09, 12.75s/it] {'loss': 0.0021, 'grad_norm': 1.6457256192866, 'learning_rate': 5.267080745341615e-07, 'completion_length': 130.61608123779297, 'rewards/accuracy_reward': 0.5, 'rewards/format_reward': 1.0, 'reward': 1.5000001192092896, 'reward_std': 0.2675470560789108, 'kl': 0.0516357421875, 'epoch': 2.37} 47%|████▋ | 762/1610 [3:09:21<3:00:09, 12.75s/it] 47%|████▋ | 763/1610 [3:09:35<3:04:41, 13.08s/it] {'loss': 0.0024, 'grad_norm': 1.3629343650360266, 'learning_rate': 5.260869565217391e-07, 'completion_length': 145.62500762939453, 'rewards/accuracy_reward': 0.455357164144516, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4375000596046448, 'reward_std': 0.21803686022758484, 'kl': 0.06103515625, 'epoch': 2.37} 47%|████▋ | 763/1610 [3:09:35<3:04:41, 13.08s/it] 47%|████▋ | 764/1610 [3:09:49<3:06:14, 13.21s/it] {'loss': 0.0027, 'grad_norm': 1.0740278954168971, 'learning_rate': 5.254658385093167e-07, 'completion_length': 134.19643020629883, 'rewards/accuracy_reward': 0.4821428805589676, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.473214328289032, 'reward_std': 0.15933407843112946, 'kl': 0.06689453125, 'epoch': 2.37} 47%|████▋ | 764/1610 [3:09:49<3:06:14, 13.21s/it] 48%|████▊ | 765/1610 [3:10:01<3:03:35, 13.04s/it] {'loss': 0.0027, 'grad_norm': 1.153361866770614, 'learning_rate': 5.248447204968945e-07, 'completion_length': 134.31250762939453, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 1.0, 'reward': 1.6071429252624512, 'reward_std': 0.17495085299015045, 'kl': 0.067626953125, 'epoch': 2.38} 48%|████▊ | 765/1610 [3:10:01<3:03:35, 13.04s/it] 48%|████▊ | 766/1610 [3:10:15<3:05:00, 13.15s/it] {'loss': 0.0024, 'grad_norm': 2.728915733229204, 'learning_rate': 5.242236024844721e-07, 'completion_length': 134.99107360839844, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.6696429252624512, 'reward_std': 0.20020467042922974, 'kl': 0.058837890625, 'epoch': 2.38} 48%|████▊ | 766/1610 [3:10:15<3:05:00, 13.15s/it] 48%|████▊ | 767/1610 [3:10:27<2:59:59, 12.81s/it] {'loss': 0.0029, 'grad_norm': 1.454500991675721, 'learning_rate': 5.236024844720497e-07, 'completion_length': 102.72322082519531, 'rewards/accuracy_reward': 0.625, 'rewards/format_reward': 1.0, 'reward': 1.6250001192092896, 'reward_std': 0.17885926365852356, 'kl': 0.072265625, 'epoch': 2.38} 48%|████▊ | 767/1610 [3:10:27<2:59:59, 12.81s/it] 48%|████▊ | 768/1610 [3:10:39<2:58:49, 12.74s/it] {'loss': 0.0024, 'grad_norm': 1.2290974288195016, 'learning_rate': 5.229813664596274e-07, 'completion_length': 119.35714721679688, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 1.0, 'reward': 1.6071429252624512, 'reward_std': 0.24289214611053467, 'kl': 0.0614013671875, 'epoch': 2.39} 48%|████▊ | 768/1610 [3:10:39<2:58:49, 12.74s/it] 48%|████▊ | 769/1610 [3:10:52<2:58:07, 12.71s/it] {'loss': 0.0023, 'grad_norm': 1.0405735148194242, 'learning_rate': 5.22360248447205e-07, 'completion_length': 140.71429443359375, 'rewards/accuracy_reward': 0.2946428656578064, 'rewards/format_reward': 1.0, 'reward': 1.2946429252624512, 'reward_std': 0.15872062370181084, 'kl': 0.0576171875, 'epoch': 2.39} 48%|████▊ | 769/1610 [3:10:52<2:58:07, 12.71s/it] 48%|████▊ | 770/1610 [3:11:05<2:58:05, 12.72s/it] {'loss': 0.0019, 'grad_norm': 2.460963679637301, 'learning_rate': 5.217391304347825e-07, 'completion_length': 119.16071701049805, 'rewards/accuracy_reward': 0.4107142984867096, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.3928571939468384, 'reward_std': 0.4067251831293106, 'kl': 0.0478515625, 'epoch': 2.39} 48%|████▊ | 770/1610 [3:11:05<2:58:05, 12.72s/it] 48%|████▊ | 771/1610 [3:11:17<2:55:05, 12.52s/it] {'loss': 0.0023, 'grad_norm': 2.824495192677231, 'learning_rate': 5.211180124223602e-07, 'completion_length': 136.3303680419922, 'rewards/accuracy_reward': 0.5625000298023224, 'rewards/format_reward': 1.0, 'reward': 1.5625000596046448, 'reward_std': 0.24620164930820465, 'kl': 0.0587158203125, 'epoch': 2.39} 48%|████▊ | 771/1610 [3:11:17<2:55:05, 12.52s/it] 48%|████▊ | 772/1610 [3:11:28<2:49:51, 12.16s/it] {'loss': 0.0022, 'grad_norm': 1.126387935252328, 'learning_rate': 5.204968944099378e-07, 'completion_length': 129.68750381469727, 'rewards/accuracy_reward': 0.6160714626312256, 'rewards/format_reward': 1.0, 'reward': 1.6160714626312256, 'reward_std': 0.15360544621944427, 'kl': 0.0557861328125, 'epoch': 2.4} 48%|████▊ | 772/1610 [3:11:28<2:49:51, 12.16s/it] 48%|████▊ | 773/1610 [3:11:39<2:44:39, 11.80s/it] {'loss': 0.0022, 'grad_norm': 1.3949433349632938, 'learning_rate': 5.198757763975154e-07, 'completion_length': 104.41071701049805, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 1.0, 'reward': 1.6071429252624512, 'reward_std': 0.21192426979541779, 'kl': 0.0543212890625, 'epoch': 2.4} 48%|████▊ | 773/1610 [3:11:39<2:44:39, 11.80s/it] 48%|████▊ | 774/1610 [3:11:50<2:40:36, 11.53s/it] {'loss': 0.0029, 'grad_norm': 1.814406591779959, 'learning_rate': 5.192546583850932e-07, 'completion_length': 93.87500381469727, 'rewards/accuracy_reward': 0.5803571939468384, 'rewards/format_reward': 1.0, 'reward': 1.5803571939468384, 'reward_std': 0.31292495131492615, 'kl': 0.072509765625, 'epoch': 2.4} 48%|████▊ | 774/1610 [3:11:50<2:40:36, 11.53s/it] 48%|████▊ | 775/1610 [3:12:00<2:34:58, 11.14s/it] {'loss': 0.0025, 'grad_norm': 1.2887552060188436, 'learning_rate': 5.186335403726708e-07, 'completion_length': 105.83036422729492, 'rewards/accuracy_reward': 0.4196428656578064, 'rewards/format_reward': 1.0, 'reward': 1.4196429252624512, 'reward_std': 0.19690078496932983, 'kl': 0.0631103515625, 'epoch': 2.41} 48%|████▊ | 775/1610 [3:12:00<2:34:58, 11.14s/it] 48%|████▊ | 776/1610 [3:12:10<2:30:42, 10.84s/it] {'loss': 0.0019, 'grad_norm': 1.1695995578497809, 'learning_rate': 5.180124223602484e-07, 'completion_length': 114.16071701049805, 'rewards/accuracy_reward': 0.4375000298023224, 'rewards/format_reward': 1.0, 'reward': 1.4375000596046448, 'reward_std': 0.17495647072792053, 'kl': 0.04833984375, 'epoch': 2.41} 48%|████▊ | 776/1610 [3:12:10<2:30:42, 10.84s/it] 48%|████▊ | 777/1610 [3:12:22<2:32:06, 10.96s/it] {'loss': 0.0023, 'grad_norm': 1.3134755439075918, 'learning_rate': 5.173913043478261e-07, 'completion_length': 115.90179443359375, 'rewards/accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 1.0, 'reward': 1.5892857909202576, 'reward_std': 0.16504816338419914, 'kl': 0.0587158203125, 'epoch': 2.41} 48%|████▊ | 777/1610 [3:12:22<2:32:06, 10.96s/it] 48%|████▊ | 778/1610 [3:12:34<2:37:27, 11.36s/it] {'loss': 0.0026, 'grad_norm': 1.5325153036408354, 'learning_rate': 5.167701863354037e-07, 'completion_length': 130.30358123779297, 'rewards/accuracy_reward': 0.464285746216774, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4553572535514832, 'reward_std': 0.2540072351694107, 'kl': 0.0654296875, 'epoch': 2.42} 48%|████▊ | 778/1610 [3:12:34<2:37:27, 11.36s/it] 48%|████▊ | 779/1610 [3:12:44<2:32:00, 10.98s/it] {'loss': 0.0025, 'grad_norm': 1.983888153255637, 'learning_rate': 5.161490683229813e-07, 'completion_length': 84.34821701049805, 'rewards/accuracy_reward': 0.4732142984867096, 'rewards/format_reward': 1.0, 'reward': 1.473214328289032, 'reward_std': 0.14969705045223236, 'kl': 0.062744140625, 'epoch': 2.42} 48%|████▊ | 779/1610 [3:12:44<2:32:00, 10.98s/it] 48%|████▊ | 780/1610 [3:12:55<2:31:26, 10.95s/it] {'loss': 0.0022, 'grad_norm': 1.713422722037486, 'learning_rate': 5.15527950310559e-07, 'completion_length': 92.98214721679688, 'rewards/accuracy_reward': 0.5892857611179352, 'rewards/format_reward': 1.0, 'reward': 1.5892857909202576, 'reward_std': 0.15481781959533691, 'kl': 0.0560302734375, 'epoch': 2.42} 48%|████▊ | 780/1610 [3:12:55<2:31:26, 10.95s/it] 49%|████▊ | 781/1610 [3:13:06<2:30:14, 10.87s/it] {'loss': 0.0027, 'grad_norm': 2.008690242623735, 'learning_rate': 5.149068322981366e-07, 'completion_length': 102.83929443359375, 'rewards/accuracy_reward': 0.4821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.4821429252624512, 'reward_std': 0.18397442996501923, 'kl': 0.066650390625, 'epoch': 2.43} 49%|████▊ | 781/1610 [3:13:06<2:30:14, 10.87s/it] 49%|████▊ | 782/1610 [3:13:16<2:28:54, 10.79s/it] {'loss': 0.0022, 'grad_norm': 1.5177924365420246, 'learning_rate': 5.142857142857142e-07, 'completion_length': 99.09821701049805, 'rewards/accuracy_reward': 0.6696428656578064, 'rewards/format_reward': 1.0, 'reward': 1.669642984867096, 'reward_std': 0.2540072277188301, 'kl': 0.0555419921875, 'epoch': 2.43} 49%|████▊ | 782/1610 [3:13:16<2:28:54, 10.79s/it] 49%|████▊ | 783/1610 [3:13:26<2:26:29, 10.63s/it] {'loss': 0.003, 'grad_norm': 3.0218247972543035, 'learning_rate': 5.13664596273292e-07, 'completion_length': 79.50000381469727, 'rewards/accuracy_reward': 0.5267857313156128, 'rewards/format_reward': 1.0, 'reward': 1.5267857909202576, 'reward_std': 0.08747543022036552, 'kl': 0.07373046875, 'epoch': 2.43} 49%|████▊ | 783/1610 [3:13:26<2:26:29, 10.63s/it] 49%|████▊ | 784/1610 [3:13:38<2:29:19, 10.85s/it] {'loss': 0.0032, 'grad_norm': 0.9260444887085337, 'learning_rate': 5.130434782608696e-07, 'completion_length': 84.3214340209961, 'rewards/accuracy_reward': 0.6517857313156128, 'rewards/format_reward': 1.0, 'reward': 1.6517857909202576, 'reward_std': 0.08747543022036552, 'kl': 0.080078125, 'epoch': 2.43} 49%|████▊ | 784/1610 [3:13:38<2:29:19, 10.85s/it] 49%|████▉ | 785/1610 [3:13:49<2:31:02, 10.98s/it] {'loss': 0.0025, 'grad_norm': 1.2408862365896616, 'learning_rate': 5.124223602484472e-07, 'completion_length': 100.38393020629883, 'rewards/accuracy_reward': 0.5446428656578064, 'rewards/format_reward': 1.0, 'reward': 1.5446429252624512, 'reward_std': 0.12054044008255005, 'kl': 0.0626220703125, 'epoch': 2.44} 49%|████▉ | 785/1610 [3:13:49<2:31:02, 10.98s/it] 49%|████▉ | 786/1610 [3:14:02<2:38:11, 11.52s/it] {'loss': 0.0031, 'grad_norm': 1.0840534889358455, 'learning_rate': 5.118012422360249e-07, 'completion_length': 113.58929061889648, 'rewards/accuracy_reward': 0.3482142984867096, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.3392857909202576, 'reward_std': 0.1963018849492073, 'kl': 0.077392578125, 'epoch': 2.44} 49%|████▉ | 786/1610 [3:14:02<2:38:11, 11.52s/it] 49%|████▉ | 787/1610 [3:14:14<2:38:43, 11.57s/it] {'loss': 0.0025, 'grad_norm': 1.2363962766789816, 'learning_rate': 5.111801242236025e-07, 'completion_length': 98.89286041259766, 'rewards/accuracy_reward': 0.4821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.4821429252624512, 'reward_std': 0.17885925620794296, 'kl': 0.0631103515625, 'epoch': 2.44} 49%|████▉ | 787/1610 [3:14:14<2:38:43, 11.57s/it] 49%|████▉ | 788/1610 [3:14:25<2:36:17, 11.41s/it] {'loss': 0.003, 'grad_norm': 1.240059237884866, 'learning_rate': 5.105590062111801e-07, 'completion_length': 92.9910774230957, 'rewards/accuracy_reward': 0.4107143133878708, 'rewards/format_reward': 1.0, 'reward': 1.410714328289032, 'reward_std': 0.17226043343544006, 'kl': 0.074462890625, 'epoch': 2.45} 49%|████▉ | 788/1610 [3:14:25<2:36:17, 11.41s/it] 49%|████▉ | 789/1610 [3:14:34<2:28:58, 10.89s/it] {'loss': 0.0024, 'grad_norm': 1.0677288533766964, 'learning_rate': 5.099378881987578e-07, 'completion_length': 92.1339340209961, 'rewards/accuracy_reward': 0.4910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.4910715222358704, 'reward_std': 0.15360544621944427, 'kl': 0.059814453125, 'epoch': 2.45} 49%|████▉ | 789/1610 [3:14:34<2:28:58, 10.89s/it] 49%|████▉ | 790/1610 [3:14:46<2:31:57, 11.12s/it] {'loss': 0.0027, 'grad_norm': 1.9324912488255568, 'learning_rate': 5.093167701863354e-07, 'completion_length': 102.12500381469727, 'rewards/accuracy_reward': 0.3392857313156128, 'rewards/format_reward': 1.0, 'reward': 1.3392857909202576, 'reward_std': 0.25670325756073, 'kl': 0.0667724609375, 'epoch': 2.45} 49%|████▉ | 790/1610 [3:14:46<2:31:57, 11.12s/it] 49%|████▉ | 791/1610 [3:14:57<2:30:41, 11.04s/it] {'loss': 0.0022, 'grad_norm': 4.132344529092845, 'learning_rate': 5.08695652173913e-07, 'completion_length': 111.65179061889648, 'rewards/accuracy_reward': 0.4107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.410714328289032, 'reward_std': 0.22936688363552094, 'kl': 0.0560302734375, 'epoch': 2.46} 49%|████▉ | 791/1610 [3:14:57<2:30:41, 11.04s/it] 49%|████▉ | 792/1610 [3:15:08<2:30:48, 11.06s/it] {'loss': 0.0026, 'grad_norm': 1.0931542223435189, 'learning_rate': 5.080745341614908e-07, 'completion_length': 115.25893020629883, 'rewards/accuracy_reward': 0.294642873108387, 'rewards/format_reward': 1.0, 'reward': 1.2946429252624512, 'reward_std': 0.19838443398475647, 'kl': 0.0643310546875, 'epoch': 2.46} 49%|████▉ | 792/1610 [3:15:08<2:30:48, 11.06s/it] 49%|████▉ | 793/1610 [3:15:19<2:32:48, 11.22s/it] {'loss': 0.0027, 'grad_norm': 1.1168386296174468, 'learning_rate': 5.074534161490684e-07, 'completion_length': 100.56250381469727, 'rewards/accuracy_reward': 0.4196428805589676, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4017857909202576, 'reward_std': 0.21313665062189102, 'kl': 0.066650390625, 'epoch': 2.46} 49%|████▉ | 793/1610 [3:15:19<2:32:48, 11.22s/it] 49%|████▉ | 794/1610 [3:15:31<2:32:03, 11.18s/it] {'loss': 0.0025, 'grad_norm': 2.951552783570242, 'learning_rate': 5.068322981366459e-07, 'completion_length': 108.7410774230957, 'rewards/accuracy_reward': 0.4464285969734192, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4375000596046448, 'reward_std': 0.27804866433143616, 'kl': 0.0623779296875, 'epoch': 2.47} 49%|████▉ | 794/1610 [3:15:31<2:32:03, 11.18s/it] 49%|████▉ | 795/1610 [3:15:44<2:40:21, 11.81s/it] {'loss': 0.0031, 'grad_norm': 1.3431752167203002, 'learning_rate': 5.062111801242235e-07, 'completion_length': 127.70536041259766, 'rewards/accuracy_reward': 0.258928582072258, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.2500000596046448, 'reward_std': 0.18397442996501923, 'kl': 0.07763671875, 'epoch': 2.47} 49%|████▉ | 795/1610 [3:15:44<2:40:21, 11.81s/it] 49%|████▉ | 796/1610 [3:15:56<2:39:42, 11.77s/it] {'loss': 0.0022, 'grad_norm': 0.9535123059650653, 'learning_rate': 5.055900621118012e-07, 'completion_length': 125.07143020629883, 'rewards/accuracy_reward': 0.4107142984867096, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4017857313156128, 'reward_std': 0.16141102463006973, 'kl': 0.0550537109375, 'epoch': 2.47} 49%|████▉ | 796/1610 [3:15:56<2:39:42, 11.77s/it] 50%|████▉ | 797/1610 [3:16:07<2:37:19, 11.61s/it] {'loss': 0.003, 'grad_norm': 1.1650565774208654, 'learning_rate': 5.049689440993788e-07, 'completion_length': 99.18750762939453, 'rewards/accuracy_reward': 0.6071428656578064, 'rewards/format_reward': 1.0, 'reward': 1.6071429252624512, 'reward_std': 0.15481781959533691, 'kl': 0.073974609375, 'epoch': 2.48} 50%|████▉ | 797/1610 [3:16:07<2:37:19, 11.61s/it] 50%|████▉ | 798/1610 [3:16:17<2:33:35, 11.35s/it] {'loss': 0.003, 'grad_norm': 1.1398745708196716, 'learning_rate': 5.043478260869564e-07, 'completion_length': 108.40179061889648, 'rewards/accuracy_reward': 0.5446428954601288, 'rewards/format_reward': 1.0, 'reward': 1.5446429252624512, 'reward_std': 0.14700662344694138, 'kl': 0.07470703125, 'epoch': 2.48} 50%|████▉ | 798/1610 [3:16:17<2:33:35, 11.35s/it] 50%|████▉ | 799/1610 [3:16:28<2:31:36, 11.22s/it] {'loss': 0.0032, 'grad_norm': 1.8980386623688648, 'learning_rate': 5.037267080745341e-07, 'completion_length': 96.47322082519531, 'rewards/accuracy_reward': 0.2767857164144516, 'rewards/format_reward': 1.0, 'reward': 1.2767857313156128, 'reward_std': 0.11394162103533745, 'kl': 0.07958984375, 'epoch': 2.48} 50%|████▉ | 799/1610 [3:16:28<2:31:36, 11.22s/it] 50%|████▉ | 800/1610 [3:16:42<2:40:52, 11.92s/it] {'loss': 0.0029, 'grad_norm': 1.301598536358594, 'learning_rate': 5.031055900621117e-07, 'completion_length': 139.29464721679688, 'rewards/accuracy_reward': 0.4821428805589676, 'rewards/format_reward': 0.9642857611179352, 'reward': 1.446428656578064, 'reward_std': 0.3020366281270981, 'kl': 0.073486328125, 'epoch': 2.48} 50%|████▉ | 800/1610 [3:16:42<2:40:52, 11.92s/it] 50%|████▉ | 801/1610 [3:17:41<5:51:25, 26.06s/it] {'loss': 0.0023, 'grad_norm': 2.2215469382099364, 'learning_rate': 5.024844720496894e-07, 'completion_length': 136.1964340209961, 'rewards/accuracy_reward': 0.383928582072258, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.3750000596046448, 'reward_std': 0.28295449912548065, 'kl': 0.0576171875, 'epoch': 2.49} 50%|████▉ | 801/1610 [3:17:41<5:51:25, 26.06s/it] 50%|████▉ | 802/1610 [3:17:52<4:50:29, 21.57s/it] {'loss': 0.0021, 'grad_norm': 6.719792895924208, 'learning_rate': 5.018633540372671e-07, 'completion_length': 122.67857360839844, 'rewards/accuracy_reward': 0.3928571790456772, 'rewards/format_reward': 1.0, 'reward': 1.3928571939468384, 'reward_std': 0.17885926365852356, 'kl': 0.0531005859375, 'epoch': 2.49} 50%|████▉ | 802/1610 [3:17:52<4:50:29, 21.57s/it] 50%|████▉ | 803/1610 [3:18:05<4:14:18, 18.91s/it] {'loss': 0.0027, 'grad_norm': 0.6974302860833719, 'learning_rate': 5.012422360248447e-07, 'completion_length': 120.76786422729492, 'rewards/accuracy_reward': 0.392857164144516, 'rewards/format_reward': 1.0, 'reward': 1.3928571939468384, 'reward_std': 0.10431019216775894, 'kl': 0.0667724609375, 'epoch': 2.49} 50%|████▉ | 803/1610 [3:18:05<4:14:18, 18.91s/it] 50%|████▉ | 804/1610 [3:18:18<3:50:08, 17.13s/it] {'loss': 0.0029, 'grad_norm': 1.0238225331766513, 'learning_rate': 5.006211180124223e-07, 'completion_length': 124.64286422729492, 'rewards/accuracy_reward': 0.5982142984867096, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.5892857909202576, 'reward_std': 0.21765290200710297, 'kl': 0.0716552734375, 'epoch': 2.5} 50%|████▉ | 804/1610 [3:18:18<3:50:08, 17.13s/it] 50%|█████ | 805/1610 [3:18:29<3:26:07, 15.36s/it] {'loss': 0.0022, 'grad_norm': 1.4159671106021998, 'learning_rate': 5e-07, 'completion_length': 127.46428680419922, 'rewards/accuracy_reward': 0.3214285895228386, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.3125000596046448, 'reward_std': 0.266334667801857, 'kl': 0.0560302734375, 'epoch': 2.5} 50%|█████ | 805/1610 [3:18:29<3:26:07, 15.36s/it] 50%|█████ | 806/1610 [3:18:41<3:13:02, 14.41s/it] {'loss': 0.0024, 'grad_norm': 1.21107729413834, 'learning_rate': 4.993788819875776e-07, 'completion_length': 127.8035774230957, 'rewards/accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4910714626312256, 'reward_std': 0.27144984900951385, 'kl': 0.0609130859375, 'epoch': 2.5} 50%|█████ | 806/1610 [3:18:41<3:13:02, 14.41s/it] 50%|█████ | 807/1610 [3:18:54<3:05:50, 13.89s/it] {'loss': 0.0023, 'grad_norm': 2.9115810211694777, 'learning_rate': 4.987577639751552e-07, 'completion_length': 133.06250762939453, 'rewards/accuracy_reward': 0.5535714328289032, 'rewards/format_reward': 1.0, 'reward': 1.5535715222358704, 'reward_std': 0.3189248591661453, 'kl': 0.056640625, 'epoch': 2.51} 50%|█████ | 807/1610 [3:18:54<3:05:50, 13.89s/it] 50%|█████ | 808/1610 [3:19:08<3:04:39, 13.81s/it] {'loss': 0.0021, 'grad_norm': 1.0530057284034233, 'learning_rate': 4.981366459627329e-07, 'completion_length': 161.00000762939453, 'rewards/accuracy_reward': 0.3571428805589676, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.348214328289032, 'reward_std': 0.20411306619644165, 'kl': 0.0518798828125, 'epoch': 2.51} 50%|█████ | 808/1610 [3:19:08<3:04:39, 13.81s/it] 50%|█████ | 809/1610 [3:19:19<2:53:52, 13.02s/it] {'loss': 0.0027, 'grad_norm': 2.439019701615549, 'learning_rate': 4.975155279503105e-07, 'completion_length': 132.69643783569336, 'rewards/accuracy_reward': 0.6160714626312256, 'rewards/format_reward': 1.0, 'reward': 1.6160715222358704, 'reward_std': 0.24948537349700928, 'kl': 0.067138671875, 'epoch': 2.51} 50%|█████ | 809/1610 [3:19:19<2:53:52, 13.02s/it] 50%|█████ | 810/1610 [3:19:32<2:54:01, 13.05s/it] {'loss': 0.0019, 'grad_norm': 1.4654013153406942, 'learning_rate': 4.968944099378881e-07, 'completion_length': 151.7589340209961, 'rewards/accuracy_reward': 0.3571428805589676, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.3482143878936768, 'reward_std': 0.19239908456802368, 'kl': 0.048583984375, 'epoch': 2.52} 50%|█████ | 810/1610 [3:19:32<2:54:01, 13.05s/it] 50%|█████ | 811/1610 [3:19:45<2:55:28, 13.18s/it] {'loss': 0.0023, 'grad_norm': 11.129191878727546, 'learning_rate': 4.962732919254658e-07, 'completion_length': 135.62500381469727, 'rewards/accuracy_reward': 0.598214328289032, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.5803572535514832, 'reward_std': 0.36547060310840607, 'kl': 0.056884765625, 'epoch': 2.52} 50%|█████ | 811/1610 [3:19:45<2:55:28, 13.18s/it] 50%|█████ | 812/1610 [3:19:58<2:53:05, 13.01s/it] {'loss': 0.0025, 'grad_norm': 0.9450129365839642, 'learning_rate': 4.956521739130435e-07, 'completion_length': 124.9464340209961, 'rewards/accuracy_reward': 0.4642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.4642857313156128, 'reward_std': 0.14006561040878296, 'kl': 0.0628662109375, 'epoch': 2.52} 50%|█████ | 812/1610 [3:19:58<2:53:05, 13.01s/it] 50%|█████ | 813/1610 [3:20:12<2:55:30, 13.21s/it] {'loss': 0.0022, 'grad_norm': 1.3330875329889698, 'learning_rate': 4.950310559006211e-07, 'completion_length': 160.7589340209961, 'rewards/accuracy_reward': 0.6875000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6875000596046448, 'reward_std': 0.23925502598285675, 'kl': 0.0550537109375, 'epoch': 2.52} 50%|█████ | 813/1610 [3:20:12<2:55:30, 13.21s/it] 51%|█████ | 814/1610 [3:20:26<2:58:22, 13.45s/it] {'loss': 0.0023, 'grad_norm': 1.1130424532781542, 'learning_rate': 4.944099378881988e-07, 'completion_length': 141.8928680419922, 'rewards/accuracy_reward': 0.446428582072258, 'rewards/format_reward': 0.973214328289032, 'reward': 1.4196428656578064, 'reward_std': 0.21543646603822708, 'kl': 0.0577392578125, 'epoch': 2.53} 51%|█████ | 814/1610 [3:20:26<2:58:22, 13.45s/it] 51%|█████ | 815/1610 [3:20:39<2:59:01, 13.51s/it] {'loss': 0.0029, 'grad_norm': 1.210430601866997, 'learning_rate': 4.937888198757764e-07, 'completion_length': 158.6339340209961, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 0.973214328289032, 'reward': 1.6517857909202576, 'reward_std': 0.2702430933713913, 'kl': 0.072265625, 'epoch': 2.53} 51%|█████ | 815/1610 [3:20:39<2:59:01, 13.51s/it] 51%|█████ | 816/1610 [3:20:51<2:52:33, 13.04s/it] {'loss': 0.0023, 'grad_norm': 1.042936182620325, 'learning_rate': 4.93167701863354e-07, 'completion_length': 121.98214721679688, 'rewards/accuracy_reward': 0.4464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.446428656578064, 'reward_std': 0.14518077671527863, 'kl': 0.0572509765625, 'epoch': 2.53} 51%|█████ | 816/1610 [3:20:51<2:52:33, 13.04s/it] 51%|█████ | 817/1610 [3:21:02<2:43:42, 12.39s/it] {'loss': 0.002, 'grad_norm': 1.7463727251525036, 'learning_rate': 4.925465838509317e-07, 'completion_length': 126.76786041259766, 'rewards/accuracy_reward': 0.4821428805589676, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.473214328289032, 'reward_std': 0.12956400960683823, 'kl': 0.0509033203125, 'epoch': 2.54} 51%|█████ | 817/1610 [3:21:02<2:43:42, 12.39s/it] 51%|█████ | 818/1610 [3:21:14<2:40:30, 12.16s/it] {'loss': 0.0024, 'grad_norm': 1.4785771039867208, 'learning_rate': 4.919254658385093e-07, 'completion_length': 113.2410774230957, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 1.0, 'reward': 1.5535715222358704, 'reward_std': 0.26181841641664505, 'kl': 0.05908203125, 'epoch': 2.54} 51%|█████ | 818/1610 [3:21:14<2:40:30, 12.16s/it] 51%|█████ | 819/1610 [3:21:26<2:40:03, 12.14s/it] {'loss': 0.0026, 'grad_norm': 1.23778696948479, 'learning_rate': 4.913043478260869e-07, 'completion_length': 122.15179061889648, 'rewards/accuracy_reward': 0.5446428954601288, 'rewards/format_reward': 1.0, 'reward': 1.5446429252624512, 'reward_std': 0.12626906484365463, 'kl': 0.065673828125, 'epoch': 2.54} 51%|█████ | 819/1610 [3:21:26<2:40:03, 12.14s/it] 51%|█████ | 820/1610 [3:21:38<2:39:41, 12.13s/it] {'loss': 0.0027, 'grad_norm': 1.246182957686771, 'learning_rate': 4.906832298136646e-07, 'completion_length': 124.84822082519531, 'rewards/accuracy_reward': 0.5267857313156128, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.508928656578064, 'reward_std': 0.24883297085762024, 'kl': 0.0667724609375, 'epoch': 2.55} 51%|█████ | 820/1610 [3:21:38<2:39:41, 12.13s/it] 51%|█████ | 821/1610 [3:21:50<2:39:33, 12.13s/it] {'loss': 0.003, 'grad_norm': 1.628809476615648, 'learning_rate': 4.900621118012422e-07, 'completion_length': 111.50000381469727, 'rewards/accuracy_reward': 0.321428582072258, 'rewards/format_reward': 1.0, 'reward': 1.3214285969734192, 'reward_std': 0.147871196269989, 'kl': 0.0751953125, 'epoch': 2.55} 51%|█████ | 821/1610 [3:21:50<2:39:33, 12.13s/it] 51%|█████ | 822/1610 [3:22:03<2:43:04, 12.42s/it] {'loss': 0.0028, 'grad_norm': 1.210360764417995, 'learning_rate': 4.894409937888198e-07, 'completion_length': 118.9910774230957, 'rewards/accuracy_reward': 0.580357164144516, 'rewards/format_reward': 1.0, 'reward': 1.5803571939468384, 'reward_std': 0.1827620565891266, 'kl': 0.0693359375, 'epoch': 2.55} 51%|█████ | 822/1610 [3:22:03<2:43:04, 12.42s/it] 51%|█████ | 823/1610 [3:22:15<2:40:25, 12.23s/it] {'loss': 0.0023, 'grad_norm': 1.4990954572747994, 'learning_rate': 4.888198757763975e-07, 'completion_length': 111.04465103149414, 'rewards/accuracy_reward': 0.4910714328289032, 'rewards/format_reward': 0.973214328289032, 'reward': 1.4642857313156128, 'reward_std': 0.32665716111660004, 'kl': 0.05859375, 'epoch': 2.56} 51%|█████ | 823/1610 [3:22:15<2:40:25, 12.23s/it] 51%|█████ | 824/1610 [3:22:25<2:33:27, 11.71s/it] {'loss': 0.0027, 'grad_norm': 0.36994245894514793, 'learning_rate': 4.881987577639751e-07, 'completion_length': 94.3660774230957, 'rewards/accuracy_reward': 0.3750000149011612, 'rewards/format_reward': 1.0, 'reward': 1.3750000596046448, 'reward_std': 0.033065006136894226, 'kl': 0.0677490234375, 'epoch': 2.56} 51%|█████ | 824/1610 [3:22:25<2:33:27, 11.71s/it] 51%|█████ | 825/1610 [3:22:36<2:30:42, 11.52s/it] {'loss': 0.0031, 'grad_norm': 1.9338728720009768, 'learning_rate': 4.875776397515527e-07, 'completion_length': 96.59822082519531, 'rewards/accuracy_reward': 0.6160714328289032, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.6071429252624512, 'reward_std': 0.3320491760969162, 'kl': 0.07666015625, 'epoch': 2.56} 51%|█████ | 825/1610 [3:22:36<2:30:42, 11.52s/it] 51%|█████▏ | 826/1610 [3:22:48<2:28:47, 11.39s/it] {'loss': 0.002, 'grad_norm': 2.8477501568672148, 'learning_rate': 4.869565217391305e-07, 'completion_length': 117.16964721679688, 'rewards/accuracy_reward': 0.4732142984867096, 'rewards/format_reward': 1.0, 'reward': 1.473214328289032, 'reward_std': 0.22094222903251648, 'kl': 0.0491943359375, 'epoch': 2.57} 51%|█████▏ | 826/1610 [3:22:48<2:28:47, 11.39s/it] 51%|█████▏ | 827/1610 [3:22:58<2:26:16, 11.21s/it] {'loss': 0.0023, 'grad_norm': 1.3400540724806596, 'learning_rate': 4.863354037267081e-07, 'completion_length': 110.58929061889648, 'rewards/accuracy_reward': 0.535714328289032, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.5267857909202576, 'reward_std': 0.15872060880064964, 'kl': 0.0576171875, 'epoch': 2.57} 51%|█████▏ | 827/1610 [3:22:58<2:26:16, 11.21s/it] 51%|█████▏ | 828/1610 [3:23:10<2:28:26, 11.39s/it] {'loss': 0.0023, 'grad_norm': 1.110488028737309, 'learning_rate': 4.857142857142857e-07, 'completion_length': 115.2589340209961, 'rewards/accuracy_reward': 0.3928571492433548, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.383928656578064, 'reward_std': 0.12626906484365463, 'kl': 0.0567626953125, 'epoch': 2.57} 51%|█████▏ | 828/1610 [3:23:10<2:28:26, 11.39s/it] 51%|█████▏ | 829/1610 [3:23:23<2:33:19, 11.78s/it] {'loss': 0.0021, 'grad_norm': 1.166661567977886, 'learning_rate': 4.850931677018633e-07, 'completion_length': 129.98215103149414, 'rewards/accuracy_reward': 0.4642857164144516, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4553571939468384, 'reward_std': 0.15933407470583916, 'kl': 0.0537109375, 'epoch': 2.57} 51%|█████▏ | 829/1610 [3:23:23<2:33:19, 11.78s/it] 52%|█████▏ | 830/1610 [3:23:35<2:36:02, 12.00s/it] {'loss': 0.0024, 'grad_norm': 1.7421248661338617, 'learning_rate': 4.84472049689441e-07, 'completion_length': 117.68750381469727, 'rewards/accuracy_reward': 0.321428582072258, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.3125000596046448, 'reward_std': 0.19751426577568054, 'kl': 0.0604248046875, 'epoch': 2.58} 52%|█████▏ | 830/1610 [3:23:35<2:36:02, 12.00s/it] 52%|█████▏ | 831/1610 [3:23:48<2:38:36, 12.22s/it] {'loss': 0.0024, 'grad_norm': 1.6791682512185486, 'learning_rate': 4.838509316770186e-07, 'completion_length': 136.2946548461914, 'rewards/accuracy_reward': 0.3214285969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3035715222358704, 'reward_std': 0.33488383889198303, 'kl': 0.0594482421875, 'epoch': 2.58} 52%|█████▏ | 831/1610 [3:23:48<2:38:36, 12.22s/it] 52%|█████▏ | 832/1610 [3:23:58<2:29:52, 11.56s/it] {'loss': 0.0031, 'grad_norm': 1.3245300584279758, 'learning_rate': 4.832298136645963e-07, 'completion_length': 100.5089340209961, 'rewards/accuracy_reward': 0.4375000149011612, 'rewards/format_reward': 1.0, 'reward': 1.4375000596046448, 'reward_std': 0.17616324126720428, 'kl': 0.077392578125, 'epoch': 2.58} 52%|█████▏ | 832/1610 [3:23:58<2:29:52, 11.56s/it] 52%|█████▏ | 833/1610 [3:24:10<2:31:56, 11.73s/it] {'loss': 0.003, 'grad_norm': 1.2258844258945751, 'learning_rate': 4.826086956521739e-07, 'completion_length': 94.1339340209961, 'rewards/accuracy_reward': 0.5267857313156128, 'rewards/format_reward': 1.0, 'reward': 1.5267857909202576, 'reward_std': 0.18849068880081177, 'kl': 0.0755615234375, 'epoch': 2.59} 52%|█████▏ | 833/1610 [3:24:10<2:31:56, 11.73s/it] 52%|█████▏ | 834/1610 [3:24:20<2:25:55, 11.28s/it] {'loss': 0.0022, 'grad_norm': 1.8128643561202733, 'learning_rate': 4.819875776397515e-07, 'completion_length': 97.29464721679688, 'rewards/accuracy_reward': 0.4910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.4910714626312256, 'reward_std': 0.12054043263196945, 'kl': 0.0543212890625, 'epoch': 2.59} 52%|█████▏ | 834/1610 [3:24:20<2:25:55, 11.28s/it] 52%|█████▏ | 835/1610 [3:24:32<2:27:45, 11.44s/it] {'loss': 0.0028, 'grad_norm': 0.8648232002315184, 'learning_rate': 4.813664596273292e-07, 'completion_length': 100.8214340209961, 'rewards/accuracy_reward': 0.321428582072258, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.3125000596046448, 'reward_std': 0.07514797151088715, 'kl': 0.07080078125, 'epoch': 2.59} 52%|█████▏ | 835/1610 [3:24:32<2:27:45, 11.44s/it] 52%|█████▏ | 836/1610 [3:24:45<2:33:24, 11.89s/it] {'loss': 0.0029, 'grad_norm': 1.8671556446361766, 'learning_rate': 4.807453416149068e-07, 'completion_length': 101.16964721679688, 'rewards/accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4821429252624512, 'reward_std': 0.16323686763644218, 'kl': 0.072021484375, 'epoch': 2.6} 52%|█████▏ | 836/1610 [3:24:45<2:33:24, 11.89s/it] 52%|█████▏ | 837/1610 [3:24:56<2:30:11, 11.66s/it] {'loss': 0.0025, 'grad_norm': 2.131971982368521, 'learning_rate': 4.801242236024844e-07, 'completion_length': 99.5714340209961, 'rewards/accuracy_reward': 0.464285746216774, 'rewards/format_reward': 1.0, 'reward': 1.4642857909202576, 'reward_std': 0.19057324528694153, 'kl': 0.0634765625, 'epoch': 2.6} 52%|█████▏ | 837/1610 [3:24:56<2:30:11, 11.66s/it] 52%|█████▏ | 838/1610 [3:25:10<2:38:36, 12.33s/it] {'loss': 0.0032, 'grad_norm': 2.71096824412628, 'learning_rate': 4.795031055900621e-07, 'completion_length': 102.29464721679688, 'rewards/accuracy_reward': 0.4107143133878708, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.3928572535514832, 'reward_std': 0.25521961599588394, 'kl': 0.0791015625, 'epoch': 2.6} 52%|█████▏ | 838/1610 [3:25:10<2:38:36, 12.33s/it] 52%|█████▏ | 839/1610 [3:25:23<2:40:17, 12.47s/it] {'loss': 0.0025, 'grad_norm': 1.701514197673171, 'learning_rate': 4.788819875776398e-07, 'completion_length': 111.50000381469727, 'rewards/accuracy_reward': 0.5000000149011612, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4910714626312256, 'reward_std': 0.29367105662822723, 'kl': 0.061767578125, 'epoch': 2.61} 52%|█████▏ | 839/1610 [3:25:23<2:40:17, 12.47s/it] 52%|█████▏ | 840/1610 [3:25:35<2:37:35, 12.28s/it] {'loss': 0.0027, 'grad_norm': 2.819125668465511, 'learning_rate': 4.782608695652174e-07, 'completion_length': 113.59822082519531, 'rewards/accuracy_reward': 0.446428582072258, 'rewards/format_reward': 1.0, 'reward': 1.446428656578064, 'reward_std': 0.2053254395723343, 'kl': 0.06787109375, 'epoch': 2.61} 52%|█████▏ | 840/1610 [3:25:35<2:37:35, 12.28s/it] 52%|█████▏ | 841/1610 [3:25:48<2:42:29, 12.68s/it] {'loss': 0.0029, 'grad_norm': 2.863243269568059, 'learning_rate': 4.77639751552795e-07, 'completion_length': 113.5089340209961, 'rewards/accuracy_reward': 0.7142857611179352, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.696428656578064, 'reward_std': 0.3219630718231201, 'kl': 0.072265625, 'epoch': 2.61} 52%|█████▏ | 841/1610 [3:25:48<2:42:29, 12.68s/it] 52%|█████▏ | 842/1610 [3:26:00<2:38:47, 12.41s/it] {'loss': 0.0026, 'grad_norm': 2.2764394401549297, 'learning_rate': 4.770186335403726e-07, 'completion_length': 104.93750381469727, 'rewards/accuracy_reward': 0.3750000149011612, 'rewards/format_reward': 1.0, 'reward': 1.3750000596046448, 'reward_std': 0.2338685840368271, 'kl': 0.0640869140625, 'epoch': 2.61} 52%|█████▏ | 842/1610 [3:26:00<2:38:47, 12.41s/it] 52%|█████▏ | 843/1610 [3:26:11<2:34:00, 12.05s/it] {'loss': 0.0033, 'grad_norm': 1.3855307274030784, 'learning_rate': 4.763975155279503e-07, 'completion_length': 96.29464721679688, 'rewards/accuracy_reward': 0.3660714477300644, 'rewards/format_reward': 1.0, 'reward': 1.3660715222358704, 'reward_std': 0.182762049138546, 'kl': 0.083740234375, 'epoch': 2.62} 52%|█████▏ | 843/1610 [3:26:11<2:34:00, 12.05s/it] 52%|█████▏ | 844/1610 [3:26:23<2:30:43, 11.81s/it] {'loss': 0.0027, 'grad_norm': 1.2605883509382614, 'learning_rate': 4.7577639751552796e-07, 'completion_length': 99.06250381469727, 'rewards/accuracy_reward': 0.464285746216774, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4553571939468384, 'reward_std': 0.15360544621944427, 'kl': 0.068115234375, 'epoch': 2.62} 52%|█████▏ | 844/1610 [3:26:23<2:30:43, 11.81s/it] 52%|█████▏ | 845/1610 [3:26:33<2:23:01, 11.22s/it] {'loss': 0.0023, 'grad_norm': 1.267290767317729, 'learning_rate': 4.751552795031056e-07, 'completion_length': 97.10714721679688, 'rewards/accuracy_reward': 0.4910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.4910714626312256, 'reward_std': 0.18787721544504166, 'kl': 0.056396484375, 'epoch': 2.62} 52%|█████▏ | 845/1610 [3:26:33<2:23:01, 11.22s/it] 53%|█████▎ | 846/1610 [3:26:43<2:19:30, 10.96s/it] {'loss': 0.0031, 'grad_norm': 1.5092802553257032, 'learning_rate': 4.7453416149068323e-07, 'completion_length': 91.18750381469727, 'rewards/accuracy_reward': 0.5446428954601288, 'rewards/format_reward': 1.0, 'reward': 1.5446429252624512, 'reward_std': 0.18787721544504166, 'kl': 0.078369140625, 'epoch': 2.63} 53%|█████▎ | 846/1610 [3:26:43<2:19:30, 10.96s/it] 53%|█████▎ | 847/1610 [3:26:54<2:20:31, 11.05s/it] {'loss': 0.0025, 'grad_norm': 1.1642922358299341, 'learning_rate': 4.739130434782608e-07, 'completion_length': 87.64286041259766, 'rewards/accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4910714626312256, 'reward_std': 0.18214857578277588, 'kl': 0.0618896484375, 'epoch': 2.63} 53%|█████▎ | 847/1610 [3:26:54<2:20:31, 11.05s/it] 53%|█████▎ | 848/1610 [3:27:05<2:18:38, 10.92s/it] {'loss': 0.0034, 'grad_norm': 1.5555328022206927, 'learning_rate': 4.732919254658385e-07, 'completion_length': 88.31250381469727, 'rewards/accuracy_reward': 0.392857164144516, 'rewards/format_reward': 1.0, 'reward': 1.3928571939468384, 'reward_std': 0.25583308935165405, 'kl': 0.08447265625, 'epoch': 2.63} 53%|█████▎ | 848/1610 [3:27:05<2:18:38, 10.92s/it] 53%|█████▎ | 849/1610 [3:27:15<2:17:32, 10.84s/it] {'loss': 0.0029, 'grad_norm': 2.0869909873809287, 'learning_rate': 4.7267080745341613e-07, 'completion_length': 100.36607360839844, 'rewards/accuracy_reward': 0.5089285969734192, 'rewards/format_reward': 1.0, 'reward': 1.508928656578064, 'reward_std': 0.1866704523563385, 'kl': 0.071533203125, 'epoch': 2.64} 53%|█████▎ | 849/1610 [3:27:15<2:17:32, 10.84s/it] 53%|█████▎ | 850/1610 [3:27:27<2:19:24, 11.01s/it] {'loss': 0.0029, 'grad_norm': 1.2170406991468503, 'learning_rate': 4.7204968944099376e-07, 'completion_length': 104.71429061889648, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.5625000596046448, 'reward_std': 0.24229325354099274, 'kl': 0.07275390625, 'epoch': 2.64} 53%|█████▎ | 850/1610 [3:27:27<2:19:24, 11.01s/it] 53%|█████▎ | 851/1610 [3:27:37<2:16:50, 10.82s/it] {'loss': 0.0027, 'grad_norm': 0.9791331733428708, 'learning_rate': 4.714285714285714e-07, 'completion_length': 87.16071701049805, 'rewards/accuracy_reward': 0.4285714477300644, 'rewards/format_reward': 1.0, 'reward': 1.4285715222358704, 'reward_std': 0.14970264956355095, 'kl': 0.0665283203125, 'epoch': 2.64} 53%|█████▎ | 851/1610 [3:27:37<2:16:50, 10.82s/it] 53%|█████▎ | 852/1610 [3:27:48<2:17:25, 10.88s/it] {'loss': 0.0025, 'grad_norm': 0.936141673921781, 'learning_rate': 4.70807453416149e-07, 'completion_length': 93.62500381469727, 'rewards/accuracy_reward': 0.4910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.4910714626312256, 'reward_std': 0.1030978113412857, 'kl': 0.063232421875, 'epoch': 2.65} 53%|█████▎ | 852/1610 [3:27:48<2:17:25, 10.88s/it] 53%|█████▎ | 853/1610 [3:27:58<2:14:08, 10.63s/it] {'loss': 0.0029, 'grad_norm': 1.929820745328979, 'learning_rate': 4.701863354037267e-07, 'completion_length': 93.4285774230957, 'rewards/accuracy_reward': 0.4107143133878708, 'rewards/format_reward': 1.0, 'reward': 1.410714328289032, 'reward_std': 0.17226044833660126, 'kl': 0.072265625, 'epoch': 2.65} 53%|█████▎ | 853/1610 [3:27:58<2:14:08, 10.63s/it] 53%|█████▎ | 854/1610 [3:28:11<2:22:43, 11.33s/it] {'loss': 0.0029, 'grad_norm': 0.8114687221486471, 'learning_rate': 4.6956521739130434e-07, 'completion_length': 103.41964721679688, 'rewards/accuracy_reward': 0.2321428656578064, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.223214328289032, 'reward_std': 0.12565560638904572, 'kl': 0.0733642578125, 'epoch': 2.65} 53%|█████▎ | 854/1610 [3:28:11<2:22:43, 11.33s/it] 53%|█████▎ | 855/1610 [3:28:23<2:23:26, 11.40s/it] {'loss': 0.0023, 'grad_norm': 1.4263700232113379, 'learning_rate': 4.68944099378882e-07, 'completion_length': 115.52679443359375, 'rewards/accuracy_reward': 0.4642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.4642857909202576, 'reward_std': 0.2501044347882271, 'kl': 0.05859375, 'epoch': 2.66} 53%|█████▎ | 855/1610 [3:28:23<2:23:26, 11.40s/it] 53%|█████▎ | 856/1610 [3:28:34<2:23:44, 11.44s/it] {'loss': 0.0028, 'grad_norm': 1.1168389690584306, 'learning_rate': 4.683229813664596e-07, 'completion_length': 126.59822082519531, 'rewards/accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4821429252624512, 'reward_std': 0.21313104033470154, 'kl': 0.0693359375, 'epoch': 2.66} 53%|█████▎ | 856/1610 [3:28:34<2:23:44, 11.44s/it] 53%|█████▎ | 857/1610 [3:28:46<2:23:07, 11.40s/it] {'loss': 0.0033, 'grad_norm': 1.7063935263026373, 'learning_rate': 4.6770186335403724e-07, 'completion_length': 113.91964721679688, 'rewards/accuracy_reward': 0.5625000298023224, 'rewards/format_reward': 1.0, 'reward': 1.5625000596046448, 'reward_std': 0.19178562611341476, 'kl': 0.083251953125, 'epoch': 2.66} 53%|█████▎ | 857/1610 [3:28:46<2:23:07, 11.40s/it] 53%|█████▎ | 858/1610 [3:28:58<2:24:42, 11.55s/it] {'loss': 0.003, 'grad_norm': 1.4713616055886067, 'learning_rate': 4.670807453416149e-07, 'completion_length': 116.9464340209961, 'rewards/accuracy_reward': 0.5446428954601288, 'rewards/format_reward': 1.0, 'reward': 1.5446429252624512, 'reward_std': 0.1418914571404457, 'kl': 0.074462890625, 'epoch': 2.66} 53%|█████▎ | 858/1610 [3:28:58<2:24:42, 11.55s/it] 53%|█████▎ | 859/1610 [3:29:11<2:30:17, 12.01s/it] {'loss': 0.0027, 'grad_norm': 1.1143024994015223, 'learning_rate': 4.664596273291925e-07, 'completion_length': 132.3482208251953, 'rewards/accuracy_reward': 0.4017857313156128, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.3928571939468384, 'reward_std': 0.14579424262046814, 'kl': 0.068359375, 'epoch': 2.67} 53%|█████▎ | 859/1610 [3:29:11<2:30:17, 12.01s/it] 53%|█████▎ | 860/1610 [3:29:23<2:33:08, 12.25s/it] {'loss': 0.0022, 'grad_norm': 2.0971556495069743, 'learning_rate': 4.6583850931677014e-07, 'completion_length': 138.63394165039062, 'rewards/accuracy_reward': 0.5267857313156128, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.5178572535514832, 'reward_std': 0.3123260587453842, 'kl': 0.0537109375, 'epoch': 2.67} 53%|█████▎ | 860/1610 [3:29:23<2:33:08, 12.25s/it] 53%|█████▎ | 861/1610 [3:29:36<2:33:17, 12.28s/it] {'loss': 0.0027, 'grad_norm': 1.5809336082857142, 'learning_rate': 4.6521739130434777e-07, 'completion_length': 123.89286041259766, 'rewards/accuracy_reward': 0.4821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.4821429252624512, 'reward_std': 0.19057324528694153, 'kl': 0.0675048828125, 'epoch': 2.67} 53%|█████▎ | 861/1610 [3:29:36<2:33:17, 12.28s/it] 54%|█████▎ | 862/1610 [3:29:48<2:32:14, 12.21s/it] {'loss': 0.0025, 'grad_norm': 1.4179169095393285, 'learning_rate': 4.6459627329192546e-07, 'completion_length': 115.99107360839844, 'rewards/accuracy_reward': 0.5267857313156128, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.5178571939468384, 'reward_std': 0.20532545447349548, 'kl': 0.0635986328125, 'epoch': 2.68} 54%|█████▎ | 862/1610 [3:29:48<2:32:14, 12.21s/it] 54%|█████▎ | 863/1610 [3:30:01<2:36:44, 12.59s/it] {'loss': 0.003, 'grad_norm': 0.8763223583341853, 'learning_rate': 4.639751552795031e-07, 'completion_length': 107.23214721679688, 'rewards/accuracy_reward': 0.5446428656578064, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.535714328289032, 'reward_std': 0.07839837297797203, 'kl': 0.07421875, 'epoch': 2.68} 54%|█████▎ | 863/1610 [3:30:01<2:36:44, 12.59s/it] 54%|█████▎ | 864/1610 [3:30:16<2:43:21, 13.14s/it] {'loss': 0.0026, 'grad_norm': 24.848975734692413, 'learning_rate': 4.633540372670807e-07, 'completion_length': 135.3482208251953, 'rewards/accuracy_reward': 0.4285714402794838, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.410714328289032, 'reward_std': 0.1929979920387268, 'kl': 0.065185546875, 'epoch': 2.68} 54%|█████▎ | 864/1610 [3:30:16<2:43:21, 13.14s/it] 54%|█████▎ | 865/1610 [3:30:28<2:39:46, 12.87s/it] {'loss': 0.0026, 'grad_norm': 2.229290066337504, 'learning_rate': 4.6273291925465835e-07, 'completion_length': 112.83929061889648, 'rewards/accuracy_reward': 0.3660714626312256, 'rewards/format_reward': 1.0, 'reward': 1.3660714626312256, 'reward_std': 0.2059243693947792, 'kl': 0.0645751953125, 'epoch': 2.69} 54%|█████▎ | 865/1610 [3:30:28<2:39:46, 12.87s/it] 54%|█████▍ | 866/1610 [3:30:42<2:42:56, 13.14s/it] {'loss': 0.0034, 'grad_norm': 6.070569887699327, 'learning_rate': 4.62111801242236e-07, 'completion_length': 115.17857360839844, 'rewards/accuracy_reward': 0.3750000149011612, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.3571428656578064, 'reward_std': 0.1780308485031128, 'kl': 0.0859375, 'epoch': 2.69} 54%|█████▍ | 866/1610 [3:30:42<2:42:56, 13.14s/it] 54%|█████▍ | 867/1610 [3:30:54<2:38:51, 12.83s/it] {'loss': 0.0027, 'grad_norm': 1.625004460630735, 'learning_rate': 4.6149068322981367e-07, 'completion_length': 121.40179061889648, 'rewards/accuracy_reward': 0.5982142984867096, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.5892857909202576, 'reward_std': 0.26897160708904266, 'kl': 0.067138671875, 'epoch': 2.69} 54%|█████▍ | 867/1610 [3:30:54<2:38:51, 12.83s/it] 54%|█████▍ | 868/1610 [3:31:06<2:38:01, 12.78s/it] {'loss': 0.0033, 'grad_norm': 0.8115205695525619, 'learning_rate': 4.608695652173913e-07, 'completion_length': 116.57143783569336, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5000000596046448, 'reward_std': 0.16141663491725922, 'kl': 0.083740234375, 'epoch': 2.7} 54%|█████▍ | 868/1610 [3:31:06<2:38:01, 12.78s/it] 54%|█████▍ | 869/1610 [3:31:17<2:30:59, 12.23s/it] {'loss': 0.003, 'grad_norm': 1.2756197863935148, 'learning_rate': 4.6024844720496894e-07, 'completion_length': 106.6964340209961, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 1.0, 'reward': 1.6071429252624512, 'reward_std': 0.17164696753025055, 'kl': 0.0743408203125, 'epoch': 2.7} 54%|█████▍ | 869/1610 [3:31:17<2:30:59, 12.23s/it] 54%|█████▍ | 870/1610 [3:31:29<2:28:43, 12.06s/it] {'loss': 0.0023, 'grad_norm': 1.5030083177258795, 'learning_rate': 4.596273291925465e-07, 'completion_length': 131.64286041259766, 'rewards/accuracy_reward': 0.7142857611179352, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.7053571939468384, 'reward_std': 0.3336714655160904, 'kl': 0.056884765625, 'epoch': 2.7} 54%|█████▍ | 870/1610 [3:31:29<2:28:43, 12.06s/it] 54%|█████▍ | 871/1610 [3:31:43<2:34:36, 12.55s/it] {'loss': 0.0024, 'grad_norm': 1.8427910189108854, 'learning_rate': 4.590062111801242e-07, 'completion_length': 145.0535774230957, 'rewards/accuracy_reward': 0.464285746216774, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.446428656578064, 'reward_std': 0.2880696803331375, 'kl': 0.06005859375, 'epoch': 2.7} 54%|█████▍ | 871/1610 [3:31:43<2:34:36, 12.55s/it] 54%|█████▍ | 872/1610 [3:31:55<2:33:08, 12.45s/it] {'loss': 0.0023, 'grad_norm': 1.9125446376669668, 'learning_rate': 4.5838509316770183e-07, 'completion_length': 134.75000381469727, 'rewards/accuracy_reward': 0.6339285969734192, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.6250000596046448, 'reward_std': 0.3985890746116638, 'kl': 0.0572509765625, 'epoch': 2.71} 54%|█████▍ | 872/1610 [3:31:55<2:33:08, 12.45s/it] 54%|█████▍ | 873/1610 [3:32:08<2:36:32, 12.74s/it] {'loss': 0.0027, 'grad_norm': 1.2795867950737092, 'learning_rate': 4.5776397515527947e-07, 'completion_length': 130.55357360839844, 'rewards/accuracy_reward': 0.4821428954601288, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4642857909202576, 'reward_std': 0.30080442130565643, 'kl': 0.067626953125, 'epoch': 2.71} 54%|█████▍ | 873/1610 [3:32:08<2:36:32, 12.74s/it] 54%|█████▍ | 874/1610 [3:32:20<2:30:36, 12.28s/it] {'loss': 0.0022, 'grad_norm': 1.3850839170936793, 'learning_rate': 4.571428571428571e-07, 'completion_length': 107.31250381469727, 'rewards/accuracy_reward': 0.5714286118745804, 'rewards/format_reward': 1.0, 'reward': 1.571428656578064, 'reward_std': 0.14579425007104874, 'kl': 0.053955078125, 'epoch': 2.71} 54%|█████▍ | 874/1610 [3:32:20<2:30:36, 12.28s/it] 54%|█████▍ | 875/1610 [3:32:33<2:32:50, 12.48s/it] {'loss': 0.0026, 'grad_norm': 2.2789377014646566, 'learning_rate': 4.5652173913043473e-07, 'completion_length': 119.27679061889648, 'rewards/accuracy_reward': 0.4821428656578064, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.473214328289032, 'reward_std': 0.23711898922920227, 'kl': 0.06591796875, 'epoch': 2.72} 54%|█████▍ | 875/1610 [3:32:33<2:32:50, 12.48s/it] 54%|█████▍ | 876/1610 [3:32:46<2:35:33, 12.72s/it] {'loss': 0.0028, 'grad_norm': 0.5174913602702943, 'learning_rate': 4.559006211180124e-07, 'completion_length': 108.64286041259766, 'rewards/accuracy_reward': 0.6517857313156128, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.6428571939468384, 'reward_std': 0.05050762742757797, 'kl': 0.0701904296875, 'epoch': 2.72} 54%|█████▍ | 876/1610 [3:32:46<2:35:33, 12.72s/it] 54%|█████▍ | 877/1610 [3:32:57<2:29:00, 12.20s/it] {'loss': 0.0022, 'grad_norm': 1.7061037600071864, 'learning_rate': 4.5527950310559005e-07, 'completion_length': 112.1160774230957, 'rewards/accuracy_reward': 0.4910714477300644, 'rewards/format_reward': 1.0, 'reward': 1.4910714626312256, 'reward_std': 0.23717807233333588, 'kl': 0.05615234375, 'epoch': 2.72} 54%|█████▍ | 877/1610 [3:32:57<2:29:00, 12.20s/it] 55%|█████▍ | 878/1610 [3:33:10<2:31:18, 12.40s/it] {'loss': 0.0034, 'grad_norm': 2.034137550434251, 'learning_rate': 4.546583850931677e-07, 'completion_length': 116.6160774230957, 'rewards/accuracy_reward': 0.535714328289032, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.5267857909202576, 'reward_std': 0.2247915416955948, 'kl': 0.0859375, 'epoch': 2.73} 55%|█████▍ | 878/1610 [3:33:10<2:31:18, 12.40s/it] 55%|█████▍ | 879/1610 [3:33:21<2:28:46, 12.21s/it] {'loss': 0.0028, 'grad_norm': 1.332728798944579, 'learning_rate': 4.540372670807453e-07, 'completion_length': 113.56250762939453, 'rewards/accuracy_reward': 0.660714328289032, 'rewards/format_reward': 1.0, 'reward': 1.6607143878936768, 'reward_std': 0.18397442996501923, 'kl': 0.071044921875, 'epoch': 2.73} 55%|█████▍ | 879/1610 [3:33:21<2:28:46, 12.21s/it] 55%|█████▍ | 880/1610 [3:33:33<2:26:48, 12.07s/it] {'loss': 0.0023, 'grad_norm': 1.0227043226031372, 'learning_rate': 4.53416149068323e-07, 'completion_length': 124.15178680419922, 'rewards/accuracy_reward': 0.3482142984867096, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.3392857909202576, 'reward_std': 0.20021027326583862, 'kl': 0.057861328125, 'epoch': 2.73} 55%|█████▍ | 880/1610 [3:33:33<2:26:48, 12.07s/it] 55%|█████▍ | 881/1610 [3:33:45<2:25:49, 12.00s/it] {'loss': 0.0031, 'grad_norm': 1.58865430994703, 'learning_rate': 4.5279503105590063e-07, 'completion_length': 117.68750381469727, 'rewards/accuracy_reward': 0.3571428805589676, 'rewards/format_reward': 1.0, 'reward': 1.3571429252624512, 'reward_std': 0.11663763970136642, 'kl': 0.0771484375, 'epoch': 2.74} 55%|█████▍ | 881/1610 [3:33:45<2:25:49, 12.00s/it] 55%|█████▍ | 882/1610 [3:33:57<2:24:03, 11.87s/it] {'loss': 0.0027, 'grad_norm': 1.3973581798701005, 'learning_rate': 4.521739130434782e-07, 'completion_length': 104.21429061889648, 'rewards/accuracy_reward': 0.5446428954601288, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.535714328289032, 'reward_std': 0.22875342518091202, 'kl': 0.068359375, 'epoch': 2.74} 55%|█████▍ | 882/1610 [3:33:57<2:24:03, 11.87s/it] 55%|█████▍ | 883/1610 [3:34:09<2:25:45, 12.03s/it] {'loss': 0.0032, 'grad_norm': 1.4418123722748666, 'learning_rate': 4.5155279503105585e-07, 'completion_length': 113.51786041259766, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.5625001192092896, 'reward_std': 0.19239908456802368, 'kl': 0.0791015625, 'epoch': 2.74} 55%|█████▍ | 883/1610 [3:34:09<2:25:45, 12.03s/it] 55%|█████▍ | 884/1610 [3:34:20<2:22:50, 11.81s/it] {'loss': 0.0037, 'grad_norm': 1.3864189185180902, 'learning_rate': 4.509316770186335e-07, 'completion_length': 106.45536041259766, 'rewards/accuracy_reward': 0.5267857313156128, 'rewards/format_reward': 1.0, 'reward': 1.5267857909202576, 'reward_std': 0.16444924473762512, 'kl': 0.09326171875, 'epoch': 2.75} 55%|█████▍ | 884/1610 [3:34:20<2:22:50, 11.81s/it] 55%|█████▍ | 885/1610 [3:34:33<2:24:47, 11.98s/it] {'loss': 0.0027, 'grad_norm': 1.2786851155206511, 'learning_rate': 4.5031055900621116e-07, 'completion_length': 114.41964721679688, 'rewards/accuracy_reward': 0.4285714328289032, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.410714328289032, 'reward_std': 0.21703943610191345, 'kl': 0.06884765625, 'epoch': 2.75} 55%|█████▍ | 885/1610 [3:34:33<2:24:47, 11.98s/it] 55%|█████▌ | 886/1610 [3:34:44<2:23:07, 11.86s/it] {'loss': 0.0026, 'grad_norm': 1.0924839621212772, 'learning_rate': 4.496894409937888e-07, 'completion_length': 127.18750762939453, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.6160715222358704, 'reward_std': 0.22954470664262772, 'kl': 0.065673828125, 'epoch': 2.75} 55%|█████▌ | 886/1610 [3:34:44<2:23:07, 11.86s/it] 55%|█████▌ | 887/1610 [3:34:56<2:23:09, 11.88s/it] {'loss': 0.0028, 'grad_norm': 1.327128562021753, 'learning_rate': 4.4906832298136643e-07, 'completion_length': 118.16964721679688, 'rewards/accuracy_reward': 0.4017857313156128, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3839285969734192, 'reward_std': 0.21434341371059418, 'kl': 0.0697021484375, 'epoch': 2.75} 55%|█████▌ | 887/1610 [3:34:56<2:23:09, 11.88s/it] 55%|█████▌ | 888/1610 [3:35:09<2:27:06, 12.23s/it] {'loss': 0.003, 'grad_norm': 3.585452925726725, 'learning_rate': 4.4844720496894406e-07, 'completion_length': 118.73215103149414, 'rewards/accuracy_reward': 0.3571428805589676, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3392857909202576, 'reward_std': 0.22094783186912537, 'kl': 0.0751953125, 'epoch': 2.76} 55%|█████▌ | 888/1610 [3:35:09<2:27:06, 12.23s/it] 55%|█████▌ | 889/1610 [3:35:20<2:23:07, 11.91s/it] {'loss': 0.0045, 'grad_norm': 1.8739380805417851, 'learning_rate': 4.4782608695652175e-07, 'completion_length': 83.84822082519531, 'rewards/accuracy_reward': 0.6964285969734192, 'rewards/format_reward': 1.0, 'reward': 1.696428656578064, 'reward_std': 0.18788282573223114, 'kl': 0.1123046875, 'epoch': 2.76} 55%|█████▌ | 889/1610 [3:35:20<2:23:07, 11.91s/it] 55%|█████▌ | 890/1610 [3:35:34<2:28:49, 12.40s/it] {'loss': 0.0033, 'grad_norm': 1.6383075662028683, 'learning_rate': 4.472049689440994e-07, 'completion_length': 123.41965103149414, 'rewards/accuracy_reward': 0.5089285969734192, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.4910715222358704, 'reward_std': 0.2589074522256851, 'kl': 0.083251953125, 'epoch': 2.76} 55%|█████▌ | 890/1610 [3:35:34<2:28:49, 12.40s/it] 55%|█████▌ | 891/1610 [3:35:45<2:24:33, 12.06s/it] {'loss': 0.0027, 'grad_norm': 1.4044756719467497, 'learning_rate': 4.46583850931677e-07, 'completion_length': 87.91072082519531, 'rewards/accuracy_reward': 0.785714328289032, 'rewards/format_reward': 1.0, 'reward': 1.7857143878936768, 'reward_std': 0.05050762742757797, 'kl': 0.0675048828125, 'epoch': 2.77} 55%|█████▌ | 891/1610 [3:35:45<2:24:33, 12.06s/it] 55%|█████▌ | 892/1610 [3:35:59<2:29:47, 12.52s/it] {'loss': 0.0038, 'grad_norm': 1.5586259180491013, 'learning_rate': 4.4596273291925464e-07, 'completion_length': 132.06250762939453, 'rewards/accuracy_reward': 0.5625000298023224, 'rewards/format_reward': 0.9642857611179352, 'reward': 1.5267857909202576, 'reward_std': 0.32069161534309387, 'kl': 0.0947265625, 'epoch': 2.77} 55%|█████▌ | 892/1610 [3:35:59<2:29:47, 12.52s/it] 55%|█████▌ | 893/1610 [3:36:11<2:26:52, 12.29s/it] {'loss': 0.0036, 'grad_norm': 2.399642684715284, 'learning_rate': 4.453416149068323e-07, 'completion_length': 115.28572082519531, 'rewards/accuracy_reward': 0.5982142984867096, 'rewards/format_reward': 1.0, 'reward': 1.5982143878936768, 'reward_std': 0.24167978018522263, 'kl': 0.091064453125, 'epoch': 2.77} 55%|█████▌ | 893/1610 [3:36:11<2:26:52, 12.29s/it] 56%|█████▌ | 894/1610 [3:36:22<2:23:58, 12.07s/it] {'loss': 0.0029, 'grad_norm': 1.7228499270648983, 'learning_rate': 4.447204968944099e-07, 'completion_length': 104.29464721679688, 'rewards/accuracy_reward': 0.3035714328289032, 'rewards/format_reward': 1.0, 'reward': 1.3035714626312256, 'reward_std': 0.2501044422388077, 'kl': 0.072265625, 'epoch': 2.78} 56%|█████▌ | 894/1610 [3:36:22<2:23:58, 12.07s/it] 56%|█████▌ | 895/1610 [3:36:32<2:16:27, 11.45s/it] {'loss': 0.0019, 'grad_norm': 1.1268560652525763, 'learning_rate': 4.4409937888198754e-07, 'completion_length': 94.6964340209961, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 1.0, 'reward': 1.571428656578064, 'reward_std': 0.17885926365852356, 'kl': 0.0487060546875, 'epoch': 2.78} 56%|█████▌ | 895/1610 [3:36:32<2:16:27, 11.45s/it] 56%|█████▌ | 896/1610 [3:36:43<2:12:23, 11.13s/it] {'loss': 0.0034, 'grad_norm': 1.9891600437205015, 'learning_rate': 4.434782608695652e-07, 'completion_length': 83.31250381469727, 'rewards/accuracy_reward': 0.4732143133878708, 'rewards/format_reward': 1.0, 'reward': 1.473214328289032, 'reward_std': 0.14700662717223167, 'kl': 0.083984375, 'epoch': 2.78} 56%|█████▌ | 896/1610 [3:36:43<2:12:23, 11.13s/it] 56%|█████▌ | 897/1610 [3:36:55<2:16:57, 11.53s/it] {'loss': 0.0039, 'grad_norm': 0.9944688054920783, 'learning_rate': 4.428571428571428e-07, 'completion_length': 79.72321701049805, 'rewards/accuracy_reward': 0.5267857313156128, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.5178571939468384, 'reward_std': 0.15481781959533691, 'kl': 0.0986328125, 'epoch': 2.79} 56%|█████▌ | 897/1610 [3:36:55<2:16:57, 11.53s/it] 56%|█████▌ | 898/1610 [3:37:06<2:13:46, 11.27s/it] {'loss': 0.0026, 'grad_norm': 1.5449653544446982, 'learning_rate': 4.422360248447205e-07, 'completion_length': 98.58036422729492, 'rewards/accuracy_reward': 0.7500000298023224, 'rewards/format_reward': 1.0, 'reward': 1.7500000596046448, 'reward_std': 0.23386860638856888, 'kl': 0.0654296875, 'epoch': 2.79} 56%|█████▌ | 898/1610 [3:37:06<2:13:46, 11.27s/it] 56%|█████▌ | 899/1610 [3:37:18<2:16:03, 11.48s/it] {'loss': 0.0038, 'grad_norm': 1.3702450008236124, 'learning_rate': 4.416149068322981e-07, 'completion_length': 97.49107360839844, 'rewards/accuracy_reward': 0.4821428656578064, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.4642857313156128, 'reward_std': 0.21192426979541779, 'kl': 0.09375, 'epoch': 2.79} 56%|█████▌ | 899/1610 [3:37:18<2:16:03, 11.48s/it] 56%|█████▌ | 900/1610 [3:37:30<2:17:59, 11.66s/it] {'loss': 0.0039, 'grad_norm': 1.4246351697397386, 'learning_rate': 4.4099378881987576e-07, 'completion_length': 98.13393020629883, 'rewards/accuracy_reward': 0.6250000149011612, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.6160715222358704, 'reward_std': 0.1866704523563385, 'kl': 0.096435546875, 'epoch': 2.8} 56%|█████▌ | 900/1610 [3:37:30<2:17:59, 11.66s/it] 56%|█████▌ | 901/1610 [3:38:32<5:15:57, 26.74s/it] {'loss': 0.0033, 'grad_norm': 1.448842750025834, 'learning_rate': 4.403726708074534e-07, 'completion_length': 79.78571701049805, 'rewards/accuracy_reward': 0.133928582072258, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.125, 'reward_std': 0.14970263838768005, 'kl': 0.08251953125, 'epoch': 2.8} 56%|█████▌ | 901/1610 [3:38:32<5:15:57, 26.74s/it] 56%|█████▌ | 902/1610 [3:38:42<4:16:12, 21.71s/it] {'loss': 0.0028, 'grad_norm': 1.3018084960367486, 'learning_rate': 4.39751552795031e-07, 'completion_length': 87.13393020629883, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 1.0, 'reward': 1.571428656578064, 'reward_std': 0.2022872269153595, 'kl': 0.06884765625, 'epoch': 2.8} 56%|█████▌ | 902/1610 [3:38:42<4:16:12, 21.71s/it] 56%|█████▌ | 903/1610 [3:38:54<3:42:23, 18.87s/it] {'loss': 0.0027, 'grad_norm': 4.575455370453518, 'learning_rate': 4.391304347826087e-07, 'completion_length': 112.66964721679688, 'rewards/accuracy_reward': 0.4375000149011612, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.4196428656578064, 'reward_std': 0.15872061252593994, 'kl': 0.068603515625, 'epoch': 2.8} 56%|█████▌ | 903/1610 [3:38:54<3:42:23, 18.87s/it] 56%|█████▌ | 904/1610 [3:39:05<3:16:20, 16.69s/it] {'loss': 0.0029, 'grad_norm': 1.074041840442988, 'learning_rate': 4.3850931677018634e-07, 'completion_length': 100.5535774230957, 'rewards/accuracy_reward': 0.4464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.4464285969734192, 'reward_std': 0.16653180867433548, 'kl': 0.072265625, 'epoch': 2.81} 56%|█████▌ | 904/1610 [3:39:05<3:16:20, 16.69s/it] 56%|█████▌ | 905/1610 [3:39:17<2:57:54, 15.14s/it] {'loss': 0.0028, 'grad_norm': 1.5209097828490867, 'learning_rate': 4.3788819875776397e-07, 'completion_length': 93.06250381469727, 'rewards/accuracy_reward': 0.4464285969734192, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4375000596046448, 'reward_std': 0.2811286449432373, 'kl': 0.07080078125, 'epoch': 2.81} 56%|█████▌ | 905/1610 [3:39:17<2:57:54, 15.14s/it] 56%|█████▋ | 906/1610 [3:39:28<2:41:27, 13.76s/it] {'loss': 0.0035, 'grad_norm': 2.0866882026945, 'learning_rate': 4.3726708074534155e-07, 'completion_length': 82.74107360839844, 'rewards/accuracy_reward': 0.660714328289032, 'rewards/format_reward': 1.0, 'reward': 1.6607143878936768, 'reward_std': 0.18847612291574478, 'kl': 0.087890625, 'epoch': 2.81} 56%|█████▋ | 906/1610 [3:39:28<2:41:27, 13.76s/it] 56%|█████▋ | 907/1610 [3:39:38<2:30:23, 12.84s/it] {'loss': 0.0037, 'grad_norm': 1.4757879437842751, 'learning_rate': 4.3664596273291924e-07, 'completion_length': 89.01786422729492, 'rewards/accuracy_reward': 0.455357164144516, 'rewards/format_reward': 1.0, 'reward': 1.4553571939468384, 'reward_std': 0.143968403339386, 'kl': 0.091796875, 'epoch': 2.82} 56%|█████▋ | 907/1610 [3:39:38<2:30:23, 12.84s/it] 56%|█████▋ | 908/1610 [3:39:48<2:20:53, 12.04s/it] {'loss': 0.003, 'grad_norm': 2.391860307622498, 'learning_rate': 4.3602484472049687e-07, 'completion_length': 102.10714721679688, 'rewards/accuracy_reward': 0.3660714477300644, 'rewards/format_reward': 1.0, 'reward': 1.3660714626312256, 'reward_std': 0.2546207159757614, 'kl': 0.07421875, 'epoch': 2.82} 56%|█████▋ | 908/1610 [3:39:48<2:20:53, 12.04s/it] 56%|█████▋ | 909/1610 [3:39:58<2:13:27, 11.42s/it] {'loss': 0.0033, 'grad_norm': 2.2045566743846643, 'learning_rate': 4.354037267080745e-07, 'completion_length': 81.8839340209961, 'rewards/accuracy_reward': 0.3125000223517418, 'rewards/format_reward': 1.0, 'reward': 1.3125000596046448, 'reward_std': 0.16531942784786224, 'kl': 0.081298828125, 'epoch': 2.82} 56%|█████▋ | 909/1610 [3:39:58<2:13:27, 11.42s/it] 57%|█████▋ | 910/1610 [3:40:09<2:09:15, 11.08s/it] {'loss': 0.0037, 'grad_norm': 1.0997354210982035, 'learning_rate': 4.3478260869565214e-07, 'completion_length': 76.13393020629883, 'rewards/accuracy_reward': 0.4107143133878708, 'rewards/format_reward': 1.0, 'reward': 1.4107143878936768, 'reward_std': 0.128351628780365, 'kl': 0.092529296875, 'epoch': 2.83} 57%|█████▋ | 910/1610 [3:40:09<2:09:15, 11.08s/it] 57%|█████▋ | 911/1610 [3:40:19<2:06:17, 10.84s/it] {'loss': 0.0031, 'grad_norm': 0.951798535334548, 'learning_rate': 4.3416149068322977e-07, 'completion_length': 84.97321701049805, 'rewards/accuracy_reward': 0.4732142984867096, 'rewards/format_reward': 1.0, 'reward': 1.473214328289032, 'reward_std': 0.08747543022036552, 'kl': 0.0772705078125, 'epoch': 2.83} 57%|█████▋ | 911/1610 [3:40:19<2:06:17, 10.84s/it] 57%|█████▋ | 912/1610 [3:40:32<2:13:53, 11.51s/it] {'loss': 0.0043, 'grad_norm': 1.8801789633855914, 'learning_rate': 4.3354037267080745e-07, 'completion_length': 97.26786422729492, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.5982143878936768, 'reward_std': 0.2247915416955948, 'kl': 0.10693359375, 'epoch': 2.83} 57%|█████▋ | 912/1610 [3:40:32<2:13:53, 11.51s/it] 57%|█████▋ | 913/1610 [3:40:43<2:12:07, 11.37s/it] {'loss': 0.0025, 'grad_norm': 2.8932294736550146, 'learning_rate': 4.329192546583851e-07, 'completion_length': 99.92857360839844, 'rewards/accuracy_reward': 0.3839285969734192, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.3750000596046448, 'reward_std': 0.26450884342193604, 'kl': 0.06201171875, 'epoch': 2.84} 57%|█████▋ | 913/1610 [3:40:43<2:12:07, 11.37s/it] 57%|█████▋ | 914/1610 [3:40:54<2:11:42, 11.35s/it] {'loss': 0.0042, 'grad_norm': 1.9430717274548654, 'learning_rate': 4.322981366459627e-07, 'completion_length': 95.90179061889648, 'rewards/accuracy_reward': 0.4107143133878708, 'rewards/format_reward': 1.0, 'reward': 1.410714328289032, 'reward_std': 0.23959723114967346, 'kl': 0.104736328125, 'epoch': 2.84} 57%|█████▋ | 914/1610 [3:40:54<2:11:42, 11.35s/it] 57%|█████▋ | 915/1610 [3:41:04<2:05:31, 10.84s/it] {'loss': 0.0028, 'grad_norm': 1.392370155409333, 'learning_rate': 4.3167701863354035e-07, 'completion_length': 96.23214721679688, 'rewards/accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 1.0, 'reward': 1.5000000596046448, 'reward_std': 0.16653180122375488, 'kl': 0.0712890625, 'epoch': 2.84} 57%|█████▋ | 915/1610 [3:41:04<2:05:31, 10.84s/it] 57%|█████▋ | 916/1610 [3:41:15<2:05:00, 10.81s/it] {'loss': 0.0022, 'grad_norm': 16.442232988399, 'learning_rate': 4.3105590062111804e-07, 'completion_length': 90.16072082519531, 'rewards/accuracy_reward': 0.4821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.4821429252624512, 'reward_std': 0.32915520668029785, 'kl': 0.0555419921875, 'epoch': 2.84} 57%|█████▋ | 916/1610 [3:41:15<2:05:00, 10.81s/it] 57%|█████▋ | 917/1610 [3:41:26<2:04:45, 10.80s/it] {'loss': 0.0031, 'grad_norm': 1.8695252373298226, 'learning_rate': 4.3043478260869567e-07, 'completion_length': 102.14286041259766, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 1.0, 'reward': 1.5178571939468384, 'reward_std': 0.26841723918914795, 'kl': 0.07861328125, 'epoch': 2.85} 57%|█████▋ | 917/1610 [3:41:26<2:04:45, 10.80s/it] 57%|█████▋ | 918/1610 [3:41:37<2:05:34, 10.89s/it] {'loss': 0.0034, 'grad_norm': 1.5127071876838574, 'learning_rate': 4.2981366459627325e-07, 'completion_length': 105.70536041259766, 'rewards/accuracy_reward': 0.5892857611179352, 'rewards/format_reward': 0.9732142984867096, 'reward': 1.5625000596046448, 'reward_std': 0.26672425121068954, 'kl': 0.0849609375, 'epoch': 2.85} 57%|█████▋ | 918/1610 [3:41:37<2:05:34, 10.89s/it] 57%|█████▋ | 919/1610 [3:41:48<2:08:05, 11.12s/it] {'loss': 0.0034, 'grad_norm': 2.68781840115549, 'learning_rate': 4.291925465838509e-07, 'completion_length': 84.55357360839844, 'rewards/accuracy_reward': 0.294642873108387, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.285714328289032, 'reward_std': 0.21044062077999115, 'kl': 0.08544921875, 'epoch': 2.85} 57%|█████▋ | 919/1610 [3:41:48<2:08:05, 11.12s/it] 57%|█████▋ | 920/1610 [3:41:59<2:05:55, 10.95s/it] {'loss': 0.0026, 'grad_norm': 4.103835467337731, 'learning_rate': 4.285714285714285e-07, 'completion_length': 95.62500381469727, 'rewards/accuracy_reward': 0.3660714477300644, 'rewards/format_reward': 1.0, 'reward': 1.3660714626312256, 'reward_std': 0.2882790118455887, 'kl': 0.065185546875, 'epoch': 2.86} 57%|█████▋ | 920/1610 [3:41:59<2:05:55, 10.95s/it] 57%|█████▋ | 921/1610 [3:42:10<2:07:28, 11.10s/it] {'loss': 0.0029, 'grad_norm': 2.16172064934102, 'learning_rate': 4.279503105590062e-07, 'completion_length': 106.46429061889648, 'rewards/accuracy_reward': 0.4642857313156128, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4553571939468384, 'reward_std': 0.20922823250293732, 'kl': 0.072265625, 'epoch': 2.86} 57%|█████▋ | 921/1610 [3:42:10<2:07:28, 11.10s/it] 57%|█████▋ | 922/1610 [3:42:21<2:07:07, 11.09s/it] {'loss': 0.0037, 'grad_norm': 1.4991291788635712, 'learning_rate': 4.2732919254658383e-07, 'completion_length': 86.74107360839844, 'rewards/accuracy_reward': 0.705357164144516, 'rewards/format_reward': 1.0, 'reward': 1.7053571939468384, 'reward_std': 0.12054044008255005, 'kl': 0.092041015625, 'epoch': 2.86} 57%|█████▋ | 922/1610 [3:42:21<2:07:07, 11.09s/it] 57%|█████▋ | 923/1610 [3:42:31<2:02:46, 10.72s/it] {'loss': 0.0033, 'grad_norm': 0.9632625106105845, 'learning_rate': 4.2670807453416146e-07, 'completion_length': 95.26786422729492, 'rewards/accuracy_reward': 0.4285714477300644, 'rewards/format_reward': 1.0, 'reward': 1.4285714626312256, 'reward_std': 0.1833609640598297, 'kl': 0.0828857421875, 'epoch': 2.87} 57%|█████▋ | 923/1610 [3:42:31<2:02:46, 10.72s/it] 57%|█████▋ | 924/1610 [3:42:42<2:02:19, 10.70s/it] {'loss': 0.0035, 'grad_norm': 0.9953763753191293, 'learning_rate': 4.260869565217391e-07, 'completion_length': 88.90179061889648, 'rewards/accuracy_reward': 0.4107143133878708, 'rewards/format_reward': 1.0, 'reward': 1.4107143878936768, 'reward_std': 0.06222161650657654, 'kl': 0.087158203125, 'epoch': 2.87} 57%|█████▋ | 924/1610 [3:42:42<2:02:19, 10.70s/it] 57%|█████▋ | 925/1610 [3:42:52<2:01:59, 10.68s/it] {'loss': 0.003, 'grad_norm': 1.4916562165828848, 'learning_rate': 4.254658385093168e-07, 'completion_length': 104.1160774230957, 'rewards/accuracy_reward': 0.455357164144516, 'rewards/format_reward': 1.0, 'reward': 1.4553571939468384, 'reward_std': 0.21973544359207153, 'kl': 0.07373046875, 'epoch': 2.87} 57%|█████▋ | 925/1610 [3:42:52<2:01:59, 10.68s/it] 58%|█████▊ | 926/1610 [3:43:04<2:03:27, 10.83s/it] {'loss': 0.003, 'grad_norm': 1.383458944321362, 'learning_rate': 4.248447204968944e-07, 'completion_length': 113.5535774230957, 'rewards/accuracy_reward': 0.3839285969734192, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.3750000596046448, 'reward_std': 0.20740239322185516, 'kl': 0.0758056640625, 'epoch': 2.88} 58%|█████▊ | 926/1610 [3:43:04<2:03:27, 10.83s/it] 58%|█████▊ | 927/1610 [3:43:14<2:02:15, 10.74s/it] {'loss': 0.0031, 'grad_norm': 2.0655610229600554, 'learning_rate': 4.2422360248447205e-07, 'completion_length': 94.40178680419922, 'rewards/accuracy_reward': 0.4553571492433548, 'rewards/format_reward': 1.0, 'reward': 1.4553571939468384, 'reward_std': 0.24167978763580322, 'kl': 0.0777587890625, 'epoch': 2.88} 58%|█████▊ | 927/1610 [3:43:14<2:02:15, 10.74s/it] 58%|█████▊ | 928/1610 [3:43:24<1:59:05, 10.48s/it] {'loss': 0.0036, 'grad_norm': 1.3223842339311118, 'learning_rate': 4.236024844720497e-07, 'completion_length': 91.50893020629883, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.6160715222358704, 'reward_std': 0.1866704523563385, 'kl': 0.0908203125, 'epoch': 2.88} 58%|█████▊ | 928/1610 [3:43:24<1:59:05, 10.48s/it] 58%|█████▊ | 929/1610 [3:43:34<1:56:17, 10.25s/it] {'loss': 0.0026, 'grad_norm': 2.1303005459136664, 'learning_rate': 4.229813664596273e-07, 'completion_length': 90.26786422729492, 'rewards/accuracy_reward': 0.5000000149011612, 'rewards/format_reward': 1.0, 'reward': 1.5000000596046448, 'reward_std': 0.10700060427188873, 'kl': 0.0655517578125, 'epoch': 2.89} 58%|█████▊ | 929/1610 [3:43:34<1:56:17, 10.25s/it] 58%|█████▊ | 930/1610 [3:43:46<2:01:45, 10.74s/it] {'loss': 0.0031, 'grad_norm': 1.963412151296795, 'learning_rate': 4.2236024844720495e-07, 'completion_length': 108.8035774230957, 'rewards/accuracy_reward': 0.3035714477300644, 'rewards/format_reward': 1.0, 'reward': 1.3035714626312256, 'reward_std': 0.22545847296714783, 'kl': 0.07763671875, 'epoch': 2.89} 58%|█████▊ | 930/1610 [3:43:46<2:01:45, 10.74s/it] 58%|█████▊ | 931/1610 [3:43:58<2:06:42, 11.20s/it] {'loss': 0.0036, 'grad_norm': 0.5407647589967557, 'learning_rate': 4.217391304347826e-07, 'completion_length': 95.54464721679688, 'rewards/accuracy_reward': 0.6339285969734192, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.6250001192092896, 'reward_std': 0.08274422585964203, 'kl': 0.089111328125, 'epoch': 2.89} 58%|█████▊ | 931/1610 [3:43:58<2:06:42, 11.20s/it] 58%|█████▊ | 932/1610 [3:44:08<2:01:06, 10.72s/it] {'loss': 0.0029, 'grad_norm': 1.9028408545199313, 'learning_rate': 4.211180124223602e-07, 'completion_length': 83.04464721679688, 'rewards/accuracy_reward': 0.4107143133878708, 'rewards/format_reward': 1.0, 'reward': 1.4107143878936768, 'reward_std': 0.09528661891818047, 'kl': 0.0733642578125, 'epoch': 2.89} 58%|█████▊ | 932/1610 [3:44:08<2:01:06, 10.72s/it] 58%|█████▊ | 933/1610 [3:44:20<2:05:31, 11.12s/it] {'loss': 0.0037, 'grad_norm': 1.705184824713344, 'learning_rate': 4.2049689440993784e-07, 'completion_length': 105.36607360839844, 'rewards/accuracy_reward': 0.6160714328289032, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.6071429252624512, 'reward_std': 0.19057324528694153, 'kl': 0.09130859375, 'epoch': 2.9} 58%|█████▊ | 933/1610 [3:44:20<2:05:31, 11.12s/it] 58%|█████▊ | 934/1610 [3:44:30<2:03:50, 10.99s/it] {'loss': 0.0038, 'grad_norm': 0.9739407694898848, 'learning_rate': 4.1987577639751553e-07, 'completion_length': 90.84821701049805, 'rewards/accuracy_reward': 0.4910714477300644, 'rewards/format_reward': 1.0, 'reward': 1.4910714626312256, 'reward_std': 0.12054043635725975, 'kl': 0.09375, 'epoch': 2.9} 58%|█████▊ | 934/1610 [3:44:30<2:03:50, 10.99s/it] 58%|█████▊ | 935/1610 [3:44:42<2:07:03, 11.29s/it] {'loss': 0.0037, 'grad_norm': 2.404099008475663, 'learning_rate': 4.1925465838509316e-07, 'completion_length': 94.68750381469727, 'rewards/accuracy_reward': 0.392857164144516, 'rewards/format_reward': 1.0, 'reward': 1.3928572535514832, 'reward_std': 0.22875341773033142, 'kl': 0.091796875, 'epoch': 2.9} 58%|█████▊ | 935/1610 [3:44:42<2:07:03, 11.29s/it] 58%|█████▊ | 936/1610 [3:44:54<2:10:00, 11.57s/it] {'loss': 0.0035, 'grad_norm': 1.7101028570142698, 'learning_rate': 4.186335403726708e-07, 'completion_length': 121.68750762939453, 'rewards/accuracy_reward': 0.5625000298023224, 'rewards/format_reward': 1.0, 'reward': 1.5625000596046448, 'reward_std': 0.32673604786396027, 'kl': 0.08740234375, 'epoch': 2.91} 58%|█████▊ | 936/1610 [3:44:54<2:10:00, 11.57s/it] 58%|█████▊ | 937/1610 [3:45:08<2:14:40, 12.01s/it] {'loss': 0.0039, 'grad_norm': 2.80754045283281, 'learning_rate': 4.180124223602484e-07, 'completion_length': 106.2410774230957, 'rewards/accuracy_reward': 0.3839285969734192, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.3750000596046448, 'reward_std': 0.09528662264347076, 'kl': 0.097412109375, 'epoch': 2.91} 58%|█████▊ | 937/1610 [3:45:08<2:14:40, 12.01s/it] 58%|█████▊ | 938/1610 [3:45:19<2:14:15, 11.99s/it] {'loss': 0.0041, 'grad_norm': 1.9337418168283633, 'learning_rate': 4.1739130434782606e-07, 'completion_length': 94.09821701049805, 'rewards/accuracy_reward': 0.4821428954601288, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4732143878936768, 'reward_std': 0.1800716370344162, 'kl': 0.10302734375, 'epoch': 2.91} 58%|█████▊ | 938/1610 [3:45:19<2:14:15, 11.99s/it] 58%|█████▊ | 939/1610 [3:45:32<2:15:19, 12.10s/it] {'loss': 0.0034, 'grad_norm': 1.0640371081193523, 'learning_rate': 4.1677018633540374e-07, 'completion_length': 104.34822082519531, 'rewards/accuracy_reward': 0.6071428656578064, 'rewards/format_reward': 1.0, 'reward': 1.6071429252624512, 'reward_std': 0.15481781959533691, 'kl': 0.085205078125, 'epoch': 2.92} 58%|█████▊ | 939/1610 [3:45:32<2:15:19, 12.10s/it] 58%|█████▊ | 940/1610 [3:45:44<2:16:36, 12.23s/it] {'loss': 0.0038, 'grad_norm': 5.717125018220124, 'learning_rate': 4.161490683229814e-07, 'completion_length': 120.6964340209961, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 1.0, 'reward': 1.571428656578064, 'reward_std': 0.2726622372865677, 'kl': 0.094970703125, 'epoch': 2.92} 58%|█████▊ | 940/1610 [3:45:44<2:16:36, 12.23s/it] 58%|█████▊ | 941/1610 [3:45:57<2:18:49, 12.45s/it] {'loss': 0.0035, 'grad_norm': 2.7387766894651193, 'learning_rate': 4.15527950310559e-07, 'completion_length': 101.52679061889648, 'rewards/accuracy_reward': 0.4821428656578064, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.473214328289032, 'reward_std': 0.27144984900951385, 'kl': 0.086669921875, 'epoch': 2.92} 58%|█████▊ | 941/1610 [3:45:57<2:18:49, 12.45s/it] 59%|█████▊ | 942/1610 [3:46:10<2:19:47, 12.56s/it] {'loss': 0.004, 'grad_norm': 2.0965477806758184, 'learning_rate': 4.149068322981366e-07, 'completion_length': 123.78571701049805, 'rewards/accuracy_reward': 0.4910714477300644, 'rewards/format_reward': 1.0, 'reward': 1.4910715222358704, 'reward_std': 0.23656460642814636, 'kl': 0.098876953125, 'epoch': 2.93} 59%|█████▊ | 942/1610 [3:46:10<2:19:47, 12.56s/it] 59%|█████▊ | 943/1610 [3:46:23<2:20:19, 12.62s/it] {'loss': 0.0035, 'grad_norm': 1.230710126779974, 'learning_rate': 4.142857142857143e-07, 'completion_length': 120.9285774230957, 'rewards/accuracy_reward': 0.321428582072258, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3035714626312256, 'reward_std': 0.20961221307516098, 'kl': 0.087158203125, 'epoch': 2.93} 59%|█████▊ | 943/1610 [3:46:23<2:20:19, 12.62s/it] 59%|█████▊ | 944/1610 [3:46:34<2:16:03, 12.26s/it] {'loss': 0.004, 'grad_norm': 1.4240754533852755, 'learning_rate': 4.136645962732919e-07, 'completion_length': 104.3035774230957, 'rewards/accuracy_reward': 0.383928582072258, 'rewards/format_reward': 1.0, 'reward': 1.383928656578064, 'reward_std': 0.2113051861524582, 'kl': 0.099365234375, 'epoch': 2.93} 59%|█████▊ | 944/1610 [3:46:34<2:16:03, 12.26s/it] 59%|█████▊ | 945/1610 [3:46:48<2:20:21, 12.66s/it] {'loss': 0.0039, 'grad_norm': 1.913254763025028, 'learning_rate': 4.1304347826086954e-07, 'completion_length': 114.12500381469727, 'rewards/accuracy_reward': 0.392857164144516, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3571429252624512, 'reward_std': 0.28350888192653656, 'kl': 0.0966796875, 'epoch': 2.93} 59%|█████▊ | 945/1610 [3:46:48<2:20:21, 12.66s/it] 59%|█████▉ | 946/1610 [3:46:59<2:13:21, 12.05s/it] {'loss': 0.0037, 'grad_norm': 2.032817204093572, 'learning_rate': 4.1242236024844717e-07, 'completion_length': 97.59822082519531, 'rewards/accuracy_reward': 0.580357164144516, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.5714285969734192, 'reward_std': 0.24289215356111526, 'kl': 0.0926513671875, 'epoch': 2.94} 59%|█████▉ | 946/1610 [3:46:59<2:13:21, 12.05s/it] 59%|█████▉ | 947/1610 [3:47:09<2:09:22, 11.71s/it] {'loss': 0.0036, 'grad_norm': 2.104529874552913, 'learning_rate': 4.118012422360248e-07, 'completion_length': 125.26786422729492, 'rewards/accuracy_reward': 0.4553571492433548, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.446428656578064, 'reward_std': 0.244989275932312, 'kl': 0.09033203125, 'epoch': 2.94} 59%|█████▉ | 947/1610 [3:47:09<2:09:22, 11.71s/it] 59%|█████▉ | 948/1610 [3:47:19<2:03:13, 11.17s/it] {'loss': 0.0033, 'grad_norm': 1.1790463984999584, 'learning_rate': 4.111801242236025e-07, 'completion_length': 84.3214340209961, 'rewards/accuracy_reward': 0.6517857611179352, 'rewards/format_reward': 1.0, 'reward': 1.6517857909202576, 'reward_std': 0.13616281747817993, 'kl': 0.08154296875, 'epoch': 2.94} 59%|█████▉ | 948/1610 [3:47:19<2:03:13, 11.17s/it] 59%|█████▉ | 949/1610 [3:47:31<2:04:09, 11.27s/it] {'loss': 0.0042, 'grad_norm': 1.8538100740664336, 'learning_rate': 4.105590062111801e-07, 'completion_length': 112.98214721679688, 'rewards/accuracy_reward': 0.446428582072258, 'rewards/format_reward': 0.9732142984867096, 'reward': 1.4196428656578064, 'reward_std': 0.3486414700746536, 'kl': 0.10546875, 'epoch': 2.95} 59%|█████▉ | 949/1610 [3:47:31<2:04:09, 11.27s/it] 59%|█████▉ | 950/1610 [3:47:43<2:06:09, 11.47s/it] {'loss': 0.0043, 'grad_norm': 1.062288839512064, 'learning_rate': 4.0993788819875776e-07, 'completion_length': 103.39286041259766, 'rewards/accuracy_reward': 0.5000000149011612, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4910715222358704, 'reward_std': 0.21065278351306915, 'kl': 0.108154296875, 'epoch': 2.95} 59%|█████▉ | 950/1610 [3:47:43<2:06:09, 11.47s/it] 59%|█████▉ | 951/1610 [3:47:55<2:09:38, 11.80s/it] {'loss': 0.0045, 'grad_norm': 2.5527418535423925, 'learning_rate': 4.093167701863354e-07, 'completion_length': 125.54464721679688, 'rewards/accuracy_reward': 0.4375000149011612, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4285714626312256, 'reward_std': 0.1995968148112297, 'kl': 0.112060546875, 'epoch': 2.95} 59%|█████▉ | 951/1610 [3:47:55<2:09:38, 11.80s/it] 59%|█████▉ | 952/1610 [3:48:08<2:13:18, 12.16s/it] {'loss': 0.0043, 'grad_norm': 0.7730708930111273, 'learning_rate': 4.0869565217391307e-07, 'completion_length': 119.82143020629883, 'rewards/accuracy_reward': 0.660714328289032, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.6517857909202576, 'reward_std': 0.12054043263196945, 'kl': 0.107421875, 'epoch': 2.96} 59%|█████▉ | 952/1610 [3:48:08<2:13:18, 12.16s/it] 59%|█████▉ | 953/1610 [3:48:21<2:13:25, 12.18s/it] {'loss': 0.0039, 'grad_norm': 1.6661922664392406, 'learning_rate': 4.080745341614907e-07, 'completion_length': 104.70536041259766, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.5625000596046448, 'reward_std': 0.2759717255830765, 'kl': 0.09765625, 'epoch': 2.96} 59%|█████▉ | 953/1610 [3:48:21<2:13:25, 12.18s/it] 59%|█████▉ | 954/1610 [3:48:34<2:16:39, 12.50s/it] {'loss': 0.0047, 'grad_norm': 1.5714875117171936, 'learning_rate': 4.074534161490683e-07, 'completion_length': 114.1964340209961, 'rewards/accuracy_reward': 0.5446428656578064, 'rewards/format_reward': 0.973214328289032, 'reward': 1.5178571939468384, 'reward_std': 0.2607449144124985, 'kl': 0.117431640625, 'epoch': 2.96} 59%|█████▉ | 954/1610 [3:48:34<2:16:39, 12.50s/it] 59%|█████▉ | 955/1610 [3:48:46<2:14:59, 12.37s/it] {'loss': 0.004, 'grad_norm': 1.5352979701577154, 'learning_rate': 4.068322981366459e-07, 'completion_length': 114.7589340209961, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5357143878936768, 'reward_std': 0.2850314527750015, 'kl': 0.100341796875, 'epoch': 2.97} 59%|█████▉ | 955/1610 [3:48:46<2:14:59, 12.37s/it] 59%|█████▉ | 956/1610 [3:48:59<2:17:18, 12.60s/it] {'loss': 0.0045, 'grad_norm': 1.206570321953466, 'learning_rate': 4.0621118012422355e-07, 'completion_length': 110.59822082519531, 'rewards/accuracy_reward': 0.4821428805589676, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.4642857909202576, 'reward_std': 0.18397442996501923, 'kl': 0.11376953125, 'epoch': 2.97} 59%|█████▉ | 956/1610 [3:48:59<2:17:18, 12.60s/it] 59%|█████▉ | 957/1610 [3:49:09<2:09:54, 11.94s/it] {'loss': 0.0039, 'grad_norm': 1.37800720421806, 'learning_rate': 4.0559006211180124e-07, 'completion_length': 80.91964721679688, 'rewards/accuracy_reward': 0.3928571492433548, 'rewards/format_reward': 1.0, 'reward': 1.3928571939468384, 'reward_std': 0.19057323783636093, 'kl': 0.097412109375, 'epoch': 2.97} 59%|█████▉ | 957/1610 [3:49:09<2:09:54, 11.94s/it] 60%|█████▉ | 958/1610 [3:49:22<2:10:55, 12.05s/it] {'loss': 0.0038, 'grad_norm': 5.56221505632245, 'learning_rate': 4.0496894409937887e-07, 'completion_length': 108.02679061889648, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.5625000596046448, 'reward_std': 0.22605739533901215, 'kl': 0.094482421875, 'epoch': 2.98} 60%|█████▉ | 958/1610 [3:49:22<2:10:55, 12.05s/it] 60%|█████▉ | 959/1610 [3:49:33<2:08:56, 11.88s/it] {'loss': 0.0036, 'grad_norm': 2.99732152060076, 'learning_rate': 4.043478260869565e-07, 'completion_length': 116.65179061889648, 'rewards/accuracy_reward': 0.2857142984867096, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.2678571939468384, 'reward_std': 0.22875341773033142, 'kl': 0.08935546875, 'epoch': 2.98} 60%|█████▉ | 959/1610 [3:49:33<2:08:56, 11.88s/it] 60%|█████▉ | 960/1610 [3:49:43<2:03:14, 11.38s/it] {'loss': 0.0036, 'grad_norm': 1.643548106110312, 'learning_rate': 4.0372670807453413e-07, 'completion_length': 87.97321701049805, 'rewards/accuracy_reward': 0.6517857313156128, 'rewards/format_reward': 1.0, 'reward': 1.6517857909202576, 'reward_std': 0.1418914571404457, 'kl': 0.09033203125, 'epoch': 2.98} 60%|█████▉ | 960/1610 [3:49:43<2:03:14, 11.38s/it] 60%|█████▉ | 961/1610 [3:49:56<2:06:10, 11.66s/it] {'loss': 0.0038, 'grad_norm': 1.8647464317975373, 'learning_rate': 4.0310559006211177e-07, 'completion_length': 113.08929443359375, 'rewards/accuracy_reward': 0.4375000149011612, 'rewards/format_reward': 1.0, 'reward': 1.4375000596046448, 'reward_std': 0.2579156383872032, 'kl': 0.09375, 'epoch': 2.98} 60%|█████▉ | 961/1610 [3:49:56<2:06:10, 11.66s/it] 60%|█████▉ | 962/1610 [3:50:09<2:09:39, 12.01s/it] {'loss': 0.0039, 'grad_norm': 1.4770098650104722, 'learning_rate': 4.0248447204968945e-07, 'completion_length': 80.01786041259766, 'rewards/accuracy_reward': 0.6160714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.607142984867096, 'reward_std': 0.17885925620794296, 'kl': 0.09814453125, 'epoch': 2.99} 60%|█████▉ | 962/1610 [3:50:09<2:09:39, 12.01s/it] 60%|█████▉ | 963/1610 [3:50:20<2:08:59, 11.96s/it] {'loss': 0.0034, 'grad_norm': 1.3386923228619885, 'learning_rate': 4.018633540372671e-07, 'completion_length': 99.01786041259766, 'rewards/accuracy_reward': 0.4285714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4196429252624512, 'reward_std': 0.15933407843112946, 'kl': 0.084716796875, 'epoch': 2.99} 60%|█████▉ | 963/1610 [3:50:20<2:08:59, 11.96s/it] 60%|█████▉ | 964/1610 [3:50:33<2:09:48, 12.06s/it] {'loss': 0.0048, 'grad_norm': 3.1279182828458842, 'learning_rate': 4.012422360248447e-07, 'completion_length': 110.72322082519531, 'rewards/accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5714285969734192, 'reward_std': 0.17226044088602066, 'kl': 0.120361328125, 'epoch': 2.99} 60%|█████▉ | 964/1610 [3:50:33<2:09:48, 12.06s/it] 60%|█████▉ | 965/1610 [3:50:46<2:12:59, 12.37s/it] {'loss': 0.005, 'grad_norm': 2.702839794843325, 'learning_rate': 4.006211180124223e-07, 'completion_length': 96.56250381469727, 'rewards/accuracy_reward': 0.2946428805589676, 'rewards/format_reward': 0.9732142984867096, 'reward': 1.2678571939468384, 'reward_std': 0.21044062077999115, 'kl': 0.123779296875, 'epoch': 3.0} 60%|█████▉ | 965/1610 [3:50:46<2:12:59, 12.37s/it] 60%|██████ | 966/1610 [3:50:58<2:12:04, 12.31s/it] {'loss': 0.0048, 'grad_norm': 2.005636834755035, 'learning_rate': 4e-07, 'completion_length': 92.47322082519531, 'rewards/accuracy_reward': 0.4285714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4196429252624512, 'reward_std': 0.15872061997652054, 'kl': 0.119140625, 'epoch': 3.0} 60%|██████ | 966/1610 [3:50:58<2:12:04, 12.31s/it] 60%|██████ | 967/1610 [3:51:10<2:11:05, 12.23s/it] {'loss': 0.0044, 'grad_norm': 1.3270718393790744, 'learning_rate': 3.993788819875776e-07, 'completion_length': 101.4464340209961, 'rewards/accuracy_reward': 0.3660714477300644, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.3571429252624512, 'reward_std': 0.16653180867433548, 'kl': 0.110107421875, 'epoch': 3.0} 60%|██████ | 967/1610 [3:51:10<2:11:05, 12.23s/it] 60%|██████ | 968/1610 [3:51:20<2:04:01, 11.59s/it] {'loss': 0.0028, 'grad_norm': 1.195112557881788, 'learning_rate': 3.9875776397515525e-07, 'completion_length': 89.71428680419922, 'rewards/accuracy_reward': 0.526785746216774, 'rewards/format_reward': 1.0, 'reward': 1.5267857909202576, 'reward_std': 0.1418914571404457, 'kl': 0.071044921875, 'epoch': 3.01} 60%|██████ | 968/1610 [3:51:20<2:04:01, 11.59s/it] 60%|██████ | 969/1610 [3:51:33<2:07:45, 11.96s/it] {'loss': 0.0055, 'grad_norm': 4.345554019137067, 'learning_rate': 3.981366459627329e-07, 'completion_length': 104.3660774230957, 'rewards/accuracy_reward': 0.5267857313156128, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5089285969734192, 'reward_std': 0.17965053021907806, 'kl': 0.138671875, 'epoch': 3.01} 60%|██████ | 969/1610 [3:51:33<2:07:45, 11.96s/it] 60%|██████ | 970/1610 [3:51:43<2:01:13, 11.36s/it] {'loss': 0.0037, 'grad_norm': 1.6914539523184688, 'learning_rate': 3.975155279503105e-07, 'completion_length': 92.43750381469727, 'rewards/accuracy_reward': 0.3125000149011612, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.3035715222358704, 'reward_std': 0.2669335976243019, 'kl': 0.093505859375, 'epoch': 3.01} 60%|██████ | 970/1610 [3:51:43<2:01:13, 11.36s/it] 60%|██████ | 971/1610 [3:51:54<2:00:56, 11.36s/it] {'loss': 0.0034, 'grad_norm': 1.2911611584759566, 'learning_rate': 3.968944099378882e-07, 'completion_length': 80.81250381469727, 'rewards/accuracy_reward': 0.2767857313156128, 'rewards/format_reward': 1.0, 'reward': 1.2767857909202576, 'reward_std': 0.13616281747817993, 'kl': 0.0859375, 'epoch': 3.02} 60%|██████ | 971/1610 [3:51:54<2:00:56, 11.36s/it] 60%|██████ | 972/1610 [3:52:06<2:01:05, 11.39s/it] {'loss': 0.0036, 'grad_norm': 1.407868107561538, 'learning_rate': 3.9627329192546583e-07, 'completion_length': 80.03572082519531, 'rewards/accuracy_reward': 0.7232142984867096, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.7142857909202576, 'reward_std': 0.17885924503207207, 'kl': 0.089599609375, 'epoch': 3.02} 60%|██████ | 972/1610 [3:52:06<2:01:05, 11.39s/it] 60%|██████ | 973/1610 [3:52:16<1:55:58, 10.92s/it] {'loss': 0.0044, 'grad_norm': 1.0180438440891582, 'learning_rate': 3.9565217391304346e-07, 'completion_length': 69.85714530944824, 'rewards/accuracy_reward': 0.660714328289032, 'rewards/format_reward': 1.0, 'reward': 1.6607143878936768, 'reward_std': 0.09528662264347076, 'kl': 0.11083984375, 'epoch': 3.02} 60%|██████ | 973/1610 [3:52:16<1:55:58, 10.92s/it] 60%|██████ | 974/1610 [3:52:27<1:57:43, 11.11s/it] {'loss': 0.0049, 'grad_norm': 1.441067538900903, 'learning_rate': 3.950310559006211e-07, 'completion_length': 96.07143020629883, 'rewards/accuracy_reward': 0.4196428805589676, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.4017857909202576, 'reward_std': 0.22605740278959274, 'kl': 0.12353515625, 'epoch': 3.02} 60%|██████ | 974/1610 [3:52:27<1:57:43, 11.11s/it] 61%|██████ | 975/1610 [3:52:40<2:04:10, 11.73s/it] {'loss': 0.0036, 'grad_norm': 3.1953497182388344, 'learning_rate': 3.944099378881988e-07, 'completion_length': 104.54464721679688, 'rewards/accuracy_reward': 0.6339285969734192, 'rewards/format_reward': 0.973214328289032, 'reward': 1.6071429252624512, 'reward_std': 0.20801585912704468, 'kl': 0.08935546875, 'epoch': 3.03} 61%|██████ | 975/1610 [3:52:40<2:04:10, 11.73s/it] 61%|██████ | 976/1610 [3:52:51<2:00:07, 11.37s/it] {'loss': 0.0031, 'grad_norm': 1.848129870250399, 'learning_rate': 3.937888198757764e-07, 'completion_length': 97.0714340209961, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 1.0, 'reward': 1.5178571939468384, 'reward_std': 0.167145274579525, 'kl': 0.078125, 'epoch': 3.03} 61%|██████ | 976/1610 [3:52:51<2:00:07, 11.37s/it] 61%|██████ | 977/1610 [3:53:03<2:02:51, 11.65s/it] {'loss': 0.0046, 'grad_norm': 1.0254283187192326, 'learning_rate': 3.93167701863354e-07, 'completion_length': 95.68750762939453, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.5625000596046448, 'reward_std': 0.12956400960683823, 'kl': 0.1142578125, 'epoch': 3.03} 61%|██████ | 977/1610 [3:53:03<2:02:51, 11.65s/it] 61%|██████ | 978/1610 [3:53:13<1:58:39, 11.27s/it] {'loss': 0.0034, 'grad_norm': 2.0468832870558797, 'learning_rate': 3.925465838509316e-07, 'completion_length': 93.04464721679688, 'rewards/accuracy_reward': 0.4107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.4107143878936768, 'reward_std': 0.23535223305225372, 'kl': 0.0863037109375, 'epoch': 3.04} 61%|██████ | 978/1610 [3:53:13<1:58:39, 11.27s/it] 61%|██████ | 979/1610 [3:53:26<2:02:06, 11.61s/it] {'loss': 0.0047, 'grad_norm': 1.5012843845522141, 'learning_rate': 3.9192546583850926e-07, 'completion_length': 95.46429061889648, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 1.0, 'reward': 1.6785715222358704, 'reward_std': 0.16922222077846527, 'kl': 0.1171875, 'epoch': 3.04} 61%|██████ | 979/1610 [3:53:26<2:02:06, 11.61s/it] 61%|██████ | 980/1610 [3:53:38<2:04:56, 11.90s/it] {'loss': 0.0038, 'grad_norm': 2.032004013390779, 'learning_rate': 3.9130434782608694e-07, 'completion_length': 99.03572082519531, 'rewards/accuracy_reward': 0.526785746216774, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.5178571939468384, 'reward_std': 0.22363825142383575, 'kl': 0.095458984375, 'epoch': 3.04} 61%|██████ | 980/1610 [3:53:38<2:04:56, 11.90s/it] 61%|██████ | 981/1610 [3:53:51<2:05:59, 12.02s/it] {'loss': 0.006, 'grad_norm': 1.310868986785517, 'learning_rate': 3.906832298136646e-07, 'completion_length': 110.23214721679688, 'rewards/accuracy_reward': 0.5714286118745804, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.5625000596046448, 'reward_std': 0.17104807496070862, 'kl': 0.14990234375, 'epoch': 3.05} 61%|██████ | 981/1610 [3:53:51<2:05:59, 12.02s/it] 61%|██████ | 982/1610 [3:54:02<2:04:12, 11.87s/it] {'loss': 0.0046, 'grad_norm': 1.0034877301979521, 'learning_rate': 3.900621118012422e-07, 'completion_length': 96.10714721679688, 'rewards/accuracy_reward': 0.4107143133878708, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4017857313156128, 'reward_std': 0.16262900456786156, 'kl': 0.1142578125, 'epoch': 3.05} 61%|██████ | 982/1610 [3:54:02<2:04:12, 11.87s/it] 61%|██████ | 983/1610 [3:54:12<1:58:24, 11.33s/it] {'loss': 0.0041, 'grad_norm': 0.6875157259043131, 'learning_rate': 3.8944099378881984e-07, 'completion_length': 72.02679061889648, 'rewards/accuracy_reward': 0.232142873108387, 'rewards/format_reward': 1.0, 'reward': 1.2321429252624512, 'reward_std': 0.033065006136894226, 'kl': 0.1015625, 'epoch': 3.05} 61%|██████ | 983/1610 [3:54:12<1:58:24, 11.33s/it] 61%|██████ | 984/1610 [3:54:23<1:57:31, 11.26s/it] {'loss': 0.0049, 'grad_norm': 1.9030101162307218, 'learning_rate': 3.8881987577639753e-07, 'completion_length': 97.08929061889648, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.6160715222358704, 'reward_std': 0.2960958182811737, 'kl': 0.123779296875, 'epoch': 3.06} 61%|██████ | 984/1610 [3:54:23<1:57:31, 11.26s/it] 61%|██████ | 985/1610 [3:54:34<1:55:06, 11.05s/it] {'loss': 0.0042, 'grad_norm': 0.8997800479464168, 'learning_rate': 3.8819875776397516e-07, 'completion_length': 73.9285774230957, 'rewards/accuracy_reward': 0.4017857313156128, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.3928571939468384, 'reward_std': 0.11272924393415451, 'kl': 0.10595703125, 'epoch': 3.06} 61%|██████ | 985/1610 [3:54:34<1:55:06, 11.05s/it] 61%|██████ | 986/1610 [3:54:47<2:02:29, 11.78s/it] {'loss': 0.0042, 'grad_norm': 1.3552937821120565, 'learning_rate': 3.875776397515528e-07, 'completion_length': 93.09821701049805, 'rewards/accuracy_reward': 0.526785746216774, 'rewards/format_reward': 0.973214328289032, 'reward': 1.5000001192092896, 'reward_std': 0.19503602385520935, 'kl': 0.10546875, 'epoch': 3.06} 61%|██████ | 986/1610 [3:54:47<2:02:29, 11.78s/it] 61%|██████▏ | 987/1610 [3:54:58<1:57:48, 11.35s/it] {'loss': 0.0047, 'grad_norm': 5.572502023365795, 'learning_rate': 3.869565217391304e-07, 'completion_length': 81.71429061889648, 'rewards/accuracy_reward': 0.5625000298023224, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.5535714626312256, 'reward_std': 0.17885926365852356, 'kl': 0.1171875, 'epoch': 3.07} 61%|██████▏ | 987/1610 [3:54:58<1:57:48, 11.35s/it] 61%|██████▏ | 988/1610 [3:55:08<1:55:10, 11.11s/it] {'loss': 0.0057, 'grad_norm': 0.9085120067555361, 'learning_rate': 3.8633540372670806e-07, 'completion_length': 84.15179061889648, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 1.0, 'reward': 1.5178571939468384, 'reward_std': 0.05050762742757797, 'kl': 0.14208984375, 'epoch': 3.07} 61%|██████▏ | 988/1610 [3:55:09<1:55:10, 11.11s/it] 61%|██████▏ | 989/1610 [3:55:19<1:52:30, 10.87s/it] {'loss': 0.0042, 'grad_norm': 1.6741765876124741, 'learning_rate': 3.857142857142857e-07, 'completion_length': 85.61607360839844, 'rewards/accuracy_reward': 0.580357164144516, 'rewards/format_reward': 1.0, 'reward': 1.5803571939468384, 'reward_std': 0.17104806005954742, 'kl': 0.104248046875, 'epoch': 3.07} 61%|██████▏ | 989/1610 [3:55:19<1:52:30, 10.87s/it] 61%|██████▏ | 990/1610 [3:55:32<1:59:33, 11.57s/it] {'loss': 0.007, 'grad_norm': 2.610471438926522, 'learning_rate': 3.850931677018633e-07, 'completion_length': 113.96429061889648, 'rewards/accuracy_reward': 0.455357164144516, 'rewards/format_reward': 0.9642857611179352, 'reward': 1.4196429252624512, 'reward_std': 0.2646476998925209, 'kl': 0.1748046875, 'epoch': 3.07} 61%|██████▏ | 990/1610 [3:55:32<1:59:33, 11.57s/it] 62%|██████▏ | 991/1610 [3:55:44<2:01:40, 11.79s/it] {'loss': 0.0049, 'grad_norm': 0.5581954610912429, 'learning_rate': 3.8447204968944095e-07, 'completion_length': 87.25000381469727, 'rewards/accuracy_reward': 0.508928582072258, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.4910715222358704, 'reward_std': 0.06343399360775948, 'kl': 0.123291015625, 'epoch': 3.08} 62%|██████▏ | 991/1610 [3:55:44<2:01:40, 11.79s/it] 62%|██████▏ | 992/1610 [3:55:57<2:03:37, 12.00s/it] {'loss': 0.0031, 'grad_norm': 1.2596813703636227, 'learning_rate': 3.838509316770186e-07, 'completion_length': 70.26786041259766, 'rewards/accuracy_reward': 0.7142857313156128, 'rewards/format_reward': 1.0, 'reward': 1.7142857909202576, 'reward_std': 0.05050762742757797, 'kl': 0.078369140625, 'epoch': 3.08} 62%|██████▏ | 992/1610 [3:55:57<2:03:37, 12.00s/it] 62%|██████▏ | 993/1610 [3:56:08<2:01:14, 11.79s/it] {'loss': 0.0043, 'grad_norm': 1.1333770139398887, 'learning_rate': 3.8322981366459627e-07, 'completion_length': 95.66071701049805, 'rewards/accuracy_reward': 0.598214328289032, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.5892857909202576, 'reward_std': 0.13919543474912643, 'kl': 0.1064453125, 'epoch': 3.08} 62%|██████▏ | 993/1610 [3:56:08<2:01:14, 11.79s/it] 62%|██████▏ | 994/1610 [3:56:20<2:00:13, 11.71s/it] {'loss': 0.0051, 'grad_norm': 1.072714474837714, 'learning_rate': 3.826086956521739e-07, 'completion_length': 98.66071701049805, 'rewards/accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4910714626312256, 'reward_std': 0.1412779837846756, 'kl': 0.1279296875, 'epoch': 3.09} 62%|██████▏ | 994/1610 [3:56:20<2:00:13, 11.71s/it] 62%|██████▏ | 995/1610 [3:56:32<2:03:28, 12.05s/it] {'loss': 0.0069, 'grad_norm': 2.766566150772289, 'learning_rate': 3.8198757763975154e-07, 'completion_length': 122.97321701049805, 'rewards/accuracy_reward': 0.5625000298023224, 'rewards/format_reward': 0.9732142984867096, 'reward': 1.5357143878936768, 'reward_std': 0.19057324528694153, 'kl': 0.172607421875, 'epoch': 3.09} 62%|██████▏ | 995/1610 [3:56:32<2:03:28, 12.05s/it] 62%|██████▏ | 996/1610 [3:56:43<1:59:07, 11.64s/it] {'loss': 0.0031, 'grad_norm': 2.427599384110767, 'learning_rate': 3.8136645962732917e-07, 'completion_length': 83.46429061889648, 'rewards/accuracy_reward': 0.5625000149011612, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.5535715222358704, 'reward_std': 0.19057324528694153, 'kl': 0.078369140625, 'epoch': 3.09} 62%|██████▏ | 996/1610 [3:56:43<1:59:07, 11.64s/it] 62%|██████▏ | 997/1610 [3:56:56<2:01:58, 11.94s/it] {'loss': 0.0055, 'grad_norm': 2.731776244754488, 'learning_rate': 3.807453416149068e-07, 'completion_length': 82.66072082519531, 'rewards/accuracy_reward': 0.642857164144516, 'rewards/format_reward': 0.9732142984867096, 'reward': 1.6160714626312256, 'reward_std': 0.2254640907049179, 'kl': 0.13720703125, 'epoch': 3.1} 62%|██████▏ | 997/1610 [3:56:56<2:01:58, 11.94s/it] 62%|██████▏ | 998/1610 [3:57:09<2:07:19, 12.48s/it] {'loss': 0.0057, 'grad_norm': 1.3074671034026892, 'learning_rate': 3.801242236024845e-07, 'completion_length': 140.33036422729492, 'rewards/accuracy_reward': 0.3392857313156128, 'rewards/format_reward': 0.9642857611179352, 'reward': 1.3035714626312256, 'reward_std': 0.25279486179351807, 'kl': 0.141357421875, 'epoch': 3.1} 62%|██████▏ | 998/1610 [3:57:09<2:07:19, 12.48s/it] 62%|██████▏ | 999/1610 [3:57:21<2:03:42, 12.15s/it] {'loss': 0.0053, 'grad_norm': 1.328384179886455, 'learning_rate': 3.795031055900621e-07, 'completion_length': 104.02678871154785, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.5625000596046448, 'reward_std': 0.13671719655394554, 'kl': 0.133544921875, 'epoch': 3.1} 62%|██████▏ | 999/1610 [3:57:21<2:03:42, 12.15s/it] 62%|██████▏ | 1000/1610 [3:57:34<2:05:36, 12.36s/it] {'loss': 0.0059, 'grad_norm': 1.6029073288203064, 'learning_rate': 3.7888198757763975e-07, 'completion_length': 101.6160774230957, 'rewards/accuracy_reward': 0.5357142984867096, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5178572535514832, 'reward_std': 0.10101525112986565, 'kl': 0.146484375, 'epoch': 3.11} 62%|██████▏ | 1000/1610 [3:57:34<2:05:36, 12.36s/it] 62%|██████▏ | 1001/1610 [3:58:35<4:33:21, 26.93s/it] {'loss': 0.0096, 'grad_norm': 1.6752205412256118, 'learning_rate': 3.7826086956521733e-07, 'completion_length': 121.1160774230957, 'rewards/accuracy_reward': 0.6160714626312256, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5625001192092896, 'reward_std': 0.31878969073295593, 'kl': 0.23974609375, 'epoch': 3.11} 62%|██████▏ | 1001/1610 [3:58:35<4:33:21, 26.93s/it] 62%|██████▏ | 1002/1610 [3:58:46<3:44:29, 22.15s/it] {'loss': 0.0036, 'grad_norm': 1.8573555036742275, 'learning_rate': 3.77639751552795e-07, 'completion_length': 99.41071701049805, 'rewards/accuracy_reward': 0.258928582072258, 'rewards/format_reward': 1.0, 'reward': 1.258928656578064, 'reward_std': 0.18787722289562225, 'kl': 0.08984375, 'epoch': 3.11} 62%|██████▏ | 1002/1610 [3:58:46<3:44:29, 22.15s/it] 62%|██████▏ | 1003/1610 [3:58:58<3:14:19, 19.21s/it] {'loss': 0.0051, 'grad_norm': 11.551544755670038, 'learning_rate': 3.7701863354037265e-07, 'completion_length': 94.15179061889648, 'rewards/accuracy_reward': 0.3482142984867096, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.3392857909202576, 'reward_std': 0.13858197629451752, 'kl': 0.128173828125, 'epoch': 3.11} 62%|██████▏ | 1003/1610 [3:58:58<3:14:19, 19.21s/it] 62%|██████▏ | 1004/1610 [3:59:09<2:50:45, 16.91s/it] {'loss': 0.0062, 'grad_norm': 1.834471773814077, 'learning_rate': 3.763975155279503e-07, 'completion_length': 109.16071701049805, 'rewards/accuracy_reward': 0.5446428954601288, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.5357143878936768, 'reward_std': 0.24558258056640625, 'kl': 0.15478515625, 'epoch': 3.12} 62%|██████▏ | 1004/1610 [3:59:09<2:50:45, 16.91s/it] 62%|██████▏ | 1005/1610 [3:59:22<2:36:30, 15.52s/it] {'loss': 0.0057, 'grad_norm': 1.9152679564566877, 'learning_rate': 3.757763975155279e-07, 'completion_length': 104.87500381469727, 'rewards/accuracy_reward': 0.3392857313156128, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.3214285969734192, 'reward_std': 0.27244730293750763, 'kl': 0.14306640625, 'epoch': 3.12} 62%|██████▏ | 1005/1610 [3:59:22<2:36:30, 15.52s/it] 62%|██████▏ | 1006/1610 [3:59:35<2:29:24, 14.84s/it] {'loss': 0.0066, 'grad_norm': 2.0407634161061647, 'learning_rate': 3.7515527950310555e-07, 'completion_length': 101.54464721679688, 'rewards/accuracy_reward': 0.5178571790456772, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.5000000596046448, 'reward_std': 0.24619603157043457, 'kl': 0.1640625, 'epoch': 3.12} 62%|██████▏ | 1006/1610 [3:59:35<2:29:24, 14.84s/it] 63%|██████▎ | 1007/1610 [3:59:47<2:21:02, 14.03s/it] {'loss': 0.005, 'grad_norm': 2.5530203138478558, 'learning_rate': 3.7453416149068323e-07, 'completion_length': 95.04464721679688, 'rewards/accuracy_reward': 0.6517857313156128, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.633928656578064, 'reward_std': 0.2987271398305893, 'kl': 0.125, 'epoch': 3.13} 63%|██████▎ | 1007/1610 [3:59:47<2:21:02, 14.03s/it] 63%|██████▎ | 1008/1610 [3:59:59<2:12:52, 13.24s/it] {'loss': 0.0062, 'grad_norm': 1.2872258711238034, 'learning_rate': 3.7391304347826087e-07, 'completion_length': 98.06250381469727, 'rewards/accuracy_reward': 0.4107143133878708, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3928571939468384, 'reward_std': 0.20801585912704468, 'kl': 0.15625, 'epoch': 3.13} 63%|██████▎ | 1008/1610 [3:59:59<2:12:52, 13.24s/it] 63%|██████▎ | 1009/1610 [4:00:11<2:09:00, 12.88s/it] {'loss': 0.0056, 'grad_norm': 1.8011983212442733, 'learning_rate': 3.732919254658385e-07, 'completion_length': 97.68750381469727, 'rewards/accuracy_reward': 0.5625000149011612, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5446429252624512, 'reward_std': 0.26322315633296967, 'kl': 0.13916015625, 'epoch': 3.13} 63%|██████▎ | 1009/1610 [4:00:11<2:09:00, 12.88s/it] 63%|██████▎ | 1010/1610 [4:00:23<2:05:59, 12.60s/it] {'loss': 0.0056, 'grad_norm': 1.3267137310235106, 'learning_rate': 3.7267080745341613e-07, 'completion_length': 94.22321701049805, 'rewards/accuracy_reward': 0.6607142984867096, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.6517857909202576, 'reward_std': 0.15933407843112946, 'kl': 0.14013671875, 'epoch': 3.14} 63%|██████▎ | 1010/1610 [4:00:23<2:05:59, 12.60s/it] 63%|██████▎ | 1011/1610 [4:00:34<2:03:43, 12.39s/it] {'loss': 0.0073, 'grad_norm': 1.5956609691906722, 'learning_rate': 3.720496894409938e-07, 'completion_length': 103.6964340209961, 'rewards/accuracy_reward': 0.8392857611179352, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.8035715222358704, 'reward_std': 0.15152287483215332, 'kl': 0.1826171875, 'epoch': 3.14} 63%|██████▎ | 1011/1610 [4:00:34<2:03:43, 12.39s/it] 63%|██████▎ | 1012/1610 [4:00:46<2:02:20, 12.27s/it] {'loss': 0.0053, 'grad_norm': 1.6294206285169845, 'learning_rate': 3.7142857142857145e-07, 'completion_length': 92.09821701049805, 'rewards/accuracy_reward': 0.508928582072258, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.4910715222358704, 'reward_std': 0.182762049138546, 'kl': 0.132080078125, 'epoch': 3.14} 63%|██████▎ | 1012/1610 [4:00:46<2:02:20, 12.27s/it] 63%|██████▎ | 1013/1610 [4:00:58<1:59:01, 11.96s/it] {'loss': 0.0055, 'grad_norm': 0.5334629574050553, 'learning_rate': 3.7080745341614903e-07, 'completion_length': 92.03571701049805, 'rewards/accuracy_reward': 0.3303571715950966, 'rewards/format_reward': 1.0, 'reward': 1.3303571939468384, 'reward_std': 0.03696779906749725, 'kl': 0.138671875, 'epoch': 3.15} 63%|██████▎ | 1013/1610 [4:00:58<1:59:01, 11.96s/it] 63%|██████▎ | 1014/1610 [4:01:11<2:03:29, 12.43s/it] {'loss': 0.004, 'grad_norm': 1.34107594935809, 'learning_rate': 3.7018633540372666e-07, 'completion_length': 86.54464340209961, 'rewards/accuracy_reward': 0.455357164144516, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.4375000596046448, 'reward_std': 0.1418914571404457, 'kl': 0.100830078125, 'epoch': 3.15} 63%|██████▎ | 1014/1610 [4:01:11<2:03:29, 12.43s/it] 63%|██████▎ | 1015/1610 [4:01:22<1:57:52, 11.89s/it] {'loss': 0.0056, 'grad_norm': 1.4687694389298616, 'learning_rate': 3.695652173913043e-07, 'completion_length': 84.87500381469727, 'rewards/accuracy_reward': 0.4196428880095482, 'rewards/format_reward': 1.0, 'reward': 1.4196429252624512, 'reward_std': 0.15360544621944427, 'kl': 0.138671875, 'epoch': 3.15} 63%|██████▎ | 1015/1610 [4:01:22<1:57:52, 11.89s/it] 63%|██████▎ | 1016/1610 [4:01:32<1:52:54, 11.41s/it] {'loss': 0.0051, 'grad_norm': 5.3963407294877825, 'learning_rate': 3.68944099378882e-07, 'completion_length': 73.33036041259766, 'rewards/accuracy_reward': 0.5625000298023224, 'rewards/format_reward': 1.0, 'reward': 1.5625000596046448, 'reward_std': 0.2092282474040985, 'kl': 0.12646484375, 'epoch': 3.16} 63%|██████▎ | 1016/1610 [4:01:32<1:52:54, 11.41s/it] 63%|██████▎ | 1017/1610 [4:01:44<1:53:06, 11.44s/it] {'loss': 0.0042, 'grad_norm': 0.800013050715321, 'learning_rate': 3.683229813664596e-07, 'completion_length': 80.43750381469727, 'rewards/accuracy_reward': 0.4910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.4910715222358704, 'reward_std': 0.07514797896146774, 'kl': 0.1044921875, 'epoch': 3.16} 63%|██████▎ | 1017/1610 [4:01:44<1:53:06, 11.44s/it] 63%|██████▎ | 1018/1610 [4:01:56<1:54:32, 11.61s/it] {'loss': 0.0036, 'grad_norm': 1.8113062906228767, 'learning_rate': 3.6770186335403724e-07, 'completion_length': 97.95536041259766, 'rewards/accuracy_reward': 0.4910714328289032, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4821429252624512, 'reward_std': 0.17164696753025055, 'kl': 0.090087890625, 'epoch': 3.16} 63%|██████▎ | 1018/1610 [4:01:56<1:54:32, 11.61s/it] 63%|██████▎ | 1019/1610 [4:02:05<1:48:36, 11.03s/it] {'loss': 0.0039, 'grad_norm': 2.4395528237586324, 'learning_rate': 3.670807453416149e-07, 'completion_length': 68.74107360839844, 'rewards/accuracy_reward': 0.4285714477300644, 'rewards/format_reward': 1.0, 'reward': 1.4285714626312256, 'reward_std': 0.16653179749846458, 'kl': 0.0986328125, 'epoch': 3.16} 63%|██████▎ | 1019/1610 [4:02:05<1:48:36, 11.03s/it] 63%|██████▎ | 1020/1610 [4:02:17<1:49:11, 11.10s/it] {'loss': 0.0044, 'grad_norm': 1.4260447775517144, 'learning_rate': 3.6645962732919256e-07, 'completion_length': 70.1785774230957, 'rewards/accuracy_reward': 0.4553571492433548, 'rewards/format_reward': 1.0, 'reward': 1.4553572535514832, 'reward_std': 0.12054044008255005, 'kl': 0.1103515625, 'epoch': 3.17} 63%|██████▎ | 1020/1610 [4:02:17<1:49:11, 11.10s/it] 63%|██████▎ | 1021/1610 [4:02:30<1:54:49, 11.70s/it] {'loss': 0.0055, 'grad_norm': 5.461894669367778, 'learning_rate': 3.658385093167702e-07, 'completion_length': 86.50000762939453, 'rewards/accuracy_reward': 0.6071428656578064, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.598214328289032, 'reward_std': 0.20984169840812683, 'kl': 0.138427734375, 'epoch': 3.17} 63%|██████▎ | 1021/1610 [4:02:30<1:54:49, 11.70s/it] 63%|██████▎ | 1022/1610 [4:02:40<1:49:46, 11.20s/it] {'loss': 0.0035, 'grad_norm': 1.239177986157669, 'learning_rate': 3.6521739130434783e-07, 'completion_length': 78.17857360839844, 'rewards/accuracy_reward': 0.7500000298023224, 'rewards/format_reward': 1.0, 'reward': 1.7500001192092896, 'reward_std': 0.128351628780365, 'kl': 0.088134765625, 'epoch': 3.17} 63%|██████▎ | 1022/1610 [4:02:40<1:49:46, 11.20s/it] 64%|██████▎ | 1023/1610 [4:02:50<1:45:49, 10.82s/it] {'loss': 0.0034, 'grad_norm': 1.626847928441539, 'learning_rate': 3.6459627329192546e-07, 'completion_length': 78.08929061889648, 'rewards/accuracy_reward': 0.5446428954601288, 'rewards/format_reward': 1.0, 'reward': 1.5446429252624512, 'reward_std': 0.15872061997652054, 'kl': 0.085205078125, 'epoch': 3.18} 64%|██████▎ | 1023/1610 [4:02:50<1:45:49, 10.82s/it] 64%|██████▎ | 1024/1610 [4:03:01<1:47:49, 11.04s/it] {'loss': 0.007, 'grad_norm': 3.00287807891388, 'learning_rate': 3.6397515527950304e-07, 'completion_length': 96.60714721679688, 'rewards/accuracy_reward': 0.4732142984867096, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4553571939468384, 'reward_std': 0.2718394249677658, 'kl': 0.174560546875, 'epoch': 3.18} 64%|██████▎ | 1024/1610 [4:03:01<1:47:49, 11.04s/it] 64%|██████▎ | 1025/1610 [4:03:11<1:45:24, 10.81s/it] {'loss': 0.0054, 'grad_norm': 1.3824293513193404, 'learning_rate': 3.633540372670807e-07, 'completion_length': 83.83036422729492, 'rewards/accuracy_reward': 0.392857164144516, 'rewards/format_reward': 1.0, 'reward': 1.3928571939468384, 'reward_std': 0.18666484951972961, 'kl': 0.135009765625, 'epoch': 3.18} 64%|██████▎ | 1025/1610 [4:03:11<1:45:24, 10.81s/it] 64%|██████▎ | 1026/1610 [4:03:24<1:49:09, 11.21s/it] {'loss': 0.005, 'grad_norm': 3.3845548357215898, 'learning_rate': 3.6273291925465836e-07, 'completion_length': 93.0714340209961, 'rewards/accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4821429252624512, 'reward_std': 0.272048756480217, 'kl': 0.125244140625, 'epoch': 3.19} 64%|██████▎ | 1026/1610 [4:03:24<1:49:09, 11.21s/it] 64%|██████▍ | 1027/1610 [4:03:35<1:48:37, 11.18s/it] {'loss': 0.0054, 'grad_norm': 3.6168866772878254, 'learning_rate': 3.62111801242236e-07, 'completion_length': 81.47321891784668, 'rewards/accuracy_reward': 0.580357164144516, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.571428656578064, 'reward_std': 0.24680952727794647, 'kl': 0.134765625, 'epoch': 3.19} 64%|██████▍ | 1027/1610 [4:03:35<1:48:37, 11.18s/it] 64%|██████▍ | 1028/1610 [4:03:45<1:45:48, 10.91s/it] {'loss': 0.0055, 'grad_norm': 1.5967516413401608, 'learning_rate': 3.614906832298136e-07, 'completion_length': 84.05357360839844, 'rewards/accuracy_reward': 0.3928571492433548, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.383928656578064, 'reward_std': 0.15933408588171005, 'kl': 0.137451171875, 'epoch': 3.19} 64%|██████▍ | 1028/1610 [4:03:45<1:45:48, 10.91s/it] 64%|██████▍ | 1029/1610 [4:03:56<1:46:59, 11.05s/it] {'loss': 0.0039, 'grad_norm': 0.6444755041702673, 'learning_rate': 3.608695652173913e-07, 'completion_length': 80.5089340209961, 'rewards/accuracy_reward': 0.598214328289032, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.5892857909202576, 'reward_std': 0.07103024423122406, 'kl': 0.096923828125, 'epoch': 3.2} 64%|██████▍ | 1029/1610 [4:03:56<1:46:59, 11.05s/it] 64%|██████▍ | 1030/1610 [4:04:07<1:46:12, 10.99s/it] {'loss': 0.0036, 'grad_norm': 7.010980333740868, 'learning_rate': 3.6024844720496894e-07, 'completion_length': 95.08036041259766, 'rewards/accuracy_reward': 0.3125000149011612, 'rewards/format_reward': 1.0, 'reward': 1.3125000596046448, 'reward_std': 0.23838485777378082, 'kl': 0.089599609375, 'epoch': 3.2} 64%|██████▍ | 1030/1610 [4:04:07<1:46:12, 10.99s/it] 64%|██████▍ | 1031/1610 [4:04:18<1:45:38, 10.95s/it] {'loss': 0.0042, 'grad_norm': 2.9747128381152885, 'learning_rate': 3.596273291925466e-07, 'completion_length': 74.57143020629883, 'rewards/accuracy_reward': 0.6696428954601288, 'rewards/format_reward': 1.0, 'reward': 1.6696429252624512, 'reward_std': 0.2158270627260208, 'kl': 0.105712890625, 'epoch': 3.2} 64%|██████▍ | 1031/1610 [4:04:18<1:45:38, 10.95s/it] 64%|██████▍ | 1032/1610 [4:04:31<1:50:16, 11.45s/it] {'loss': 0.0076, 'grad_norm': 1.1611282543811163, 'learning_rate': 3.590062111801242e-07, 'completion_length': 99.8839340209961, 'rewards/accuracy_reward': 0.5982142984867096, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5803571939468384, 'reward_std': 0.1623549722135067, 'kl': 0.18896484375, 'epoch': 3.2} 64%|██████▍ | 1032/1610 [4:04:31<1:50:16, 11.45s/it] 64%|██████▍ | 1033/1610 [4:04:41<1:47:15, 11.15s/it] {'loss': 0.0037, 'grad_norm': 1.4360384302843505, 'learning_rate': 3.5838509316770184e-07, 'completion_length': 72.83928680419922, 'rewards/accuracy_reward': 0.3839285969734192, 'rewards/format_reward': 1.0, 'reward': 1.383928656578064, 'reward_std': 0.12054044008255005, 'kl': 0.09326171875, 'epoch': 3.21} 64%|██████▍ | 1033/1610 [4:04:41<1:47:15, 11.15s/it] 64%|██████▍ | 1034/1610 [4:04:53<1:47:41, 11.22s/it] {'loss': 0.0042, 'grad_norm': 1.5692479354335538, 'learning_rate': 3.577639751552795e-07, 'completion_length': 86.4464340209961, 'rewards/accuracy_reward': 0.4910714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4821429252624512, 'reward_std': 0.20141704380512238, 'kl': 0.105224609375, 'epoch': 3.21} 64%|██████▍ | 1034/1610 [4:04:53<1:47:41, 11.22s/it] 64%|██████▍ | 1035/1610 [4:05:03<1:45:03, 10.96s/it] {'loss': 0.0046, 'grad_norm': 1.1595638514097926, 'learning_rate': 3.5714285714285716e-07, 'completion_length': 80.84821701049805, 'rewards/accuracy_reward': 0.5267857313156128, 'rewards/format_reward': 1.0, 'reward': 1.5267857909202576, 'reward_std': 0.12054043635725975, 'kl': 0.115234375, 'epoch': 3.21} 64%|██████▍ | 1035/1610 [4:05:03<1:45:03, 10.96s/it] 64%|██████▍ | 1036/1610 [4:05:13<1:43:47, 10.85s/it] {'loss': 0.0073, 'grad_norm': 1.357688972123626, 'learning_rate': 3.5652173913043474e-07, 'completion_length': 80.90179061889648, 'rewards/accuracy_reward': 0.3928571492433548, 'rewards/format_reward': 1.0, 'reward': 1.3928571939468384, 'reward_std': 0.128351628780365, 'kl': 0.18359375, 'epoch': 3.22} 64%|██████▍ | 1036/1610 [4:05:13<1:43:47, 10.85s/it] 64%|██████▍ | 1037/1610 [4:05:26<1:47:04, 11.21s/it] {'loss': 0.0044, 'grad_norm': 2.306348425189829, 'learning_rate': 3.5590062111801237e-07, 'completion_length': 101.70536422729492, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 0.9732142984867096, 'reward': 1.5803571939468384, 'reward_std': 0.2897626608610153, 'kl': 0.110107421875, 'epoch': 3.22} 64%|██████▍ | 1037/1610 [4:05:26<1:47:04, 11.21s/it] 64%|██████▍ | 1038/1610 [4:05:35<1:41:54, 10.69s/it] {'loss': 0.0036, 'grad_norm': 1.921727256020997, 'learning_rate': 3.5527950310559005e-07, 'completion_length': 85.29464721679688, 'rewards/accuracy_reward': 0.4107142984867096, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4017857909202576, 'reward_std': 0.2911729961633682, 'kl': 0.089111328125, 'epoch': 3.22} 64%|██████▍ | 1038/1610 [4:05:35<1:41:54, 10.69s/it] 65%|██████▍ | 1039/1610 [4:05:46<1:42:25, 10.76s/it] {'loss': 0.0044, 'grad_norm': 2.8059254599725834, 'learning_rate': 3.546583850931677e-07, 'completion_length': 83.91071701049805, 'rewards/accuracy_reward': 0.464285746216774, 'rewards/format_reward': 1.0, 'reward': 1.4642857909202576, 'reward_std': 0.1548178270459175, 'kl': 0.109375, 'epoch': 3.23} 65%|██████▍ | 1039/1610 [4:05:46<1:42:25, 10.76s/it] 65%|██████▍ | 1040/1610 [4:05:57<1:43:36, 10.91s/it] {'loss': 0.0036, 'grad_norm': 2.205459468478053, 'learning_rate': 3.540372670807453e-07, 'completion_length': 77.46429061889648, 'rewards/accuracy_reward': 0.5178571939468384, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.508928656578064, 'reward_std': 0.15872061252593994, 'kl': 0.0888671875, 'epoch': 3.23} 65%|██████▍ | 1040/1610 [4:05:57<1:43:36, 10.91s/it] 65%|██████▍ | 1041/1610 [4:06:10<1:47:43, 11.36s/it] {'loss': 0.0057, 'grad_norm': 2.570683506551273, 'learning_rate': 3.5341614906832295e-07, 'completion_length': 86.50000381469727, 'rewards/accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4910715222358704, 'reward_std': 0.32855626940727234, 'kl': 0.14208984375, 'epoch': 3.23} 65%|██████▍ | 1041/1610 [4:06:10<1:47:43, 11.36s/it] 65%|██████▍ | 1042/1610 [4:06:20<1:46:04, 11.21s/it] {'loss': 0.0045, 'grad_norm': 2.3559590037274094, 'learning_rate': 3.527950310559006e-07, 'completion_length': 70.58036041259766, 'rewards/accuracy_reward': 0.4821428805589676, 'rewards/format_reward': 1.0, 'reward': 1.4821429252624512, 'reward_std': 0.1121157705783844, 'kl': 0.1123046875, 'epoch': 3.24} 65%|██████▍ | 1042/1610 [4:06:20<1:46:04, 11.21s/it] 65%|██████▍ | 1043/1610 [4:06:33<1:48:40, 11.50s/it] {'loss': 0.0045, 'grad_norm': 1.2363118157427477, 'learning_rate': 3.5217391304347827e-07, 'completion_length': 73.76786041259766, 'rewards/accuracy_reward': 0.580357164144516, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.5714285969734192, 'reward_std': 0.13408026099205017, 'kl': 0.111328125, 'epoch': 3.24} 65%|██████▍ | 1043/1610 [4:06:33<1:48:40, 11.50s/it] 65%|██████▍ | 1044/1610 [4:06:42<1:41:49, 10.79s/it] {'loss': 0.0047, 'grad_norm': 2.882837495439166, 'learning_rate': 3.515527950310559e-07, 'completion_length': 78.14286041259766, 'rewards/accuracy_reward': 0.580357164144516, 'rewards/format_reward': 1.0, 'reward': 1.5803571939468384, 'reward_std': 0.17434299737215042, 'kl': 0.117919921875, 'epoch': 3.24} 65%|██████▍ | 1044/1610 [4:06:42<1:41:49, 10.79s/it] 65%|██████▍ | 1045/1610 [4:06:52<1:39:31, 10.57s/it] {'loss': 0.0046, 'grad_norm': 1.242571383400811, 'learning_rate': 3.5093167701863354e-07, 'completion_length': 76.9285774230957, 'rewards/accuracy_reward': 0.5535714328289032, 'rewards/format_reward': 1.0, 'reward': 1.5535715222358704, 'reward_std': 0.16262339800596237, 'kl': 0.11572265625, 'epoch': 3.25} 65%|██████▍ | 1045/1610 [4:06:52<1:39:31, 10.57s/it] 65%|██████▍ | 1046/1610 [4:07:01<1:35:54, 10.20s/it] {'loss': 0.0043, 'grad_norm': 0.9300153491100144, 'learning_rate': 3.5031055900621117e-07, 'completion_length': 54.41964530944824, 'rewards/accuracy_reward': 0.660714328289032, 'rewards/format_reward': 1.0, 'reward': 1.6607143878936768, 'reward_std': 0.033065006136894226, 'kl': 0.107666015625, 'epoch': 3.25} 65%|██████▍ | 1046/1610 [4:07:01<1:35:54, 10.20s/it] 65%|██████▌ | 1047/1610 [4:07:12<1:37:12, 10.36s/it] {'loss': 0.0061, 'grad_norm': 1.8458419600048666, 'learning_rate': 3.4968944099378885e-07, 'completion_length': 92.31250381469727, 'rewards/accuracy_reward': 0.3125000149011612, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.2946429252624512, 'reward_std': 0.22667086124420166, 'kl': 0.1513671875, 'epoch': 3.25} 65%|██████▌ | 1047/1610 [4:07:12<1:37:12, 10.36s/it] 65%|██████▌ | 1048/1610 [4:07:25<1:43:24, 11.04s/it] {'loss': 0.0048, 'grad_norm': 0.9848711174139086, 'learning_rate': 3.4906832298136643e-07, 'completion_length': 75.18750381469727, 'rewards/accuracy_reward': 0.4642857313156128, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.4464285969734192, 'reward_std': 0.08868780359625816, 'kl': 0.120361328125, 'epoch': 3.25} 65%|██████▌ | 1048/1610 [4:07:25<1:43:24, 11.04s/it] 65%|██████▌ | 1049/1610 [4:07:35<1:40:48, 10.78s/it] {'loss': 0.0061, 'grad_norm': 1.041004502350663, 'learning_rate': 3.4844720496894407e-07, 'completion_length': 73.70536041259766, 'rewards/accuracy_reward': 0.6160714328289032, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.6071429252624512, 'reward_std': 0.11580923572182655, 'kl': 0.1513671875, 'epoch': 3.26} 65%|██████▌ | 1049/1610 [4:07:35<1:40:48, 10.78s/it] 65%|██████▌ | 1050/1610 [4:07:46<1:41:14, 10.85s/it] {'loss': 0.005, 'grad_norm': 1.4043536034462263, 'learning_rate': 3.478260869565217e-07, 'completion_length': 73.75000381469727, 'rewards/accuracy_reward': 0.598214328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5803572535514832, 'reward_std': 0.19233998656272888, 'kl': 0.124267578125, 'epoch': 3.26} 65%|██████▌ | 1050/1610 [4:07:46<1:41:14, 10.85s/it] 65%|██████▌ | 1051/1610 [4:07:56<1:39:51, 10.72s/it] {'loss': 0.004, 'grad_norm': 1.3062376451857078, 'learning_rate': 3.4720496894409933e-07, 'completion_length': 77.5714340209961, 'rewards/accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 1.0, 'reward': 1.5892857909202576, 'reward_std': 0.16262341290712357, 'kl': 0.10009765625, 'epoch': 3.26} 65%|██████▌ | 1051/1610 [4:07:56<1:39:51, 10.72s/it] 65%|██████▌ | 1052/1610 [4:08:07<1:39:52, 10.74s/it] {'loss': 0.0039, 'grad_norm': 1.7551571923577138, 'learning_rate': 3.46583850931677e-07, 'completion_length': 82.19643211364746, 'rewards/accuracy_reward': 0.7589285969734192, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.7500001192092896, 'reward_std': 0.25237376242876053, 'kl': 0.097900390625, 'epoch': 3.27} 65%|██████▌ | 1052/1610 [4:08:07<1:39:52, 10.74s/it] 65%|██████▌ | 1053/1610 [4:08:19<1:43:04, 11.10s/it] {'loss': 0.0054, 'grad_norm': 2.8316929152483117, 'learning_rate': 3.4596273291925465e-07, 'completion_length': 76.45536041259766, 'rewards/accuracy_reward': 0.4821428805589676, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.473214328289032, 'reward_std': 0.23838484287261963, 'kl': 0.135009765625, 'epoch': 3.27} 65%|██████▌ | 1053/1610 [4:08:19<1:43:04, 11.10s/it] 65%|██████▌ | 1054/1610 [4:08:31<1:44:56, 11.32s/it] {'loss': 0.0049, 'grad_norm': 2.133632597969048, 'learning_rate': 3.453416149068323e-07, 'completion_length': 88.39286041259766, 'rewards/accuracy_reward': 0.464285746216774, 'rewards/format_reward': 0.973214328289032, 'reward': 1.4375000596046448, 'reward_std': 0.25252360105514526, 'kl': 0.1220703125, 'epoch': 3.27} 65%|██████▌ | 1054/1610 [4:08:31<1:44:56, 11.32s/it] 66%|██████▌ | 1055/1610 [4:08:41<1:41:24, 10.96s/it] {'loss': 0.0069, 'grad_norm': 1.9845864403744786, 'learning_rate': 3.447204968944099e-07, 'completion_length': 88.79464721679688, 'rewards/accuracy_reward': 0.5000000149011612, 'rewards/format_reward': 1.0, 'reward': 1.5000001192092896, 'reward_std': 0.21972984820604324, 'kl': 0.172607421875, 'epoch': 3.28} 66%|██████▌ | 1055/1610 [4:08:41<1:41:24, 10.96s/it] 66%|██████▌ | 1056/1610 [4:08:51<1:38:10, 10.63s/it] {'loss': 0.0043, 'grad_norm': 1.5032930441608623, 'learning_rate': 3.440993788819876e-07, 'completion_length': 82.16071701049805, 'rewards/accuracy_reward': 0.723214328289032, 'rewards/format_reward': 1.0, 'reward': 1.723214328289032, 'reward_std': 0.1704346016049385, 'kl': 0.10693359375, 'epoch': 3.28} 66%|██████▌ | 1056/1610 [4:08:51<1:38:10, 10.63s/it] 66%|██████▌ | 1057/1610 [4:08:59<1:30:38, 9.83s/it] {'loss': 0.0032, 'grad_norm': 2.708525253479944, 'learning_rate': 3.4347826086956523e-07, 'completion_length': 58.33928871154785, 'rewards/accuracy_reward': 0.7321428656578064, 'rewards/format_reward': 1.0, 'reward': 1.732142984867096, 'reward_std': 0.18276765942573547, 'kl': 0.078857421875, 'epoch': 3.28} 66%|██████▌ | 1057/1610 [4:08:59<1:30:38, 9.83s/it] 66%|██████▌ | 1058/1610 [4:09:09<1:32:49, 10.09s/it] {'loss': 0.0055, 'grad_norm': 1.754759195623575, 'learning_rate': 3.4285714285714286e-07, 'completion_length': 96.15179061889648, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.5446429252624512, 'reward_std': 0.1800716295838356, 'kl': 0.137451171875, 'epoch': 3.29} 66%|██████▌ | 1058/1610 [4:09:09<1:32:49, 10.09s/it] 66%|██████▌ | 1059/1610 [4:09:20<1:34:33, 10.30s/it] {'loss': 0.0048, 'grad_norm': 2.4881405543239326, 'learning_rate': 3.422360248447205e-07, 'completion_length': 80.58928680419922, 'rewards/accuracy_reward': 0.3839285969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3660714626312256, 'reward_std': 0.24889206886291504, 'kl': 0.12060546875, 'epoch': 3.29} 66%|██████▌ | 1059/1610 [4:09:20<1:34:33, 10.30s/it] 66%|██████▌ | 1060/1610 [4:09:31<1:34:56, 10.36s/it] {'loss': 0.0067, 'grad_norm': 0.9956081387405178, 'learning_rate': 3.416149068322981e-07, 'completion_length': 69.00893020629883, 'rewards/accuracy_reward': 0.6696428954601288, 'rewards/format_reward': 1.0, 'reward': 1.6696429252624512, 'reward_std': 0.09138382598757744, 'kl': 0.16845703125, 'epoch': 3.29} 66%|██████▌ | 1060/1610 [4:09:31<1:34:56, 10.36s/it] 66%|██████▌ | 1061/1610 [4:09:41<1:34:26, 10.32s/it] {'loss': 0.005, 'grad_norm': 1.661463889552401, 'learning_rate': 3.4099378881987576e-07, 'completion_length': 70.1875, 'rewards/accuracy_reward': 0.4910714477300644, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.473214328289032, 'reward_std': 0.15872061252593994, 'kl': 0.12548828125, 'epoch': 3.3} 66%|██████▌ | 1061/1610 [4:09:41<1:34:26, 10.32s/it] 66%|██████▌ | 1062/1610 [4:09:51<1:35:11, 10.42s/it] {'loss': 0.0084, 'grad_norm': 1.2289103297873352, 'learning_rate': 3.403726708074534e-07, 'completion_length': 79.36607360839844, 'rewards/accuracy_reward': 0.4642857313156128, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4553571939468384, 'reward_std': 0.13225442171096802, 'kl': 0.2109375, 'epoch': 3.3} 66%|██████▌ | 1062/1610 [4:09:52<1:35:11, 10.42s/it] 66%|██████▌ | 1063/1610 [4:10:02<1:34:04, 10.32s/it] {'loss': 0.005, 'grad_norm': 2.546238352456347, 'learning_rate': 3.3975155279503103e-07, 'completion_length': 73.3660774230957, 'rewards/accuracy_reward': 0.4017857313156128, 'rewards/format_reward': 1.0, 'reward': 1.4017857909202576, 'reward_std': 0.10882645472884178, 'kl': 0.12548828125, 'epoch': 3.3} 66%|██████▌ | 1063/1610 [4:10:02<1:34:04, 10.32s/it] 66%|██████▌ | 1064/1610 [4:10:12<1:34:30, 10.39s/it] {'loss': 0.0047, 'grad_norm': 2.0759529412456574, 'learning_rate': 3.3913043478260866e-07, 'completion_length': 77.44643020629883, 'rewards/accuracy_reward': 0.4910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.4910714626312256, 'reward_std': 0.19178561866283417, 'kl': 0.117431640625, 'epoch': 3.3} 66%|██████▌ | 1064/1610 [4:10:12<1:34:30, 10.39s/it] 66%|██████▌ | 1065/1610 [4:10:23<1:36:33, 10.63s/it] {'loss': 0.0073, 'grad_norm': 1.3293275240771194, 'learning_rate': 3.385093167701863e-07, 'completion_length': 93.16964721679688, 'rewards/accuracy_reward': 0.267857164144516, 'rewards/format_reward': 1.0, 'reward': 1.2678571939468384, 'reward_std': 0.16323687136173248, 'kl': 0.1826171875, 'epoch': 3.31} 66%|██████▌ | 1065/1610 [4:10:23<1:36:33, 10.63s/it] 66%|██████▌ | 1066/1610 [4:10:32<1:32:09, 10.16s/it] {'loss': 0.0046, 'grad_norm': 1.2570750797088655, 'learning_rate': 3.37888198757764e-07, 'completion_length': 85.80357360839844, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 1.0, 'reward': 1.571428656578064, 'reward_std': 0.2338685691356659, 'kl': 0.115966796875, 'epoch': 3.31} 66%|██████▌ | 1066/1610 [4:10:32<1:32:09, 10.16s/it] 66%|██████▋ | 1067/1610 [4:10:42<1:30:57, 10.05s/it] {'loss': 0.0068, 'grad_norm': 1.6286290293511396, 'learning_rate': 3.372670807453416e-07, 'completion_length': 79.20536041259766, 'rewards/accuracy_reward': 0.294642873108387, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.285714328289032, 'reward_std': 0.14579425007104874, 'kl': 0.1708984375, 'epoch': 3.31} 66%|██████▋ | 1067/1610 [4:10:42<1:30:57, 10.05s/it] 66%|██████▋ | 1068/1610 [4:10:54<1:36:07, 10.64s/it] {'loss': 0.0059, 'grad_norm': 0.7848596896841324, 'learning_rate': 3.3664596273291924e-07, 'completion_length': 78.6964340209961, 'rewards/accuracy_reward': 0.598214328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5803571939468384, 'reward_std': 0.1379830539226532, 'kl': 0.14794921875, 'epoch': 3.32} 66%|██████▋ | 1068/1610 [4:10:54<1:36:07, 10.64s/it] 66%|██████▋ | 1069/1610 [4:11:03<1:30:03, 9.99s/it] {'loss': 0.0038, 'grad_norm': 0.7774256854011623, 'learning_rate': 3.360248447204969e-07, 'completion_length': 68.8839340209961, 'rewards/accuracy_reward': 0.6696428954601288, 'rewards/format_reward': 1.0, 'reward': 1.6696429252624512, 'reward_std': 0.07003280520439148, 'kl': 0.09423828125, 'epoch': 3.32} 66%|██████▋ | 1069/1610 [4:11:03<1:30:03, 9.99s/it] 66%|██████▋ | 1070/1610 [4:11:12<1:29:27, 9.94s/it] {'loss': 0.0055, 'grad_norm': 2.0477838518887794, 'learning_rate': 3.3540372670807456e-07, 'completion_length': 97.89286041259766, 'rewards/accuracy_reward': 0.339285746216774, 'rewards/format_reward': 1.0, 'reward': 1.3392857909202576, 'reward_std': 0.1575082391500473, 'kl': 0.137939453125, 'epoch': 3.32} 66%|██████▋ | 1070/1610 [4:11:12<1:29:27, 9.94s/it] 67%|██████▋ | 1071/1610 [4:11:22<1:26:49, 9.67s/it] {'loss': 0.0058, 'grad_norm': 1.0038957006077955, 'learning_rate': 3.347826086956522e-07, 'completion_length': 70.94643020629883, 'rewards/accuracy_reward': 0.7142857313156128, 'rewards/format_reward': 1.0, 'reward': 1.7142857313156128, 'reward_std': 0.05050762742757797, 'kl': 0.145751953125, 'epoch': 3.33} 67%|██████▋ | 1071/1610 [4:11:22<1:26:49, 9.67s/it] 67%|██████▋ | 1072/1610 [4:11:31<1:27:18, 9.74s/it] {'loss': 0.0052, 'grad_norm': 2.095387406187745, 'learning_rate': 3.3416149068322977e-07, 'completion_length': 74.83036041259766, 'rewards/accuracy_reward': 0.723214328289032, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.7142857909202576, 'reward_std': 0.19448164105415344, 'kl': 0.128662109375, 'epoch': 3.33} 67%|██████▋ | 1072/1610 [4:11:31<1:27:18, 9.74s/it] 67%|██████▋ | 1073/1610 [4:11:42<1:30:40, 10.13s/it] {'loss': 0.0071, 'grad_norm': 3.3916563605099337, 'learning_rate': 3.335403726708074e-07, 'completion_length': 79.78571891784668, 'rewards/accuracy_reward': 0.6696428954601288, 'rewards/format_reward': 0.9732142984867096, 'reward': 1.6428571939468384, 'reward_std': 0.19503602385520935, 'kl': 0.177490234375, 'epoch': 3.33} 67%|██████▋ | 1073/1610 [4:11:42<1:30:40, 10.13s/it] 67%|██████▋ | 1074/1610 [4:11:53<1:32:19, 10.34s/it] {'loss': 0.0075, 'grad_norm': 1.1066116656033789, 'learning_rate': 3.3291925465838504e-07, 'completion_length': 85.55357360839844, 'rewards/accuracy_reward': 0.2589285895228386, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.2410715222358704, 'reward_std': 0.1379830539226532, 'kl': 0.189208984375, 'epoch': 3.34} 67%|██████▋ | 1074/1610 [4:11:53<1:32:19, 10.34s/it] 67%|██████▋ | 1075/1610 [4:12:05<1:36:28, 10.82s/it] {'loss': 0.0078, 'grad_norm': 2.5412028726438174, 'learning_rate': 3.322981366459627e-07, 'completion_length': 79.89286041259766, 'rewards/accuracy_reward': 0.3660714328289032, 'rewards/format_reward': 0.973214328289032, 'reward': 1.3392857909202576, 'reward_std': 0.2168245017528534, 'kl': 0.19384765625, 'epoch': 3.34} 67%|██████▋ | 1075/1610 [4:12:05<1:36:28, 10.82s/it] 67%|██████▋ | 1076/1610 [4:12:17<1:39:58, 11.23s/it] {'loss': 0.0087, 'grad_norm': 2.3692393235864206, 'learning_rate': 3.3167701863354036e-07, 'completion_length': 87.88393020629883, 'rewards/accuracy_reward': 0.7142857313156128, 'rewards/format_reward': 0.973214328289032, 'reward': 1.6875001192092896, 'reward_std': 0.32729044556617737, 'kl': 0.21826171875, 'epoch': 3.34} 67%|██████▋ | 1076/1610 [4:12:17<1:39:58, 11.23s/it] 67%|██████▋ | 1077/1610 [4:12:26<1:32:45, 10.44s/it] {'loss': 0.0034, 'grad_norm': 1.372259025852884, 'learning_rate': 3.31055900621118e-07, 'completion_length': 69.18750190734863, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 1.0, 'reward': 1.571428656578064, 'reward_std': 0.05050762742757797, 'kl': 0.083740234375, 'epoch': 3.34} 67%|██████▋ | 1077/1610 [4:12:26<1:32:45, 10.44s/it] 67%|██████▋ | 1078/1610 [4:12:39<1:39:04, 11.17s/it] {'loss': 0.0059, 'grad_norm': 7.705075463921386, 'learning_rate': 3.304347826086956e-07, 'completion_length': 84.61607360839844, 'rewards/accuracy_reward': 0.4196428656578064, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3839285969734192, 'reward_std': 0.22667087614536285, 'kl': 0.1474609375, 'epoch': 3.35} 67%|██████▋ | 1078/1610 [4:12:39<1:39:04, 11.17s/it] 67%|██████▋ | 1079/1610 [4:12:50<1:38:46, 11.16s/it] {'loss': 0.0062, 'grad_norm': 1.1021545683513938, 'learning_rate': 3.298136645962733e-07, 'completion_length': 72.55357360839844, 'rewards/accuracy_reward': 0.598214328289032, 'rewards/format_reward': 1.0, 'reward': 1.598214328289032, 'reward_std': 0.096499003469944, 'kl': 0.154296875, 'epoch': 3.35} 67%|██████▋ | 1079/1610 [4:12:50<1:38:46, 11.16s/it] 67%|██████▋ | 1080/1610 [4:13:02<1:39:24, 11.25s/it] {'loss': 0.0098, 'grad_norm': 1.8165406587008683, 'learning_rate': 3.2919254658385094e-07, 'completion_length': 92.40179061889648, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5892857909202576, 'reward_std': 0.329827144742012, 'kl': 0.24560546875, 'epoch': 3.35} 67%|██████▋ | 1080/1610 [4:13:02<1:39:24, 11.25s/it] 67%|██████▋ | 1081/1610 [4:13:11<1:34:45, 10.75s/it] {'loss': 0.0052, 'grad_norm': 1.3896695562866075, 'learning_rate': 3.2857142857142857e-07, 'completion_length': 72.41071701049805, 'rewards/accuracy_reward': 0.4107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.410714328289032, 'reward_std': 0.2022872269153595, 'kl': 0.12890625, 'epoch': 3.36} 67%|██████▋ | 1081/1610 [4:13:11<1:34:45, 10.75s/it] 67%|██████▋ | 1082/1610 [4:13:22<1:36:03, 10.92s/it] {'loss': 0.0107, 'grad_norm': 1.7977239507424139, 'learning_rate': 3.279503105590062e-07, 'completion_length': 90.80357360839844, 'rewards/accuracy_reward': 0.455357164144516, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4196429252624512, 'reward_std': 0.3084232807159424, 'kl': 0.26708984375, 'epoch': 3.36} 67%|██████▋ | 1082/1610 [4:13:22<1:36:03, 10.92s/it] 67%|██████▋ | 1083/1610 [4:13:32<1:33:31, 10.65s/it] {'loss': 0.0041, 'grad_norm': 2.386685061400328, 'learning_rate': 3.273291925465838e-07, 'completion_length': 67.17857360839844, 'rewards/accuracy_reward': 0.5535714477300644, 'rewards/format_reward': 1.0, 'reward': 1.5535715222358704, 'reward_std': 0.12175280973315239, 'kl': 0.102783203125, 'epoch': 3.36} 67%|██████▋ | 1083/1610 [4:13:32<1:33:31, 10.65s/it] 67%|██████▋ | 1084/1610 [4:13:43<1:32:00, 10.50s/it] {'loss': 0.0056, 'grad_norm': 1.195041722499004, 'learning_rate': 3.2670807453416147e-07, 'completion_length': 70.64286041259766, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.6160714626312256, 'reward_std': 0.12054043263196945, 'kl': 0.14013671875, 'epoch': 3.37} 67%|██████▋ | 1084/1610 [4:13:43<1:32:00, 10.50s/it] 67%|██████▋ | 1085/1610 [4:13:55<1:38:06, 11.21s/it] {'loss': 0.0073, 'grad_norm': 0.9297082704630993, 'learning_rate': 3.260869565217391e-07, 'completion_length': 77.84821701049805, 'rewards/accuracy_reward': 0.535714328289032, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.5267857909202576, 'reward_std': 0.07924874126911163, 'kl': 0.18310546875, 'epoch': 3.37} 67%|██████▋ | 1085/1610 [4:13:55<1:38:06, 11.21s/it] 67%|██████▋ | 1086/1610 [4:14:06<1:36:16, 11.02s/it] {'loss': 0.0066, 'grad_norm': 1.210823540695952, 'learning_rate': 3.2546583850931673e-07, 'completion_length': 85.25000381469727, 'rewards/accuracy_reward': 0.4732143133878708, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4642857909202576, 'reward_std': 0.14579425007104874, 'kl': 0.166015625, 'epoch': 3.37} 67%|██████▋ | 1086/1610 [4:14:06<1:36:16, 11.02s/it] 68%|██████▊ | 1087/1610 [4:14:17<1:37:11, 11.15s/it] {'loss': 0.008, 'grad_norm': 1.1747404698498818, 'learning_rate': 3.2484472049689437e-07, 'completion_length': 82.68750381469727, 'rewards/accuracy_reward': 0.5446428656578064, 'rewards/format_reward': 0.9732142984867096, 'reward': 1.5178572535514832, 'reward_std': 0.17824579030275345, 'kl': 0.20068359375, 'epoch': 3.38} 68%|██████▊ | 1087/1610 [4:14:17<1:37:11, 11.15s/it] 68%|██████▊ | 1088/1610 [4:14:28<1:35:46, 11.01s/it] {'loss': 0.0071, 'grad_norm': 2.0962469795689174, 'learning_rate': 3.2422360248447205e-07, 'completion_length': 76.19643020629883, 'rewards/accuracy_reward': 0.473214328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4553571939468384, 'reward_std': 0.21456117928028107, 'kl': 0.17626953125, 'epoch': 3.38} 68%|██████▊ | 1088/1610 [4:14:28<1:35:46, 11.01s/it] 68%|██████▊ | 1089/1610 [4:14:38<1:33:09, 10.73s/it] {'loss': 0.0046, 'grad_norm': 1.5054220920218513, 'learning_rate': 3.236024844720497e-07, 'completion_length': 78.93750381469727, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.508928656578064, 'reward_std': 0.19690079987049103, 'kl': 0.11376953125, 'epoch': 3.38} 68%|██████▊ | 1089/1610 [4:14:38<1:33:09, 10.73s/it] 68%|██████▊ | 1090/1610 [4:14:47<1:28:51, 10.25s/it] {'loss': 0.0055, 'grad_norm': 2.167180571019216, 'learning_rate': 3.229813664596273e-07, 'completion_length': 75.77679061889648, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 0.973214328289032, 'reward': 1.598214328289032, 'reward_std': 0.19838443398475647, 'kl': 0.136962890625, 'epoch': 3.39} 68%|██████▊ | 1090/1610 [4:14:47<1:28:51, 10.25s/it] 68%|██████▊ | 1091/1610 [4:14:58<1:30:25, 10.45s/it] {'loss': 0.0069, 'grad_norm': 1.2705387181558216, 'learning_rate': 3.2236024844720495e-07, 'completion_length': 81.48214340209961, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5000001192092896, 'reward_std': 0.16141663491725922, 'kl': 0.1728515625, 'epoch': 3.39} 68%|██████▊ | 1091/1610 [4:14:58<1:30:25, 10.45s/it] 68%|██████▊ | 1092/1610 [4:15:11<1:35:03, 11.01s/it] {'loss': 0.0089, 'grad_norm': 1.55609957607401, 'learning_rate': 3.217391304347826e-07, 'completion_length': 101.58036422729492, 'rewards/accuracy_reward': 0.6071428656578064, 'rewards/format_reward': 0.9642857611179352, 'reward': 1.5714285969734192, 'reward_std': 0.21621102094650269, 'kl': 0.22314453125, 'epoch': 3.39} 68%|██████▊ | 1092/1610 [4:15:11<1:35:03, 11.01s/it] 68%|██████▊ | 1093/1610 [4:15:19<1:28:34, 10.28s/it] {'loss': 0.0043, 'grad_norm': 1.8271362902574, 'learning_rate': 3.2111801242236027e-07, 'completion_length': 70.74107360839844, 'rewards/accuracy_reward': 0.446428582072258, 'rewards/format_reward': 1.0, 'reward': 1.4464285969734192, 'reward_std': 0.10040178522467613, 'kl': 0.10693359375, 'epoch': 3.39} 68%|██████▊ | 1093/1610 [4:15:19<1:28:34, 10.28s/it] 68%|██████▊ | 1094/1610 [4:15:28<1:23:49, 9.75s/it] {'loss': 0.0054, 'grad_norm': 1.525629542517755, 'learning_rate': 3.204968944099379e-07, 'completion_length': 67.14286231994629, 'rewards/accuracy_reward': 0.5357142984867096, 'rewards/format_reward': 1.0, 'reward': 1.535714328289032, 'reward_std': 0.18397442996501923, 'kl': 0.135986328125, 'epoch': 3.4} 68%|██████▊ | 1094/1610 [4:15:28<1:23:49, 9.75s/it] 68%|██████▊ | 1095/1610 [4:15:38<1:26:22, 10.06s/it] {'loss': 0.0107, 'grad_norm': 2.4714399868484174, 'learning_rate': 3.198757763975155e-07, 'completion_length': 83.40178680419922, 'rewards/accuracy_reward': 0.625, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6071429252624512, 'reward_std': 0.17063257098197937, 'kl': 0.26611328125, 'epoch': 3.4} 68%|██████▊ | 1095/1610 [4:15:38<1:26:22, 10.06s/it] 68%|██████▊ | 1096/1610 [4:15:47<1:23:13, 9.72s/it] {'loss': 0.0036, 'grad_norm': 1.7516526690643024, 'learning_rate': 3.192546583850931e-07, 'completion_length': 79.1964340209961, 'rewards/accuracy_reward': 0.3303571492433548, 'rewards/format_reward': 1.0, 'reward': 1.3303571939468384, 'reward_std': 0.10882645472884178, 'kl': 0.08984375, 'epoch': 3.4} 68%|██████▊ | 1096/1610 [4:15:47<1:23:13, 9.72s/it] 68%|██████▊ | 1097/1610 [4:15:59<1:26:49, 10.15s/it] {'loss': 0.0078, 'grad_norm': 1.7772438162565656, 'learning_rate': 3.186335403726708e-07, 'completion_length': 90.84821701049805, 'rewards/accuracy_reward': 0.4196428805589676, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.383928656578064, 'reward_std': 0.1827620528638363, 'kl': 0.19580078125, 'epoch': 3.41} 68%|██████▊ | 1097/1610 [4:15:59<1:26:49, 10.15s/it] 68%|██████▊ | 1098/1610 [4:16:08<1:24:45, 9.93s/it] {'loss': 0.0068, 'grad_norm': 0.8279799937198656, 'learning_rate': 3.1801242236024843e-07, 'completion_length': 80.62500381469727, 'rewards/accuracy_reward': 0.642857164144516, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.633928656578064, 'reward_std': 0.09138382598757744, 'kl': 0.16943359375, 'epoch': 3.41} 68%|██████▊ | 1098/1610 [4:16:08<1:24:45, 9.93s/it] 68%|██████▊ | 1099/1610 [4:16:17<1:22:13, 9.66s/it] {'loss': 0.0041, 'grad_norm': 2.281298322785382, 'learning_rate': 3.1739130434782606e-07, 'completion_length': 80.33036041259766, 'rewards/accuracy_reward': 0.2857142984867096, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.2767857909202576, 'reward_std': 0.19178561866283417, 'kl': 0.101806640625, 'epoch': 3.41} 68%|██████▊ | 1099/1610 [4:16:17<1:22:13, 9.66s/it] 68%|██████▊ | 1100/1610 [4:16:27<1:23:51, 9.86s/it] {'loss': 0.0044, 'grad_norm': 2.1551534259175265, 'learning_rate': 3.167701863354037e-07, 'completion_length': 74.08036231994629, 'rewards/accuracy_reward': 0.366071455180645, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.3571429252624512, 'reward_std': 0.28767116367816925, 'kl': 0.111328125, 'epoch': 3.42} 68%|██████▊ | 1100/1610 [4:16:27<1:23:51, 9.86s/it] 68%|██████▊ | 1101/1610 [4:17:19<3:11:01, 22.52s/it] {'loss': 0.0068, 'grad_norm': 1.5204569536266284, 'learning_rate': 3.1614906832298133e-07, 'completion_length': 70.54464721679688, 'rewards/accuracy_reward': 0.357142873108387, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.3392857313156128, 'reward_std': 0.15481781587004662, 'kl': 0.16943359375, 'epoch': 3.42} 68%|██████▊ | 1101/1610 [4:17:19<3:11:01, 22.52s/it] 68%|██████▊ | 1102/1610 [4:17:32<2:45:54, 19.59s/it] {'loss': 0.0054, 'grad_norm': 1.4395607955086251, 'learning_rate': 3.15527950310559e-07, 'completion_length': 86.80357360839844, 'rewards/accuracy_reward': 0.3482142984867096, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.3392857909202576, 'reward_std': 0.1575082391500473, 'kl': 0.135498046875, 'epoch': 3.42} 68%|██████▊ | 1102/1610 [4:17:32<2:45:54, 19.59s/it] 69%|██████▊ | 1103/1610 [4:17:42<2:21:41, 16.77s/it] {'loss': 0.0084, 'grad_norm': 1.5467226238656142, 'learning_rate': 3.1490683229813665e-07, 'completion_length': 71.70536231994629, 'rewards/accuracy_reward': 0.705357164144516, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6875000596046448, 'reward_std': 0.18518681079149246, 'kl': 0.2099609375, 'epoch': 3.43} 69%|██████▊ | 1103/1610 [4:17:42<2:21:41, 16.77s/it] 69%|██████▊ | 1104/1610 [4:17:52<2:02:14, 14.50s/it] {'loss': 0.0058, 'grad_norm': 0.798885132011226, 'learning_rate': 3.142857142857143e-07, 'completion_length': 78.67857360839844, 'rewards/accuracy_reward': 0.4732143133878708, 'rewards/format_reward': 1.0, 'reward': 1.4732143878936768, 'reward_std': 0.03696779906749725, 'kl': 0.145751953125, 'epoch': 3.43} 69%|██████▊ | 1104/1610 [4:17:52<2:02:14, 14.50s/it] 69%|██████▊ | 1105/1610 [4:18:04<1:56:03, 13.79s/it] {'loss': 0.0092, 'grad_norm': 1.1502407276419602, 'learning_rate': 3.136645962732919e-07, 'completion_length': 79.63393020629883, 'rewards/accuracy_reward': 0.2321428656578064, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.223214328289032, 'reward_std': 0.07576144114136696, 'kl': 0.23046875, 'epoch': 3.43} 69%|██████▊ | 1105/1610 [4:18:04<1:56:03, 13.79s/it] 69%|██████▊ | 1106/1610 [4:18:13<1:45:07, 12.51s/it] {'loss': 0.0055, 'grad_norm': 1.9234354598724392, 'learning_rate': 3.130434782608696e-07, 'completion_length': 82.23214721679688, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.5446429252624512, 'reward_std': 0.18518681079149246, 'kl': 0.136474609375, 'epoch': 3.43} 69%|██████▊ | 1106/1610 [4:18:13<1:45:07, 12.51s/it] 69%|██████▉ | 1107/1610 [4:18:26<1:46:44, 12.73s/it] {'loss': 0.0097, 'grad_norm': 2.6360471843800295, 'learning_rate': 3.1242236024844723e-07, 'completion_length': 90.91964721679688, 'rewards/accuracy_reward': 0.5982142984867096, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5625000596046448, 'reward_std': 0.35668954253196716, 'kl': 0.2412109375, 'epoch': 3.44} 69%|██████▉ | 1107/1610 [4:18:26<1:46:44, 12.73s/it] 69%|██████▉ | 1108/1610 [4:18:37<1:40:57, 12.07s/it] {'loss': 0.0071, 'grad_norm': 1.2854004388887572, 'learning_rate': 3.118012422360248e-07, 'completion_length': 72.43750381469727, 'rewards/accuracy_reward': 0.3392857313156128, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3214285969734192, 'reward_std': 0.11272924020886421, 'kl': 0.1767578125, 'epoch': 3.44} 69%|██████▉ | 1108/1610 [4:18:37<1:40:57, 12.07s/it] 69%|██████▉ | 1109/1610 [4:18:48<1:38:30, 11.80s/it] {'loss': 0.0083, 'grad_norm': 2.4398384332314316, 'learning_rate': 3.1118012422360244e-07, 'completion_length': 79.91964340209961, 'rewards/accuracy_reward': 0.5000000149011612, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4910715222358704, 'reward_std': 0.11332815885543823, 'kl': 0.20654296875, 'epoch': 3.44} 69%|██████▉ | 1109/1610 [4:18:48<1:38:30, 11.80s/it] 69%|██████▉ | 1110/1610 [4:19:00<1:37:32, 11.70s/it] {'loss': 0.0078, 'grad_norm': 1.1268392310156492, 'learning_rate': 3.105590062111801e-07, 'completion_length': 78.17857360839844, 'rewards/accuracy_reward': 0.660714328289032, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.6517857909202576, 'reward_std': 0.1418914571404457, 'kl': 0.19580078125, 'epoch': 3.45} 69%|██████▉ | 1110/1610 [4:19:00<1:37:32, 11.70s/it] 69%|██████▉ | 1111/1610 [4:19:14<1:43:00, 12.39s/it] {'loss': 0.0174, 'grad_norm': 2.647115748302458, 'learning_rate': 3.0993788819875776e-07, 'completion_length': 118.52679443359375, 'rewards/accuracy_reward': 0.383928582072258, 'rewards/format_reward': 0.9196429252624512, 'reward': 1.3035714626312256, 'reward_std': 0.4017099142074585, 'kl': 0.4345703125, 'epoch': 3.45} 69%|██████▉ | 1111/1610 [4:19:14<1:43:00, 12.39s/it] 69%|██████▉ | 1112/1610 [4:19:25<1:40:12, 12.07s/it] {'loss': 0.006, 'grad_norm': 3.372930552404311, 'learning_rate': 3.093167701863354e-07, 'completion_length': 102.2589340209961, 'rewards/accuracy_reward': 0.4642857313156128, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4464285969734192, 'reward_std': 0.2606116533279419, 'kl': 0.14990234375, 'epoch': 3.45} 69%|██████▉ | 1112/1610 [4:19:25<1:40:12, 12.07s/it] 69%|██████▉ | 1113/1610 [4:19:34<1:33:30, 11.29s/it] {'loss': 0.0042, 'grad_norm': 1.2161434800652073, 'learning_rate': 3.08695652173913e-07, 'completion_length': 87.77679061889648, 'rewards/accuracy_reward': 0.580357164144516, 'rewards/format_reward': 1.0, 'reward': 1.5803571939468384, 'reward_std': 0.08747543022036552, 'kl': 0.10400390625, 'epoch': 3.46} 69%|██████▉ | 1113/1610 [4:19:34<1:33:30, 11.29s/it] 69%|██████▉ | 1114/1610 [4:19:47<1:37:42, 11.82s/it] {'loss': 0.0085, 'grad_norm': 1.5722579167426605, 'learning_rate': 3.0807453416149066e-07, 'completion_length': 82.51786041259766, 'rewards/accuracy_reward': 0.3571428805589676, 'rewards/format_reward': 0.973214328289032, 'reward': 1.3303572535514832, 'reward_std': 0.19234000146389008, 'kl': 0.21240234375, 'epoch': 3.46} 69%|██████▉ | 1114/1610 [4:19:47<1:37:42, 11.82s/it] 69%|██████▉ | 1115/1610 [4:19:59<1:37:22, 11.80s/it] {'loss': 0.0072, 'grad_norm': 1.6328063927398269, 'learning_rate': 3.0745341614906834e-07, 'completion_length': 74.01786041259766, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 0.973214328289032, 'reward': 1.6517857909202576, 'reward_std': 0.15360543876886368, 'kl': 0.18115234375, 'epoch': 3.46} 69%|██████▉ | 1115/1610 [4:19:59<1:37:22, 11.80s/it] 69%|██████▉ | 1116/1610 [4:20:11<1:37:32, 11.85s/it] {'loss': 0.0135, 'grad_norm': 1.9021992932660878, 'learning_rate': 3.06832298136646e-07, 'completion_length': 91.83036041259766, 'rewards/accuracy_reward': 0.4285714477300644, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3928571939468384, 'reward_std': 0.26647355407476425, 'kl': 0.33642578125, 'epoch': 3.47} 69%|██████▉ | 1116/1610 [4:20:11<1:37:32, 11.85s/it] 69%|██████▉ | 1117/1610 [4:20:24<1:40:11, 12.19s/it] {'loss': 0.0096, 'grad_norm': 4.152746609301304, 'learning_rate': 3.062111801242236e-07, 'completion_length': 94.0714340209961, 'rewards/accuracy_reward': 0.3928571492433548, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3571429252624512, 'reward_std': 0.2207328975200653, 'kl': 0.240234375, 'epoch': 3.47} 69%|██████▉ | 1117/1610 [4:20:24<1:40:11, 12.19s/it] 69%|██████▉ | 1118/1610 [4:20:35<1:37:00, 11.83s/it] {'loss': 0.0098, 'grad_norm': 1.703624556364044, 'learning_rate': 3.0559006211180124e-07, 'completion_length': 80.20536041259766, 'rewards/accuracy_reward': 0.4196428805589676, 'rewards/format_reward': 0.9732142984867096, 'reward': 1.3928571939468384, 'reward_std': 0.17885925620794296, 'kl': 0.24560546875, 'epoch': 3.47} 69%|██████▉ | 1118/1610 [4:20:35<1:37:00, 11.83s/it] 70%|██████▉ | 1119/1610 [4:20:48<1:40:20, 12.26s/it] {'loss': 0.017, 'grad_norm': 3.1386105292784237, 'learning_rate': 3.049689440993788e-07, 'completion_length': 105.28571701049805, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5892857909202576, 'reward_std': 0.21765289455652237, 'kl': 0.42578125, 'epoch': 3.48} 70%|██████▉ | 1119/1610 [4:20:48<1:40:20, 12.26s/it] 70%|██████▉ | 1120/1610 [4:20:59<1:36:48, 11.85s/it] {'loss': 0.0094, 'grad_norm': 1.4709692037250681, 'learning_rate': 3.043478260869565e-07, 'completion_length': 74.83036041259766, 'rewards/accuracy_reward': 0.6964285969734192, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.6875000596046448, 'reward_std': 0.12054044008255005, 'kl': 0.23486328125, 'epoch': 3.48} 70%|██████▉ | 1120/1610 [4:20:59<1:36:48, 11.85s/it] 70%|██████▉ | 1121/1610 [4:21:12<1:39:06, 12.16s/it] {'loss': 0.0166, 'grad_norm': 4.572289725269342, 'learning_rate': 3.0372670807453414e-07, 'completion_length': 100.7589340209961, 'rewards/accuracy_reward': 0.3392857313156128, 'rewards/format_reward': 0.955357164144516, 'reward': 1.2946429252624512, 'reward_std': 0.27493715286254883, 'kl': 0.4140625, 'epoch': 3.48} 70%|██████▉ | 1121/1610 [4:21:12<1:39:06, 12.16s/it] 70%|██████▉ | 1122/1610 [4:21:26<1:41:57, 12.54s/it] {'loss': 0.0118, 'grad_norm': 2.821161871643887, 'learning_rate': 3.0310559006211177e-07, 'completion_length': 90.31250381469727, 'rewards/accuracy_reward': 0.4285714477300644, 'rewards/format_reward': 0.9642857611179352, 'reward': 1.3928571939468384, 'reward_std': 0.2579100430011749, 'kl': 0.2939453125, 'epoch': 3.48} 70%|██████▉ | 1122/1610 [4:21:26<1:41:57, 12.54s/it] 70%|██████▉ | 1123/1610 [4:21:36<1:37:19, 11.99s/it] {'loss': 0.009, 'grad_norm': 2.197050033396061, 'learning_rate': 3.024844720496894e-07, 'completion_length': 77.66071701049805, 'rewards/accuracy_reward': 0.5803571790456772, 'rewards/format_reward': 0.9732142984867096, 'reward': 1.5535714626312256, 'reward_std': 0.19407302141189575, 'kl': 0.225341796875, 'epoch': 3.49} 70%|██████▉ | 1123/1610 [4:21:36<1:37:19, 11.99s/it] 70%|██████▉ | 1124/1610 [4:21:48<1:36:39, 11.93s/it] {'loss': 0.0089, 'grad_norm': 1.209024598846721, 'learning_rate': 3.018633540372671e-07, 'completion_length': 75.00893020629883, 'rewards/accuracy_reward': 0.6607142984867096, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.6517857909202576, 'reward_std': 0.17495645582675934, 'kl': 0.22216796875, 'epoch': 3.49} 70%|██████▉ | 1124/1610 [4:21:48<1:36:39, 11.93s/it] 70%|██████▉ | 1125/1610 [4:21:59<1:34:12, 11.65s/it] {'loss': 0.0108, 'grad_norm': 1.0048667834428955, 'learning_rate': 3.012422360248447e-07, 'completion_length': 79.01786041259766, 'rewards/accuracy_reward': 0.4910714626312256, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.4732143878936768, 'reward_std': 0.10799804702401161, 'kl': 0.26953125, 'epoch': 3.49} 70%|██████▉ | 1125/1610 [4:21:59<1:34:12, 11.65s/it] 70%|██████▉ | 1126/1610 [4:22:12<1:37:31, 12.09s/it] {'loss': 0.0126, 'grad_norm': 5.623040975334638, 'learning_rate': 3.0062111801242235e-07, 'completion_length': 87.37500381469727, 'rewards/accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 0.973214328289032, 'reward': 1.5625000596046448, 'reward_std': 0.2989785820245743, 'kl': 0.31640625, 'epoch': 3.5} 70%|██████▉ | 1126/1610 [4:22:12<1:37:31, 12.09s/it] 70%|███████ | 1127/1610 [4:22:25<1:39:44, 12.39s/it] {'loss': 0.0126, 'grad_norm': 3.045791896457133, 'learning_rate': 3e-07, 'completion_length': 109.97322082519531, 'rewards/accuracy_reward': 0.4375000149011612, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.3660715222358704, 'reward_std': 0.4050913155078888, 'kl': 0.31494140625, 'epoch': 3.5} 70%|███████ | 1127/1610 [4:22:25<1:39:44, 12.39s/it] 70%|███████ | 1128/1610 [4:22:37<1:37:38, 12.15s/it] {'loss': 0.0108, 'grad_norm': 2.8142485058264306, 'learning_rate': 2.993788819875776e-07, 'completion_length': 86.25000381469727, 'rewards/accuracy_reward': 0.339285746216774, 'rewards/format_reward': 0.9732142984867096, 'reward': 1.3125000596046448, 'reward_std': 0.14700662717223167, 'kl': 0.26904296875, 'epoch': 3.5} 70%|███████ | 1128/1610 [4:22:37<1:37:38, 12.15s/it] 70%|███████ | 1129/1610 [4:22:48<1:35:32, 11.92s/it] {'loss': 0.0078, 'grad_norm': 1.3937030425566568, 'learning_rate': 2.987577639751553e-07, 'completion_length': 83.63393020629883, 'rewards/accuracy_reward': 0.383928582072258, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3660714626312256, 'reward_std': 0.15150833129882812, 'kl': 0.19482421875, 'epoch': 3.51} 70%|███████ | 1129/1610 [4:22:48<1:35:32, 11.92s/it] 70%|███████ | 1130/1610 [4:22:57<1:27:42, 10.96s/it] {'loss': 0.0055, 'grad_norm': 1.6130291770138196, 'learning_rate': 2.9813664596273294e-07, 'completion_length': 59.16964530944824, 'rewards/accuracy_reward': 0.526785746216774, 'rewards/format_reward': 1.0, 'reward': 1.5267857909202576, 'reward_std': 0.08747543022036552, 'kl': 0.13720703125, 'epoch': 3.51} 70%|███████ | 1130/1610 [4:22:57<1:27:42, 10.96s/it] 70%|███████ | 1131/1610 [4:23:08<1:27:20, 10.94s/it] {'loss': 0.0102, 'grad_norm': 9.450409719865426, 'learning_rate': 2.975155279503105e-07, 'completion_length': 83.68750381469727, 'rewards/accuracy_reward': 0.4821428805589676, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4732143878936768, 'reward_std': 0.16444925218820572, 'kl': 0.2548828125, 'epoch': 3.51} 70%|███████ | 1131/1610 [4:23:08<1:27:20, 10.94s/it] 70%|███████ | 1132/1610 [4:23:20<1:29:30, 11.24s/it] {'loss': 0.0132, 'grad_norm': 2.275113390610501, 'learning_rate': 2.9689440993788815e-07, 'completion_length': 81.08929061889648, 'rewards/accuracy_reward': 0.3571428656578064, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.321428656578064, 'reward_std': 0.17426397651433945, 'kl': 0.3310546875, 'epoch': 3.52} 70%|███████ | 1132/1610 [4:23:20<1:29:30, 11.24s/it] 70%|███████ | 1133/1610 [4:23:33<1:33:26, 11.75s/it] {'loss': 0.0166, 'grad_norm': 1.5672176546962246, 'learning_rate': 2.9627329192546583e-07, 'completion_length': 94.73214721679688, 'rewards/accuracy_reward': 0.3125000149011612, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.2410715222358704, 'reward_std': 0.20349961519241333, 'kl': 0.416015625, 'epoch': 3.52} 70%|███████ | 1133/1610 [4:23:33<1:33:26, 11.75s/it] 70%|███████ | 1134/1610 [4:23:46<1:36:27, 12.16s/it] {'loss': 0.0105, 'grad_norm': 1.5634926667365545, 'learning_rate': 2.9565217391304347e-07, 'completion_length': 84.58036041259766, 'rewards/accuracy_reward': 0.4910714477300644, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.473214328289032, 'reward_std': 0.17264440655708313, 'kl': 0.26220703125, 'epoch': 3.52} 70%|███████ | 1134/1610 [4:23:46<1:36:27, 12.16s/it] 70%|███████ | 1135/1610 [4:23:57<1:34:17, 11.91s/it] {'loss': 0.0173, 'grad_norm': 2.254051341927566, 'learning_rate': 2.950310559006211e-07, 'completion_length': 81.40179061889648, 'rewards/accuracy_reward': 0.4732142984867096, 'rewards/format_reward': 0.9553571939468384, 'reward': 1.4285715222358704, 'reward_std': 0.1548178270459175, 'kl': 0.431640625, 'epoch': 3.52} 70%|███████ | 1135/1610 [4:23:57<1:34:17, 11.91s/it] 71%|███████ | 1136/1610 [4:24:08<1:31:48, 11.62s/it] {'loss': 0.0139, 'grad_norm': 1.311175839049226, 'learning_rate': 2.9440993788819873e-07, 'completion_length': 72.44643020629883, 'rewards/accuracy_reward': 0.3482142984867096, 'rewards/format_reward': 0.973214328289032, 'reward': 1.321428656578064, 'reward_std': 0.20511049777269363, 'kl': 0.345703125, 'epoch': 3.53} 71%|███████ | 1136/1610 [4:24:08<1:31:48, 11.62s/it] 71%|███████ | 1137/1610 [4:24:22<1:36:35, 12.25s/it] {'loss': 0.0192, 'grad_norm': 2.8977519257970936, 'learning_rate': 2.9378881987577636e-07, 'completion_length': 117.6964340209961, 'rewards/accuracy_reward': 0.3750000298023224, 'rewards/format_reward': 0.9196428954601288, 'reward': 1.2946429252624512, 'reward_std': 0.31738631427288055, 'kl': 0.48046875, 'epoch': 3.53} 71%|███████ | 1137/1610 [4:24:22<1:36:35, 12.25s/it] 71%|███████ | 1138/1610 [4:24:33<1:34:20, 11.99s/it] {'loss': 0.0091, 'grad_norm': 1.597945788117461, 'learning_rate': 2.9316770186335405e-07, 'completion_length': 65.96428680419922, 'rewards/accuracy_reward': 0.7767857611179352, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.758928656578064, 'reward_std': 0.1379830539226532, 'kl': 0.22802734375, 'epoch': 3.53} 71%|███████ | 1138/1610 [4:24:33<1:34:20, 11.99s/it] 71%|███████ | 1139/1610 [4:24:42<1:26:45, 11.05s/it] {'loss': 0.0094, 'grad_norm': 1.5938391890749246, 'learning_rate': 2.925465838509317e-07, 'completion_length': 58.58928680419922, 'rewards/accuracy_reward': 0.4732143133878708, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4642857909202576, 'reward_std': 0.16323687136173248, 'kl': 0.234375, 'epoch': 3.54} 71%|███████ | 1139/1610 [4:24:42<1:26:45, 11.05s/it] 71%|███████ | 1140/1610 [4:24:56<1:32:11, 11.77s/it] {'loss': 0.0235, 'grad_norm': 4.467369253203156, 'learning_rate': 2.919254658385093e-07, 'completion_length': 99.85714721679688, 'rewards/accuracy_reward': 0.5357142984867096, 'rewards/format_reward': 0.9464286267757416, 'reward': 1.4821429252624512, 'reward_std': 0.2377769872546196, 'kl': 0.587890625, 'epoch': 3.54} 71%|███████ | 1140/1610 [4:24:56<1:32:11, 11.77s/it] 71%|███████ | 1141/1610 [4:25:09<1:35:13, 12.18s/it] {'loss': 0.0173, 'grad_norm': 3.3497174545536, 'learning_rate': 2.9130434782608695e-07, 'completion_length': 92.95536041259766, 'rewards/accuracy_reward': 0.642857164144516, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.571428656578064, 'reward_std': 0.2300388216972351, 'kl': 0.4326171875, 'epoch': 3.54} 71%|███████ | 1141/1610 [4:25:09<1:35:13, 12.18s/it] 71%|███████ | 1142/1610 [4:25:22<1:37:44, 12.53s/it] {'loss': 0.0229, 'grad_norm': 2.6242700325038357, 'learning_rate': 2.9068322981366463e-07, 'completion_length': 110.98214721679688, 'rewards/accuracy_reward': 0.6071428656578064, 'rewards/format_reward': 0.9017857611179352, 'reward': 1.508928656578064, 'reward_std': 0.3045148700475693, 'kl': 0.572265625, 'epoch': 3.55} 71%|███████ | 1142/1610 [4:25:22<1:37:44, 12.53s/it] 71%|███████ | 1143/1610 [4:25:33<1:34:01, 12.08s/it] {'loss': 0.0105, 'grad_norm': 2.064977196208422, 'learning_rate': 2.900621118012422e-07, 'completion_length': 64.43750381469727, 'rewards/accuracy_reward': 0.4821428805589676, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4642857909202576, 'reward_std': 0.25029680132865906, 'kl': 0.26220703125, 'epoch': 3.55} 71%|███████ | 1143/1610 [4:25:33<1:34:01, 12.08s/it] 71%|███████ | 1144/1610 [4:25:43<1:29:45, 11.56s/it] {'loss': 0.0071, 'grad_norm': 1.7354685864716, 'learning_rate': 2.8944099378881985e-07, 'completion_length': 71.28571701049805, 'rewards/accuracy_reward': 0.330357164144516, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3125000596046448, 'reward_std': 0.20020467042922974, 'kl': 0.176513671875, 'epoch': 3.55} 71%|███████ | 1144/1610 [4:25:43<1:29:45, 11.56s/it] 71%|███████ | 1145/1610 [4:25:56<1:32:32, 11.94s/it] {'loss': 0.0126, 'grad_norm': 1.7691604165950092, 'learning_rate': 2.888198757763975e-07, 'completion_length': 79.66071701049805, 'rewards/accuracy_reward': 0.4196428656578064, 'rewards/format_reward': 0.973214328289032, 'reward': 1.3928571939468384, 'reward_std': 0.13035528734326363, 'kl': 0.31396484375, 'epoch': 3.56} 71%|███████ | 1145/1610 [4:25:56<1:32:32, 11.94s/it] 71%|███████ | 1146/1610 [4:26:09<1:34:50, 12.26s/it] {'loss': 0.0117, 'grad_norm': 1.7808019217044042, 'learning_rate': 2.881987577639751e-07, 'completion_length': 76.96428680419922, 'rewards/accuracy_reward': 0.5089285969734192, 'rewards/format_reward': 0.973214328289032, 'reward': 1.4821429252624512, 'reward_std': 0.17226044088602066, 'kl': 0.29345703125, 'epoch': 3.56} 71%|███████ | 1146/1610 [4:26:09<1:34:50, 12.26s/it] 71%|███████ | 1147/1610 [4:26:21<1:34:22, 12.23s/it] {'loss': 0.0075, 'grad_norm': 1.4598163441715684, 'learning_rate': 2.875776397515528e-07, 'completion_length': 75.79464721679688, 'rewards/accuracy_reward': 0.5446428954601288, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.5357143878936768, 'reward_std': 0.12444322556257248, 'kl': 0.187255859375, 'epoch': 3.56} 71%|███████ | 1147/1610 [4:26:21<1:34:22, 12.23s/it] 71%|███████▏ | 1148/1610 [4:26:32<1:31:11, 11.84s/it] {'loss': 0.0072, 'grad_norm': 1.28151230632944, 'learning_rate': 2.8695652173913043e-07, 'completion_length': 69.10714721679688, 'rewards/accuracy_reward': 0.5982142984867096, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5803571939468384, 'reward_std': 0.12054044008255005, 'kl': 0.18017578125, 'epoch': 3.57} 71%|███████▏ | 1148/1610 [4:26:32<1:31:11, 11.84s/it] 71%|███████▏ | 1149/1610 [4:26:45<1:31:56, 11.97s/it] {'loss': 0.0193, 'grad_norm': 1.7921560162853953, 'learning_rate': 2.8633540372670806e-07, 'completion_length': 75.91964721679688, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.6250000596046448, 'reward_std': 0.19112761318683624, 'kl': 0.4814453125, 'epoch': 3.57} 71%|███████▏ | 1149/1610 [4:26:45<1:31:56, 11.97s/it] 71%|███████▏ | 1150/1610 [4:26:58<1:34:44, 12.36s/it] {'loss': 0.0256, 'grad_norm': 1.7042746917364078, 'learning_rate': 2.857142857142857e-07, 'completion_length': 96.35714721679688, 'rewards/accuracy_reward': 0.4464286118745804, 'rewards/format_reward': 0.910714328289032, 'reward': 1.3571429252624512, 'reward_std': 0.23360932245850563, 'kl': 0.6416015625, 'epoch': 3.57} 71%|███████▏ | 1150/1610 [4:26:58<1:34:44, 12.36s/it] 71%|███████▏ | 1151/1610 [4:27:11<1:35:08, 12.44s/it] {'loss': 0.0116, 'grad_norm': 3.141332347059437, 'learning_rate': 2.850931677018634e-07, 'completion_length': 86.41964721679688, 'rewards/accuracy_reward': 0.3928571790456772, 'rewards/format_reward': 0.955357164144516, 'reward': 1.348214328289032, 'reward_std': 0.20411308109760284, 'kl': 0.29052734375, 'epoch': 3.57} 71%|███████▏ | 1151/1610 [4:27:11<1:35:08, 12.44s/it] 72%|███████▏ | 1152/1610 [4:27:24<1:36:41, 12.67s/it] {'loss': 0.0219, 'grad_norm': 2.6778144914333435, 'learning_rate': 2.84472049689441e-07, 'completion_length': 97.02679061889648, 'rewards/accuracy_reward': 0.330357164144516, 'rewards/format_reward': 0.9196428954601288, 'reward': 1.2500000596046448, 'reward_std': 0.3334621340036392, 'kl': 0.546875, 'epoch': 3.58} 72%|███████▏ | 1152/1610 [4:27:24<1:36:41, 12.67s/it] 72%|███████▏ | 1153/1610 [4:27:34<1:31:13, 11.98s/it] {'loss': 0.0122, 'grad_norm': 2.557994565031611, 'learning_rate': 2.8385093167701864e-07, 'completion_length': 72.7589340209961, 'rewards/accuracy_reward': 0.6160714626312256, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5803571939468384, 'reward_std': 0.19751426577568054, 'kl': 0.30419921875, 'epoch': 3.58} 72%|███████▏ | 1153/1610 [4:27:34<1:31:13, 11.98s/it] 72%|███████▏ | 1154/1610 [4:27:46<1:31:53, 12.09s/it] {'loss': 0.0166, 'grad_norm': 2.7621433076904647, 'learning_rate': 2.832298136645963e-07, 'completion_length': 78.89286041259766, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5178572535514832, 'reward_std': 0.11663764715194702, 'kl': 0.4150390625, 'epoch': 3.58} 72%|███████▏ | 1154/1610 [4:27:46<1:31:53, 12.09s/it] 72%|███████▏ | 1155/1610 [4:27:57<1:28:32, 11.68s/it] {'loss': 0.0085, 'grad_norm': 2.44807147685025, 'learning_rate': 2.8260869565217386e-07, 'completion_length': 61.24107551574707, 'rewards/accuracy_reward': 0.5714286118745804, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5535714626312256, 'reward_std': 0.1890895813703537, 'kl': 0.212890625, 'epoch': 3.59} 72%|███████▏ | 1155/1610 [4:27:57<1:28:32, 11.68s/it] 72%|███████▏ | 1156/1610 [4:28:08<1:27:31, 11.57s/it] {'loss': 0.0091, 'grad_norm': 3.7366991638322165, 'learning_rate': 2.8198757763975154e-07, 'completion_length': 67.89286041259766, 'rewards/accuracy_reward': 0.2857143059372902, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.2767857909202576, 'reward_std': 0.13736959546804428, 'kl': 0.2265625, 'epoch': 3.59} 72%|███████▏ | 1156/1610 [4:28:08<1:27:31, 11.57s/it] 72%|███████▏ | 1157/1610 [4:28:21<1:30:16, 11.96s/it] {'loss': 0.0081, 'grad_norm': 3.27332170039315, 'learning_rate': 2.813664596273292e-07, 'completion_length': 80.51786041259766, 'rewards/accuracy_reward': 0.5535714328289032, 'rewards/format_reward': 0.973214328289032, 'reward': 1.5267857313156128, 'reward_std': 0.23534663021564484, 'kl': 0.20263671875, 'epoch': 3.59} 72%|███████▏ | 1157/1610 [4:28:21<1:30:16, 11.96s/it] 72%|███████▏ | 1158/1610 [4:28:34<1:32:25, 12.27s/it] {'loss': 0.0179, 'grad_norm': 2.0741550317359714, 'learning_rate': 2.807453416149068e-07, 'completion_length': 77.02679061889648, 'rewards/accuracy_reward': 0.3660714328289032, 'rewards/format_reward': 0.955357164144516, 'reward': 1.3214285969734192, 'reward_std': 0.2798429876565933, 'kl': 0.447265625, 'epoch': 3.6} 72%|███████▏ | 1158/1610 [4:28:34<1:32:25, 12.27s/it] 72%|███████▏ | 1159/1610 [4:28:47<1:34:05, 12.52s/it] {'loss': 0.0184, 'grad_norm': 2.2833868577276633, 'learning_rate': 2.8012422360248444e-07, 'completion_length': 73.5, 'rewards/accuracy_reward': 0.4017857164144516, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3660714626312256, 'reward_std': 0.16531942784786224, 'kl': 0.4619140625, 'epoch': 3.6} 72%|███████▏ | 1159/1610 [4:28:47<1:34:05, 12.52s/it] 72%|███████▏ | 1160/1610 [4:28:58<1:28:58, 11.86s/it] {'loss': 0.0076, 'grad_norm': 1.9159099515685414, 'learning_rate': 2.7950310559006207e-07, 'completion_length': 65.80357551574707, 'rewards/accuracy_reward': 0.446428582072258, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.4285714626312256, 'reward_std': 0.22276806831359863, 'kl': 0.189697265625, 'epoch': 3.6} 72%|███████▏ | 1160/1610 [4:28:58<1:28:58, 11.86s/it] 72%|███████▏ | 1161/1610 [4:29:11<1:30:42, 12.12s/it] {'loss': 0.0133, 'grad_norm': 2.262328138469998, 'learning_rate': 2.7888198757763976e-07, 'completion_length': 71.19643020629883, 'rewards/accuracy_reward': 0.7678571939468384, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.732142984867096, 'reward_std': 0.19978917390108109, 'kl': 0.33349609375, 'epoch': 3.61} 72%|███████▏ | 1161/1610 [4:29:11<1:30:42, 12.12s/it] 72%|███████▏ | 1162/1610 [4:29:21<1:27:02, 11.66s/it] {'loss': 0.01, 'grad_norm': 2.3554014499178604, 'learning_rate': 2.782608695652174e-07, 'completion_length': 60.83035850524902, 'rewards/accuracy_reward': 0.4375000298023224, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4196429252624512, 'reward_std': 0.17104807496070862, 'kl': 0.24951171875, 'epoch': 3.61} 72%|███████▏ | 1162/1610 [4:29:21<1:27:02, 11.66s/it] 72%|███████▏ | 1163/1610 [4:29:31<1:24:00, 11.28s/it] {'loss': 0.0125, 'grad_norm': 1.3641962431543848, 'learning_rate': 2.77639751552795e-07, 'completion_length': 71.90178871154785, 'rewards/accuracy_reward': 0.508928582072258, 'rewards/format_reward': 0.9732142984867096, 'reward': 1.4821429252624512, 'reward_std': 0.11792761087417603, 'kl': 0.31103515625, 'epoch': 3.61} 72%|███████▏ | 1163/1610 [4:29:31<1:24:00, 11.28s/it] 72%|███████▏ | 1164/1610 [4:29:42<1:21:25, 10.95s/it] {'loss': 0.0106, 'grad_norm': 2.0647651553058566, 'learning_rate': 2.7701863354037266e-07, 'completion_length': 69.73214530944824, 'rewards/accuracy_reward': 0.7500000596046448, 'rewards/format_reward': 0.9553571939468384, 'reward': 1.7053572535514832, 'reward_std': 0.17468241974711418, 'kl': 0.26513671875, 'epoch': 3.61} 72%|███████▏ | 1164/1610 [4:29:42<1:21:25, 10.95s/it] 72%|███████▏ | 1165/1610 [4:29:53<1:21:32, 10.99s/it] {'loss': 0.0097, 'grad_norm': 2.3142390798038446, 'learning_rate': 2.7639751552795034e-07, 'completion_length': 64.72321891784668, 'rewards/accuracy_reward': 0.392857164144516, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.383928656578064, 'reward_std': 0.19668584316968918, 'kl': 0.24267578125, 'epoch': 3.62} 72%|███████▏ | 1165/1610 [4:29:53<1:21:32, 10.99s/it] 72%|███████▏ | 1166/1610 [4:30:05<1:25:01, 11.49s/it] {'loss': 0.0132, 'grad_norm': 3.604066807507998, 'learning_rate': 2.7577639751552797e-07, 'completion_length': 72.30357360839844, 'rewards/accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 0.9642857611179352, 'reward': 1.5535714626312256, 'reward_std': 0.2858007550239563, 'kl': 0.3291015625, 'epoch': 3.62} 72%|███████▏ | 1166/1610 [4:30:05<1:25:01, 11.49s/it] 72%|███████▏ | 1167/1610 [4:30:16<1:22:35, 11.19s/it] {'loss': 0.014, 'grad_norm': 2.6486352782195905, 'learning_rate': 2.7515527950310555e-07, 'completion_length': 66.94643020629883, 'rewards/accuracy_reward': 0.4732143133878708, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4642857909202576, 'reward_std': 0.19938186556100845, 'kl': 0.34912109375, 'epoch': 3.62} 72%|███████▏ | 1167/1610 [4:30:16<1:22:35, 11.19s/it] 73%|███████▎ | 1168/1610 [4:30:29<1:26:53, 11.80s/it] {'loss': 0.0188, 'grad_norm': 2.1781344448521582, 'learning_rate': 2.745341614906832e-07, 'completion_length': 82.67857360839844, 'rewards/accuracy_reward': 0.3392857313156128, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.2857143878936768, 'reward_std': 0.24809947609901428, 'kl': 0.46875, 'epoch': 3.63} 73%|███████▎ | 1168/1610 [4:30:29<1:26:53, 11.80s/it] 73%|███████▎ | 1169/1610 [4:30:39<1:22:37, 11.24s/it] {'loss': 0.0162, 'grad_norm': 2.4090896311414958, 'learning_rate': 2.739130434782608e-07, 'completion_length': 74.12500190734863, 'rewards/accuracy_reward': 0.4910714477300644, 'rewards/format_reward': 0.9464286267757416, 'reward': 1.4375000596046448, 'reward_std': 0.2833384945988655, 'kl': 0.4052734375, 'epoch': 3.63} 73%|███████▎ | 1169/1610 [4:30:39<1:22:37, 11.24s/it] 73%|███████▎ | 1170/1610 [4:30:49<1:20:17, 10.95s/it] {'loss': 0.0181, 'grad_norm': 2.680890440506886, 'learning_rate': 2.732919254658385e-07, 'completion_length': 82.08929061889648, 'rewards/accuracy_reward': 0.4821428954601288, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.4285714626312256, 'reward_std': 0.2351585552096367, 'kl': 0.453125, 'epoch': 3.63} 73%|███████▎ | 1170/1610 [4:30:49<1:20:17, 10.95s/it] 73%|███████▎ | 1171/1610 [4:31:02<1:24:45, 11.59s/it] {'loss': 0.0237, 'grad_norm': 3.5783947395577087, 'learning_rate': 2.7267080745341614e-07, 'completion_length': 84.5089340209961, 'rewards/accuracy_reward': 0.464285746216774, 'rewards/format_reward': 0.9196428954601288, 'reward': 1.3839285969734192, 'reward_std': 0.2831638306379318, 'kl': 0.591796875, 'epoch': 3.64} 73%|███████▎ | 1171/1610 [4:31:02<1:24:45, 11.59s/it] 73%|███████▎ | 1172/1610 [4:31:11<1:17:00, 10.55s/it] {'loss': 0.0056, 'grad_norm': 2.2883300398986535, 'learning_rate': 2.7204968944099377e-07, 'completion_length': 57.34821701049805, 'rewards/accuracy_reward': 0.7232142984867096, 'rewards/format_reward': 1.0, 'reward': 1.7232143878936768, 'reward_std': 0.14700661599636078, 'kl': 0.138671875, 'epoch': 3.64} 73%|███████▎ | 1172/1610 [4:31:11<1:17:00, 10.55s/it] 73%|███████▎ | 1173/1610 [4:31:22<1:18:26, 10.77s/it] {'loss': 0.01, 'grad_norm': 1.6570317143346114, 'learning_rate': 2.714285714285714e-07, 'completion_length': 67.28571701049805, 'rewards/accuracy_reward': 0.5625000298023224, 'rewards/format_reward': 0.973214328289032, 'reward': 1.535714328289032, 'reward_std': 0.14970265328884125, 'kl': 0.25, 'epoch': 3.64} 73%|███████▎ | 1173/1610 [4:31:22<1:18:26, 10.77s/it] 73%|███████▎ | 1174/1610 [4:31:30<1:13:29, 10.11s/it] {'loss': 0.0062, 'grad_norm': 1.034576752870539, 'learning_rate': 2.708074534161491e-07, 'completion_length': 55.41071701049805, 'rewards/accuracy_reward': 0.455357164144516, 'rewards/format_reward': 1.0, 'reward': 1.4553571939468384, 'reward_std': 0.08747542649507523, 'kl': 0.15478515625, 'epoch': 3.65} 73%|███████▎ | 1174/1610 [4:31:30<1:13:29, 10.11s/it] 73%|███████▎ | 1175/1610 [4:31:40<1:13:06, 10.08s/it] {'loss': 0.0135, 'grad_norm': 2.1332007678463536, 'learning_rate': 2.701863354037267e-07, 'completion_length': 68.00893211364746, 'rewards/accuracy_reward': 0.5446428954601288, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.508928656578064, 'reward_std': 0.18334399163722992, 'kl': 0.3388671875, 'epoch': 3.65} 73%|███████▎ | 1175/1610 [4:31:40<1:13:06, 10.08s/it] 73%|███████▎ | 1176/1610 [4:31:51<1:14:29, 10.30s/it] {'loss': 0.0103, 'grad_norm': 2.2847212186385515, 'learning_rate': 2.6956521739130435e-07, 'completion_length': 66.73214721679688, 'rewards/accuracy_reward': 0.6607142984867096, 'rewards/format_reward': 0.973214328289032, 'reward': 1.6339285969734192, 'reward_std': 0.24419668316841125, 'kl': 0.25830078125, 'epoch': 3.65} 73%|███████▎ | 1176/1610 [4:31:51<1:14:29, 10.30s/it] 73%|███████▎ | 1177/1610 [4:31:59<1:09:28, 9.63s/it] {'loss': 0.0072, 'grad_norm': 1.2401507592108043, 'learning_rate': 2.68944099378882e-07, 'completion_length': 56.875003814697266, 'rewards/accuracy_reward': 0.455357164144516, 'rewards/format_reward': 1.0, 'reward': 1.4553571939468384, 'reward_std': 0.03696779906749725, 'kl': 0.17919921875, 'epoch': 3.66} 73%|███████▎ | 1177/1610 [4:31:59<1:09:28, 9.63s/it] 73%|███████▎ | 1178/1610 [4:32:09<1:10:19, 9.77s/it] {'loss': 0.0067, 'grad_norm': 2.6406428714814667, 'learning_rate': 2.6832298136645956e-07, 'completion_length': 66.89286041259766, 'rewards/accuracy_reward': 0.455357164144516, 'rewards/format_reward': 0.9732142984867096, 'reward': 1.4285714626312256, 'reward_std': 0.16323686763644218, 'kl': 0.16796875, 'epoch': 3.66} 73%|███████▎ | 1178/1610 [4:32:09<1:10:19, 9.77s/it] 73%|███████▎ | 1179/1610 [4:32:20<1:11:12, 9.91s/it] {'loss': 0.0196, 'grad_norm': 1.239595223992422, 'learning_rate': 2.6770186335403725e-07, 'completion_length': 73.39285850524902, 'rewards/accuracy_reward': 0.4732143133878708, 'rewards/format_reward': 0.955357164144516, 'reward': 1.4285715222358704, 'reward_std': 0.07103024423122406, 'kl': 0.48974609375, 'epoch': 3.66} 73%|███████▎ | 1179/1610 [4:32:20<1:11:12, 9.91s/it] 73%|███████▎ | 1180/1610 [4:32:32<1:17:05, 10.76s/it] {'loss': 0.0259, 'grad_norm': 5.5915917136034885, 'learning_rate': 2.670807453416149e-07, 'completion_length': 80.70536041259766, 'rewards/accuracy_reward': 0.3392857313156128, 'rewards/format_reward': 0.9375000298023224, 'reward': 1.2767857909202576, 'reward_std': 0.3490789085626602, 'kl': 0.650390625, 'epoch': 3.66} 73%|███████▎ | 1180/1610 [4:32:32<1:17:05, 10.76s/it] 73%|███████▎ | 1181/1610 [4:32:43<1:16:38, 10.72s/it] {'loss': 0.0108, 'grad_norm': 4.242370133568407, 'learning_rate': 2.664596273291925e-07, 'completion_length': 56.73214530944824, 'rewards/accuracy_reward': 0.4375000149011612, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4285714626312256, 'reward_std': 0.11272924020886421, 'kl': 0.26953125, 'epoch': 3.67} 73%|███████▎ | 1181/1610 [4:32:43<1:16:38, 10.72s/it] 73%|███████▎ | 1182/1610 [4:32:56<1:21:27, 11.42s/it] {'loss': 0.0244, 'grad_norm': 2.868114677850159, 'learning_rate': 2.6583850931677015e-07, 'completion_length': 78.83929061889648, 'rewards/accuracy_reward': 0.7589285969734192, 'rewards/format_reward': 0.928571492433548, 'reward': 1.6875000596046448, 'reward_std': 0.2562864422798157, 'kl': 0.609375, 'epoch': 3.67} 73%|███████▎ | 1182/1610 [4:32:56<1:21:27, 11.42s/it] 73%|███████▎ | 1183/1610 [4:33:06<1:18:18, 11.00s/it] {'loss': 0.0055, 'grad_norm': 2.8219036553553023, 'learning_rate': 2.6521739130434783e-07, 'completion_length': 59.78571701049805, 'rewards/accuracy_reward': 0.6339286118745804, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.6160715222358704, 'reward_std': 0.20020467042922974, 'kl': 0.13671875, 'epoch': 3.67} 73%|███████▎ | 1183/1610 [4:33:06<1:18:18, 11.00s/it] 74%|███████▎ | 1184/1610 [4:33:19<1:21:26, 11.47s/it] {'loss': 0.0186, 'grad_norm': 2.3980665306531024, 'learning_rate': 2.6459627329192547e-07, 'completion_length': 69.52679061889648, 'rewards/accuracy_reward': 0.375, 'rewards/format_reward': 0.9642857611179352, 'reward': 1.3392857909202576, 'reward_std': 0.20141705870628357, 'kl': 0.4658203125, 'epoch': 3.68} 74%|███████▎ | 1184/1610 [4:33:19<1:21:26, 11.47s/it] 74%|███████▎ | 1185/1610 [4:33:32<1:24:20, 11.91s/it] {'loss': 0.0126, 'grad_norm': 3.3553775458272916, 'learning_rate': 2.639751552795031e-07, 'completion_length': 71.91964721679688, 'rewards/accuracy_reward': 0.4285714477300644, 'rewards/format_reward': 0.973214328289032, 'reward': 1.4017857909202576, 'reward_std': 0.20922823250293732, 'kl': 0.31396484375, 'epoch': 3.68} 74%|███████▎ | 1185/1610 [4:33:32<1:24:20, 11.91s/it] 74%|███████▎ | 1186/1610 [4:33:42<1:20:04, 11.33s/it] {'loss': 0.0094, 'grad_norm': 4.278512104331532, 'learning_rate': 2.6335403726708073e-07, 'completion_length': 56.43750190734863, 'rewards/accuracy_reward': 0.3392857313156128, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.3214285969734192, 'reward_std': 0.19503602385520935, 'kl': 0.236328125, 'epoch': 3.68} 74%|███████▎ | 1186/1610 [4:33:42<1:20:04, 11.33s/it] 74%|███████▎ | 1187/1610 [4:33:51<1:16:50, 10.90s/it] {'loss': 0.0078, 'grad_norm': 4.986868492191444, 'learning_rate': 2.6273291925465836e-07, 'completion_length': 60.70535850524902, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6071429252624512, 'reward_std': 0.2383313775062561, 'kl': 0.196044921875, 'epoch': 3.69} 74%|███████▎ | 1187/1610 [4:33:51<1:16:50, 10.90s/it] 74%|███████▍ | 1188/1610 [4:34:02<1:15:32, 10.74s/it] {'loss': 0.0114, 'grad_norm': 3.7441600218971476, 'learning_rate': 2.6211180124223605e-07, 'completion_length': 66.66071701049805, 'rewards/accuracy_reward': 0.6875000298023224, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.6785715222358704, 'reward_std': 0.21541155874729156, 'kl': 0.28369140625, 'epoch': 3.69} 74%|███████▍ | 1188/1610 [4:34:02<1:15:32, 10.74s/it] 74%|███████▍ | 1189/1610 [4:34:14<1:17:30, 11.05s/it] {'loss': 0.0116, 'grad_norm': 1.7867403111613316, 'learning_rate': 2.614906832298137e-07, 'completion_length': 60.72321701049805, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 0.973214328289032, 'reward': 1.5803572535514832, 'reward_std': 0.18849068880081177, 'kl': 0.291015625, 'epoch': 3.69} 74%|███████▍ | 1189/1610 [4:34:14<1:17:30, 11.05s/it] 74%|███████▍ | 1190/1610 [4:34:26<1:20:14, 11.46s/it] {'loss': 0.0175, 'grad_norm': 4.456957791702991, 'learning_rate': 2.6086956521739126e-07, 'completion_length': 71.29464340209961, 'rewards/accuracy_reward': 0.4910714626312256, 'rewards/format_reward': 0.955357164144516, 'reward': 1.446428656578064, 'reward_std': 0.2734343856573105, 'kl': 0.4375, 'epoch': 3.7} 74%|███████▍ | 1190/1610 [4:34:26<1:20:14, 11.46s/it] 74%|███████▍ | 1191/1610 [4:34:36<1:17:06, 11.04s/it] {'loss': 0.0118, 'grad_norm': 2.24175396944388, 'learning_rate': 2.602484472049689e-07, 'completion_length': 60.25000190734863, 'rewards/accuracy_reward': 0.4107143133878708, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4017857909202576, 'reward_std': 0.20080918818712234, 'kl': 0.29541015625, 'epoch': 3.7} 74%|███████▍ | 1191/1610 [4:34:36<1:17:06, 11.04s/it] 74%|███████▍ | 1192/1610 [4:34:49<1:20:03, 11.49s/it] {'loss': 0.0231, 'grad_norm': 3.204574358123133, 'learning_rate': 2.596273291925466e-07, 'completion_length': 70.19643020629883, 'rewards/accuracy_reward': 0.3750000149011612, 'rewards/format_reward': 0.9642857611179352, 'reward': 1.3392857909202576, 'reward_std': 0.27139637619256973, 'kl': 0.5771484375, 'epoch': 3.7} 74%|███████▍ | 1192/1610 [4:34:49<1:20:03, 11.49s/it] 74%|███████▍ | 1193/1610 [4:34:59<1:17:31, 11.15s/it] {'loss': 0.0211, 'grad_norm': 8.578005711544694, 'learning_rate': 2.590062111801242e-07, 'completion_length': 65.53571510314941, 'rewards/accuracy_reward': 0.705357164144516, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.6517857909202576, 'reward_std': 0.25180534273386, 'kl': 0.5283203125, 'epoch': 3.7} 74%|███████▍ | 1193/1610 [4:34:59<1:17:31, 11.15s/it] 74%|███████▍ | 1194/1610 [4:35:06<1:09:50, 10.07s/it] {'loss': 0.0086, 'grad_norm': 1.1320626513303975, 'learning_rate': 2.5838509316770184e-07, 'completion_length': 54.75893020629883, 'rewards/accuracy_reward': 0.5625000447034836, 'rewards/format_reward': 1.0, 'reward': 1.5625001192092896, 'reward_std': 0.07003280520439148, 'kl': 0.21484375, 'epoch': 3.71} 74%|███████▍ | 1194/1610 [4:35:06<1:09:50, 10.07s/it] 74%|███████▍ | 1195/1610 [4:35:17<1:11:03, 10.27s/it] {'loss': 0.011, 'grad_norm': 1.2854009751415207, 'learning_rate': 2.577639751552795e-07, 'completion_length': 62.36607551574707, 'rewards/accuracy_reward': 0.4196428656578064, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.410714328289032, 'reward_std': 0.12444322556257248, 'kl': 0.27490234375, 'epoch': 3.71} 74%|███████▍ | 1195/1610 [4:35:17<1:11:03, 10.27s/it] 74%|███████▍ | 1196/1610 [4:35:30<1:16:07, 11.03s/it] {'loss': 0.0372, 'grad_norm': 2.4499983083071557, 'learning_rate': 2.571428571428571e-07, 'completion_length': 84.8214340209961, 'rewards/accuracy_reward': 0.5535714477300644, 'rewards/format_reward': 0.9464286267757416, 'reward': 1.5000001192092896, 'reward_std': 0.30465375632047653, 'kl': 0.931640625, 'epoch': 3.71} 74%|███████▍ | 1196/1610 [4:35:30<1:16:07, 11.03s/it] 74%|███████▍ | 1197/1610 [4:35:42<1:17:21, 11.24s/it] {'loss': 0.0295, 'grad_norm': 6.068853814770092, 'learning_rate': 2.565217391304348e-07, 'completion_length': 84.33928871154785, 'rewards/accuracy_reward': 0.5446428954601288, 'rewards/format_reward': 0.9196429252624512, 'reward': 1.4642857909202576, 'reward_std': 0.3423607647418976, 'kl': 0.73828125, 'epoch': 3.72} 74%|███████▍ | 1197/1610 [4:35:42<1:17:21, 11.24s/it] 74%|███████▍ | 1198/1610 [4:35:52<1:15:23, 10.98s/it] {'loss': 0.0098, 'grad_norm': 2.7690442255983743, 'learning_rate': 2.5590062111801243e-07, 'completion_length': 68.12500190734863, 'rewards/accuracy_reward': 0.4553571492433548, 'rewards/format_reward': 0.9732142984867096, 'reward': 1.4285715222358704, 'reward_std': 0.16141663491725922, 'kl': 0.24462890625, 'epoch': 3.72} 74%|███████▍ | 1198/1610 [4:35:52<1:15:23, 10.98s/it] 74%|███████▍ | 1199/1610 [4:36:05<1:18:41, 11.49s/it] {'loss': 0.0113, 'grad_norm': 2.7028438306465166, 'learning_rate': 2.5527950310559006e-07, 'completion_length': 63.776790618896484, 'rewards/accuracy_reward': 0.446428582072258, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.4285715222358704, 'reward_std': 0.25420519709587097, 'kl': 0.28369140625, 'epoch': 3.72} 74%|███████▍ | 1199/1610 [4:36:05<1:18:41, 11.49s/it] 75%|███████▍ | 1200/1610 [4:36:18<1:21:03, 11.86s/it] {'loss': 0.0205, 'grad_norm': 2.569694795823898, 'learning_rate': 2.546583850931677e-07, 'completion_length': 73.59821701049805, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 0.955357164144516, 'reward': 1.473214328289032, 'reward_std': 0.20901328325271606, 'kl': 0.51171875, 'epoch': 3.73} 75%|███████▍ | 1200/1610 [4:36:18<1:21:03, 11.86s/it] 75%|███████▍ | 1201/1610 [4:37:16<2:56:05, 25.83s/it] {'loss': 0.0173, 'grad_norm': 3.5666846358677766, 'learning_rate': 2.540372670807454e-07, 'completion_length': 69.60714340209961, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 0.9553571939468384, 'reward': 1.5803572535514832, 'reward_std': 0.31515534222126007, 'kl': 0.431640625, 'epoch': 3.73} 75%|███████▍ | 1201/1610 [4:37:16<2:56:05, 25.83s/it] 75%|███████▍ | 1202/1610 [4:37:29<2:29:34, 22.00s/it] {'loss': 0.022, 'grad_norm': 1.858700880360028, 'learning_rate': 2.5341614906832296e-07, 'completion_length': 80.12500381469727, 'rewards/accuracy_reward': 0.2410714477300644, 'rewards/format_reward': 0.9464286267757416, 'reward': 1.1875, 'reward_std': 0.20478501170873642, 'kl': 0.548828125, 'epoch': 3.73} 75%|███████▍ | 1202/1610 [4:37:29<2:29:34, 22.00s/it] 75%|███████▍ | 1203/1610 [4:37:41<2:09:26, 19.08s/it] {'loss': 0.0145, 'grad_norm': 1.8597501161413228, 'learning_rate': 2.527950310559006e-07, 'completion_length': 65.36607551574707, 'rewards/accuracy_reward': 0.5089286118745804, 'rewards/format_reward': 0.955357164144516, 'reward': 1.4642857313156128, 'reward_std': 0.18506748229265213, 'kl': 0.361328125, 'epoch': 3.74} 75%|███████▍ | 1203/1610 [4:37:41<2:09:26, 19.08s/it] 75%|███████▍ | 1204/1610 [4:37:54<1:55:41, 17.10s/it] {'loss': 0.012, 'grad_norm': 2.097315249748622, 'learning_rate': 2.521739130434782e-07, 'completion_length': 60.47321701049805, 'rewards/accuracy_reward': 0.223214291036129, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.2053571939468384, 'reward_std': 0.20411306619644165, 'kl': 0.298828125, 'epoch': 3.74} 75%|███████▍ | 1204/1610 [4:37:54<1:55:41, 17.10s/it] 75%|███████▍ | 1205/1610 [4:38:07<1:47:04, 15.86s/it] {'loss': 0.0353, 'grad_norm': 2.74415500097876, 'learning_rate': 2.5155279503105585e-07, 'completion_length': 84.83928680419922, 'rewards/accuracy_reward': 0.4375000149011612, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.3660715222358704, 'reward_std': 0.22025534510612488, 'kl': 0.884765625, 'epoch': 3.74} 75%|███████▍ | 1205/1610 [4:38:07<1:47:04, 15.86s/it] 75%|███████▍ | 1206/1610 [4:38:17<1:34:38, 14.05s/it] {'loss': 0.009, 'grad_norm': 1.620267414291647, 'learning_rate': 2.5093167701863354e-07, 'completion_length': 52.85714530944824, 'rewards/accuracy_reward': 0.7946428954601288, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7767858505249023, 'reward_std': 0.11536617204546928, 'kl': 0.223388671875, 'epoch': 3.75} 75%|███████▍ | 1206/1610 [4:38:17<1:34:38, 14.05s/it] 75%|███████▍ | 1207/1610 [4:38:27<1:27:58, 13.10s/it] {'loss': 0.0186, 'grad_norm': 3.219488441011074, 'learning_rate': 2.5031055900621117e-07, 'completion_length': 69.17857360839844, 'rewards/accuracy_reward': 0.5089285969734192, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.473214328289032, 'reward_std': 0.2630308121442795, 'kl': 0.464111328125, 'epoch': 3.75} 75%|███████▍ | 1207/1610 [4:38:27<1:27:58, 13.10s/it] 75%|███████▌ | 1208/1610 [4:38:37<1:21:35, 12.18s/it] {'loss': 0.0334, 'grad_norm': 1.9929327874247484, 'learning_rate': 2.496894409937888e-07, 'completion_length': 83.09821701049805, 'rewards/accuracy_reward': 0.25, 'rewards/format_reward': 0.9375000298023224, 'reward': 1.1875000596046448, 'reward_std': 0.19588638842105865, 'kl': 0.8369140625, 'epoch': 3.75} 75%|███████▌ | 1208/1610 [4:38:37<1:21:35, 12.18s/it] 75%|███████▌ | 1209/1610 [4:38:51<1:23:34, 12.50s/it] {'loss': 0.0246, 'grad_norm': 2.3963570487857875, 'learning_rate': 2.4906832298136644e-07, 'completion_length': 71.70536041259766, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 0.955357164144516, 'reward': 1.5625001192092896, 'reward_std': 0.24146483093500137, 'kl': 0.615234375, 'epoch': 3.75} 75%|███████▌ | 1209/1610 [4:38:51<1:23:34, 12.50s/it] 75%|███████▌ | 1210/1610 [4:39:04<1:24:32, 12.68s/it] {'loss': 0.0487, 'grad_norm': 2.691345803186223, 'learning_rate': 2.4844720496894407e-07, 'completion_length': 92.39286041259766, 'rewards/accuracy_reward': 0.5535714328289032, 'rewards/format_reward': 0.910714328289032, 'reward': 1.4642857909202576, 'reward_std': 0.29615846276283264, 'kl': 1.216796875, 'epoch': 3.76} 75%|███████▌ | 1210/1610 [4:39:04<1:24:32, 12.68s/it] 75%|███████▌ | 1211/1610 [4:39:17<1:24:46, 12.75s/it] {'loss': 0.0321, 'grad_norm': 2.5817804947577527, 'learning_rate': 2.4782608695652176e-07, 'completion_length': 75.16071701049805, 'rewards/accuracy_reward': 0.6339285969734192, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.5625001192092896, 'reward_std': 0.16587380319833755, 'kl': 0.8046875, 'epoch': 3.76} 75%|███████▌ | 1211/1610 [4:39:17<1:24:46, 12.75s/it] 75%|███████▌ | 1212/1610 [4:39:29<1:24:19, 12.71s/it] {'loss': 0.0247, 'grad_norm': 3.0815478647941883, 'learning_rate': 2.472049689440994e-07, 'completion_length': 68.40178871154785, 'rewards/accuracy_reward': 0.6875000298023224, 'rewards/format_reward': 0.9553571939468384, 'reward': 1.6428571939468384, 'reward_std': 0.29119790345430374, 'kl': 0.6171875, 'epoch': 3.76} 75%|███████▌ | 1212/1610 [4:39:29<1:24:19, 12.71s/it] 75%|███████▌ | 1213/1610 [4:39:40<1:20:01, 12.09s/it] {'loss': 0.0318, 'grad_norm': 2.94813869909509, 'learning_rate': 2.46583850931677e-07, 'completion_length': 64.24107551574707, 'rewards/accuracy_reward': 0.7500000596046448, 'rewards/format_reward': 0.955357164144516, 'reward': 1.7053571939468384, 'reward_std': 0.15212181210517883, 'kl': 0.7939453125, 'epoch': 3.77} 75%|███████▌ | 1213/1610 [4:39:40<1:20:01, 12.09s/it] 75%|███████▌ | 1214/1610 [4:39:53<1:22:20, 12.47s/it] {'loss': 0.064, 'grad_norm': 3.0866185250476015, 'learning_rate': 2.4596273291925465e-07, 'completion_length': 111.22321701049805, 'rewards/accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 0.8571429252624512, 'reward': 1.3571429252624512, 'reward_std': 0.36404046416282654, 'kl': 1.59375, 'epoch': 3.77} 75%|███████▌ | 1214/1610 [4:39:53<1:22:20, 12.47s/it] 75%|███████▌ | 1215/1610 [4:40:04<1:18:02, 11.86s/it] {'loss': 0.0428, 'grad_norm': 4.103838968645473, 'learning_rate': 2.453416149068323e-07, 'completion_length': 81.91964721679688, 'rewards/accuracy_reward': 0.4464285969734192, 'rewards/format_reward': 0.9375000298023224, 'reward': 1.383928656578064, 'reward_std': 0.28782738745212555, 'kl': 1.067626953125, 'epoch': 3.77} 75%|███████▌ | 1215/1610 [4:40:04<1:18:02, 11.86s/it] 76%|███████▌ | 1216/1610 [4:40:17<1:20:24, 12.24s/it] {'loss': 0.0398, 'grad_norm': 4.728380143400901, 'learning_rate': 2.447204968944099e-07, 'completion_length': 79.25893020629883, 'rewards/accuracy_reward': 0.4732143133878708, 'rewards/format_reward': 0.9196428954601288, 'reward': 1.3928571939468384, 'reward_std': 0.39645305275917053, 'kl': 0.994140625, 'epoch': 3.78} 76%|███████▌ | 1216/1610 [4:40:17<1:20:24, 12.24s/it] 76%|███████▌ | 1217/1610 [4:40:30<1:20:50, 12.34s/it] {'loss': 0.0484, 'grad_norm': 3.266055131255479, 'learning_rate': 2.4409937888198755e-07, 'completion_length': 86.68750381469727, 'rewards/accuracy_reward': 0.4375000149011612, 'rewards/format_reward': 0.910714328289032, 'reward': 1.348214328289032, 'reward_std': 0.347409263253212, 'kl': 1.2109375, 'epoch': 3.78} 76%|███████▌ | 1217/1610 [4:40:30<1:20:50, 12.34s/it] 76%|███████▌ | 1218/1610 [4:40:43<1:22:25, 12.61s/it] {'loss': 0.038, 'grad_norm': 4.054748177498233, 'learning_rate': 2.4347826086956524e-07, 'completion_length': 78.38393020629883, 'rewards/accuracy_reward': 0.4196428656578064, 'rewards/format_reward': 0.9375000298023224, 'reward': 1.3571429252624512, 'reward_std': 0.2770591676235199, 'kl': 0.94921875, 'epoch': 3.78} 76%|███████▌ | 1218/1610 [4:40:43<1:22:25, 12.61s/it] 76%|███████▌ | 1219/1610 [4:40:55<1:22:20, 12.64s/it] {'loss': 0.0344, 'grad_norm': 2.516237475199758, 'learning_rate': 2.4285714285714287e-07, 'completion_length': 71.36607360839844, 'rewards/accuracy_reward': 0.3392857313156128, 'rewards/format_reward': 0.9553571939468384, 'reward': 1.2946428656578064, 'reward_std': 0.28108689188957214, 'kl': 0.8583984375, 'epoch': 3.79} 76%|███████▌ | 1219/1610 [4:40:55<1:22:20, 12.64s/it] 76%|███████▌ | 1220/1610 [4:41:08<1:21:39, 12.56s/it] {'loss': 0.0317, 'grad_norm': 2.336842507222085, 'learning_rate': 2.422360248447205e-07, 'completion_length': 62.05357551574707, 'rewards/accuracy_reward': 0.4910714477300644, 'rewards/format_reward': 0.9642857611179352, 'reward': 1.4553571939468384, 'reward_std': 0.15360543876886368, 'kl': 0.791015625, 'epoch': 3.79} 76%|███████▌ | 1220/1610 [4:41:08<1:21:39, 12.56s/it] 76%|███████▌ | 1221/1610 [4:41:20<1:21:33, 12.58s/it] {'loss': 0.0377, 'grad_norm': 2.1725459588087555, 'learning_rate': 2.4161490683229813e-07, 'completion_length': 71.29464721679688, 'rewards/accuracy_reward': 0.339285746216774, 'rewards/format_reward': 0.9553571939468384, 'reward': 1.2946429252624512, 'reward_std': 0.18722482025623322, 'kl': 0.94140625, 'epoch': 3.79} 76%|███████▌ | 1221/1610 [4:41:20<1:21:33, 12.58s/it] 76%|███████▌ | 1222/1610 [4:41:33<1:21:42, 12.64s/it] {'loss': 0.0267, 'grad_norm': 2.9370796521447637, 'learning_rate': 2.4099378881987577e-07, 'completion_length': 64.72321701049805, 'rewards/accuracy_reward': 0.6696428954601288, 'rewards/format_reward': 0.973214328289032, 'reward': 1.6428572535514832, 'reward_std': 0.17063257098197937, 'kl': 0.6650390625, 'epoch': 3.8} 76%|███████▌ | 1222/1610 [4:41:33<1:21:42, 12.64s/it] 76%|███████▌ | 1223/1610 [4:41:46<1:22:38, 12.81s/it] {'loss': 0.0398, 'grad_norm': 2.0181559329539493, 'learning_rate': 2.403726708074534e-07, 'completion_length': 73.75893020629883, 'rewards/accuracy_reward': 0.642857164144516, 'rewards/format_reward': 0.955357164144516, 'reward': 1.5982143878936768, 'reward_std': 0.298794150352478, 'kl': 0.994140625, 'epoch': 3.8} 76%|███████▌ | 1223/1610 [4:41:46<1:22:38, 12.81s/it] 76%|███████▌ | 1224/1610 [4:41:59<1:22:15, 12.79s/it] {'loss': 0.0374, 'grad_norm': 2.204657690315157, 'learning_rate': 2.3975155279503103e-07, 'completion_length': 72.18750381469727, 'rewards/accuracy_reward': 0.6339285969734192, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5803572535514832, 'reward_std': 0.25129425525665283, 'kl': 0.935546875, 'epoch': 3.8} 76%|███████▌ | 1224/1610 [4:41:59<1:22:15, 12.79s/it] 76%|███████▌ | 1225/1610 [4:42:12<1:21:31, 12.71s/it] {'loss': 0.0167, 'grad_norm': 3.6266681964033762, 'learning_rate': 2.391304347826087e-07, 'completion_length': 54.705360412597656, 'rewards/accuracy_reward': 0.6964286267757416, 'rewards/format_reward': 0.973214328289032, 'reward': 1.669642984867096, 'reward_std': 0.23057925701141357, 'kl': 0.4169921875, 'epoch': 3.8} 76%|███████▌ | 1225/1610 [4:42:12<1:21:31, 12.71s/it] 76%|███████▌ | 1226/1610 [4:42:24<1:20:19, 12.55s/it] {'loss': 0.0304, 'grad_norm': 1.7427174205842124, 'learning_rate': 2.385093167701863e-07, 'completion_length': 70.9375, 'rewards/accuracy_reward': 0.6517857611179352, 'rewards/format_reward': 0.955357164144516, 'reward': 1.607142984867096, 'reward_std': 0.168435238301754, 'kl': 0.759765625, 'epoch': 3.81} 76%|███████▌ | 1226/1610 [4:42:24<1:20:19, 12.55s/it] 76%|███████▌ | 1227/1610 [4:42:36<1:20:01, 12.54s/it] {'loss': 0.0375, 'grad_norm': 2.4195233791626514, 'learning_rate': 2.3788819875776398e-07, 'completion_length': 69.91964721679688, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 0.9375000596046448, 'reward': 1.6160714626312256, 'reward_std': 0.24620163440704346, 'kl': 0.93798828125, 'epoch': 3.81} 76%|███████▌ | 1227/1610 [4:42:36<1:20:01, 12.54s/it] 76%|███████▋ | 1228/1610 [4:42:46<1:15:08, 11.80s/it] {'loss': 0.0358, 'grad_norm': 3.021732334597263, 'learning_rate': 2.3726708074534161e-07, 'completion_length': 67.66071510314941, 'rewards/accuracy_reward': 0.5625, 'rewards/format_reward': 0.955357164144516, 'reward': 1.5178572535514832, 'reward_std': 0.1890895962715149, 'kl': 0.892822265625, 'epoch': 3.81} 76%|███████▋ | 1228/1610 [4:42:46<1:15:08, 11.80s/it] 76%|███████▋ | 1229/1610 [4:42:59<1:17:00, 12.13s/it] {'loss': 0.0383, 'grad_norm': 2.271199596769912, 'learning_rate': 2.3664596273291925e-07, 'completion_length': 69.66071701049805, 'rewards/accuracy_reward': 0.6964286267757416, 'rewards/format_reward': 0.955357164144516, 'reward': 1.6517857909202576, 'reward_std': 0.24229324609041214, 'kl': 0.958984375, 'epoch': 3.82} 76%|███████▋ | 1229/1610 [4:42:59<1:17:00, 12.13s/it] 76%|███████▋ | 1230/1610 [4:43:14<1:21:08, 12.81s/it] {'loss': 0.0578, 'grad_norm': 3.717416948797325, 'learning_rate': 2.3602484472049688e-07, 'completion_length': 80.75893020629883, 'rewards/accuracy_reward': 0.4910714328289032, 'rewards/format_reward': 0.9196428954601288, 'reward': 1.410714328289032, 'reward_std': 0.21703942120075226, 'kl': 1.4453125, 'epoch': 3.82} 76%|███████▋ | 1230/1610 [4:43:14<1:21:08, 12.81s/it] 76%|███████▋ | 1231/1610 [4:43:24<1:15:41, 11.98s/it] {'loss': 0.0355, 'grad_norm': 2.905600217066292, 'learning_rate': 2.354037267080745e-07, 'completion_length': 59.84821701049805, 'rewards/accuracy_reward': 0.3482143133878708, 'rewards/format_reward': 0.9642857611179352, 'reward': 1.3125000596046448, 'reward_std': 0.22728431224822998, 'kl': 0.888671875, 'epoch': 3.82} 76%|███████▋ | 1231/1610 [4:43:24<1:15:41, 11.98s/it] 77%|███████▋ | 1232/1610 [4:43:36<1:15:49, 12.04s/it] {'loss': 0.0257, 'grad_norm': 2.229514502811787, 'learning_rate': 2.3478260869565217e-07, 'completion_length': 61.58928871154785, 'rewards/accuracy_reward': 0.3660714477300644, 'rewards/format_reward': 0.9642857611179352, 'reward': 1.3303571939468384, 'reward_std': 0.1664527915418148, 'kl': 0.6435546875, 'epoch': 3.83} 77%|███████▋ | 1232/1610 [4:43:36<1:15:49, 12.04s/it] 77%|███████▋ | 1233/1610 [4:43:46<1:12:05, 11.47s/it] {'loss': 0.061, 'grad_norm': 3.820980944471744, 'learning_rate': 2.341614906832298e-07, 'completion_length': 76.00893020629883, 'rewards/accuracy_reward': 0.5625000298023224, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.4910715222358704, 'reward_std': 0.2140055149793625, 'kl': 1.52685546875, 'epoch': 3.83} 77%|███████▋ | 1233/1610 [4:43:46<1:12:05, 11.47s/it] 77%|███████▋ | 1234/1610 [4:43:57<1:10:08, 11.19s/it] {'loss': 0.0238, 'grad_norm': 2.1065018845284573, 'learning_rate': 2.3354037267080746e-07, 'completion_length': 69.14286231994629, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5892857909202576, 'reward_std': 0.18683947622776031, 'kl': 0.59423828125, 'epoch': 3.83} 77%|███████▋ | 1234/1610 [4:43:57<1:10:08, 11.19s/it] 77%|███████▋ | 1235/1610 [4:44:07<1:07:28, 10.80s/it] {'loss': 0.0088, 'grad_norm': 2.2365399268665582, 'learning_rate': 2.3291925465838507e-07, 'completion_length': 56.29464530944824, 'rewards/accuracy_reward': 0.7589286267757416, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.7500000596046448, 'reward_std': 0.2053254470229149, 'kl': 0.220703125, 'epoch': 3.84} 77%|███████▋ | 1235/1610 [4:44:07<1:07:28, 10.80s/it] 77%|███████▋ | 1236/1610 [4:44:16<1:05:31, 10.51s/it] {'loss': 0.0254, 'grad_norm': 2.5137243872055786, 'learning_rate': 2.3229813664596273e-07, 'completion_length': 57.410715103149414, 'rewards/accuracy_reward': 0.4910714328289032, 'rewards/format_reward': 0.9732142984867096, 'reward': 1.4642857909202576, 'reward_std': 0.19319036602973938, 'kl': 0.63525390625, 'epoch': 3.84} 77%|███████▋ | 1236/1610 [4:44:16<1:05:31, 10.51s/it] 77%|███████▋ | 1237/1610 [4:44:29<1:08:53, 11.08s/it] {'loss': 0.0328, 'grad_norm': 2.018635994826644, 'learning_rate': 2.3167701863354036e-07, 'completion_length': 64.05357360839844, 'rewards/accuracy_reward': 0.3750000149011612, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3392857909202576, 'reward_std': 0.22936686873435974, 'kl': 0.818359375, 'epoch': 3.84} 77%|███████▋ | 1237/1610 [4:44:29<1:08:53, 11.08s/it] 77%|███████▋ | 1238/1610 [4:44:42<1:11:56, 11.60s/it] {'loss': 0.0605, 'grad_norm': 2.6336351806161, 'learning_rate': 2.31055900621118e-07, 'completion_length': 71.53571701049805, 'rewards/accuracy_reward': 0.4732142984867096, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.4196429252624512, 'reward_std': 0.29912567138671875, 'kl': 1.51171875, 'epoch': 3.84} 77%|███████▋ | 1238/1610 [4:44:42<1:11:56, 11.60s/it] 77%|███████▋ | 1239/1610 [4:44:51<1:08:19, 11.05s/it] {'loss': 0.0195, 'grad_norm': 2.5630103232973562, 'learning_rate': 2.3043478260869565e-07, 'completion_length': 63.86607360839844, 'rewards/accuracy_reward': 0.5714286118745804, 'rewards/format_reward': 0.9732142984867096, 'reward': 1.5446429252624512, 'reward_std': 0.15933407843112946, 'kl': 0.486328125, 'epoch': 3.85} 77%|███████▋ | 1239/1610 [4:44:51<1:08:19, 11.05s/it] 77%|███████▋ | 1240/1610 [4:45:04<1:11:22, 11.57s/it] {'loss': 0.0288, 'grad_norm': 2.272132675371321, 'learning_rate': 2.2981366459627326e-07, 'completion_length': 69.86607360839844, 'rewards/accuracy_reward': 0.4196428656578064, 'rewards/format_reward': 0.955357164144516, 'reward': 1.3750000596046448, 'reward_std': 0.21596594154834747, 'kl': 0.720703125, 'epoch': 3.85} 77%|███████▋ | 1240/1610 [4:45:04<1:11:22, 11.57s/it] 77%|███████▋ | 1241/1610 [4:45:17<1:12:38, 11.81s/it] {'loss': 0.043, 'grad_norm': 2.9377031209697324, 'learning_rate': 2.2919254658385092e-07, 'completion_length': 72.37500190734863, 'rewards/accuracy_reward': 0.580357164144516, 'rewards/format_reward': 0.9642857611179352, 'reward': 1.5446429252624512, 'reward_std': 0.28355342149734497, 'kl': 1.072265625, 'epoch': 3.85} 77%|███████▋ | 1241/1610 [4:45:17<1:12:38, 11.81s/it] 77%|███████▋ | 1242/1610 [4:45:29<1:14:03, 12.08s/it] {'loss': 0.0247, 'grad_norm': 2.8880044322216385, 'learning_rate': 2.2857142857142855e-07, 'completion_length': 58.82143020629883, 'rewards/accuracy_reward': 0.5357142984867096, 'rewards/format_reward': 0.973214328289032, 'reward': 1.5089285969734192, 'reward_std': 0.17776241898536682, 'kl': 0.6171875, 'epoch': 3.86} 77%|███████▋ | 1242/1610 [4:45:29<1:14:03, 12.08s/it] 77%|███████▋ | 1243/1610 [4:45:42<1:15:31, 12.35s/it] {'loss': 0.0552, 'grad_norm': 8.701279787573153, 'learning_rate': 2.279503105590062e-07, 'completion_length': 69.13393020629883, 'rewards/accuracy_reward': 0.2946428656578064, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.223214328289032, 'reward_std': 0.38759300112724304, 'kl': 1.3828125, 'epoch': 3.86} 77%|███████▋ | 1243/1610 [4:45:42<1:15:31, 12.35s/it] 77%|███████▋ | 1244/1610 [4:45:54<1:14:33, 12.22s/it] {'loss': 0.0498, 'grad_norm': 2.4882846096986433, 'learning_rate': 2.2732919254658384e-07, 'completion_length': 67.91964530944824, 'rewards/accuracy_reward': 0.3392857313156128, 'rewards/format_reward': 0.9553571939468384, 'reward': 1.2946429252624512, 'reward_std': 0.18849068880081177, 'kl': 1.240234375, 'epoch': 3.86} 77%|███████▋ | 1244/1610 [4:45:54<1:14:33, 12.22s/it] 77%|███████▋ | 1245/1610 [4:46:03<1:07:55, 11.16s/it] {'loss': 0.0133, 'grad_norm': 2.6913286937094183, 'learning_rate': 2.267080745341615e-07, 'completion_length': 53.57143020629883, 'rewards/accuracy_reward': 0.4732143133878708, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4642857909202576, 'reward_std': 0.06222161650657654, 'kl': 0.333251953125, 'epoch': 3.87} 77%|███████▋ | 1245/1610 [4:46:03<1:07:55, 11.16s/it] 77%|███████▋ | 1246/1610 [4:46:16<1:10:50, 11.68s/it] {'loss': 0.0301, 'grad_norm': 5.469236283767878, 'learning_rate': 2.260869565217391e-07, 'completion_length': 59.23214530944824, 'rewards/accuracy_reward': 0.4107143133878708, 'rewards/format_reward': 0.973214328289032, 'reward': 1.383928656578064, 'reward_std': 0.2512577399611473, 'kl': 0.75390625, 'epoch': 3.87} 77%|███████▋ | 1246/1610 [4:46:16<1:10:50, 11.68s/it] 77%|███████▋ | 1247/1610 [4:46:25<1:07:05, 11.09s/it] {'loss': 0.0205, 'grad_norm': 2.1462991161440654, 'learning_rate': 2.2546583850931674e-07, 'completion_length': 58.080360412597656, 'rewards/accuracy_reward': 0.455357164144516, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4375000596046448, 'reward_std': 0.1800125390291214, 'kl': 0.51025390625, 'epoch': 3.87} 77%|███████▋ | 1247/1610 [4:46:25<1:07:05, 11.09s/it] 78%|███████▊ | 1248/1610 [4:46:33<1:00:33, 10.04s/it] {'loss': 0.005, 'grad_norm': 2.386808710800321, 'learning_rate': 2.248447204968944e-07, 'completion_length': 53.49107360839844, 'rewards/accuracy_reward': 0.6607142984867096, 'rewards/format_reward': 1.0, 'reward': 1.6607143878936768, 'reward_std': 0.17824578285217285, 'kl': 0.123779296875, 'epoch': 3.88} 78%|███████▊ | 1248/1610 [4:46:33<1:00:33, 10.04s/it] 78%|███████▊ | 1249/1610 [4:46:45<1:04:20, 10.69s/it] {'loss': 0.0251, 'grad_norm': 2.2349176094120558, 'learning_rate': 2.2422360248447203e-07, 'completion_length': 58.67857360839844, 'rewards/accuracy_reward': 0.321428582072258, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.3035714626312256, 'reward_std': 0.1509094163775444, 'kl': 0.626953125, 'epoch': 3.88} 78%|███████▊ | 1249/1610 [4:46:45<1:04:20, 10.69s/it] 78%|███████▊ | 1250/1610 [4:46:58<1:07:17, 11.22s/it] {'loss': 0.0494, 'grad_norm': 4.966807545216073, 'learning_rate': 2.236024844720497e-07, 'completion_length': 71.39286231994629, 'rewards/accuracy_reward': 0.464285746216774, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.3928571939468384, 'reward_std': 0.280967578291893, 'kl': 1.234375, 'epoch': 3.88} 78%|███████▊ | 1250/1610 [4:46:58<1:07:17, 11.22s/it] 78%|███████▊ | 1251/1610 [4:47:10<1:09:52, 11.68s/it] {'loss': 0.0631, 'grad_norm': 5.0599770579209356, 'learning_rate': 2.2298136645962732e-07, 'completion_length': 65.40178871154785, 'rewards/accuracy_reward': 0.5357142984867096, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.4821429252624512, 'reward_std': 0.20740239322185516, 'kl': 1.58203125, 'epoch': 3.89} 78%|███████▊ | 1251/1610 [4:47:10<1:09:52, 11.68s/it] 78%|███████▊ | 1252/1610 [4:47:18<1:01:50, 10.37s/it] {'loss': 0.0051, 'grad_norm': 1.669121194894985, 'learning_rate': 2.2236024844720495e-07, 'completion_length': 48.40178680419922, 'rewards/accuracy_reward': 0.4910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.4910715222358704, 'reward_std': 0.0964989997446537, 'kl': 0.12646484375, 'epoch': 3.89} 78%|███████▊ | 1252/1610 [4:47:18<1:01:50, 10.37s/it] 78%|███████▊ | 1253/1610 [4:47:28<1:00:41, 10.20s/it] {'loss': 0.0223, 'grad_norm': 4.85461302434469, 'learning_rate': 2.217391304347826e-07, 'completion_length': 54.89285850524902, 'rewards/accuracy_reward': 0.598214328289032, 'rewards/format_reward': 0.9732142984867096, 'reward': 1.571428656578064, 'reward_std': 0.14579425007104874, 'kl': 0.5576171875, 'epoch': 3.89} 78%|███████▊ | 1253/1610 [4:47:28<1:00:41, 10.20s/it] 78%|███████▊ | 1254/1610 [4:47:39<1:03:18, 10.67s/it] {'loss': 0.0641, 'grad_norm': 3.626968401495913, 'learning_rate': 2.2111801242236025e-07, 'completion_length': 66.12500190734863, 'rewards/accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 0.9375000596046448, 'reward': 1.4375000596046448, 'reward_std': 0.17233803123235703, 'kl': 1.607421875, 'epoch': 3.89} 78%|███████▊ | 1254/1610 [4:47:39<1:03:18, 10.67s/it] 78%|███████▊ | 1255/1610 [4:47:52<1:06:02, 11.16s/it] {'loss': 0.0261, 'grad_norm': 6.683669802839779, 'learning_rate': 2.2049689440993788e-07, 'completion_length': 58.08928871154785, 'rewards/accuracy_reward': 0.6517857611179352, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.633928656578064, 'reward_std': 0.19447603821754456, 'kl': 0.65234375, 'epoch': 3.9} 78%|███████▊ | 1255/1610 [4:47:52<1:06:02, 11.16s/it] 78%|███████▊ | 1256/1610 [4:48:02<1:03:34, 10.78s/it] {'loss': 0.0095, 'grad_norm': 3.3586839259248378, 'learning_rate': 2.198757763975155e-07, 'completion_length': 52.91964530944824, 'rewards/accuracy_reward': 0.6964285969734192, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.6875001192092896, 'reward_std': 0.1827620528638363, 'kl': 0.2373046875, 'epoch': 3.9} 78%|███████▊ | 1256/1610 [4:48:02<1:03:34, 10.78s/it] 78%|███████▊ | 1257/1610 [4:48:12<1:02:14, 10.58s/it] {'loss': 0.031, 'grad_norm': 2.5394962355903674, 'learning_rate': 2.1925465838509317e-07, 'completion_length': 62.00000190734863, 'rewards/accuracy_reward': 0.5625000298023224, 'rewards/format_reward': 0.9642857611179352, 'reward': 1.5267857909202576, 'reward_std': 0.15933407843112946, 'kl': 0.779296875, 'epoch': 3.9} 78%|███████▊ | 1257/1610 [4:48:12<1:02:14, 10.58s/it] 78%|███████▊ | 1258/1610 [4:48:22<1:00:51, 10.37s/it] {'loss': 0.0224, 'grad_norm': 2.225538133764564, 'learning_rate': 2.1863354037267078e-07, 'completion_length': 53.09821701049805, 'rewards/accuracy_reward': 0.5625000298023224, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5446429252624512, 'reward_std': 0.10799804329872131, 'kl': 0.55810546875, 'epoch': 3.91} 78%|███████▊ | 1258/1610 [4:48:22<1:00:51, 10.37s/it] 78%|███████▊ | 1259/1610 [4:48:29<55:17, 9.45s/it] {'loss': 0.0062, 'grad_norm': 3.2850798377420207, 'learning_rate': 2.1801242236024844e-07, 'completion_length': 46.81250190734863, 'rewards/accuracy_reward': 0.705357164144516, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.696428656578064, 'reward_std': 0.16141663491725922, 'kl': 0.15625, 'epoch': 3.91} 78%|███████▊ | 1259/1610 [4:48:29<55:17, 9.45s/it] 78%|███████▊ | 1260/1610 [4:48:36<50:45, 8.70s/it] {'loss': 0.0052, 'grad_norm': 1.306288867887687, 'learning_rate': 2.1739130434782607e-07, 'completion_length': 46.86607360839844, 'rewards/accuracy_reward': 0.455357164144516, 'rewards/format_reward': 1.0, 'reward': 1.4553571939468384, 'reward_std': 0.03696779906749725, 'kl': 0.130859375, 'epoch': 3.91} 78%|███████▊ | 1260/1610 [4:48:36<50:45, 8.70s/it] 78%|███████▊ | 1261/1610 [4:48:43<47:50, 8.23s/it] {'loss': 0.0047, 'grad_norm': 3.5426102745013486, 'learning_rate': 2.1677018633540373e-07, 'completion_length': 49.57143020629883, 'rewards/accuracy_reward': 0.4107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.4107143878936768, 'reward_std': 0.07124518603086472, 'kl': 0.11767578125, 'epoch': 3.92} 78%|███████▊ | 1261/1610 [4:48:43<47:50, 8.23s/it] 78%|███████▊ | 1262/1610 [4:48:51<47:40, 8.22s/it] {'loss': 0.0105, 'grad_norm': 1.8859119249547143, 'learning_rate': 2.1614906832298136e-07, 'completion_length': 50.88393020629883, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.508928656578064, 'reward_std': 0.0964989960193634, 'kl': 0.26220703125, 'epoch': 3.92} 78%|███████▊ | 1262/1610 [4:48:51<47:40, 8.22s/it] 78%|███████▊ | 1263/1610 [4:49:01<50:48, 8.79s/it] {'loss': 0.0117, 'grad_norm': 2.4207089987296624, 'learning_rate': 2.1552795031055902e-07, 'completion_length': 50.79464530944824, 'rewards/accuracy_reward': 0.4107142984867096, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4017857909202576, 'reward_std': 0.14700663089752197, 'kl': 0.29150390625, 'epoch': 3.92} 78%|███████▊ | 1263/1610 [4:49:01<50:48, 8.79s/it] 79%|███████▊ | 1264/1610 [4:49:08<48:04, 8.34s/it] {'loss': 0.0048, 'grad_norm': 2.375267974319018, 'learning_rate': 2.1490683229813662e-07, 'completion_length': 48.80357360839844, 'rewards/accuracy_reward': 0.5267857313156128, 'rewards/format_reward': 1.0, 'reward': 1.5267857313156128, 'reward_std': 0.07576144114136696, 'kl': 0.119140625, 'epoch': 3.93} 79%|███████▊ | 1264/1610 [4:49:09<48:04, 8.34s/it] 79%|███████▊ | 1265/1610 [4:49:20<52:52, 9.20s/it] {'loss': 0.0357, 'grad_norm': 3.0788994299328847, 'learning_rate': 2.1428571428571426e-07, 'completion_length': 58.392860412597656, 'rewards/accuracy_reward': 0.5357142984867096, 'rewards/format_reward': 0.9642857611179352, 'reward': 1.5000000596046448, 'reward_std': 0.2085978090763092, 'kl': 0.892578125, 'epoch': 3.93} 79%|███████▊ | 1265/1610 [4:49:20<52:52, 9.20s/it] 79%|███████▊ | 1266/1610 [4:49:27<50:19, 8.78s/it] {'loss': 0.0095, 'grad_norm': 10.250788438050268, 'learning_rate': 2.1366459627329192e-07, 'completion_length': 47.500003814697266, 'rewards/accuracy_reward': 0.4553571790456772, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.4375000596046448, 'reward_std': 0.18149618804454803, 'kl': 0.2373046875, 'epoch': 3.93} 79%|███████▊ | 1266/1610 [4:49:27<50:19, 8.78s/it] 79%|███████▊ | 1267/1610 [4:49:36<50:12, 8.78s/it] {'loss': 0.0218, 'grad_norm': 2.9942043734822037, 'learning_rate': 2.1304347826086955e-07, 'completion_length': 59.16964530944824, 'rewards/accuracy_reward': 0.446428582072258, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.410714328289032, 'reward_std': 0.2746657580137253, 'kl': 0.546875, 'epoch': 3.93} 79%|███████▊ | 1267/1610 [4:49:36<50:12, 8.78s/it] 79%|███████▉ | 1268/1610 [4:49:49<56:16, 9.87s/it] {'loss': 0.0588, 'grad_norm': 2.115883355682617, 'learning_rate': 2.124223602484472e-07, 'completion_length': 63.02678680419922, 'rewards/accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.535714328289032, 'reward_std': 0.18766789138317108, 'kl': 1.46875, 'epoch': 3.94} 79%|███████▉ | 1268/1610 [4:49:49<56:16, 9.87s/it] 79%|███████▉ | 1269/1610 [4:49:58<55:44, 9.81s/it] {'loss': 0.0277, 'grad_norm': 7.745357950511789, 'learning_rate': 2.1180124223602484e-07, 'completion_length': 63.95535850524902, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 0.9642857611179352, 'reward': 1.4821429252624512, 'reward_std': 0.3164268136024475, 'kl': 0.69287109375, 'epoch': 3.94} 79%|███████▉ | 1269/1610 [4:49:58<55:44, 9.81s/it] 79%|███████▉ | 1270/1610 [4:50:07<54:00, 9.53s/it] {'loss': 0.0158, 'grad_norm': 1.1937820711515452, 'learning_rate': 2.1118012422360247e-07, 'completion_length': 49.49107360839844, 'rewards/accuracy_reward': 0.8303571939468384, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.8125001192092896, 'reward_std': 0.10365218669176102, 'kl': 0.39453125, 'epoch': 3.94} 79%|███████▉ | 1270/1610 [4:50:07<54:00, 9.53s/it] 79%|███████▉ | 1271/1610 [4:50:19<57:35, 10.19s/it] {'loss': 0.0387, 'grad_norm': 2.548834134378419, 'learning_rate': 2.105590062111801e-07, 'completion_length': 62.54464530944824, 'rewards/accuracy_reward': 0.6517857313156128, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6160715222358704, 'reward_std': 0.23711898922920227, 'kl': 0.966796875, 'epoch': 3.95} 79%|███████▉ | 1271/1610 [4:50:19<57:35, 10.19s/it] 79%|███████▉ | 1272/1610 [4:50:29<56:47, 10.08s/it] {'loss': 0.0246, 'grad_norm': 3.028900697325006, 'learning_rate': 2.0993788819875776e-07, 'completion_length': 56.96428871154785, 'rewards/accuracy_reward': 0.508928582072258, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.473214328289032, 'reward_std': 0.1379830539226532, 'kl': 0.611328125, 'epoch': 3.95} 79%|███████▉ | 1272/1610 [4:50:29<56:47, 10.08s/it] 79%|███████▉ | 1273/1610 [4:50:42<1:01:09, 10.89s/it] {'loss': 0.0415, 'grad_norm': 3.397183517272342, 'learning_rate': 2.093167701863354e-07, 'completion_length': 59.83928871154785, 'rewards/accuracy_reward': 0.4821428954601288, 'rewards/format_reward': 0.955357164144516, 'reward': 1.4375000596046448, 'reward_std': 0.19690078496932983, 'kl': 1.037109375, 'epoch': 3.95} 79%|███████▉ | 1273/1610 [4:50:42<1:01:09, 10.89s/it] 79%|███████▉ | 1274/1610 [4:50:51<58:07, 10.38s/it] {'loss': 0.0186, 'grad_norm': 1.9121482952399482, 'learning_rate': 2.0869565217391303e-07, 'completion_length': 54.24107360839844, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 0.9642857611179352, 'reward': 1.4821429252624512, 'reward_std': 0.10101525485515594, 'kl': 0.46533203125, 'epoch': 3.96} 79%|███████▉ | 1274/1610 [4:50:51<58:07, 10.38s/it] 79%|███████▉ | 1275/1610 [4:50:59<54:03, 9.68s/it] {'loss': 0.0126, 'grad_norm': 2.190904451914343, 'learning_rate': 2.080745341614907e-07, 'completion_length': 51.04464340209961, 'rewards/accuracy_reward': 0.660714328289032, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.6428572535514832, 'reward_std': 0.16197100281715393, 'kl': 0.31591796875, 'epoch': 3.96} 79%|███████▉ | 1275/1610 [4:50:59<54:03, 9.68s/it] 79%|███████▉ | 1276/1610 [4:51:09<54:32, 9.80s/it] {'loss': 0.0181, 'grad_norm': 2.5878683844387917, 'learning_rate': 2.074534161490683e-07, 'completion_length': 51.83928871154785, 'rewards/accuracy_reward': 0.5803571939468384, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5446429252624512, 'reward_std': 0.24743251502513885, 'kl': 0.45361328125, 'epoch': 3.96} 79%|███████▉ | 1276/1610 [4:51:09<54:32, 9.80s/it] 79%|███████▉ | 1277/1610 [4:51:17<51:46, 9.33s/it] {'loss': 0.0121, 'grad_norm': 1.7844587959430733, 'learning_rate': 2.0683229813664595e-07, 'completion_length': 51.250003814697266, 'rewards/accuracy_reward': 0.4553571492433548, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4375000596046448, 'reward_std': 0.09628405794501305, 'kl': 0.302734375, 'epoch': 3.97} 79%|███████▉ | 1277/1610 [4:51:17<51:46, 9.33s/it] 79%|███████▉ | 1278/1610 [4:51:25<49:57, 9.03s/it] {'loss': 0.039, 'grad_norm': 4.194414784842366, 'learning_rate': 2.0621118012422359e-07, 'completion_length': 58.58928871154785, 'rewards/accuracy_reward': 0.4375000298023224, 'rewards/format_reward': 0.9464286267757416, 'reward': 1.3839285969734192, 'reward_std': 0.33631402254104614, 'kl': 0.974609375, 'epoch': 3.97} 79%|███████▉ | 1278/1610 [4:51:25<49:57, 9.03s/it] 79%|███████▉ | 1279/1610 [4:51:37<54:17, 9.84s/it] {'loss': 0.0355, 'grad_norm': 2.366414283588355, 'learning_rate': 2.0559006211180125e-07, 'completion_length': 54.65178680419922, 'rewards/accuracy_reward': 0.2857142984867096, 'rewards/format_reward': 0.955357164144516, 'reward': 1.2410714626312256, 'reward_std': 0.18205057084560394, 'kl': 0.88671875, 'epoch': 3.97} 79%|███████▉ | 1279/1610 [4:51:37<54:17, 9.84s/it] 80%|███████▉ | 1280/1610 [4:51:47<53:45, 9.78s/it] {'loss': 0.022, 'grad_norm': 3.130484048221779, 'learning_rate': 2.0496894409937888e-07, 'completion_length': 49.37500190734863, 'rewards/accuracy_reward': 0.723214328289032, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.7053571939468384, 'reward_std': 0.18849068135023117, 'kl': 0.548828125, 'epoch': 3.98} 80%|███████▉ | 1280/1610 [4:51:47<53:45, 9.78s/it] 80%|███████▉ | 1281/1610 [4:51:56<53:19, 9.72s/it] {'loss': 0.0389, 'grad_norm': 5.902615812240136, 'learning_rate': 2.0434782608695654e-07, 'completion_length': 50.59821701049805, 'rewards/accuracy_reward': 0.4375000298023224, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.383928656578064, 'reward_std': 0.2742847502231598, 'kl': 0.974609375, 'epoch': 3.98} 80%|███████▉ | 1281/1610 [4:51:56<53:19, 9.72s/it] 80%|███████▉ | 1282/1610 [4:52:05<51:13, 9.37s/it] {'loss': 0.0452, 'grad_norm': 3.9175461418282027, 'learning_rate': 2.0372670807453414e-07, 'completion_length': 55.28571701049805, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.4642857909202576, 'reward_std': 0.17737560719251633, 'kl': 1.131103515625, 'epoch': 3.98} 80%|███████▉ | 1282/1610 [4:52:05<51:13, 9.37s/it] 80%|███████▉ | 1283/1610 [4:52:18<56:24, 10.35s/it] {'loss': 0.0583, 'grad_norm': 2.4850049942703807, 'learning_rate': 2.0310559006211178e-07, 'completion_length': 64.75000190734863, 'rewards/accuracy_reward': 0.785714328289032, 'rewards/format_reward': 0.9196428954601288, 'reward': 1.7053571939468384, 'reward_std': 0.2760959267616272, 'kl': 1.453125, 'epoch': 3.98} 80%|███████▉ | 1283/1610 [4:52:18<56:24, 10.35s/it] 80%|███████▉ | 1284/1610 [4:52:28<56:59, 10.49s/it] {'loss': 0.051, 'grad_norm': 3.3417451622020966, 'learning_rate': 2.0248447204968943e-07, 'completion_length': 56.27678680419922, 'rewards/accuracy_reward': 0.3303571492433548, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.258928656578064, 'reward_std': 0.2513168305158615, 'kl': 1.2734375, 'epoch': 3.99} 80%|███████▉ | 1284/1610 [4:52:28<56:59, 10.49s/it] 80%|███████▉ | 1285/1610 [4:52:36<52:09, 9.63s/it] {'loss': 0.028, 'grad_norm': 3.9843360627008466, 'learning_rate': 2.0186335403726707e-07, 'completion_length': 48.187503814697266, 'rewards/accuracy_reward': 0.4821428656578064, 'rewards/format_reward': 0.910714328289032, 'reward': 1.3928571939468384, 'reward_std': 0.28907015919685364, 'kl': 0.6962890625, 'epoch': 3.99} 80%|███████▉ | 1285/1610 [4:52:36<52:09, 9.63s/it] 80%|███████▉ | 1286/1610 [4:52:48<55:15, 10.23s/it] {'loss': 0.1003, 'grad_norm': 7.3284968831794535, 'learning_rate': 2.0124223602484473e-07, 'completion_length': 72.33928680419922, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 0.8571428954601288, 'reward': 1.3750000596046448, 'reward_std': 0.43386392295360565, 'kl': 2.5078125, 'epoch': 3.99} 80%|███████▉ | 1286/1610 [4:52:48<55:15, 10.23s/it] 80%|███████▉ | 1287/1610 [4:53:00<57:40, 10.71s/it] {'loss': 0.0794, 'grad_norm': 6.353954377730052, 'learning_rate': 2.0062111801242236e-07, 'completion_length': 62.562503814697266, 'rewards/accuracy_reward': 0.4642857313156128, 'rewards/format_reward': 0.910714328289032, 'reward': 1.3750000596046448, 'reward_std': 0.28209593892097473, 'kl': 1.984375, 'epoch': 4.0} 80%|███████▉ | 1287/1610 [4:53:00<57:40, 10.71s/it] 80%|████████ | 1288/1610 [4:53:07<52:48, 9.84s/it] {'loss': 0.026, 'grad_norm': 7.738090534194152, 'learning_rate': 2e-07, 'completion_length': 47.63393211364746, 'rewards/accuracy_reward': 0.803571492433548, 'rewards/format_reward': 0.955357164144516, 'reward': 1.758928656578064, 'reward_std': 0.14679168164730072, 'kl': 0.650390625, 'epoch': 4.0} 80%|████████ | 1288/1610 [4:53:07<52:48, 9.84s/it] 80%|████████ | 1289/1610 [4:53:16<50:33, 9.45s/it] {'loss': 0.0257, 'grad_norm': 4.988950394773518, 'learning_rate': 1.9937888198757762e-07, 'completion_length': 51.25000190734863, 'rewards/accuracy_reward': 0.6517857313156128, 'rewards/format_reward': 0.955357164144516, 'reward': 1.6071429252624512, 'reward_std': 0.16323686763644218, 'kl': 0.642578125, 'epoch': 4.0} 80%|████████ | 1289/1610 [4:53:16<50:33, 9.45s/it] 80%|████████ | 1290/1610 [4:53:24<47:36, 8.93s/it] {'loss': 0.0443, 'grad_norm': 6.786119083534958, 'learning_rate': 1.9875776397515526e-07, 'completion_length': 49.18750190734863, 'rewards/accuracy_reward': 0.642857164144516, 'rewards/format_reward': 0.9196429252624512, 'reward': 1.5625000596046448, 'reward_std': 0.2502015680074692, 'kl': 1.1083984375, 'epoch': 4.01} 80%|████████ | 1290/1610 [4:53:24<47:36, 8.93s/it] 80%|████████ | 1291/1610 [4:53:33<48:06, 9.05s/it] {'loss': 0.0463, 'grad_norm': 2.399485260873983, 'learning_rate': 1.9813664596273292e-07, 'completion_length': 52.75893020629883, 'rewards/accuracy_reward': 0.6160714626312256, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.5446429252624512, 'reward_std': 0.27154944837093353, 'kl': 1.16015625, 'epoch': 4.01} 80%|████████ | 1291/1610 [4:53:33<48:06, 9.05s/it] 80%|████████ | 1292/1610 [4:53:40<44:51, 8.46s/it] {'loss': 0.0064, 'grad_norm': 2.542576299557618, 'learning_rate': 1.9751552795031055e-07, 'completion_length': 43.38393020629883, 'rewards/accuracy_reward': 0.5089285969734192, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.5000000596046448, 'reward_std': 0.11272924393415451, 'kl': 0.1591796875, 'epoch': 4.01} 80%|████████ | 1292/1610 [4:53:40<44:51, 8.46s/it] 80%|████████ | 1293/1610 [4:53:47<42:53, 8.12s/it] {'loss': 0.0083, 'grad_norm': 2.1384893224231174, 'learning_rate': 1.968944099378882e-07, 'completion_length': 50.41071701049805, 'rewards/accuracy_reward': 0.455357164144516, 'rewards/format_reward': 0.9732142984867096, 'reward': 1.4285714626312256, 'reward_std': 0.14579425379633904, 'kl': 0.208984375, 'epoch': 4.02} 80%|████████ | 1293/1610 [4:53:47<42:53, 8.12s/it] 80%|████████ | 1294/1610 [4:53:55<41:40, 7.91s/it] {'loss': 0.0185, 'grad_norm': 3.354450747973548, 'learning_rate': 1.962732919254658e-07, 'completion_length': 48.09821701049805, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 0.955357164144516, 'reward': 1.5625001192092896, 'reward_std': 0.2702430784702301, 'kl': 0.462890625, 'epoch': 4.02} 80%|████████ | 1294/1610 [4:53:55<41:40, 7.91s/it] 80%|████████ | 1295/1610 [4:54:02<40:56, 7.80s/it] {'loss': 0.0151, 'grad_norm': 2.9042318176507713, 'learning_rate': 1.9565217391304347e-07, 'completion_length': 47.00000190734863, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 0.9642857611179352, 'reward': 1.5892857909202576, 'reward_std': 0.22255312651395798, 'kl': 0.3779296875, 'epoch': 4.02} 80%|████████ | 1295/1610 [4:54:02<40:56, 7.80s/it] 80%|████████ | 1296/1610 [4:54:09<39:34, 7.56s/it] {'loss': 0.0067, 'grad_norm': 3.1183605992400976, 'learning_rate': 1.950310559006211e-07, 'completion_length': 43.366071701049805, 'rewards/accuracy_reward': 0.580357164144516, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.571428656578064, 'reward_std': 0.14964355528354645, 'kl': 0.16796875, 'epoch': 4.02} 80%|████████ | 1296/1610 [4:54:09<39:34, 7.56s/it] 81%|████████ | 1297/1610 [4:54:17<39:08, 7.50s/it] {'loss': 0.0071, 'grad_norm': 2.0530125736765803, 'learning_rate': 1.9440993788819876e-07, 'completion_length': 47.08928871154785, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.6696429252624512, 'reward_std': 0.10882644355297089, 'kl': 0.177490234375, 'epoch': 4.03} 81%|████████ | 1297/1610 [4:54:17<39:08, 7.50s/it] 81%|████████ | 1298/1610 [4:54:25<41:02, 7.89s/it] {'loss': 0.0255, 'grad_norm': 3.4944179486224605, 'learning_rate': 1.937888198757764e-07, 'completion_length': 51.437503814697266, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 0.955357164144516, 'reward': 1.4732143878936768, 'reward_std': 0.30261000990867615, 'kl': 0.63671875, 'epoch': 4.03} 81%|████████ | 1298/1610 [4:54:25<41:02, 7.89s/it] 81%|████████ | 1299/1610 [4:54:34<41:19, 7.97s/it] {'loss': 0.0126, 'grad_norm': 2.101561171864086, 'learning_rate': 1.9316770186335403e-07, 'completion_length': 51.06250190734863, 'rewards/accuracy_reward': 0.4642857313156128, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.4464285969734192, 'reward_std': 0.2119242623448372, 'kl': 0.31640625, 'epoch': 4.03} 81%|████████ | 1299/1610 [4:54:34<41:19, 7.97s/it] 81%|████████ | 1300/1610 [4:54:43<43:41, 8.46s/it] {'loss': 0.0374, 'grad_norm': 3.78163406713862, 'learning_rate': 1.9254658385093166e-07, 'completion_length': 50.33928680419922, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 0.9375000298023224, 'reward': 1.4910715222358704, 'reward_std': 0.22983404994010925, 'kl': 0.939453125, 'epoch': 4.04} 81%|████████ | 1300/1610 [4:54:43<43:41, 8.46s/it] 81%|████████ | 1301/1610 [4:55:32<1:45:08, 20.41s/it] {'loss': 0.0349, 'grad_norm': 3.7006767337560516, 'learning_rate': 1.919254658385093e-07, 'completion_length': 50.910715103149414, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 0.8660714626312256, 'reward': 1.4375001192092896, 'reward_std': 0.43280982971191406, 'kl': 0.873046875, 'epoch': 4.04} 81%|████████ | 1301/1610 [4:55:32<1:45:08, 20.41s/it] 81%|████████ | 1302/1610 [4:55:39<1:24:14, 16.41s/it] {'loss': 0.0323, 'grad_norm': 6.285647548057792, 'learning_rate': 1.9130434782608695e-07, 'completion_length': 47.05357360839844, 'rewards/accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 0.955357164144516, 'reward': 1.4553571939468384, 'reward_std': 0.2574945241212845, 'kl': 0.80859375, 'epoch': 4.04} 81%|████████ | 1302/1610 [4:55:39<1:24:14, 16.41s/it] 81%|████████ | 1303/1610 [4:55:47<1:10:58, 13.87s/it] {'loss': 0.0184, 'grad_norm': 6.119852275077587, 'learning_rate': 1.9068322981366459e-07, 'completion_length': 44.88393020629883, 'rewards/accuracy_reward': 0.5535714328289032, 'rewards/format_reward': 0.9196428954601288, 'reward': 1.4732143878936768, 'reward_std': 0.3099055886268616, 'kl': 0.4609375, 'epoch': 4.05} 81%|████████ | 1303/1610 [4:55:47<1:10:58, 13.87s/it] 81%|████████ | 1304/1610 [4:55:58<1:06:34, 13.06s/it] {'loss': 0.0884, 'grad_norm': 5.704907682426358, 'learning_rate': 1.9006211180124224e-07, 'completion_length': 67.66071701049805, 'rewards/accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 0.8214285969734192, 'reward': 1.321428656578064, 'reward_std': 0.5254766494035721, 'kl': 2.20703125, 'epoch': 4.05} 81%|████████ | 1304/1610 [4:55:58<1:06:34, 13.06s/it] 81%|████████ | 1305/1610 [4:56:08<1:02:19, 12.26s/it] {'loss': 0.0519, 'grad_norm': 6.613799252311684, 'learning_rate': 1.8944099378881988e-07, 'completion_length': 53.892860412597656, 'rewards/accuracy_reward': 0.3482142984867096, 'rewards/format_reward': 0.8928571939468384, 'reward': 1.2410715222358704, 'reward_std': 0.30390140414237976, 'kl': 1.298828125, 'epoch': 4.05} 81%|████████ | 1305/1610 [4:56:08<1:02:19, 12.26s/it] 81%|████████ | 1306/1610 [4:56:17<57:13, 11.29s/it] {'loss': 0.0461, 'grad_norm': 3.591711316990193, 'learning_rate': 1.888198757763975e-07, 'completion_length': 49.14285850524902, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 0.9375000596046448, 'reward': 1.4910715222358704, 'reward_std': 0.296901635825634, 'kl': 1.1552734375, 'epoch': 4.06} 81%|████████ | 1306/1610 [4:56:17<57:13, 11.29s/it] 81%|████████ | 1307/1610 [4:56:25<52:13, 10.34s/it] {'loss': 0.0294, 'grad_norm': 2.8130130058088407, 'learning_rate': 1.8819875776397514e-07, 'completion_length': 45.642860412597656, 'rewards/accuracy_reward': 0.5892857611179352, 'rewards/format_reward': 0.9732142984867096, 'reward': 1.5625000596046448, 'reward_std': 0.16141103208065033, 'kl': 0.7333984375, 'epoch': 4.06} 81%|████████ | 1307/1610 [4:56:25<52:13, 10.34s/it] 81%|████████ | 1308/1610 [4:56:33<47:47, 9.49s/it] {'loss': 0.0267, 'grad_norm': 4.891710159293425, 'learning_rate': 1.8757763975155277e-07, 'completion_length': 48.42857360839844, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 0.955357164144516, 'reward': 1.633928656578064, 'reward_std': 0.31684230268001556, 'kl': 0.6669921875, 'epoch': 4.06} 81%|████████ | 1308/1610 [4:56:33<47:47, 9.49s/it] 81%|████████▏ | 1309/1610 [4:56:41<45:05, 8.99s/it] {'loss': 0.0344, 'grad_norm': 8.155711823249078, 'learning_rate': 1.8695652173913043e-07, 'completion_length': 50.37500190734863, 'rewards/accuracy_reward': 0.535714328289032, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.4642857909202576, 'reward_std': 0.2820959538221359, 'kl': 0.859375, 'epoch': 4.07} 81%|████████▏ | 1309/1610 [4:56:41<45:05, 8.99s/it] 81%|████████▏ | 1310/1610 [4:56:49<43:41, 8.74s/it] {'loss': 0.0629, 'grad_norm': 6.849557460148373, 'learning_rate': 1.8633540372670807e-07, 'completion_length': 51.16964530944824, 'rewards/accuracy_reward': 0.4732143133878708, 'rewards/format_reward': 0.8660714626312256, 'reward': 1.3392857909202576, 'reward_std': 0.2792610377073288, 'kl': 1.5712890625, 'epoch': 4.07} 81%|████████▏ | 1310/1610 [4:56:49<43:41, 8.74s/it] 81%|████████▏ | 1311/1610 [4:56:56<41:54, 8.41s/it] {'loss': 0.0165, 'grad_norm': 2.5639834156312205, 'learning_rate': 1.8571428571428572e-07, 'completion_length': 44.97321701049805, 'rewards/accuracy_reward': 0.4732143133878708, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.4196429252624512, 'reward_std': 0.23158271610736847, 'kl': 0.412109375, 'epoch': 4.07} 81%|████████▏ | 1311/1610 [4:56:56<41:54, 8.41s/it] 81%|████████▏ | 1312/1610 [4:57:03<39:38, 7.98s/it] {'loss': 0.0231, 'grad_norm': 2.668756969879945, 'learning_rate': 1.8509316770186333e-07, 'completion_length': 51.17857360839844, 'rewards/accuracy_reward': 0.535714328289032, 'rewards/format_reward': 0.9375000298023224, 'reward': 1.473214328289032, 'reward_std': 0.2618369236588478, 'kl': 0.578125, 'epoch': 4.07} 81%|████████▏ | 1312/1610 [4:57:03<39:38, 7.98s/it] 82%|████████▏ | 1313/1610 [4:57:14<42:58, 8.68s/it] {'loss': 0.0203, 'grad_norm': 2.6834984346697484, 'learning_rate': 1.84472049689441e-07, 'completion_length': 50.33928680419922, 'rewards/accuracy_reward': 0.5446428954601288, 'rewards/format_reward': 0.9642857611179352, 'reward': 1.508928656578064, 'reward_std': 0.12054043263196945, 'kl': 0.5087890625, 'epoch': 4.08} 82%|████████▏ | 1313/1610 [4:57:14<42:58, 8.68s/it] 82%|████████▏ | 1314/1610 [4:57:23<43:51, 8.89s/it] {'loss': 0.0526, 'grad_norm': 4.175391742107041, 'learning_rate': 1.8385093167701862e-07, 'completion_length': 52.09821701049805, 'rewards/accuracy_reward': 0.6875000298023224, 'rewards/format_reward': 0.9017857313156128, 'reward': 1.5892857909202576, 'reward_std': 0.37354570627212524, 'kl': 1.31640625, 'epoch': 4.08} 82%|████████▏ | 1314/1610 [4:57:23<43:51, 8.89s/it] 82%|████████▏ | 1315/1610 [4:57:31<41:59, 8.54s/it] {'loss': 0.0103, 'grad_norm': 5.164645921295919, 'learning_rate': 1.8322981366459628e-07, 'completion_length': 48.15178680419922, 'rewards/accuracy_reward': 0.6964285969734192, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.6875000596046448, 'reward_std': 0.19178560376167297, 'kl': 0.2587890625, 'epoch': 4.08} 82%|████████▏ | 1315/1610 [4:57:31<41:59, 8.54s/it] 82%|████████▏ | 1316/1610 [4:57:39<41:02, 8.38s/it] {'loss': 0.0157, 'grad_norm': 2.812468059193621, 'learning_rate': 1.8260869565217391e-07, 'completion_length': 46.48214530944824, 'rewards/accuracy_reward': 0.4821428954601288, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4642857909202576, 'reward_std': 0.13408025726675987, 'kl': 0.39111328125, 'epoch': 4.09} 82%|████████▏ | 1316/1610 [4:57:39<41:02, 8.38s/it] 82%|████████▏ | 1317/1610 [4:57:46<38:42, 7.93s/it] {'loss': 0.0155, 'grad_norm': 2.7255962132614253, 'learning_rate': 1.8198757763975152e-07, 'completion_length': 39.92857360839844, 'rewards/accuracy_reward': 0.401785746216774, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.383928656578064, 'reward_std': 0.1379830539226532, 'kl': 0.3876953125, 'epoch': 4.09} 82%|████████▏ | 1317/1610 [4:57:46<38:42, 7.93s/it] 82%|████████▏ | 1318/1610 [4:57:53<38:04, 7.82s/it] {'loss': 0.0093, 'grad_norm': 3.1283896427336098, 'learning_rate': 1.8136645962732918e-07, 'completion_length': 43.035715103149414, 'rewards/accuracy_reward': 0.642857164144516, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.633928656578064, 'reward_std': 0.2291373908519745, 'kl': 0.2333984375, 'epoch': 4.09} 82%|████████▏ | 1318/1610 [4:57:53<38:04, 7.82s/it] 82%|████████▏ | 1319/1610 [4:58:02<38:37, 7.97s/it] {'loss': 0.015, 'grad_norm': 1.5662866249911385, 'learning_rate': 1.807453416149068e-07, 'completion_length': 47.67857360839844, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.508928656578064, 'reward_std': 0.15872059762477875, 'kl': 0.375244140625, 'epoch': 4.1} 82%|████████▏ | 1319/1610 [4:58:02<38:37, 7.97s/it] 82%|████████▏ | 1320/1610 [4:58:08<36:56, 7.64s/it] {'loss': 0.0082, 'grad_norm': 3.0120567245417997, 'learning_rate': 1.8012422360248447e-07, 'completion_length': 51.33928871154785, 'rewards/accuracy_reward': 0.4107143133878708, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4017857909202576, 'reward_std': 0.21582704409956932, 'kl': 0.203857421875, 'epoch': 4.1} 82%|████████▏ | 1320/1610 [4:58:08<36:56, 7.64s/it] 82%|████████▏ | 1321/1610 [4:58:15<35:39, 7.40s/it] {'loss': 0.0081, 'grad_norm': 2.621167502275807, 'learning_rate': 1.795031055900621e-07, 'completion_length': 42.26785850524902, 'rewards/accuracy_reward': 0.5446428954601288, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.535714328289032, 'reward_std': 0.06222161278128624, 'kl': 0.2021484375, 'epoch': 4.1} 82%|████████▏ | 1321/1610 [4:58:15<35:39, 7.40s/it] 82%|████████▏ | 1322/1610 [4:58:24<37:21, 7.78s/it] {'loss': 0.0347, 'grad_norm': 2.3696764879070185, 'learning_rate': 1.7888198757763976e-07, 'completion_length': 49.25000190734863, 'rewards/accuracy_reward': 0.4375000298023224, 'rewards/format_reward': 0.9196428954601288, 'reward': 1.3571429252624512, 'reward_std': 0.23638546466827393, 'kl': 0.869140625, 'epoch': 4.11} 82%|████████▏ | 1322/1610 [4:58:24<37:21, 7.78s/it] 82%|████████▏ | 1323/1610 [4:58:32<37:54, 7.92s/it] {'loss': 0.0211, 'grad_norm': 4.17463678409391, 'learning_rate': 1.7826086956521737e-07, 'completion_length': 47.33928871154785, 'rewards/accuracy_reward': 0.446428582072258, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4107143878936768, 'reward_std': 0.24619604647159576, 'kl': 0.5283203125, 'epoch': 4.11} 82%|████████▏ | 1323/1610 [4:58:32<37:54, 7.92s/it] 82%|████████▏ | 1324/1610 [4:58:39<36:41, 7.70s/it] {'loss': 0.01, 'grad_norm': 2.417522561853512, 'learning_rate': 1.7763975155279503e-07, 'completion_length': 43.40178871154785, 'rewards/accuracy_reward': 0.2857142984867096, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.2678571939468384, 'reward_std': 0.05050762742757797, 'kl': 0.25, 'epoch': 4.11} 82%|████████▏ | 1324/1610 [4:58:39<36:41, 7.70s/it] 82%|████████▏ | 1325/1610 [4:58:46<35:39, 7.51s/it] {'loss': 0.0057, 'grad_norm': 1.0657907491204965, 'learning_rate': 1.7701863354037266e-07, 'completion_length': 43.32143020629883, 'rewards/accuracy_reward': 0.5535714328289032, 'rewards/format_reward': 1.0, 'reward': 1.5535715222358704, 'reward_std': 0.09528662264347076, 'kl': 0.14111328125, 'epoch': 4.11} 82%|████████▏ | 1325/1610 [4:58:46<35:39, 7.51s/it] 82%|████████▏ | 1326/1610 [4:58:54<35:35, 7.52s/it] {'loss': 0.0257, 'grad_norm': 2.30932886498742, 'learning_rate': 1.763975155279503e-07, 'completion_length': 48.517860412597656, 'rewards/accuracy_reward': 0.4375000149011612, 'rewards/format_reward': 0.9732142984867096, 'reward': 1.4107143878936768, 'reward_std': 0.270717054605484, 'kl': 0.6396484375, 'epoch': 4.12} 82%|████████▏ | 1326/1610 [4:58:54<35:35, 7.52s/it] 82%|████████▏ | 1327/1610 [4:59:01<35:03, 7.43s/it] {'loss': 0.0277, 'grad_norm': 4.084391683589715, 'learning_rate': 1.7577639751552795e-07, 'completion_length': 42.40178680419922, 'rewards/accuracy_reward': 0.464285746216774, 'rewards/format_reward': 0.9375000298023224, 'reward': 1.4017857909202576, 'reward_std': 0.23899829387664795, 'kl': 0.6923828125, 'epoch': 4.12} 82%|████████▏ | 1327/1610 [4:59:01<35:03, 7.43s/it] 82%|████████▏ | 1328/1610 [4:59:09<35:56, 7.65s/it] {'loss': 0.0353, 'grad_norm': 3.0546085991385747, 'learning_rate': 1.7515527950310558e-07, 'completion_length': 52.57143211364746, 'rewards/accuracy_reward': 0.3928571790456772, 'rewards/format_reward': 0.9375000596046448, 'reward': 1.3303572535514832, 'reward_std': 0.29034458100795746, 'kl': 0.8818359375, 'epoch': 4.12} 82%|████████▏ | 1328/1610 [4:59:09<35:56, 7.65s/it] 83%|████████▎ | 1329/1610 [4:59:20<40:42, 8.69s/it] {'loss': 0.0524, 'grad_norm': 8.674717877939578, 'learning_rate': 1.7453416149068322e-07, 'completion_length': 59.44643020629883, 'rewards/accuracy_reward': 0.4017857313156128, 'rewards/format_reward': 0.8839285969734192, 'reward': 1.285714328289032, 'reward_std': 0.4402645081281662, 'kl': 1.3125, 'epoch': 4.13} 83%|████████▎ | 1329/1610 [4:59:20<40:42, 8.69s/it] 83%|████████▎ | 1330/1610 [4:59:29<40:23, 8.66s/it] {'loss': 0.0301, 'grad_norm': 3.3655236250530867, 'learning_rate': 1.7391304347826085e-07, 'completion_length': 46.83035850524902, 'rewards/accuracy_reward': 0.7767857611179352, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7410715222358704, 'reward_std': 0.24802187085151672, 'kl': 0.75390625, 'epoch': 4.13} 83%|████████▎ | 1330/1610 [4:59:29<40:23, 8.66s/it] 83%|████████▎ | 1331/1610 [4:59:37<39:10, 8.42s/it] {'loss': 0.0263, 'grad_norm': 2.9254130224521, 'learning_rate': 1.732919254658385e-07, 'completion_length': 47.94643211364746, 'rewards/accuracy_reward': 0.455357164144516, 'rewards/format_reward': 0.955357164144516, 'reward': 1.410714328289032, 'reward_std': 0.23266181349754333, 'kl': 0.65625, 'epoch': 4.13} 83%|████████▎ | 1331/1610 [4:59:37<39:10, 8.42s/it] 83%|████████▎ | 1332/1610 [4:59:47<41:26, 8.94s/it] {'loss': 0.0408, 'grad_norm': 5.064723961044737, 'learning_rate': 1.7267080745341614e-07, 'completion_length': 50.61607360839844, 'rewards/accuracy_reward': 0.3660714477300644, 'rewards/format_reward': 0.9464286267757416, 'reward': 1.3125000596046448, 'reward_std': 0.2759717181324959, 'kl': 1.017578125, 'epoch': 4.14} 83%|████████▎ | 1332/1610 [4:59:47<41:26, 8.94s/it] 83%|████████▎ | 1333/1610 [4:59:54<39:07, 8.47s/it] {'loss': 0.0355, 'grad_norm': 3.2204770530406988, 'learning_rate': 1.720496894409938e-07, 'completion_length': 45.50893020629883, 'rewards/accuracy_reward': 0.7500000298023224, 'rewards/format_reward': 0.955357164144516, 'reward': 1.7053571939468384, 'reward_std': 0.2511918842792511, 'kl': 0.88720703125, 'epoch': 4.14} 83%|████████▎ | 1333/1610 [4:59:54<39:07, 8.47s/it] 83%|████████▎ | 1334/1610 [5:00:05<42:00, 9.13s/it] {'loss': 0.1038, 'grad_norm': 7.209409574633595, 'learning_rate': 1.7142857142857143e-07, 'completion_length': 61.455360412597656, 'rewards/accuracy_reward': 0.3035714477300644, 'rewards/format_reward': 0.8392857313156128, 'reward': 1.1428571939468384, 'reward_std': 0.45728932321071625, 'kl': 2.6015625, 'epoch': 4.14} 83%|████████▎ | 1334/1610 [5:00:05<42:00, 9.13s/it] 83%|████████▎ | 1335/1610 [5:00:12<39:03, 8.52s/it] {'loss': 0.0689, 'grad_norm': 9.514164498804249, 'learning_rate': 1.7080745341614904e-07, 'completion_length': 42.67857360839844, 'rewards/accuracy_reward': 0.5357143133878708, 'rewards/format_reward': 0.9017857313156128, 'reward': 1.4375000596046448, 'reward_std': 0.3132643848657608, 'kl': 1.7265625, 'epoch': 4.15} 83%|████████▎ | 1335/1610 [5:00:12<39:03, 8.52s/it] 83%|████████▎ | 1336/1610 [5:00:21<39:07, 8.57s/it] {'loss': 0.076, 'grad_norm': 7.1108221614410505, 'learning_rate': 1.701863354037267e-07, 'completion_length': 49.11607360839844, 'rewards/accuracy_reward': 0.455357164144516, 'rewards/format_reward': 0.8839286267757416, 'reward': 1.3392857909202576, 'reward_std': 0.29097503423690796, 'kl': 1.8984375, 'epoch': 4.15} 83%|████████▎ | 1336/1610 [5:00:21<39:07, 8.57s/it] 83%|████████▎ | 1337/1610 [5:00:30<39:41, 8.72s/it] {'loss': 0.0188, 'grad_norm': 3.2005911459244865, 'learning_rate': 1.6956521739130433e-07, 'completion_length': 43.75893020629883, 'rewards/accuracy_reward': 0.3928571492433548, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.3392857313156128, 'reward_std': 0.14708422124385834, 'kl': 0.46875, 'epoch': 4.15} 83%|████████▎ | 1337/1610 [5:00:30<39:41, 8.72s/it] 83%|████████▎ | 1338/1610 [5:00:39<39:28, 8.71s/it] {'loss': 0.0512, 'grad_norm': 6.369573556570025, 'learning_rate': 1.68944099378882e-07, 'completion_length': 47.94643020629883, 'rewards/accuracy_reward': 0.580357164144516, 'rewards/format_reward': 0.9017857313156128, 'reward': 1.4821429252624512, 'reward_std': 0.26237280666828156, 'kl': 1.279296875, 'epoch': 4.16} 83%|████████▎ | 1338/1610 [5:00:39<39:28, 8.71s/it] 83%|████████▎ | 1339/1610 [5:00:47<38:41, 8.56s/it] {'loss': 0.0437, 'grad_norm': 2.9704848408075204, 'learning_rate': 1.6832298136645962e-07, 'completion_length': 46.50893211364746, 'rewards/accuracy_reward': 0.580357164144516, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5267857909202576, 'reward_std': 0.16062404215335846, 'kl': 1.09375, 'epoch': 4.16} 83%|████████▎ | 1339/1610 [5:00:47<38:41, 8.56s/it] 83%|████████▎ | 1340/1610 [5:00:56<38:49, 8.63s/it] {'loss': 0.0574, 'grad_norm': 4.385571960648403, 'learning_rate': 1.6770186335403728e-07, 'completion_length': 45.57143020629883, 'rewards/accuracy_reward': 0.6160714626312256, 'rewards/format_reward': 0.910714328289032, 'reward': 1.5267857909202576, 'reward_std': 0.35745008289813995, 'kl': 1.439453125, 'epoch': 4.16} 83%|████████▎ | 1340/1610 [5:00:56<38:49, 8.63s/it] 83%|████████▎ | 1341/1610 [5:01:03<37:37, 8.39s/it] {'loss': 0.0324, 'grad_norm': 4.156289734099834, 'learning_rate': 1.6708074534161489e-07, 'completion_length': 44.07143020629883, 'rewards/accuracy_reward': 0.232142873108387, 'rewards/format_reward': 0.910714328289032, 'reward': 1.1428571939468384, 'reward_std': 0.2831694483757019, 'kl': 0.810546875, 'epoch': 4.16} 83%|████████▎ | 1341/1610 [5:01:03<37:37, 8.39s/it] 83%|████████▎ | 1342/1610 [5:01:12<37:29, 8.39s/it] {'loss': 0.0184, 'grad_norm': 4.1515118946235905, 'learning_rate': 1.6645962732919252e-07, 'completion_length': 47.89285850524902, 'rewards/accuracy_reward': 0.6071428805589676, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.571428656578064, 'reward_std': 0.34209612011909485, 'kl': 0.4580078125, 'epoch': 4.17} 83%|████████▎ | 1342/1610 [5:01:12<37:29, 8.39s/it] 83%|████████▎ | 1343/1610 [5:01:19<35:38, 8.01s/it] {'loss': 0.0319, 'grad_norm': 4.450779364771274, 'learning_rate': 1.6583850931677018e-07, 'completion_length': 47.95535850524902, 'rewards/accuracy_reward': 0.294642873108387, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.2232143878936768, 'reward_std': 0.27162449061870575, 'kl': 0.796875, 'epoch': 4.17} 83%|████████▎ | 1343/1610 [5:01:19<35:38, 8.01s/it] 83%|████████▎ | 1344/1610 [5:01:29<38:05, 8.59s/it] {'loss': 0.1045, 'grad_norm': 4.825701413598981, 'learning_rate': 1.652173913043478e-07, 'completion_length': 57.75893211364746, 'rewards/accuracy_reward': 0.4821428805589676, 'rewards/format_reward': 0.785714328289032, 'reward': 1.2678571939468384, 'reward_std': 0.47602657973766327, 'kl': 2.6171875, 'epoch': 4.17} 83%|████████▎ | 1344/1610 [5:01:29<38:05, 8.59s/it] 84%|████████▎ | 1345/1610 [5:01:38<38:37, 8.75s/it] {'loss': 0.0515, 'grad_norm': 3.7965206183170452, 'learning_rate': 1.6459627329192547e-07, 'completion_length': 57.99107360839844, 'rewards/accuracy_reward': 0.4732143133878708, 'rewards/format_reward': 0.8392857313156128, 'reward': 1.3125000596046448, 'reward_std': 0.4076753407716751, 'kl': 1.283203125, 'epoch': 4.18} 84%|████████▎ | 1345/1610 [5:01:38<38:37, 8.75s/it] 84%|████████▎ | 1346/1610 [5:01:46<37:55, 8.62s/it] {'loss': 0.0601, 'grad_norm': 2.869254058190639, 'learning_rate': 1.639751552795031e-07, 'completion_length': 51.08928680419922, 'rewards/accuracy_reward': 0.3482142984867096, 'rewards/format_reward': 0.8928571939468384, 'reward': 1.2410714626312256, 'reward_std': 0.34269504249095917, 'kl': 1.5, 'epoch': 4.18} 84%|████████▎ | 1346/1610 [5:01:46<37:55, 8.62s/it] 84%|████████▎ | 1347/1610 [5:01:58<41:12, 9.40s/it] {'loss': 0.0827, 'grad_norm': 8.852140122055584, 'learning_rate': 1.6335403726708073e-07, 'completion_length': 56.83928871154785, 'rewards/accuracy_reward': 0.2857142984867096, 'rewards/format_reward': 0.8750000596046448, 'reward': 1.1607143878936768, 'reward_std': 0.3882565051317215, 'kl': 2.06640625, 'epoch': 4.18} 84%|████████▎ | 1347/1610 [5:01:58<41:12, 9.40s/it] 84%|████████▎ | 1348/1610 [5:02:06<40:11, 9.20s/it] {'loss': 0.0792, 'grad_norm': 6.3217983299758576, 'learning_rate': 1.6273291925465837e-07, 'completion_length': 47.13393020629883, 'rewards/accuracy_reward': 0.4910714626312256, 'rewards/format_reward': 0.848214328289032, 'reward': 1.3392857909202576, 'reward_std': 0.4392039477825165, 'kl': 1.9765625, 'epoch': 4.19} 84%|████████▎ | 1348/1610 [5:02:06<40:11, 9.20s/it] 84%|████████▍ | 1349/1610 [5:02:16<40:35, 9.33s/it] {'loss': 0.1314, 'grad_norm': 9.270071721355919, 'learning_rate': 1.6211180124223603e-07, 'completion_length': 55.87500190734863, 'rewards/accuracy_reward': 0.5178571939468384, 'rewards/format_reward': 0.7678571939468384, 'reward': 1.2857143878936768, 'reward_std': 0.3558870404958725, 'kl': 3.2890625, 'epoch': 4.19} 84%|████████▍ | 1349/1610 [5:02:16<40:35, 9.33s/it] 84%|████████▍ | 1350/1610 [5:02:24<39:02, 9.01s/it] {'loss': 0.0697, 'grad_norm': 5.574502883312429, 'learning_rate': 1.6149068322981366e-07, 'completion_length': 49.74107360839844, 'rewards/accuracy_reward': 0.4821428805589676, 'rewards/format_reward': 0.8392857611179352, 'reward': 1.321428656578064, 'reward_std': 0.4216592609882355, 'kl': 1.740234375, 'epoch': 4.19} 84%|████████▍ | 1350/1610 [5:02:24<39:02, 9.01s/it] 84%|████████▍ | 1351/1610 [5:02:35<40:41, 9.43s/it] {'loss': 0.0531, 'grad_norm': 3.868045039249816, 'learning_rate': 1.608695652173913e-07, 'completion_length': 54.41071701049805, 'rewards/accuracy_reward': 0.232142873108387, 'rewards/format_reward': 0.8839285969734192, 'reward': 1.1160714626312256, 'reward_std': 0.34288738667964935, 'kl': 1.330078125, 'epoch': 4.2} 84%|████████▍ | 1351/1610 [5:02:35<40:41, 9.43s/it] 84%|████████▍ | 1352/1610 [5:02:42<37:13, 8.66s/it] {'loss': 0.0293, 'grad_norm': 5.401388619948962, 'learning_rate': 1.6024844720496895e-07, 'completion_length': 39.33928680419922, 'rewards/accuracy_reward': 0.5, 'rewards/format_reward': 0.955357164144516, 'reward': 1.4553571939468384, 'reward_std': 0.2663346976041794, 'kl': 0.728515625, 'epoch': 4.2} 84%|████████▍ | 1352/1610 [5:02:42<37:13, 8.66s/it] 84%|████████▍ | 1353/1610 [5:02:49<35:38, 8.32s/it] {'loss': 0.045, 'grad_norm': 4.271457198964956, 'learning_rate': 1.5962732919254656e-07, 'completion_length': 49.41071701049805, 'rewards/accuracy_reward': 0.4821428954601288, 'rewards/format_reward': 0.8928571939468384, 'reward': 1.3750000596046448, 'reward_std': 0.3111763410270214, 'kl': 1.123046875, 'epoch': 4.2} 84%|████████▍ | 1353/1610 [5:02:49<35:38, 8.32s/it] 84%|████████▍ | 1354/1610 [5:02:56<34:24, 8.06s/it] {'loss': 0.0464, 'grad_norm': 6.779942898961372, 'learning_rate': 1.5900621118012422e-07, 'completion_length': 42.11607360839844, 'rewards/accuracy_reward': 0.4107143133878708, 'rewards/format_reward': 0.8839285969734192, 'reward': 1.2946429252624512, 'reward_std': 0.3400046229362488, 'kl': 1.16015625, 'epoch': 4.2} 84%|████████▍ | 1354/1610 [5:02:57<34:24, 8.06s/it] 84%|████████▍ | 1355/1610 [5:03:05<35:08, 8.27s/it] {'loss': 0.0726, 'grad_norm': 5.686217274439282, 'learning_rate': 1.5838509316770185e-07, 'completion_length': 48.30357360839844, 'rewards/accuracy_reward': 0.535714328289032, 'rewards/format_reward': 0.8839285969734192, 'reward': 1.4196428656578064, 'reward_std': 0.3090367168188095, 'kl': 1.8125, 'epoch': 4.21} 84%|████████▍ | 1355/1610 [5:03:05<35:08, 8.27s/it] 84%|████████▍ | 1356/1610 [5:03:13<34:21, 8.12s/it] {'loss': 0.0584, 'grad_norm': 4.42392315127221, 'learning_rate': 1.577639751552795e-07, 'completion_length': 46.50893020629883, 'rewards/accuracy_reward': 0.2142857238650322, 'rewards/format_reward': 0.8839285969734192, 'reward': 1.0982142984867096, 'reward_std': 0.3532022386789322, 'kl': 1.462890625, 'epoch': 4.21} 84%|████████▍ | 1356/1610 [5:03:13<34:21, 8.12s/it] 84%|████████▍ | 1357/1610 [5:03:21<33:38, 7.98s/it] {'loss': 0.0566, 'grad_norm': 4.978923953760584, 'learning_rate': 1.5714285714285714e-07, 'completion_length': 51.37500190734863, 'rewards/accuracy_reward': 0.2678571492433548, 'rewards/format_reward': 0.9017857313156128, 'reward': 1.1696429252624512, 'reward_std': 0.22996579855680466, 'kl': 1.412109375, 'epoch': 4.21} 84%|████████▍ | 1357/1610 [5:03:21<33:38, 7.98s/it] 84%|████████▍ | 1358/1610 [5:03:28<32:16, 7.68s/it] {'loss': 0.0336, 'grad_norm': 6.633733096995275, 'learning_rate': 1.565217391304348e-07, 'completion_length': 41.02678680419922, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5178571939468384, 'reward_std': 0.32015572488307953, 'kl': 0.841796875, 'epoch': 4.22} 84%|████████▍ | 1358/1610 [5:03:28<32:16, 7.68s/it] 84%|████████▍ | 1359/1610 [5:03:37<33:57, 8.12s/it] {'loss': 0.0513, 'grad_norm': 3.4768593562027124, 'learning_rate': 1.559006211180124e-07, 'completion_length': 48.27678871154785, 'rewards/accuracy_reward': 0.3035714477300644, 'rewards/format_reward': 0.9017857313156128, 'reward': 1.2053571939468384, 'reward_std': 0.2093869298696518, 'kl': 1.279296875, 'epoch': 4.22} 84%|████████▍ | 1359/1610 [5:03:37<33:57, 8.12s/it] 84%|████████▍ | 1360/1610 [5:03:44<33:00, 7.92s/it] {'loss': 0.0804, 'grad_norm': 8.716174679584698, 'learning_rate': 1.5527950310559004e-07, 'completion_length': 41.47321701049805, 'rewards/accuracy_reward': 0.5267857313156128, 'rewards/format_reward': 0.8660714626312256, 'reward': 1.3928571939468384, 'reward_std': 0.27414587140083313, 'kl': 2.00390625, 'epoch': 4.22} 84%|████████▍ | 1360/1610 [5:03:44<33:00, 7.92s/it] 85%|████████▍ | 1361/1610 [5:03:52<32:25, 7.81s/it] {'loss': 0.03, 'grad_norm': 3.7290944189852313, 'learning_rate': 1.546583850931677e-07, 'completion_length': 45.29464340209961, 'rewards/accuracy_reward': 0.3928571492433548, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.3392857909202576, 'reward_std': 0.23535223305225372, 'kl': 0.7470703125, 'epoch': 4.23} 85%|████████▍ | 1361/1610 [5:03:52<32:25, 7.81s/it] 85%|████████▍ | 1362/1610 [5:03:59<31:32, 7.63s/it] {'loss': 0.0424, 'grad_norm': 4.479483919109792, 'learning_rate': 1.5403726708074533e-07, 'completion_length': 43.303571701049805, 'rewards/accuracy_reward': 0.526785746216774, 'rewards/format_reward': 0.9196429252624512, 'reward': 1.446428656578064, 'reward_std': 0.27735617011785507, 'kl': 1.0615234375, 'epoch': 4.23} 85%|████████▍ | 1362/1610 [5:03:59<31:32, 7.63s/it] 85%|████████▍ | 1363/1610 [5:04:06<30:43, 7.46s/it] {'loss': 0.0138, 'grad_norm': 4.4732001253942855, 'learning_rate': 1.53416149068323e-07, 'completion_length': 40.94643020629883, 'rewards/accuracy_reward': 0.4821428805589676, 'rewards/format_reward': 0.955357164144516, 'reward': 1.4375000596046448, 'reward_std': 0.22519004344940186, 'kl': 0.34375, 'epoch': 4.23} 85%|████████▍ | 1363/1610 [5:04:06<30:43, 7.46s/it] 85%|████████▍ | 1364/1610 [5:04:13<30:23, 7.41s/it] {'loss': 0.0207, 'grad_norm': 2.7507226581961293, 'learning_rate': 1.5279503105590062e-07, 'completion_length': 45.33035850524902, 'rewards/accuracy_reward': 0.580357164144516, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5446429252624512, 'reward_std': 0.2274085357785225, 'kl': 0.517578125, 'epoch': 4.24} 85%|████████▍ | 1364/1610 [5:04:13<30:23, 7.41s/it] 85%|████████▍ | 1365/1610 [5:04:20<29:20, 7.18s/it] {'loss': 0.0201, 'grad_norm': 3.1599618928303888, 'learning_rate': 1.5217391304347825e-07, 'completion_length': 40.303571701049805, 'rewards/accuracy_reward': 0.598214328289032, 'rewards/format_reward': 0.973214328289032, 'reward': 1.571428656578064, 'reward_std': 0.26506321132183075, 'kl': 0.5009765625, 'epoch': 4.24} 85%|████████▍ | 1365/1610 [5:04:20<29:20, 7.18s/it] 85%|████████▍ | 1366/1610 [5:04:27<29:11, 7.18s/it] {'loss': 0.0126, 'grad_norm': 8.969513521040238, 'learning_rate': 1.5155279503105589e-07, 'completion_length': 40.94643020629883, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 0.9642857611179352, 'reward': 1.5178572535514832, 'reward_std': 0.2888980656862259, 'kl': 0.31494140625, 'epoch': 4.24} 85%|████████▍ | 1366/1610 [5:04:27<29:11, 7.18s/it] 85%|████████▍ | 1367/1610 [5:04:35<30:13, 7.46s/it] {'loss': 0.0356, 'grad_norm': 4.137564594272407, 'learning_rate': 1.5093167701863354e-07, 'completion_length': 44.02678871154785, 'rewards/accuracy_reward': 0.3125000223517418, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.2410715222358704, 'reward_std': 0.24419668316841125, 'kl': 0.890625, 'epoch': 4.25} 85%|████████▍ | 1367/1610 [5:04:35<30:13, 7.46s/it] 85%|████████▍ | 1368/1610 [5:04:42<29:36, 7.34s/it] {'loss': 0.0099, 'grad_norm': 2.4758918584976204, 'learning_rate': 1.5031055900621118e-07, 'completion_length': 44.54464530944824, 'rewards/accuracy_reward': 0.3750000298023224, 'rewards/format_reward': 0.973214328289032, 'reward': 1.348214328289032, 'reward_std': 0.12054043263196945, 'kl': 0.24658203125, 'epoch': 4.25} 85%|████████▍ | 1368/1610 [5:04:42<29:36, 7.34s/it] 85%|████████▌ | 1369/1610 [5:04:49<29:03, 7.24s/it] {'loss': 0.0151, 'grad_norm': 3.0231130693987143, 'learning_rate': 1.496894409937888e-07, 'completion_length': 42.50000190734863, 'rewards/accuracy_reward': 0.4017857313156128, 'rewards/format_reward': 0.9732142984867096, 'reward': 1.3750000596046448, 'reward_std': 0.16323686763644218, 'kl': 0.37841796875, 'epoch': 4.25} 85%|████████▌ | 1369/1610 [5:04:49<29:03, 7.24s/it] 85%|████████▌ | 1370/1610 [5:04:56<28:34, 7.14s/it] {'loss': 0.0229, 'grad_norm': 4.015913090538761, 'learning_rate': 1.4906832298136647e-07, 'completion_length': 42.01785850524902, 'rewards/accuracy_reward': 0.5357142984867096, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5000000596046448, 'reward_std': 0.1550101786851883, 'kl': 0.5703125, 'epoch': 4.25} 85%|████████▌ | 1370/1610 [5:04:56<28:34, 7.14s/it] 85%|████████▌ | 1371/1610 [5:05:04<28:44, 7.22s/it] {'loss': 0.0378, 'grad_norm': 3.9026062259430976, 'learning_rate': 1.4844720496894407e-07, 'completion_length': 43.36607360839844, 'rewards/accuracy_reward': 0.5267857313156128, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.4553572535514832, 'reward_std': 0.13334746658802032, 'kl': 0.9482421875, 'epoch': 4.26} 85%|████████▌ | 1371/1610 [5:05:04<28:44, 7.22s/it] 85%|████████▌ | 1372/1610 [5:05:11<29:07, 7.34s/it] {'loss': 0.037, 'grad_norm': 5.481864301540194, 'learning_rate': 1.4782608695652173e-07, 'completion_length': 39.36607360839844, 'rewards/accuracy_reward': 0.2946428656578064, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.223214328289032, 'reward_std': 0.3124036490917206, 'kl': 0.923828125, 'epoch': 4.26} 85%|████████▌ | 1372/1610 [5:05:11<29:07, 7.34s/it] 85%|████████▌ | 1373/1610 [5:05:18<28:42, 7.27s/it] {'loss': 0.0373, 'grad_norm': 9.366006644199405, 'learning_rate': 1.4720496894409937e-07, 'completion_length': 44.616071701049805, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5535714626312256, 'reward_std': 0.19112762808799744, 'kl': 0.931640625, 'epoch': 4.26} 85%|████████▌ | 1373/1610 [5:05:18<28:42, 7.27s/it] 85%|████████▌ | 1374/1610 [5:05:25<28:19, 7.20s/it] {'loss': 0.0389, 'grad_norm': 2.546102189029719, 'learning_rate': 1.4658385093167703e-07, 'completion_length': 45.32143020629883, 'rewards/accuracy_reward': 0.4375000298023224, 'rewards/format_reward': 0.9017857611179352, 'reward': 1.3392857909202576, 'reward_std': 0.1282925382256508, 'kl': 0.97265625, 'epoch': 4.27} 85%|████████▌ | 1374/1610 [5:05:25<28:19, 7.20s/it] 85%|████████▌ | 1375/1610 [5:05:33<28:03, 7.16s/it] {'loss': 0.0079, 'grad_norm': 5.36655334933912, 'learning_rate': 1.4596273291925466e-07, 'completion_length': 43.34821701049805, 'rewards/accuracy_reward': 0.392857164144516, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3750000596046448, 'reward_std': 0.17941362410783768, 'kl': 0.197265625, 'epoch': 4.27} 85%|████████▌ | 1375/1610 [5:05:33<28:03, 7.16s/it] 85%|████████▌ | 1376/1610 [5:05:39<27:22, 7.02s/it] {'loss': 0.0174, 'grad_norm': 3.3445509310657964, 'learning_rate': 1.4534161490683232e-07, 'completion_length': 39.67857360839844, 'rewards/accuracy_reward': 0.4107143133878708, 'rewards/format_reward': 0.973214328289032, 'reward': 1.383928656578064, 'reward_std': 0.13446423411369324, 'kl': 0.4326171875, 'epoch': 4.27} 85%|████████▌ | 1376/1610 [5:05:39<27:22, 7.02s/it] 86%|████████▌ | 1377/1610 [5:05:46<27:29, 7.08s/it] {'loss': 0.0796, 'grad_norm': 8.033351870727145, 'learning_rate': 1.4472049689440992e-07, 'completion_length': 44.29464340209961, 'rewards/accuracy_reward': 0.4464285969734192, 'rewards/format_reward': 0.8928571939468384, 'reward': 1.3392857909202576, 'reward_std': 0.21703942865133286, 'kl': 1.9892578125, 'epoch': 4.28} 86%|████████▌ | 1377/1610 [5:05:46<27:29, 7.08s/it] 86%|████████▌ | 1378/1610 [5:05:54<27:24, 7.09s/it] {'loss': 0.0265, 'grad_norm': 5.170775953595978, 'learning_rate': 1.4409937888198756e-07, 'completion_length': 38.96428871154785, 'rewards/accuracy_reward': 0.446428582072258, 'rewards/format_reward': 0.955357164144516, 'reward': 1.4017857909202576, 'reward_std': 0.394686296582222, 'kl': 0.66162109375, 'epoch': 4.28} 86%|████████▌ | 1378/1610 [5:05:54<27:24, 7.09s/it] 86%|████████▌ | 1379/1610 [5:06:01<27:18, 7.09s/it] {'loss': 0.0613, 'grad_norm': 5.346858314993642, 'learning_rate': 1.4347826086956521e-07, 'completion_length': 42.41964530944824, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 0.8392857611179352, 'reward': 1.5178571939468384, 'reward_std': 0.3253912851214409, 'kl': 1.5361328125, 'epoch': 4.28} 86%|████████▌ | 1379/1610 [5:06:01<27:18, 7.09s/it] 86%|████████▌ | 1380/1610 [5:06:08<27:01, 7.05s/it] {'loss': 0.0762, 'grad_norm': 5.167940923245825, 'learning_rate': 1.4285714285714285e-07, 'completion_length': 46.125003814697266, 'rewards/accuracy_reward': 0.3125000149011612, 'rewards/format_reward': 0.8571428954601288, 'reward': 1.1696428656578064, 'reward_std': 0.4257373958826065, 'kl': 1.91015625, 'epoch': 4.29} 86%|████████▌ | 1380/1610 [5:06:08<27:01, 7.05s/it] 86%|████████▌ | 1381/1610 [5:06:16<27:53, 7.31s/it] {'loss': 0.1013, 'grad_norm': 8.585519503167902, 'learning_rate': 1.422360248447205e-07, 'completion_length': 46.85714530944824, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 0.8392857611179352, 'reward': 1.3928571939468384, 'reward_std': 0.4141524136066437, 'kl': 2.537109375, 'epoch': 4.29} 86%|████████▌ | 1381/1610 [5:06:16<27:53, 7.31s/it] 86%|████████▌ | 1382/1610 [5:06:23<27:40, 7.28s/it] {'loss': 0.0266, 'grad_norm': 10.636055391644856, 'learning_rate': 1.4161490683229814e-07, 'completion_length': 41.11607360839844, 'rewards/accuracy_reward': 0.392857164144516, 'rewards/format_reward': 0.9196428954601288, 'reward': 1.3125000596046448, 'reward_std': 0.3498137593269348, 'kl': 0.6640625, 'epoch': 4.29} 86%|████████▌ | 1382/1610 [5:06:23<27:40, 7.28s/it] 86%|████████▌ | 1383/1610 [5:06:30<27:25, 7.25s/it] {'loss': 0.0252, 'grad_norm': 5.171644906444195, 'learning_rate': 1.4099378881987577e-07, 'completion_length': 45.24107360839844, 'rewards/accuracy_reward': 0.5178571790456772, 'rewards/format_reward': 0.9375000298023224, 'reward': 1.4553571939468384, 'reward_std': 0.3802228420972824, 'kl': 0.630859375, 'epoch': 4.3} 86%|████████▌ | 1383/1610 [5:06:30<27:25, 7.25s/it] 86%|████████▌ | 1384/1610 [5:06:37<27:01, 7.17s/it] {'loss': 0.0209, 'grad_norm': 2.7126008315354273, 'learning_rate': 1.403726708074534e-07, 'completion_length': 44.87500190734863, 'rewards/accuracy_reward': 0.5267857313156128, 'rewards/format_reward': 0.9553571939468384, 'reward': 1.4821429252624512, 'reward_std': 0.22215460240840912, 'kl': 0.5234375, 'epoch': 4.3} 86%|████████▌ | 1384/1610 [5:06:37<27:01, 7.17s/it] 86%|████████▌ | 1385/1610 [5:06:44<26:39, 7.11s/it] {'loss': 0.0231, 'grad_norm': 2.9805891691370046, 'learning_rate': 1.3975155279503104e-07, 'completion_length': 41.65178871154785, 'rewards/accuracy_reward': 0.3750000298023224, 'rewards/format_reward': 0.9642857611179352, 'reward': 1.3392857909202576, 'reward_std': 0.14518078416585922, 'kl': 0.57763671875, 'epoch': 4.3} 86%|████████▌ | 1385/1610 [5:06:44<26:39, 7.11s/it] 86%|████████▌ | 1386/1610 [5:06:52<27:12, 7.29s/it] {'loss': 0.081, 'grad_norm': 8.22336329687187, 'learning_rate': 1.391304347826087e-07, 'completion_length': 43.33035850524902, 'rewards/accuracy_reward': 0.3928571790456772, 'rewards/format_reward': 0.8750000298023224, 'reward': 1.2678571939468384, 'reward_std': 0.3380538523197174, 'kl': 2.0234375, 'epoch': 4.3} 86%|████████▌ | 1386/1610 [5:06:52<27:12, 7.29s/it] 86%|████████▌ | 1387/1610 [5:07:00<27:49, 7.49s/it] {'loss': 0.0498, 'grad_norm': 6.751964723194883, 'learning_rate': 1.3850931677018633e-07, 'completion_length': 42.37500190734863, 'rewards/accuracy_reward': 0.5625000298023224, 'rewards/format_reward': 0.910714328289032, 'reward': 1.4732143878936768, 'reward_std': 0.3521358519792557, 'kl': 1.244140625, 'epoch': 4.31} 86%|████████▌ | 1387/1610 [5:07:00<27:49, 7.49s/it] 86%|████████▌ | 1388/1610 [5:07:06<26:45, 7.23s/it] {'loss': 0.0247, 'grad_norm': 3.6008392398050053, 'learning_rate': 1.3788819875776399e-07, 'completion_length': 43.90178680419922, 'rewards/accuracy_reward': 0.446428582072258, 'rewards/format_reward': 0.9375000298023224, 'reward': 1.383928656578064, 'reward_std': 0.25335483253002167, 'kl': 0.6171875, 'epoch': 4.31} 86%|████████▌ | 1388/1610 [5:07:06<26:45, 7.23s/it] 86%|████████▋ | 1389/1610 [5:07:13<25:54, 7.03s/it] {'loss': 0.0462, 'grad_norm': 20.801800489910438, 'learning_rate': 1.372670807453416e-07, 'completion_length': 42.49107360839844, 'rewards/accuracy_reward': 0.7142857313156128, 'rewards/format_reward': 0.9196428954601288, 'reward': 1.633928656578064, 'reward_std': 0.32911069691181183, 'kl': 1.15234375, 'epoch': 4.31} 86%|████████▋ | 1389/1610 [5:07:13<25:54, 7.03s/it] 86%|████████▋ | 1390/1610 [5:07:20<26:02, 7.10s/it] {'loss': 0.0567, 'grad_norm': 7.132927789834126, 'learning_rate': 1.3664596273291925e-07, 'completion_length': 46.56250190734863, 'rewards/accuracy_reward': 0.4107142984867096, 'rewards/format_reward': 0.8750000596046448, 'reward': 1.285714328289032, 'reward_std': 0.4807567298412323, 'kl': 1.416015625, 'epoch': 4.32} 86%|████████▋ | 1390/1610 [5:07:20<26:02, 7.10s/it] 86%|████████▋ | 1391/1610 [5:07:27<26:13, 7.18s/it] {'loss': 0.0538, 'grad_norm': 4.919862856497752, 'learning_rate': 1.3602484472049688e-07, 'completion_length': 44.79464530944824, 'rewards/accuracy_reward': 0.580357164144516, 'rewards/format_reward': 0.8750000298023224, 'reward': 1.4553571939468384, 'reward_std': 0.3823588639497757, 'kl': 1.34375, 'epoch': 4.32} 86%|████████▋ | 1391/1610 [5:07:27<26:13, 7.18s/it] 86%|████████▋ | 1392/1610 [5:07:35<26:17, 7.23s/it] {'loss': 0.0411, 'grad_norm': 4.099664502700367, 'learning_rate': 1.3540372670807454e-07, 'completion_length': 44.38393020629883, 'rewards/accuracy_reward': 0.508928582072258, 'rewards/format_reward': 0.910714328289032, 'reward': 1.4196429252624512, 'reward_std': 0.2870131582021713, 'kl': 1.029296875, 'epoch': 4.32} 86%|████████▋ | 1392/1610 [5:07:35<26:17, 7.23s/it] 87%|████████▋ | 1393/1610 [5:07:42<26:34, 7.35s/it] {'loss': 0.0656, 'grad_norm': 4.984529392034855, 'learning_rate': 1.3478260869565218e-07, 'completion_length': 47.06250190734863, 'rewards/accuracy_reward': 0.2053571492433548, 'rewards/format_reward': 0.8839285969734192, 'reward': 1.0892857313156128, 'reward_std': 0.19057324528694153, 'kl': 1.640625, 'epoch': 4.33} 87%|████████▋ | 1393/1610 [5:07:42<26:34, 7.35s/it] 87%|████████▋ | 1394/1610 [5:07:49<26:14, 7.29s/it] {'loss': 0.0441, 'grad_norm': 5.252070935477526, 'learning_rate': 1.3416149068322978e-07, 'completion_length': 45.88393020629883, 'rewards/accuracy_reward': 0.4732142984867096, 'rewards/format_reward': 0.9196428954601288, 'reward': 1.3928571939468384, 'reward_std': 0.3570459634065628, 'kl': 1.1025390625, 'epoch': 4.33} 87%|████████▋ | 1394/1610 [5:07:50<26:14, 7.29s/it] 87%|████████▋ | 1395/1610 [5:07:56<25:41, 7.17s/it] {'loss': 0.037, 'grad_norm': 4.094600912376587, 'learning_rate': 1.3354037267080744e-07, 'completion_length': 44.44643020629883, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4821429252624512, 'reward_std': 0.16653180867433548, 'kl': 0.92529296875, 'epoch': 4.33} 87%|████████▋ | 1395/1610 [5:07:56<25:41, 7.17s/it] 87%|████████▋ | 1396/1610 [5:08:06<28:19, 7.94s/it] {'loss': 0.0394, 'grad_norm': 6.700829013939446, 'learning_rate': 1.3291925465838507e-07, 'completion_length': 50.33035850524902, 'rewards/accuracy_reward': 0.4107143133878708, 'rewards/format_reward': 0.8928571939468384, 'reward': 1.3035714626312256, 'reward_std': 0.3152393400669098, 'kl': 0.984375, 'epoch': 4.34} 87%|████████▋ | 1396/1610 [5:08:06<28:19, 7.94s/it] 87%|████████▋ | 1397/1610 [5:08:13<27:33, 7.76s/it] {'loss': 0.057, 'grad_norm': 8.536670041536023, 'learning_rate': 1.3229813664596273e-07, 'completion_length': 39.803571701049805, 'rewards/accuracy_reward': 0.5267857313156128, 'rewards/format_reward': 0.910714328289032, 'reward': 1.4375000596046448, 'reward_std': 0.4059313088655472, 'kl': 1.42578125, 'epoch': 4.34} 87%|████████▋ | 1397/1610 [5:08:13<27:33, 7.76s/it] 87%|████████▋ | 1398/1610 [5:08:22<27:44, 7.85s/it] {'loss': 0.0641, 'grad_norm': 5.726432165466818, 'learning_rate': 1.3167701863354037e-07, 'completion_length': 47.31250190734863, 'rewards/accuracy_reward': 0.383928582072258, 'rewards/format_reward': 0.8660714626312256, 'reward': 1.2500000596046448, 'reward_std': 0.2821261137723923, 'kl': 1.6015625, 'epoch': 4.34} 87%|████████▋ | 1398/1610 [5:08:22<27:44, 7.85s/it] 87%|████████▋ | 1399/1610 [5:08:29<26:56, 7.66s/it] {'loss': 0.0247, 'grad_norm': 11.973536939911877, 'learning_rate': 1.3105590062111802e-07, 'completion_length': 40.56250190734863, 'rewards/accuracy_reward': 0.642857164144516, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.571428656578064, 'reward_std': 0.3084176629781723, 'kl': 0.619140625, 'epoch': 4.34} 87%|████████▋ | 1399/1610 [5:08:29<26:56, 7.66s/it] 87%|████████▋ | 1400/1610 [5:08:35<25:45, 7.36s/it] {'loss': 0.0333, 'grad_norm': 5.43462917601122, 'learning_rate': 1.3043478260869563e-07, 'completion_length': 39.66071701049805, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 0.9196428954601288, 'reward': 1.4910714626312256, 'reward_std': 0.2586623728275299, 'kl': 0.83203125, 'epoch': 4.35} 87%|████████▋ | 1400/1610 [5:08:35<25:45, 7.36s/it] 87%|████████▋ | 1401/1610 [5:09:26<1:10:56, 20.37s/it] {'loss': 0.0639, 'grad_norm': 4.995610031584886, 'learning_rate': 1.298136645962733e-07, 'completion_length': 53.892860412597656, 'rewards/accuracy_reward': 0.5089285969734192, 'rewards/format_reward': 0.8214286267757416, 'reward': 1.3303571939468384, 'reward_std': 0.40715545415878296, 'kl': 1.6015625, 'epoch': 4.35} 87%|████████▋ | 1401/1610 [5:09:26<1:10:56, 20.37s/it] 87%|████████▋ | 1402/1610 [5:09:33<56:29, 16.30s/it] {'loss': 0.0607, 'grad_norm': 5.770613972664435, 'learning_rate': 1.2919254658385092e-07, 'completion_length': 40.44643020629883, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 0.8660714626312256, 'reward': 1.3839285969734192, 'reward_std': 0.39676326513290405, 'kl': 1.513671875, 'epoch': 4.35} 87%|████████▋ | 1402/1610 [5:09:33<56:29, 16.30s/it] 87%|████████▋ | 1403/1610 [5:09:40<46:52, 13.59s/it] {'loss': 0.0541, 'grad_norm': 4.2457067420364325, 'learning_rate': 1.2857142857142855e-07, 'completion_length': 46.14285850524902, 'rewards/accuracy_reward': 0.6428571939468384, 'rewards/format_reward': 0.892857164144516, 'reward': 1.535714328289032, 'reward_std': 0.3254503905773163, 'kl': 1.353515625, 'epoch': 4.36} 87%|████████▋ | 1403/1610 [5:09:40<46:52, 13.59s/it] 87%|████████▋ | 1404/1610 [5:09:47<39:44, 11.57s/it] {'loss': 0.047, 'grad_norm': 6.689006151860356, 'learning_rate': 1.2795031055900621e-07, 'completion_length': 38.25000190734863, 'rewards/accuracy_reward': 0.580357164144516, 'rewards/format_reward': 0.892857164144516, 'reward': 1.4732143878936768, 'reward_std': 0.36195534467697144, 'kl': 1.1728515625, 'epoch': 4.36} 87%|████████▋ | 1404/1610 [5:09:47<39:44, 11.57s/it] 87%|████████▋ | 1405/1610 [5:09:57<37:31, 10.98s/it] {'loss': 0.0623, 'grad_norm': 12.683891767476586, 'learning_rate': 1.2732919254658385e-07, 'completion_length': 49.10714530944824, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 0.8392857313156128, 'reward': 1.410714328289032, 'reward_std': 0.4848206341266632, 'kl': 1.552734375, 'epoch': 4.36} 87%|████████▋ | 1405/1610 [5:09:57<37:31, 10.98s/it] 87%|████████▋ | 1406/1610 [5:10:04<33:09, 9.75s/it] {'loss': 0.0121, 'grad_norm': 6.141284720475197, 'learning_rate': 1.2670807453416148e-07, 'completion_length': 37.33928680419922, 'rewards/accuracy_reward': 0.7410714626312256, 'rewards/format_reward': 0.9732142984867096, 'reward': 1.7142857909202576, 'reward_std': 0.24289216101169586, 'kl': 0.30224609375, 'epoch': 4.37} 87%|████████▋ | 1406/1610 [5:10:04<33:09, 9.75s/it] 87%|████████▋ | 1407/1610 [5:10:11<30:16, 8.95s/it] {'loss': 0.0384, 'grad_norm': 7.768747212543534, 'learning_rate': 1.260869565217391e-07, 'completion_length': 38.11607360839844, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 0.9017857611179352, 'reward': 1.4732143878936768, 'reward_std': 0.23102832213044167, 'kl': 0.962890625, 'epoch': 4.37} 87%|████████▋ | 1407/1610 [5:10:11<30:16, 8.95s/it] 87%|████████▋ | 1408/1610 [5:10:18<28:39, 8.51s/it] {'loss': 0.0426, 'grad_norm': 4.85881488482656, 'learning_rate': 1.2546583850931677e-07, 'completion_length': 45.23214530944824, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 0.9196428954601288, 'reward': 1.4732143878936768, 'reward_std': 0.33246469497680664, 'kl': 1.064453125, 'epoch': 4.37} 87%|████████▋ | 1408/1610 [5:10:18<28:39, 8.51s/it] 88%|████████▊ | 1409/1610 [5:10:26<28:17, 8.44s/it] {'loss': 0.0409, 'grad_norm': 4.823546694892068, 'learning_rate': 1.248447204968944e-07, 'completion_length': 44.58035850524902, 'rewards/accuracy_reward': 0.5357142984867096, 'rewards/format_reward': 0.9017857611179352, 'reward': 1.4375000596046448, 'reward_std': 0.36271435022354126, 'kl': 1.0234375, 'epoch': 4.38} 88%|████████▊ | 1409/1610 [5:10:26<28:17, 8.44s/it] 88%|████████▊ | 1410/1610 [5:10:35<27:57, 8.39s/it] {'loss': 0.0929, 'grad_norm': 9.155187233518054, 'learning_rate': 1.2422360248447204e-07, 'completion_length': 46.50893211364746, 'rewards/accuracy_reward': 0.6339285969734192, 'rewards/format_reward': 0.8571428954601288, 'reward': 1.4910714626312256, 'reward_std': 0.2999540716409683, 'kl': 2.33203125, 'epoch': 4.38} 88%|████████▊ | 1410/1610 [5:10:35<27:57, 8.39s/it] 88%|████████▊ | 1411/1610 [5:10:42<26:42, 8.06s/it] {'loss': 0.0673, 'grad_norm': 7.846669267072267, 'learning_rate': 1.236024844720497e-07, 'completion_length': 40.63393020629883, 'rewards/accuracy_reward': 0.5892857611179352, 'rewards/format_reward': 0.8303571939468384, 'reward': 1.4196429252624512, 'reward_std': 0.44196413457393646, 'kl': 1.6796875, 'epoch': 4.38} 88%|████████▊ | 1411/1610 [5:10:42<26:42, 8.06s/it] 88%|████████▊ | 1412/1610 [5:10:49<25:31, 7.73s/it] {'loss': 0.0824, 'grad_norm': 5.44954509373456, 'learning_rate': 1.2298136645962733e-07, 'completion_length': 40.91964530944824, 'rewards/accuracy_reward': 0.5089285969734192, 'rewards/format_reward': 0.8303571939468384, 'reward': 1.3392857909202576, 'reward_std': 0.5092727988958359, 'kl': 2.06640625, 'epoch': 4.39} 88%|████████▊ | 1412/1610 [5:10:49<25:31, 7.73s/it] 88%|████████▊ | 1413/1610 [5:11:01<30:08, 9.18s/it] {'loss': 0.0719, 'grad_norm': 7.3117221729171025, 'learning_rate': 1.2236024844720496e-07, 'completion_length': 51.40178680419922, 'rewards/accuracy_reward': 0.4107142984867096, 'rewards/format_reward': 0.8392857611179352, 'reward': 1.2500000596046448, 'reward_std': 0.42012353241443634, 'kl': 1.80078125, 'epoch': 4.39} 88%|████████▊ | 1413/1610 [5:11:01<30:08, 9.18s/it] 88%|████████▊ | 1414/1610 [5:11:09<28:22, 8.69s/it] {'loss': 0.0191, 'grad_norm': 5.838771616365456, 'learning_rate': 1.2173913043478262e-07, 'completion_length': 37.63393020629883, 'rewards/accuracy_reward': 0.526785746216774, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.473214328289032, 'reward_std': 0.2703763544559479, 'kl': 0.478515625, 'epoch': 4.39} 88%|████████▊ | 1414/1610 [5:11:09<28:22, 8.69s/it] 88%|████████▊ | 1415/1610 [5:11:17<27:31, 8.47s/it] {'loss': 0.0723, 'grad_norm': 10.352721319062264, 'learning_rate': 1.2111801242236025e-07, 'completion_length': 45.07143211364746, 'rewards/accuracy_reward': 0.3660714477300644, 'rewards/format_reward': 0.8214286267757416, 'reward': 1.1875000596046448, 'reward_std': 0.3012109845876694, 'kl': 1.80859375, 'epoch': 4.39} 88%|████████▊ | 1415/1610 [5:11:17<27:31, 8.47s/it] 88%|████████▊ | 1416/1610 [5:11:25<26:34, 8.22s/it] {'loss': 0.113, 'grad_norm': 7.84068081447663, 'learning_rate': 1.2049689440993788e-07, 'completion_length': 44.19643020629883, 'rewards/accuracy_reward': 0.6517857313156128, 'rewards/format_reward': 0.8035714626312256, 'reward': 1.4553571939468384, 'reward_std': 0.5437164306640625, 'kl': 2.82421875, 'epoch': 4.4} 88%|████████▊ | 1416/1610 [5:11:25<26:34, 8.22s/it] 88%|████████▊ | 1417/1610 [5:11:32<25:26, 7.91s/it] {'loss': 0.057, 'grad_norm': 7.849017322122853, 'learning_rate': 1.1987577639751552e-07, 'completion_length': 43.75000190734863, 'rewards/accuracy_reward': 0.2767857313156128, 'rewards/format_reward': 0.7678571939468384, 'reward': 1.0446429252624512, 'reward_std': 0.5033959895372391, 'kl': 1.421875, 'epoch': 4.4} 88%|████████▊ | 1417/1610 [5:11:32<25:26, 7.91s/it] 88%|████████▊ | 1418/1610 [5:11:39<25:06, 7.85s/it] {'loss': 0.0668, 'grad_norm': 5.993914144879054, 'learning_rate': 1.1925465838509315e-07, 'completion_length': 47.35714530944824, 'rewards/accuracy_reward': 0.3839285969734192, 'rewards/format_reward': 0.866071492433548, 'reward': 1.2500000596046448, 'reward_std': 0.30824482440948486, 'kl': 1.66796875, 'epoch': 4.4} 88%|████████▊ | 1418/1610 [5:11:39<25:06, 7.85s/it] 88%|████████▊ | 1419/1610 [5:11:47<24:12, 7.60s/it] {'loss': 0.0537, 'grad_norm': 5.630840920966283, 'learning_rate': 1.1863354037267081e-07, 'completion_length': 41.18750190734863, 'rewards/accuracy_reward': 0.642857164144516, 'rewards/format_reward': 0.9375000298023224, 'reward': 1.5803571939468384, 'reward_std': 0.26841163635253906, 'kl': 1.34375, 'epoch': 4.41} 88%|████████▊ | 1419/1610 [5:11:47<24:12, 7.60s/it] 88%|████████▊ | 1420/1610 [5:11:54<23:51, 7.53s/it] {'loss': 0.06, 'grad_norm': 5.001095336022731, 'learning_rate': 1.1801242236024844e-07, 'completion_length': 46.37500190734863, 'rewards/accuracy_reward': 0.4285714477300644, 'rewards/format_reward': 0.830357164144516, 'reward': 1.258928656578064, 'reward_std': 0.47697292268276215, 'kl': 1.49609375, 'epoch': 4.41} 88%|████████▊ | 1420/1610 [5:11:54<23:51, 7.53s/it] 88%|████████▊ | 1421/1610 [5:12:04<26:20, 8.36s/it] {'loss': 0.183, 'grad_norm': 15.51566075066573, 'learning_rate': 1.1739130434782609e-07, 'completion_length': 52.30357360839844, 'rewards/accuracy_reward': 0.2053571566939354, 'rewards/format_reward': 0.7142857313156128, 'reward': 0.9196428954601288, 'reward_std': 0.46358053386211395, 'kl': 4.5625, 'epoch': 4.41} 88%|████████▊ | 1421/1610 [5:12:04<26:20, 8.36s/it] 88%|████████▊ | 1422/1610 [5:12:12<25:16, 8.06s/it] {'loss': 0.1001, 'grad_norm': 9.137650256685786, 'learning_rate': 1.1677018633540373e-07, 'completion_length': 43.60714530944824, 'rewards/accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 0.7500000298023224, 'reward': 1.2500000596046448, 'reward_std': 0.5706851482391357, 'kl': 2.5, 'epoch': 4.42} 88%|████████▊ | 1422/1610 [5:12:12<25:16, 8.06s/it] 88%|████████▊ | 1423/1610 [5:12:19<24:39, 7.91s/it] {'loss': 0.0667, 'grad_norm': 6.517155755135767, 'learning_rate': 1.1614906832298136e-07, 'completion_length': 42.01785850524902, 'rewards/accuracy_reward': 0.580357164144516, 'rewards/format_reward': 0.9196428954601288, 'reward': 1.5000001192092896, 'reward_std': 0.3413098603487015, 'kl': 1.6640625, 'epoch': 4.42} 88%|████████▊ | 1423/1610 [5:12:19<24:39, 7.91s/it] 88%|████████▊ | 1424/1610 [5:12:27<24:12, 7.81s/it] {'loss': 0.0895, 'grad_norm': 10.057474080758533, 'learning_rate': 1.15527950310559e-07, 'completion_length': 40.93750190734863, 'rewards/accuracy_reward': 0.5803571790456772, 'rewards/format_reward': 0.848214328289032, 'reward': 1.4285714626312256, 'reward_std': 0.31369465589523315, 'kl': 2.234375, 'epoch': 4.42} 88%|████████▊ | 1424/1610 [5:12:27<24:12, 7.81s/it] 89%|████████▊ | 1425/1610 [5:12:34<23:50, 7.73s/it] {'loss': 0.0899, 'grad_norm': 8.855356505699989, 'learning_rate': 1.1490683229813663e-07, 'completion_length': 41.84821701049805, 'rewards/accuracy_reward': 0.4196428805589676, 'rewards/format_reward': 0.830357164144516, 'reward': 1.2500000596046448, 'reward_std': 0.41419604420661926, 'kl': 2.2421875, 'epoch': 4.43} 89%|████████▊ | 1425/1610 [5:12:34<23:50, 7.73s/it] 89%|████████▊ | 1426/1610 [5:12:43<24:40, 8.05s/it] {'loss': 0.0739, 'grad_norm': 11.416235740443199, 'learning_rate': 1.1428571428571427e-07, 'completion_length': 42.02678871154785, 'rewards/accuracy_reward': 0.4553571790456772, 'rewards/format_reward': 0.8928571939468384, 'reward': 1.348214328289032, 'reward_std': 0.3790639042854309, 'kl': 1.84765625, 'epoch': 4.43} 89%|████████▊ | 1426/1610 [5:12:43<24:40, 8.05s/it] 89%|████████▊ | 1427/1610 [5:12:53<26:23, 8.66s/it] {'loss': 0.0475, 'grad_norm': 5.511986770460964, 'learning_rate': 1.1366459627329192e-07, 'completion_length': 46.705360412597656, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 0.8839286267757416, 'reward': 1.4553572535514832, 'reward_std': 0.29670925438404083, 'kl': 1.1875, 'epoch': 4.43} 89%|████████▊ | 1427/1610 [5:12:53<26:23, 8.66s/it] 89%|████████▊ | 1428/1610 [5:13:01<25:22, 8.37s/it] {'loss': 0.0604, 'grad_norm': 5.78160508354707, 'learning_rate': 1.1304347826086955e-07, 'completion_length': 43.71428680419922, 'rewards/accuracy_reward': 0.3571428805589676, 'rewards/format_reward': 0.8660714626312256, 'reward': 1.223214328289032, 'reward_std': 0.3448506295681, 'kl': 1.505859375, 'epoch': 4.43} 89%|████████▊ | 1428/1610 [5:13:01<25:22, 8.37s/it] 89%|████████▉ | 1429/1610 [5:13:08<24:31, 8.13s/it] {'loss': 0.059, 'grad_norm': 4.601424109567119, 'learning_rate': 1.124223602484472e-07, 'completion_length': 45.85714530944824, 'rewards/accuracy_reward': 0.4910714626312256, 'rewards/format_reward': 0.8839286267757416, 'reward': 1.3750000596046448, 'reward_std': 0.33109965920448303, 'kl': 1.47265625, 'epoch': 4.44} 89%|████████▉ | 1429/1610 [5:13:08<24:31, 8.13s/it] 89%|████████▉ | 1430/1610 [5:13:16<23:49, 7.94s/it] {'loss': 0.0646, 'grad_norm': 4.481929612369556, 'learning_rate': 1.1180124223602484e-07, 'completion_length': 45.705360412597656, 'rewards/accuracy_reward': 0.4732143133878708, 'rewards/format_reward': 0.892857164144516, 'reward': 1.3660714626312256, 'reward_std': 0.2507122904062271, 'kl': 1.6171875, 'epoch': 4.44} 89%|████████▉ | 1430/1610 [5:13:16<23:49, 7.94s/it] 89%|████████▉ | 1431/1610 [5:13:23<23:22, 7.83s/it] {'loss': 0.0389, 'grad_norm': 5.671777905366679, 'learning_rate': 1.1118012422360248e-07, 'completion_length': 40.44643020629883, 'rewards/accuracy_reward': 0.3839285969734192, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.3303571939468384, 'reward_std': 0.33637309074401855, 'kl': 0.97265625, 'epoch': 4.44} 89%|████████▉ | 1431/1610 [5:13:23<23:22, 7.83s/it] 89%|████████▉ | 1432/1610 [5:13:31<23:11, 7.82s/it] {'loss': 0.062, 'grad_norm': 6.723165132583994, 'learning_rate': 1.1055900621118012e-07, 'completion_length': 46.267860412597656, 'rewards/accuracy_reward': 0.2767857313156128, 'rewards/format_reward': 0.848214328289032, 'reward': 1.1250000596046448, 'reward_std': 0.2720487713813782, 'kl': 1.55078125, 'epoch': 4.45} 89%|████████▉ | 1432/1610 [5:13:31<23:11, 7.82s/it] 89%|████████▉ | 1433/1610 [5:13:38<22:16, 7.55s/it] {'loss': 0.0109, 'grad_norm': 4.255061463031153, 'learning_rate': 1.0993788819875776e-07, 'completion_length': 39.06250190734863, 'rewards/accuracy_reward': 0.4375000298023224, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.4285715222358704, 'reward_std': 0.22726977616548538, 'kl': 0.27392578125, 'epoch': 4.45} 89%|████████▉ | 1433/1610 [5:13:38<22:16, 7.55s/it] 89%|████████▉ | 1434/1610 [5:13:45<21:28, 7.32s/it] {'loss': 0.0237, 'grad_norm': 4.31332648530418, 'learning_rate': 1.0931677018633539e-07, 'completion_length': 43.31250190734863, 'rewards/accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 0.9642857611179352, 'reward': 1.4642857909202576, 'reward_std': 0.2792610451579094, 'kl': 0.591796875, 'epoch': 4.45} 89%|████████▉ | 1434/1610 [5:13:45<21:28, 7.32s/it] 89%|████████▉ | 1435/1610 [5:13:52<21:17, 7.30s/it] {'loss': 0.0809, 'grad_norm': 7.5252652844208425, 'learning_rate': 1.0869565217391303e-07, 'completion_length': 44.27678680419922, 'rewards/accuracy_reward': 0.4910714626312256, 'rewards/format_reward': 0.892857164144516, 'reward': 1.383928656578064, 'reward_std': 0.26707248389720917, 'kl': 2.017578125, 'epoch': 4.46} 89%|████████▉ | 1435/1610 [5:13:52<21:17, 7.30s/it] 89%|████████▉ | 1436/1610 [5:13:59<21:04, 7.27s/it] {'loss': 0.0515, 'grad_norm': 5.133398179580612, 'learning_rate': 1.0807453416149068e-07, 'completion_length': 42.19643020629883, 'rewards/accuracy_reward': 0.526785746216774, 'rewards/format_reward': 0.8928571939468384, 'reward': 1.4196429252624512, 'reward_std': 0.336430162191391, 'kl': 1.287109375, 'epoch': 4.46} 89%|████████▉ | 1436/1610 [5:13:59<21:04, 7.27s/it] 89%|████████▉ | 1437/1610 [5:14:06<20:26, 7.09s/it] {'loss': 0.0185, 'grad_norm': 5.364966239135723, 'learning_rate': 1.0745341614906831e-07, 'completion_length': 38.73214530944824, 'rewards/accuracy_reward': 0.6160714626312256, 'rewards/format_reward': 0.973214328289032, 'reward': 1.5892857909202576, 'reward_std': 0.2168244868516922, 'kl': 0.4609375, 'epoch': 4.46} 89%|████████▉ | 1437/1610 [5:14:06<20:26, 7.09s/it] 89%|████████▉ | 1438/1610 [5:14:13<20:29, 7.15s/it] {'loss': 0.0475, 'grad_norm': 4.09658406577515, 'learning_rate': 1.0683229813664596e-07, 'completion_length': 44.30357360839844, 'rewards/accuracy_reward': 0.4642857313156128, 'rewards/format_reward': 0.892857164144516, 'reward': 1.3571429252624512, 'reward_std': 0.4037874639034271, 'kl': 1.1904296875, 'epoch': 4.47} 89%|████████▉ | 1438/1610 [5:14:13<20:29, 7.15s/it] 89%|████████▉ | 1439/1610 [5:14:21<20:34, 7.22s/it] {'loss': 0.0731, 'grad_norm': 8.491534298036514, 'learning_rate': 1.062111801242236e-07, 'completion_length': 41.428571701049805, 'rewards/accuracy_reward': 0.4553571492433548, 'rewards/format_reward': 0.8571428954601288, 'reward': 1.3125000596046448, 'reward_std': 0.4973255842924118, 'kl': 1.828125, 'epoch': 4.47} 89%|████████▉ | 1439/1610 [5:14:21<20:34, 7.22s/it] 89%|████████▉ | 1440/1610 [5:14:28<20:30, 7.24s/it] {'loss': 0.0464, 'grad_norm': 4.722434448317464, 'learning_rate': 1.0559006211180124e-07, 'completion_length': 40.96428680419922, 'rewards/accuracy_reward': 0.3392857313156128, 'rewards/format_reward': 0.9017857611179352, 'reward': 1.2410715222358704, 'reward_std': 0.3072052597999573, 'kl': 1.16015625, 'epoch': 4.47} 89%|████████▉ | 1440/1610 [5:14:28<20:30, 7.24s/it] 90%|████████▉ | 1441/1610 [5:14:35<20:23, 7.24s/it] {'loss': 0.0295, 'grad_norm': 4.293203239400322, 'learning_rate': 1.0496894409937888e-07, 'completion_length': 46.91964340209961, 'rewards/accuracy_reward': 0.4107143133878708, 'rewards/format_reward': 0.955357164144516, 'reward': 1.3660715222358704, 'reward_std': 0.24633492529392242, 'kl': 0.7392578125, 'epoch': 4.48} 90%|████████▉ | 1441/1610 [5:14:35<20:23, 7.24s/it] 90%|████████▉ | 1442/1610 [5:14:42<20:02, 7.16s/it] {'loss': 0.045, 'grad_norm': 5.04637261751424, 'learning_rate': 1.0434782608695651e-07, 'completion_length': 42.15178680419922, 'rewards/accuracy_reward': 0.330357164144516, 'rewards/format_reward': 0.9375000298023224, 'reward': 1.2678571939468384, 'reward_std': 0.20526636391878128, 'kl': 1.125, 'epoch': 4.48} 90%|████████▉ | 1442/1610 [5:14:42<20:02, 7.16s/it] 90%|████████▉ | 1443/1610 [5:14:50<20:00, 7.19s/it] {'loss': 0.0777, 'grad_norm': 7.300546600000178, 'learning_rate': 1.0372670807453415e-07, 'completion_length': 40.75893020629883, 'rewards/accuracy_reward': 0.598214328289032, 'rewards/format_reward': 0.848214328289032, 'reward': 1.446428656578064, 'reward_std': 0.3724116384983063, 'kl': 1.93359375, 'epoch': 4.48} 90%|████████▉ | 1443/1610 [5:14:50<20:00, 7.19s/it] 90%|████████▉ | 1444/1610 [5:14:57<19:57, 7.21s/it] {'loss': 0.0361, 'grad_norm': 4.847915988398458, 'learning_rate': 1.0310559006211179e-07, 'completion_length': 43.85714530944824, 'rewards/accuracy_reward': 0.5089285969734192, 'rewards/format_reward': 0.9017857611179352, 'reward': 1.4107143878936768, 'reward_std': 0.32701918482780457, 'kl': 0.90234375, 'epoch': 4.48} 90%|████████▉ | 1444/1610 [5:14:57<19:57, 7.21s/it] 90%|████████▉ | 1445/1610 [5:15:03<19:24, 7.06s/it] {'loss': 0.0454, 'grad_norm': 4.108655487276695, 'learning_rate': 1.0248447204968944e-07, 'completion_length': 42.61607360839844, 'rewards/accuracy_reward': 0.589285746216774, 'rewards/format_reward': 0.928571492433548, 'reward': 1.5178572535514832, 'reward_std': 0.20141704380512238, 'kl': 1.1337890625, 'epoch': 4.49} 90%|████████▉ | 1445/1610 [5:15:03<19:24, 7.06s/it] 90%|████████▉ | 1446/1610 [5:15:10<19:15, 7.04s/it] {'loss': 0.0433, 'grad_norm': 5.622386249543025, 'learning_rate': 1.0186335403726707e-07, 'completion_length': 42.59821701049805, 'rewards/accuracy_reward': 0.5357142984867096, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.4821429252624512, 'reward_std': 0.35663338005542755, 'kl': 1.08203125, 'epoch': 4.49} 90%|████████▉ | 1446/1610 [5:15:10<19:15, 7.04s/it] 90%|████████▉ | 1447/1610 [5:15:17<19:00, 6.99s/it] {'loss': 0.0163, 'grad_norm': 4.490652678615324, 'learning_rate': 1.0124223602484472e-07, 'completion_length': 40.15178871154785, 'rewards/accuracy_reward': 0.4017857238650322, 'rewards/format_reward': 0.973214328289032, 'reward': 1.3750000596046448, 'reward_std': 0.1973949372768402, 'kl': 0.4072265625, 'epoch': 4.49} 90%|████████▉ | 1447/1610 [5:15:17<19:00, 6.99s/it] 90%|████████▉ | 1448/1610 [5:15:28<21:26, 7.94s/it] {'loss': 0.0551, 'grad_norm': 9.532340414276117, 'learning_rate': 1.0062111801242236e-07, 'completion_length': 45.37500190734863, 'rewards/accuracy_reward': 0.6875000298023224, 'rewards/format_reward': 0.9017857611179352, 'reward': 1.5892857909202576, 'reward_std': 0.39152926206588745, 'kl': 1.376953125, 'epoch': 4.5} 90%|████████▉ | 1448/1610 [5:15:28<21:26, 7.94s/it] 90%|█████████ | 1449/1610 [5:15:35<21:03, 7.85s/it] {'loss': 0.0296, 'grad_norm': 3.0956708151888757, 'learning_rate': 1e-07, 'completion_length': 44.74107360839844, 'rewards/accuracy_reward': 0.4196428656578064, 'rewards/format_reward': 0.9375000298023224, 'reward': 1.3571429252624512, 'reward_std': 0.2158326581120491, 'kl': 0.740234375, 'epoch': 4.5} 90%|█████████ | 1449/1610 [5:15:35<21:03, 7.85s/it] 90%|█████████ | 1450/1610 [5:15:46<23:01, 8.64s/it] {'loss': 0.0502, 'grad_norm': 6.860737252096048, 'learning_rate': 9.937888198757763e-08, 'completion_length': 44.17857360839844, 'rewards/accuracy_reward': 0.5178571790456772, 'rewards/format_reward': 0.848214328289032, 'reward': 1.3660714626312256, 'reward_std': 0.28844083845615387, 'kl': 1.255859375, 'epoch': 4.5} 90%|█████████ | 1450/1610 [5:15:46<23:01, 8.64s/it] 90%|█████████ | 1451/1610 [5:15:53<21:55, 8.28s/it] {'loss': 0.0541, 'grad_norm': 13.024936027312245, 'learning_rate': 9.875776397515527e-08, 'completion_length': 43.87500190734863, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 0.910714328289032, 'reward': 1.4285715222358704, 'reward_std': 0.3422795385122299, 'kl': 1.3515625, 'epoch': 4.51} 90%|█████████ | 1451/1610 [5:15:53<21:55, 8.28s/it] 90%|█████████ | 1452/1610 [5:16:01<21:12, 8.06s/it] {'loss': 0.0666, 'grad_norm': 5.530768427109676, 'learning_rate': 9.81366459627329e-08, 'completion_length': 49.12500190734863, 'rewards/accuracy_reward': 0.2767857313156128, 'rewards/format_reward': 0.8928571939468384, 'reward': 1.1696428656578064, 'reward_std': 0.3061111867427826, 'kl': 1.6640625, 'epoch': 4.51} 90%|█████████ | 1452/1610 [5:16:01<21:12, 8.06s/it] 90%|█████████ | 1453/1610 [5:16:08<20:13, 7.73s/it] {'loss': 0.0431, 'grad_norm': 11.4662978396553, 'learning_rate': 9.751552795031055e-08, 'completion_length': 41.00893020629883, 'rewards/accuracy_reward': 0.4375000149011612, 'rewards/format_reward': 0.9196428954601288, 'reward': 1.3571429252624512, 'reward_std': 0.356418177485466, 'kl': 1.078125, 'epoch': 4.51} 90%|█████████ | 1453/1610 [5:16:08<20:13, 7.73s/it] 90%|█████████ | 1454/1610 [5:16:15<19:38, 7.55s/it] {'loss': 0.0245, 'grad_norm': 4.301113267724599, 'learning_rate': 9.68944099378882e-08, 'completion_length': 46.160715103149414, 'rewards/accuracy_reward': 0.3571428805589676, 'rewards/format_reward': 0.9375000298023224, 'reward': 1.2946429252624512, 'reward_std': 0.25852909684181213, 'kl': 0.61328125, 'epoch': 4.52} 90%|█████████ | 1454/1610 [5:16:15<19:38, 7.55s/it] 90%|█████████ | 1455/1610 [5:16:22<19:22, 7.50s/it] {'loss': 0.021, 'grad_norm': 3.2579060493090557, 'learning_rate': 9.627329192546583e-08, 'completion_length': 42.19643020629883, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 0.9642857611179352, 'reward': 1.5178571939468384, 'reward_std': 0.17355041205883026, 'kl': 0.5263671875, 'epoch': 4.52} 90%|█████████ | 1455/1610 [5:16:22<19:22, 7.50s/it] 90%|█████████ | 1456/1610 [5:16:29<18:54, 7.37s/it] {'loss': 0.0577, 'grad_norm': 3.891941085688682, 'learning_rate': 9.565217391304348e-08, 'completion_length': 42.30357360839844, 'rewards/accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 0.8839286267757416, 'reward': 1.3839285969734192, 'reward_std': 0.2948778420686722, 'kl': 1.44140625, 'epoch': 4.52} 90%|█████████ | 1456/1610 [5:16:29<18:54, 7.37s/it] 90%|█████████ | 1457/1610 [5:16:36<18:34, 7.28s/it] {'loss': 0.0571, 'grad_norm': 5.780380764344786, 'learning_rate': 9.503105590062112e-08, 'completion_length': 42.11607360839844, 'rewards/accuracy_reward': 0.4285714477300644, 'rewards/format_reward': 0.8750000298023224, 'reward': 1.3035714626312256, 'reward_std': 0.30659742653369904, 'kl': 1.431640625, 'epoch': 4.52} 90%|█████████ | 1457/1610 [5:16:36<18:34, 7.28s/it] 91%|█████████ | 1458/1610 [5:16:43<18:15, 7.21s/it] {'loss': 0.041, 'grad_norm': 4.205691520386602, 'learning_rate': 9.440993788819875e-08, 'completion_length': 41.56250190734863, 'rewards/accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 0.9196429252624512, 'reward': 1.4196429252624512, 'reward_std': 0.3541206568479538, 'kl': 1.02734375, 'epoch': 4.53} 91%|█████████ | 1458/1610 [5:16:43<18:15, 7.21s/it] 91%|█████████ | 1459/1610 [5:16:50<18:04, 7.18s/it] {'loss': 0.0713, 'grad_norm': 6.594278812908184, 'learning_rate': 9.378881987577639e-08, 'completion_length': 41.86607360839844, 'rewards/accuracy_reward': 0.2767857313156128, 'rewards/format_reward': 0.7946428954601288, 'reward': 1.071428656578064, 'reward_std': 0.33976517617702484, 'kl': 1.783203125, 'epoch': 4.53} 91%|█████████ | 1459/1610 [5:16:50<18:04, 7.18s/it] 91%|█████████ | 1460/1610 [5:16:57<17:34, 7.03s/it] {'loss': 0.048, 'grad_norm': 5.712852491107754, 'learning_rate': 9.316770186335403e-08, 'completion_length': 44.06250190734863, 'rewards/accuracy_reward': 0.2946428656578064, 'rewards/format_reward': 0.9196428954601288, 'reward': 1.2142857909202576, 'reward_std': 0.2383313700556755, 'kl': 1.2021484375, 'epoch': 4.53} 91%|█████████ | 1460/1610 [5:16:57<17:34, 7.03s/it] 91%|█████████ | 1461/1610 [5:17:04<17:40, 7.12s/it] {'loss': 0.038, 'grad_norm': 3.348014513577402, 'learning_rate': 9.254658385093167e-08, 'completion_length': 44.33928680419922, 'rewards/accuracy_reward': 0.5535714477300644, 'rewards/format_reward': 0.910714328289032, 'reward': 1.4642857909202576, 'reward_std': 0.21972984075546265, 'kl': 0.9482421875, 'epoch': 4.54} 91%|█████████ | 1461/1610 [5:17:04<17:40, 7.12s/it] 91%|█████████ | 1462/1610 [5:17:12<17:52, 7.25s/it] {'loss': 0.0349, 'grad_norm': 4.546117661800463, 'learning_rate': 9.192546583850931e-08, 'completion_length': 47.392860412597656, 'rewards/accuracy_reward': 0.3660714402794838, 'rewards/format_reward': 0.9375000298023224, 'reward': 1.3035714626312256, 'reward_std': 0.20532545447349548, 'kl': 0.875, 'epoch': 4.54} 91%|█████████ | 1462/1610 [5:17:12<17:52, 7.25s/it] 91%|█████████ | 1463/1610 [5:17:19<17:26, 7.12s/it] {'loss': 0.0664, 'grad_norm': 5.197897939058672, 'learning_rate': 9.130434782608696e-08, 'completion_length': 37.6875, 'rewards/accuracy_reward': 0.7500000298023224, 'rewards/format_reward': 0.9017857611179352, 'reward': 1.6517857909202576, 'reward_std': 0.3045148551464081, 'kl': 1.654296875, 'epoch': 4.54} 91%|█████████ | 1463/1610 [5:17:19<17:26, 7.12s/it] 91%|█████████ | 1464/1610 [5:17:27<17:55, 7.37s/it] {'loss': 0.0826, 'grad_norm': 6.576984693924923, 'learning_rate': 9.068322981366459e-08, 'completion_length': 36.08928871154785, 'rewards/accuracy_reward': 0.4910714477300644, 'rewards/format_reward': 0.8660714626312256, 'reward': 1.3571429252624512, 'reward_std': 0.29036155343055725, 'kl': 2.05859375, 'epoch': 4.55} 91%|█████████ | 1464/1610 [5:17:27<17:55, 7.37s/it] 91%|█████████ | 1465/1610 [5:17:35<18:40, 7.73s/it] {'loss': 0.0511, 'grad_norm': 6.876691165154465, 'learning_rate': 9.006211180124224e-08, 'completion_length': 41.83928871154785, 'rewards/accuracy_reward': 0.6160714626312256, 'rewards/format_reward': 0.892857164144516, 'reward': 1.508928656578064, 'reward_std': 0.4030574858188629, 'kl': 1.28125, 'epoch': 4.55} 91%|█████████ | 1465/1610 [5:17:35<18:40, 7.73s/it] 91%|█████████ | 1466/1610 [5:17:43<18:23, 7.66s/it] {'loss': 0.0351, 'grad_norm': 7.1970995846633, 'learning_rate': 8.944099378881988e-08, 'completion_length': 44.33928871154785, 'rewards/accuracy_reward': 0.4285714477300644, 'rewards/format_reward': 0.9375000298023224, 'reward': 1.3660715222358704, 'reward_std': 0.24639402329921722, 'kl': 0.875, 'epoch': 4.55} 91%|█████████ | 1466/1610 [5:17:43<18:23, 7.66s/it] 91%|█████████ | 1467/1610 [5:17:51<18:27, 7.75s/it] {'loss': 0.0664, 'grad_norm': 4.63781558966631, 'learning_rate': 8.881987577639751e-08, 'completion_length': 43.23214530944824, 'rewards/accuracy_reward': 0.5357143133878708, 'rewards/format_reward': 0.8660714626312256, 'reward': 1.4017857909202576, 'reward_std': 0.2740923762321472, 'kl': 1.662109375, 'epoch': 4.56} 91%|█████████ | 1467/1610 [5:17:51<18:27, 7.75s/it] 91%|█████████ | 1468/1610 [5:17:58<17:55, 7.57s/it] {'loss': 0.0312, 'grad_norm': 7.349360933936707, 'learning_rate': 8.819875776397515e-08, 'completion_length': 38.71428680419922, 'rewards/accuracy_reward': 0.580357164144516, 'rewards/format_reward': 0.910714328289032, 'reward': 1.4910714626312256, 'reward_std': 0.322754368185997, 'kl': 0.779296875, 'epoch': 4.56} 91%|█████████ | 1468/1610 [5:17:58<17:55, 7.57s/it] 91%|█████████ | 1469/1610 [5:18:05<17:33, 7.47s/it] {'loss': 0.0776, 'grad_norm': 4.995945183012538, 'learning_rate': 8.757763975155279e-08, 'completion_length': 39.1875, 'rewards/accuracy_reward': 0.6517857313156128, 'rewards/format_reward': 0.8571428954601288, 'reward': 1.508928656578064, 'reward_std': 0.3893774598836899, 'kl': 1.94140625, 'epoch': 4.56} 91%|█████████ | 1469/1610 [5:18:05<17:33, 7.47s/it] 91%|█████████▏| 1470/1610 [5:18:16<19:34, 8.39s/it] {'loss': 0.0788, 'grad_norm': 7.298483615448526, 'learning_rate': 8.695652173913042e-08, 'completion_length': 51.74107360839844, 'rewards/accuracy_reward': 0.4196428805589676, 'rewards/format_reward': 0.8303571939468384, 'reward': 1.2500000596046448, 'reward_std': 0.49990931153297424, 'kl': 1.96875, 'epoch': 4.57} 91%|█████████▏| 1470/1610 [5:18:16<19:34, 8.39s/it] 91%|█████████▏| 1471/1610 [5:18:23<18:29, 7.98s/it] {'loss': 0.0574, 'grad_norm': 14.298632603948636, 'learning_rate': 8.633540372670807e-08, 'completion_length': 42.16071701049805, 'rewards/accuracy_reward': 0.598214328289032, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.5267857909202576, 'reward_std': 0.23690403252840042, 'kl': 1.43359375, 'epoch': 4.57} 91%|█████████▏| 1471/1610 [5:18:23<18:29, 7.98s/it] 91%|█████████▏| 1472/1610 [5:18:31<18:29, 8.04s/it] {'loss': 0.0891, 'grad_norm': 7.0221325466218145, 'learning_rate': 8.571428571428572e-08, 'completion_length': 42.75893020629883, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 0.8214285969734192, 'reward': 1.3750000596046448, 'reward_std': 0.5364250987768173, 'kl': 2.2265625, 'epoch': 4.57} 91%|█████████▏| 1472/1610 [5:18:31<18:29, 8.04s/it] 91%|█████████▏| 1473/1610 [5:18:38<17:35, 7.70s/it] {'loss': 0.0404, 'grad_norm': 3.98554668222008, 'learning_rate': 8.509316770186335e-08, 'completion_length': 41.90178680419922, 'rewards/accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 0.955357164144516, 'reward': 1.5446429252624512, 'reward_std': 0.2133289948105812, 'kl': 1.009765625, 'epoch': 4.57} 91%|█████████▏| 1473/1610 [5:18:38<17:35, 7.70s/it] 92%|█████████▏| 1474/1610 [5:18:47<18:50, 8.31s/it] {'loss': 0.0835, 'grad_norm': 7.497116174364523, 'learning_rate': 8.4472049689441e-08, 'completion_length': 48.36607360839844, 'rewards/accuracy_reward': 0.401785746216774, 'rewards/format_reward': 0.8303571939468384, 'reward': 1.2321429252624512, 'reward_std': 0.4518769532442093, 'kl': 2.08203125, 'epoch': 4.58} 92%|█████████▏| 1474/1610 [5:18:47<18:50, 8.31s/it] 92%|█████████▏| 1475/1610 [5:18:55<17:53, 7.95s/it] {'loss': 0.0501, 'grad_norm': 4.0365434329665195, 'learning_rate': 8.385093167701864e-08, 'completion_length': 39.81250190734863, 'rewards/accuracy_reward': 0.6250000149011612, 'rewards/format_reward': 0.9196429252624512, 'reward': 1.5446429252624512, 'reward_std': 0.3199223056435585, 'kl': 1.251953125, 'epoch': 4.58} 92%|█████████▏| 1475/1610 [5:18:55<17:53, 7.95s/it] 92%|█████████▏| 1476/1610 [5:19:02<17:30, 7.84s/it] {'loss': 0.0316, 'grad_norm': 5.315216049884273, 'learning_rate': 8.322981366459626e-08, 'completion_length': 37.473215103149414, 'rewards/accuracy_reward': 0.580357164144516, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5446429252624512, 'reward_std': 0.2825503796339035, 'kl': 0.78759765625, 'epoch': 4.58} 92%|█████████▏| 1476/1610 [5:19:02<17:30, 7.84s/it] 92%|█████████▏| 1477/1610 [5:19:10<17:02, 7.69s/it] {'loss': 0.0746, 'grad_norm': 6.540637668254599, 'learning_rate': 8.26086956521739e-08, 'completion_length': 40.02678871154785, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 0.830357164144516, 'reward': 1.3839285969734192, 'reward_std': 0.45602038502693176, 'kl': 1.865234375, 'epoch': 4.59} 92%|█████████▏| 1477/1610 [5:19:10<17:02, 7.69s/it] 92%|█████████▏| 1478/1610 [5:19:17<16:56, 7.70s/it] {'loss': 0.0921, 'grad_norm': 7.5906490584796495, 'learning_rate': 8.198757763975155e-08, 'completion_length': 44.39285850524902, 'rewards/accuracy_reward': 0.4375000149011612, 'rewards/format_reward': 0.8392857611179352, 'reward': 1.2767857909202576, 'reward_std': 0.48910272121429443, 'kl': 2.3046875, 'epoch': 4.59} 92%|█████████▏| 1478/1610 [5:19:17<16:56, 7.70s/it] 92%|█████████▏| 1479/1610 [5:19:25<16:32, 7.57s/it] {'loss': 0.0544, 'grad_norm': 7.189560647653164, 'learning_rate': 8.136645962732918e-08, 'completion_length': 39.73214340209961, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 0.910714328289032, 'reward': 1.4285715222358704, 'reward_std': 0.2838655114173889, 'kl': 1.3642578125, 'epoch': 4.59} 92%|█████████▏| 1479/1610 [5:19:25<16:32, 7.57s/it] 92%|█████████▏| 1480/1610 [5:19:32<16:15, 7.50s/it] {'loss': 0.0391, 'grad_norm': 4.940815043258649, 'learning_rate': 8.074534161490683e-08, 'completion_length': 43.77678871154785, 'rewards/accuracy_reward': 0.5089285969734192, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.4375000596046448, 'reward_std': 0.32642027735710144, 'kl': 0.978515625, 'epoch': 4.6} 92%|█████████▏| 1480/1610 [5:19:32<16:15, 7.50s/it] 92%|█████████▏| 1481/1610 [5:19:39<15:54, 7.40s/it] {'loss': 0.04, 'grad_norm': 8.401293270856504, 'learning_rate': 8.012422360248448e-08, 'completion_length': 43.767860412597656, 'rewards/accuracy_reward': 0.4017857313156128, 'rewards/format_reward': 0.8750000298023224, 'reward': 1.2767857313156128, 'reward_std': 0.35070421546697617, 'kl': 1.0, 'epoch': 4.6} 92%|█████████▏| 1481/1610 [5:19:39<15:54, 7.40s/it] 92%|█████████▏| 1482/1610 [5:19:47<16:00, 7.50s/it] {'loss': 0.0643, 'grad_norm': 4.989245669727075, 'learning_rate': 7.950310559006211e-08, 'completion_length': 39.473215103149414, 'rewards/accuracy_reward': 0.7142857611179352, 'rewards/format_reward': 0.910714328289032, 'reward': 1.6250000596046448, 'reward_std': 0.2910432890057564, 'kl': 1.607421875, 'epoch': 4.6} 92%|█████████▏| 1482/1610 [5:19:47<16:00, 7.50s/it] 92%|█████████▏| 1483/1610 [5:19:54<15:32, 7.34s/it] {'loss': 0.0502, 'grad_norm': 4.855023799015556, 'learning_rate': 7.888198757763975e-08, 'completion_length': 43.52678680419922, 'rewards/accuracy_reward': 0.4821428805589676, 'rewards/format_reward': 0.9196428656578064, 'reward': 1.4017857909202576, 'reward_std': 0.22990670055150986, 'kl': 1.25634765625, 'epoch': 4.61} 92%|█████████▏| 1483/1610 [5:19:54<15:32, 7.34s/it] 92%|█████████▏| 1484/1610 [5:20:01<15:17, 7.28s/it] {'loss': 0.0535, 'grad_norm': 4.480829050935437, 'learning_rate': 7.82608695652174e-08, 'completion_length': 44.55357360839844, 'rewards/accuracy_reward': 0.464285746216774, 'rewards/format_reward': 0.8750000596046448, 'reward': 1.3392857313156128, 'reward_std': 0.35198988020420074, 'kl': 1.34375, 'epoch': 4.61} 92%|█████████▏| 1484/1610 [5:20:01<15:17, 7.28s/it] 92%|█████████▏| 1485/1610 [5:20:08<15:17, 7.34s/it] {'loss': 0.1208, 'grad_norm': 20.377974129851623, 'learning_rate': 7.763975155279502e-08, 'completion_length': 46.04464530944824, 'rewards/accuracy_reward': 0.4821428656578064, 'rewards/format_reward': 0.8392857313156128, 'reward': 1.3214285969734192, 'reward_std': 0.3778505325317383, 'kl': 3.015625, 'epoch': 4.61} 92%|█████████▏| 1485/1610 [5:20:08<15:17, 7.34s/it] 92%|█████████▏| 1486/1610 [5:20:16<15:13, 7.37s/it] {'loss': 0.0324, 'grad_norm': 4.319700855283583, 'learning_rate': 7.701863354037266e-08, 'completion_length': 41.71428871154785, 'rewards/accuracy_reward': 0.3035714477300644, 'rewards/format_reward': 0.8839285969734192, 'reward': 1.1875000596046448, 'reward_std': 0.25977765023708344, 'kl': 0.80859375, 'epoch': 4.61} 92%|█████████▏| 1486/1610 [5:20:16<15:13, 7.37s/it] 92%|█████████▏| 1487/1610 [5:20:23<15:15, 7.44s/it] {'loss': 0.0289, 'grad_norm': 3.1899294511781373, 'learning_rate': 7.639751552795031e-08, 'completion_length': 42.178571701049805, 'rewards/accuracy_reward': 0.3839285969734192, 'rewards/format_reward': 0.9642857611179352, 'reward': 1.348214328289032, 'reward_std': 0.1827620565891266, 'kl': 0.72265625, 'epoch': 4.62} 92%|█████████▏| 1487/1610 [5:20:23<15:15, 7.44s/it] 92%|█████████▏| 1488/1610 [5:20:33<16:45, 8.24s/it] {'loss': 0.0533, 'grad_norm': 5.404107349355144, 'learning_rate': 7.577639751552794e-08, 'completion_length': 49.06250190734863, 'rewards/accuracy_reward': 0.5625000149011612, 'rewards/format_reward': 0.9017857611179352, 'reward': 1.4642857909202576, 'reward_std': 0.22936688363552094, 'kl': 1.333984375, 'epoch': 4.62} 92%|█████████▏| 1488/1610 [5:20:34<16:45, 8.24s/it] 92%|█████████▏| 1489/1610 [5:20:40<15:47, 7.83s/it] {'loss': 0.0393, 'grad_norm': 3.8343392738293485, 'learning_rate': 7.515527950310559e-08, 'completion_length': 37.75893020629883, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.4642857909202576, 'reward_std': 0.16323687136173248, 'kl': 0.984375, 'epoch': 4.62} 92%|█████████▏| 1489/1610 [5:20:40<15:47, 7.83s/it] 93%|█████████▎| 1490/1610 [5:20:47<15:05, 7.55s/it] {'loss': 0.0291, 'grad_norm': 4.994576718672895, 'learning_rate': 7.453416149068323e-08, 'completion_length': 39.90178871154785, 'rewards/accuracy_reward': 0.4910714477300644, 'rewards/format_reward': 0.9196428954601288, 'reward': 1.410714328289032, 'reward_std': 0.2344820573925972, 'kl': 0.7265625, 'epoch': 4.63} 93%|█████████▎| 1490/1610 [5:20:47<15:05, 7.55s/it] 93%|█████████▎| 1491/1610 [5:20:54<14:25, 7.27s/it] {'loss': 0.0447, 'grad_norm': 3.8872400552359005, 'learning_rate': 7.391304347826087e-08, 'completion_length': 44.47321701049805, 'rewards/accuracy_reward': 0.7142857313156128, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.6428572535514832, 'reward_std': 0.2962091565132141, 'kl': 1.1171875, 'epoch': 4.63} 93%|█████████▎| 1491/1610 [5:20:54<14:25, 7.27s/it] 93%|█████████▎| 1492/1610 [5:21:01<14:14, 7.24s/it] {'loss': 0.0557, 'grad_norm': 6.688465770721729, 'learning_rate': 7.329192546583851e-08, 'completion_length': 38.21428680419922, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 0.9017857611179352, 'reward': 1.508928656578064, 'reward_std': 0.38019727170467377, 'kl': 1.39453125, 'epoch': 4.63} 93%|█████████▎| 1492/1610 [5:21:01<14:14, 7.24s/it] 93%|█████████▎| 1493/1610 [5:21:09<14:18, 7.34s/it] {'loss': 0.0318, 'grad_norm': 10.856322743331788, 'learning_rate': 7.267080745341616e-08, 'completion_length': 41.01785850524902, 'rewards/accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.535714328289032, 'reward_std': 0.3683473616838455, 'kl': 0.7939453125, 'epoch': 4.64} 93%|█████████▎| 1493/1610 [5:21:09<14:18, 7.34s/it] 93%|█████████▎| 1494/1610 [5:21:16<14:05, 7.29s/it] {'loss': 0.026, 'grad_norm': 3.1889487296849643, 'learning_rate': 7.204968944099378e-08, 'completion_length': 39.95535850524902, 'rewards/accuracy_reward': 0.5089285969734192, 'rewards/format_reward': 0.9553571939468384, 'reward': 1.4642857909202576, 'reward_std': 0.18276765942573547, 'kl': 0.6494140625, 'epoch': 4.64} 93%|█████████▎| 1494/1610 [5:21:16<14:05, 7.29s/it] 93%|█████████▎| 1495/1610 [5:21:23<13:57, 7.28s/it] {'loss': 0.0887, 'grad_norm': 8.362119379616576, 'learning_rate': 7.142857142857142e-08, 'completion_length': 40.09821701049805, 'rewards/accuracy_reward': 0.5089285969734192, 'rewards/format_reward': 0.9017857611179352, 'reward': 1.410714328289032, 'reward_std': 0.3997337967157364, 'kl': 2.22265625, 'epoch': 4.64} 93%|█████████▎| 1495/1610 [5:21:23<13:57, 7.28s/it] 93%|█████████▎| 1496/1610 [5:21:30<13:54, 7.32s/it] {'loss': 0.0451, 'grad_norm': 5.303481797366078, 'learning_rate': 7.080745341614907e-08, 'completion_length': 40.40178871154785, 'rewards/accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 0.9375000298023224, 'reward': 1.4375001192092896, 'reward_std': 0.2808719426393509, 'kl': 1.126953125, 'epoch': 4.65} 93%|█████████▎| 1496/1610 [5:21:30<13:54, 7.32s/it] 93%|█████████▎| 1497/1610 [5:21:38<13:51, 7.36s/it] {'loss': 0.0326, 'grad_norm': 3.7688606152690154, 'learning_rate': 7.01863354037267e-08, 'completion_length': 41.26785850524902, 'rewards/accuracy_reward': 0.5267857611179352, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.473214328289032, 'reward_std': 0.2938939183950424, 'kl': 0.81640625, 'epoch': 4.65} 93%|█████████▎| 1497/1610 [5:21:38<13:51, 7.36s/it] 93%|█████████▎| 1498/1610 [5:21:45<13:36, 7.29s/it] {'loss': 0.0558, 'grad_norm': 4.123321205011006, 'learning_rate': 6.956521739130435e-08, 'completion_length': 42.33928680419922, 'rewards/accuracy_reward': 0.4553571566939354, 'rewards/format_reward': 0.910714328289032, 'reward': 1.3660714626312256, 'reward_std': 0.2865186333656311, 'kl': 1.39453125, 'epoch': 4.65} 93%|█████████▎| 1498/1610 [5:21:45<13:36, 7.29s/it] 93%|█████████▎| 1499/1610 [5:21:54<14:31, 7.85s/it] {'loss': 0.0665, 'grad_norm': 5.500198623431777, 'learning_rate': 6.894409937888199e-08, 'completion_length': 48.15178680419922, 'rewards/accuracy_reward': 0.4196428656578064, 'rewards/format_reward': 0.8928571939468384, 'reward': 1.3125000596046448, 'reward_std': 0.3416806310415268, 'kl': 1.6611328125, 'epoch': 4.66} 93%|█████████▎| 1499/1610 [5:21:54<14:31, 7.85s/it] 93%|█████████▎| 1500/1610 [5:22:02<14:20, 7.82s/it] {'loss': 0.058, 'grad_norm': 6.472585377187822, 'learning_rate': 6.832298136645963e-08, 'completion_length': 42.54464530944824, 'rewards/accuracy_reward': 0.4107142984867096, 'rewards/format_reward': 0.8571429252624512, 'reward': 1.2678571939468384, 'reward_std': 0.42827680706977844, 'kl': 1.44921875, 'epoch': 4.66} 93%|█████████▎| 1500/1610 [5:22:02<14:20, 7.82s/it] 93%|█████████▎| 1501/1610 [5:22:56<39:26, 21.71s/it] {'loss': 0.0375, 'grad_norm': 4.527929657044937, 'learning_rate': 6.770186335403727e-08, 'completion_length': 38.31250190734863, 'rewards/accuracy_reward': 0.535714328289032, 'rewards/format_reward': 0.955357164144516, 'reward': 1.4910715222358704, 'reward_std': 0.23057925701141357, 'kl': 0.9375, 'epoch': 4.66} 93%|█████████▎| 1501/1610 [5:22:56<39:26, 21.71s/it] 93%|█████████▎| 1502/1610 [5:23:04<31:43, 17.62s/it] {'loss': 0.0566, 'grad_norm': 7.491581701822286, 'learning_rate': 6.708074534161489e-08, 'completion_length': 42.50000190734863, 'rewards/accuracy_reward': 0.383928582072258, 'rewards/format_reward': 0.9017857611179352, 'reward': 1.285714328289032, 'reward_std': 0.26940906047821045, 'kl': 1.41796875, 'epoch': 4.66} 93%|█████████▎| 1502/1610 [5:23:04<31:43, 17.62s/it] 93%|█████████▎| 1503/1610 [5:23:12<26:02, 14.60s/it] {'loss': 0.052, 'grad_norm': 3.7903644441781386, 'learning_rate': 6.645962732919254e-08, 'completion_length': 44.92857360839844, 'rewards/accuracy_reward': 0.4910714477300644, 'rewards/format_reward': 0.8660714626312256, 'reward': 1.3571429252624512, 'reward_std': 0.19811318069696426, 'kl': 1.296875, 'epoch': 4.67} 93%|█████████▎| 1503/1610 [5:23:12<26:02, 14.60s/it] 93%|█████████▎| 1504/1610 [5:23:22<23:32, 13.32s/it] {'loss': 0.0761, 'grad_norm': 6.271050027614355, 'learning_rate': 6.583850931677018e-08, 'completion_length': 50.142860412597656, 'rewards/accuracy_reward': 0.3303571715950966, 'rewards/format_reward': 0.8125000596046448, 'reward': 1.1428571939468384, 'reward_std': 0.34411102533340454, 'kl': 1.90234375, 'epoch': 4.67} 93%|█████████▎| 1504/1610 [5:23:22<23:32, 13.32s/it] 93%|█████████▎| 1505/1610 [5:23:29<19:54, 11.38s/it] {'loss': 0.0161, 'grad_norm': 3.3890584987885153, 'learning_rate': 6.521739130434782e-08, 'completion_length': 36.10714530944824, 'rewards/accuracy_reward': 0.6071428805589676, 'rewards/format_reward': 0.973214328289032, 'reward': 1.5803571939468384, 'reward_std': 0.19178562611341476, 'kl': 0.4013671875, 'epoch': 4.67} 93%|█████████▎| 1505/1610 [5:23:29<19:54, 11.38s/it] 94%|█████████▎| 1506/1610 [5:23:36<17:44, 10.23s/it] {'loss': 0.0976, 'grad_norm': 10.595031858524328, 'learning_rate': 6.459627329192546e-08, 'completion_length': 39.27678680419922, 'rewards/accuracy_reward': 0.4464285969734192, 'rewards/format_reward': 0.8571428954601288, 'reward': 1.3035714626312256, 'reward_std': 0.4892376661300659, 'kl': 2.44140625, 'epoch': 4.68} 94%|█████████▎| 1506/1610 [5:23:36<17:44, 10.23s/it] 94%|█████████▎| 1507/1610 [5:23:44<16:03, 9.36s/it] {'loss': 0.0619, 'grad_norm': 5.763035068888961, 'learning_rate': 6.397515527950311e-08, 'completion_length': 43.56250190734863, 'rewards/accuracy_reward': 0.6696428954601288, 'rewards/format_reward': 0.910714328289032, 'reward': 1.5803572535514832, 'reward_std': 0.31224703788757324, 'kl': 1.548828125, 'epoch': 4.68} 94%|█████████▎| 1507/1610 [5:23:44<16:03, 9.36s/it] 94%|█████████▎| 1508/1610 [5:23:51<14:39, 8.62s/it] {'loss': 0.0372, 'grad_norm': 4.009139681616405, 'learning_rate': 6.335403726708074e-08, 'completion_length': 38.25000190734863, 'rewards/accuracy_reward': 0.6071428805589676, 'rewards/format_reward': 0.9196428954601288, 'reward': 1.5267857909202576, 'reward_std': 0.2410864681005478, 'kl': 0.927734375, 'epoch': 4.68} 94%|█████████▎| 1508/1610 [5:23:51<14:39, 8.62s/it] 94%|█████████▎| 1509/1610 [5:23:57<13:35, 8.08s/it] {'loss': 0.0266, 'grad_norm': 5.1460045832788195, 'learning_rate': 6.273291925465838e-08, 'completion_length': 39.91071701049805, 'rewards/accuracy_reward': 0.4553571790456772, 'rewards/format_reward': 0.955357164144516, 'reward': 1.410714328289032, 'reward_std': 0.2570082098245621, 'kl': 0.6650390625, 'epoch': 4.69} 94%|█████████▎| 1509/1610 [5:23:57<13:35, 8.08s/it] 94%|█████████▍| 1510/1610 [5:24:04<12:47, 7.67s/it] {'loss': 0.0768, 'grad_norm': 6.745736432761188, 'learning_rate': 6.211180124223602e-08, 'completion_length': 41.517860412597656, 'rewards/accuracy_reward': 0.5446428954601288, 'rewards/format_reward': 0.830357164144516, 'reward': 1.3750000596046448, 'reward_std': 0.587340772151947, 'kl': 1.91796875, 'epoch': 4.69} 94%|█████████▍| 1510/1610 [5:24:04<12:47, 7.67s/it] 94%|█████████▍| 1511/1610 [5:24:11<12:09, 7.36s/it] {'loss': 0.0374, 'grad_norm': 4.278857768387558, 'learning_rate': 6.149068322981366e-08, 'completion_length': 41.28571701049805, 'rewards/accuracy_reward': 0.5267857313156128, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.473214328289032, 'reward_std': 0.2702430784702301, 'kl': 0.9375, 'epoch': 4.69} 94%|█████████▍| 1511/1610 [5:24:11<12:09, 7.36s/it] 94%|█████████▍| 1512/1610 [5:24:18<11:56, 7.31s/it] {'loss': 0.0618, 'grad_norm': 4.6699069256907615, 'learning_rate': 6.086956521739131e-08, 'completion_length': 41.61607360839844, 'rewards/accuracy_reward': 0.4642857313156128, 'rewards/format_reward': 0.9017857611179352, 'reward': 1.3660715222358704, 'reward_std': 0.29928920418024063, 'kl': 1.544921875, 'epoch': 4.7} 94%|█████████▍| 1512/1610 [5:24:18<11:56, 7.31s/it] 94%|█████████▍| 1513/1610 [5:24:25<11:47, 7.29s/it] {'loss': 0.04, 'grad_norm': 4.749337971869439, 'learning_rate': 6.024844720496894e-08, 'completion_length': 38.70535850524902, 'rewards/accuracy_reward': 0.4464285969734192, 'rewards/format_reward': 0.910714328289032, 'reward': 1.3571429252624512, 'reward_std': 0.23817552626132965, 'kl': 1.001953125, 'epoch': 4.7} 94%|█████████▍| 1513/1610 [5:24:25<11:47, 7.29s/it] 94%|█████████▍| 1514/1610 [5:24:32<11:34, 7.24s/it] {'loss': 0.103, 'grad_norm': 13.950262874398765, 'learning_rate': 5.962732919254657e-08, 'completion_length': 42.83035850524902, 'rewards/accuracy_reward': 0.4375000149011612, 'rewards/format_reward': 0.8571428954601288, 'reward': 1.2946429252624512, 'reward_std': 0.26522670686244965, 'kl': 2.578125, 'epoch': 4.7} 94%|█████████▍| 1514/1610 [5:24:32<11:34, 7.24s/it] 94%|█████████▍| 1515/1610 [5:24:40<11:30, 7.27s/it] {'loss': 0.0425, 'grad_norm': 3.0466446099692632, 'learning_rate': 5.900621118012422e-08, 'completion_length': 41.91964530944824, 'rewards/accuracy_reward': 0.4910714626312256, 'rewards/format_reward': 0.910714328289032, 'reward': 1.4017857909202576, 'reward_std': 0.2320038080215454, 'kl': 1.064453125, 'epoch': 4.7} 94%|█████████▍| 1515/1610 [5:24:40<11:30, 7.27s/it] 94%|█████████▍| 1516/1610 [5:24:48<11:51, 7.57s/it] {'loss': 0.0493, 'grad_norm': 5.204788590441447, 'learning_rate': 5.8385093167701866e-08, 'completion_length': 42.580360412597656, 'rewards/accuracy_reward': 0.5000000149011612, 'rewards/format_reward': 0.910714328289032, 'reward': 1.410714328289032, 'reward_std': 0.2907600849866867, 'kl': 1.234375, 'epoch': 4.71} 94%|█████████▍| 1516/1610 [5:24:48<11:51, 7.57s/it] 94%|█████████▍| 1517/1610 [5:24:55<11:34, 7.47s/it] {'loss': 0.0262, 'grad_norm': 3.118414295699997, 'learning_rate': 5.77639751552795e-08, 'completion_length': 42.37500190734863, 'rewards/accuracy_reward': 0.4821428805589676, 'rewards/format_reward': 0.9642857611179352, 'reward': 1.446428656578064, 'reward_std': 0.26591917872428894, 'kl': 0.6552734375, 'epoch': 4.71} 94%|█████████▍| 1517/1610 [5:24:55<11:34, 7.47s/it] 94%|█████████▍| 1518/1610 [5:25:03<11:36, 7.57s/it] {'loss': 0.0939, 'grad_norm': 9.617636044351629, 'learning_rate': 5.714285714285714e-08, 'completion_length': 41.26785850524902, 'rewards/accuracy_reward': 0.5446428656578064, 'rewards/format_reward': 0.8750000596046448, 'reward': 1.4196429252624512, 'reward_std': 0.25048841536045074, 'kl': 2.33984375, 'epoch': 4.71} 94%|█████████▍| 1518/1610 [5:25:03<11:36, 7.57s/it] 94%|█████████▍| 1519/1610 [5:25:11<11:26, 7.55s/it] {'loss': 0.0744, 'grad_norm': 7.617927489203349, 'learning_rate': 5.6521739130434777e-08, 'completion_length': 42.03571701049805, 'rewards/accuracy_reward': 0.3750000149011612, 'rewards/format_reward': 0.910714328289032, 'reward': 1.2857143878936768, 'reward_std': 0.2065005898475647, 'kl': 1.859375, 'epoch': 4.72} 94%|█████████▍| 1519/1610 [5:25:11<11:26, 7.55s/it] 94%|█████████▍| 1520/1610 [5:25:20<12:18, 8.21s/it] {'loss': 0.0903, 'grad_norm': 5.277145801289365, 'learning_rate': 5.590062111801242e-08, 'completion_length': 51.22321701049805, 'rewards/accuracy_reward': 0.508928582072258, 'rewards/format_reward': 0.8571428954601288, 'reward': 1.3660715222358704, 'reward_std': 0.2663346827030182, 'kl': 2.25, 'epoch': 4.72} 94%|█████████▍| 1520/1610 [5:25:20<12:18, 8.21s/it] 94%|█████████▍| 1521/1610 [5:25:28<11:44, 7.91s/it] {'loss': 0.0354, 'grad_norm': 3.9865705187980014, 'learning_rate': 5.527950310559006e-08, 'completion_length': 39.55357360839844, 'rewards/accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 0.9464286267757416, 'reward': 1.5357143878936768, 'reward_std': 0.17226043343544006, 'kl': 0.8857421875, 'epoch': 4.72} 94%|█████████▍| 1521/1610 [5:25:28<11:44, 7.91s/it] 95%|█████████▍| 1522/1610 [5:25:34<11:01, 7.52s/it] {'loss': 0.0128, 'grad_norm': 3.4323344800353683, 'learning_rate': 5.4658385093167694e-08, 'completion_length': 38.02678680419922, 'rewards/accuracy_reward': 0.6160714626312256, 'rewards/format_reward': 0.973214328289032, 'reward': 1.5892857909202576, 'reward_std': 0.1963018700480461, 'kl': 0.3203125, 'epoch': 4.73} 95%|█████████▍| 1522/1610 [5:25:34<11:01, 7.52s/it] 95%|█████████▍| 1523/1610 [5:25:41<10:47, 7.44s/it] {'loss': 0.0692, 'grad_norm': 4.8653197788245555, 'learning_rate': 5.403726708074534e-08, 'completion_length': 42.21428871154785, 'rewards/accuracy_reward': 0.4821428656578064, 'rewards/format_reward': 0.8750000596046448, 'reward': 1.3571429252624512, 'reward_std': 0.2876489609479904, 'kl': 1.73046875, 'epoch': 4.73} 95%|█████████▍| 1523/1610 [5:25:41<10:47, 7.44s/it] 95%|█████████▍| 1524/1610 [5:25:49<10:33, 7.37s/it] {'loss': 0.094, 'grad_norm': 8.02379705985569, 'learning_rate': 5.341614906832298e-08, 'completion_length': 40.26785850524902, 'rewards/accuracy_reward': 0.4196428805589676, 'rewards/format_reward': 0.8303571939468384, 'reward': 1.2500000596046448, 'reward_std': 0.3796628415584564, 'kl': 2.34765625, 'epoch': 4.73} 95%|█████████▍| 1524/1610 [5:25:49<10:33, 7.37s/it] 95%|█████████▍| 1525/1610 [5:25:55<10:09, 7.17s/it] {'loss': 0.0369, 'grad_norm': 3.79968772766901, 'learning_rate': 5.279503105590062e-08, 'completion_length': 36.875, 'rewards/accuracy_reward': 0.3392857313156128, 'rewards/format_reward': 0.9464286267757416, 'reward': 1.285714328289032, 'reward_std': 0.22626957297325134, 'kl': 0.9228515625, 'epoch': 4.74} 95%|█████████▍| 1525/1610 [5:25:55<10:09, 7.17s/it] 95%|█████████▍| 1526/1610 [5:26:03<10:12, 7.29s/it] {'loss': 0.0653, 'grad_norm': 5.151816374255565, 'learning_rate': 5.217391304347826e-08, 'completion_length': 41.55357360839844, 'rewards/accuracy_reward': 0.598214328289032, 'rewards/format_reward': 0.910714328289032, 'reward': 1.508928656578064, 'reward_std': 0.3018244355916977, 'kl': 1.6328125, 'epoch': 4.74} 95%|█████████▍| 1526/1610 [5:26:03<10:12, 7.29s/it] 95%|█████████▍| 1527/1610 [5:26:13<11:13, 8.11s/it] {'loss': 0.0747, 'grad_norm': 6.665381265867258, 'learning_rate': 5.1552795031055897e-08, 'completion_length': 47.55357360839844, 'rewards/accuracy_reward': 0.4821428954601288, 'rewards/format_reward': 0.8660714626312256, 'reward': 1.348214328289032, 'reward_std': 0.41933226585388184, 'kl': 1.869140625, 'epoch': 4.74} 95%|█████████▍| 1527/1610 [5:26:13<11:13, 8.11s/it] 95%|█████████▍| 1528/1610 [5:26:21<11:03, 8.10s/it] {'loss': 0.0767, 'grad_norm': 6.0309695275947774, 'learning_rate': 5.0931677018633536e-08, 'completion_length': 40.910715103149414, 'rewards/accuracy_reward': 0.535714328289032, 'rewards/format_reward': 0.8392857611179352, 'reward': 1.3750000596046448, 'reward_std': 0.5494644343852997, 'kl': 1.9140625, 'epoch': 4.75} 95%|█████████▍| 1528/1610 [5:26:21<11:03, 8.10s/it] 95%|█████████▍| 1529/1610 [5:26:28<10:28, 7.76s/it] {'loss': 0.0204, 'grad_norm': 6.444423208078394, 'learning_rate': 5.031055900621118e-08, 'completion_length': 36.50893020629883, 'rewards/accuracy_reward': 0.535714328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5178571939468384, 'reward_std': 0.22936688363552094, 'kl': 0.5107421875, 'epoch': 4.75} 95%|█████████▍| 1529/1610 [5:26:28<10:28, 7.76s/it] 95%|█████████▌| 1530/1610 [5:26:35<10:05, 7.57s/it] {'loss': 0.043, 'grad_norm': 5.066447292141788, 'learning_rate': 4.9689440993788814e-08, 'completion_length': 42.07143211364746, 'rewards/accuracy_reward': 0.4910714626312256, 'rewards/format_reward': 0.9375000298023224, 'reward': 1.4285714626312256, 'reward_std': 0.19057324528694153, 'kl': 1.072265625, 'epoch': 4.75} 95%|█████████▌| 1530/1610 [5:26:35<10:05, 7.57s/it] 95%|█████████▌| 1531/1610 [5:26:43<09:59, 7.59s/it] {'loss': 0.0412, 'grad_norm': 5.553377246069584, 'learning_rate': 4.906832298136645e-08, 'completion_length': 39.491071701049805, 'rewards/accuracy_reward': 0.446428582072258, 'rewards/format_reward': 0.8839286267757416, 'reward': 1.3303571939468384, 'reward_std': 0.3705857843160629, 'kl': 1.029296875, 'epoch': 4.75} 95%|█████████▌| 1531/1610 [5:26:43<09:59, 7.59s/it] 95%|█████████▌| 1532/1610 [5:26:50<09:41, 7.46s/it] {'loss': 0.0451, 'grad_norm': 4.362977498880168, 'learning_rate': 4.84472049689441e-08, 'completion_length': 41.82143020629883, 'rewards/accuracy_reward': 0.4821428805589676, 'rewards/format_reward': 0.9196428954601288, 'reward': 1.4017857909202576, 'reward_std': 0.20411306619644165, 'kl': 1.126953125, 'epoch': 4.76} 95%|█████████▌| 1532/1610 [5:26:50<09:41, 7.46s/it] 95%|█████████▌| 1533/1610 [5:26:57<09:18, 7.25s/it] {'loss': 0.0261, 'grad_norm': 4.157893656951718, 'learning_rate': 4.782608695652174e-08, 'completion_length': 43.03571701049805, 'rewards/accuracy_reward': 0.5267857313156128, 'rewards/format_reward': 0.9642857611179352, 'reward': 1.4910714626312256, 'reward_std': 0.17104806005954742, 'kl': 0.65234375, 'epoch': 4.76} 95%|█████████▌| 1533/1610 [5:26:57<09:18, 7.25s/it] 95%|█████████▌| 1534/1610 [5:27:05<09:37, 7.59s/it] {'loss': 0.0716, 'grad_norm': 8.27504485995726, 'learning_rate': 4.720496894409938e-08, 'completion_length': 41.02678871154785, 'rewards/accuracy_reward': 0.383928582072258, 'rewards/format_reward': 0.8839285969734192, 'reward': 1.2678571939468384, 'reward_std': 0.353810116648674, 'kl': 1.78515625, 'epoch': 4.76} 95%|█████████▌| 1534/1610 [5:27:05<09:37, 7.59s/it] 95%|█████████▌| 1535/1610 [5:27:13<09:31, 7.62s/it] {'loss': 0.0314, 'grad_norm': 7.1171149028885425, 'learning_rate': 4.6583850931677016e-08, 'completion_length': 39.57143020629883, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 0.9375000596046448, 'reward': 1.4910715222358704, 'reward_std': 0.22049059718847275, 'kl': 0.78515625, 'epoch': 4.77} 95%|█████████▌| 1535/1610 [5:27:13<09:31, 7.62s/it] 95%|█████████▌| 1536/1610 [5:27:20<09:27, 7.68s/it] {'loss': 0.1057, 'grad_norm': 5.867086278613332, 'learning_rate': 4.5962732919254656e-08, 'completion_length': 43.12500190734863, 'rewards/accuracy_reward': 0.392857164144516, 'rewards/format_reward': 0.8750000298023224, 'reward': 1.2678571939468384, 'reward_std': 0.3724707067012787, 'kl': 2.6328125, 'epoch': 4.77} 95%|█████████▌| 1536/1610 [5:27:20<09:27, 7.68s/it] 95%|█████████▌| 1537/1610 [5:27:28<09:09, 7.53s/it] {'loss': 0.0442, 'grad_norm': 5.986862998471629, 'learning_rate': 4.5341614906832295e-08, 'completion_length': 42.49107360839844, 'rewards/accuracy_reward': 0.3035714402794838, 'rewards/format_reward': 0.9196428954601288, 'reward': 1.223214328289032, 'reward_std': 0.3439870774745941, 'kl': 1.103515625, 'epoch': 4.77} 95%|█████████▌| 1537/1610 [5:27:28<09:09, 7.53s/it] 96%|█████████▌| 1538/1610 [5:27:35<08:49, 7.35s/it] {'loss': 0.0254, 'grad_norm': 3.189806783221097, 'learning_rate': 4.472049689440994e-08, 'completion_length': 35.89285850524902, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.5535715222358704, 'reward_std': 0.13408026099205017, 'kl': 0.63671875, 'epoch': 4.78} 96%|█████████▌| 1538/1610 [5:27:35<08:49, 7.35s/it] 96%|█████████▌| 1539/1610 [5:27:41<08:26, 7.14s/it] {'loss': 0.0985, 'grad_norm': 7.879519955561625, 'learning_rate': 4.409937888198757e-08, 'completion_length': 38.660715103149414, 'rewards/accuracy_reward': 0.4375000149011612, 'rewards/format_reward': 0.8928571939468384, 'reward': 1.3303571939468384, 'reward_std': 0.2863853722810745, 'kl': 2.4609375, 'epoch': 4.78} 96%|█████████▌| 1539/1610 [5:27:41<08:26, 7.14s/it] 96%|█████████▌| 1540/1610 [5:27:48<08:12, 7.03s/it] {'loss': 0.064, 'grad_norm': 4.973405516556921, 'learning_rate': 4.347826086956521e-08, 'completion_length': 42.94643020629883, 'rewards/accuracy_reward': 0.473214328289032, 'rewards/format_reward': 0.9196428954601288, 'reward': 1.3928571939468384, 'reward_std': 0.3441730886697769, 'kl': 1.6015625, 'epoch': 4.78} 96%|█████████▌| 1540/1610 [5:27:48<08:12, 7.03s/it] 96%|█████████▌| 1541/1610 [5:27:55<07:57, 6.92s/it] {'loss': 0.0287, 'grad_norm': 8.160213458665265, 'learning_rate': 4.285714285714286e-08, 'completion_length': 40.45535850524902, 'rewards/accuracy_reward': 0.4107143133878708, 'rewards/format_reward': 0.9196428954601288, 'reward': 1.3303571939468384, 'reward_std': 0.351686954498291, 'kl': 0.7197265625, 'epoch': 4.79} 96%|█████████▌| 1541/1610 [5:27:55<07:57, 6.92s/it] 96%|█████████▌| 1542/1610 [5:28:02<07:53, 6.96s/it] {'loss': 0.0468, 'grad_norm': 5.022197365531575, 'learning_rate': 4.22360248447205e-08, 'completion_length': 43.33928871154785, 'rewards/accuracy_reward': 0.321428582072258, 'rewards/format_reward': 0.9107142984867096, 'reward': 1.2321429252624512, 'reward_std': 0.2404673919081688, 'kl': 1.1669921875, 'epoch': 4.79} 96%|█████████▌| 1542/1610 [5:28:02<07:53, 6.96s/it] 96%|█████████▌| 1543/1610 [5:28:09<07:49, 7.00s/it] {'loss': 0.0824, 'grad_norm': 8.067726422082368, 'learning_rate': 4.161490683229813e-08, 'completion_length': 41.58928680419922, 'rewards/accuracy_reward': 0.4285714328289032, 'rewards/format_reward': 0.7767857611179352, 'reward': 1.2053571939468384, 'reward_std': 0.43390046060085297, 'kl': 2.0546875, 'epoch': 4.79} 96%|█████████▌| 1543/1610 [5:28:09<07:49, 7.00s/it] 96%|█████████▌| 1544/1610 [5:28:16<07:41, 6.99s/it] {'loss': 0.0228, 'grad_norm': 4.612850820637881, 'learning_rate': 4.0993788819875776e-08, 'completion_length': 41.76785850524902, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 0.9642857611179352, 'reward': 1.571428656578064, 'reward_std': 0.2262553796172142, 'kl': 0.5693359375, 'epoch': 4.8} 96%|█████████▌| 1544/1610 [5:28:16<07:41, 6.99s/it] 96%|█████████▌| 1545/1610 [5:28:23<07:44, 7.15s/it] {'loss': 0.0575, 'grad_norm': 5.958782554184991, 'learning_rate': 4.0372670807453415e-08, 'completion_length': 43.50000190734863, 'rewards/accuracy_reward': 0.589285746216774, 'rewards/format_reward': 0.910714328289032, 'reward': 1.5000000596046448, 'reward_std': 0.32976867258548737, 'kl': 1.4375, 'epoch': 4.8} 96%|█████████▌| 1545/1610 [5:28:23<07:44, 7.15s/it] 96%|█████████▌| 1546/1610 [5:28:30<07:35, 7.12s/it] {'loss': 0.0543, 'grad_norm': 4.503495979964896, 'learning_rate': 3.9751552795031054e-08, 'completion_length': 43.535715103149414, 'rewards/accuracy_reward': 0.642857164144516, 'rewards/format_reward': 0.9017857313156128, 'reward': 1.544642984867096, 'reward_std': 0.096499003469944, 'kl': 1.3642578125, 'epoch': 4.8} 96%|█████████▌| 1546/1610 [5:28:30<07:35, 7.12s/it] 96%|█████████▌| 1547/1610 [5:28:37<07:21, 7.01s/it] {'loss': 0.0378, 'grad_norm': 4.580697850631515, 'learning_rate': 3.91304347826087e-08, 'completion_length': 40.71428680419922, 'rewards/accuracy_reward': 0.348214291036129, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.2946429252624512, 'reward_std': 0.2275410294532776, 'kl': 0.9453125, 'epoch': 4.8} 96%|█████████▌| 1547/1610 [5:28:37<07:21, 7.01s/it] 96%|█████████▌| 1548/1610 [5:28:44<07:05, 6.87s/it] {'loss': 0.011, 'grad_norm': 3.9170531630756487, 'learning_rate': 3.850931677018633e-08, 'completion_length': 42.88393020629883, 'rewards/accuracy_reward': 0.464285746216774, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.446428656578064, 'reward_std': 0.21703943610191345, 'kl': 0.2744140625, 'epoch': 4.81} 96%|█████████▌| 1548/1610 [5:28:44<07:05, 6.87s/it] 96%|█████████▌| 1549/1610 [5:28:50<06:54, 6.79s/it] {'loss': 0.0606, 'grad_norm': 7.315048027300612, 'learning_rate': 3.788819875776397e-08, 'completion_length': 37.08928680419922, 'rewards/accuracy_reward': 0.4196428805589676, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.3482143878936768, 'reward_std': 0.24229323863983154, 'kl': 1.515625, 'epoch': 4.81} 96%|█████████▌| 1549/1610 [5:28:50<06:54, 6.79s/it] 96%|█████████▋| 1550/1610 [5:28:58<06:59, 7.00s/it] {'loss': 0.0409, 'grad_norm': 7.037728756093704, 'learning_rate': 3.726708074534162e-08, 'completion_length': 42.01785850524902, 'rewards/accuracy_reward': 0.196428582072258, 'rewards/format_reward': 0.910714328289032, 'reward': 1.1071429252624512, 'reward_std': 0.24680949747562408, 'kl': 1.025390625, 'epoch': 4.81} 96%|█████████▋| 1550/1610 [5:28:58<06:59, 7.00s/it] 96%|█████████▋| 1551/1610 [5:29:08<07:55, 8.05s/it] {'loss': 0.0415, 'grad_norm': 4.320163930539939, 'learning_rate': 3.6645962732919256e-08, 'completion_length': 49.86607360839844, 'rewards/accuracy_reward': 0.598214328289032, 'rewards/format_reward': 0.9196428954601288, 'reward': 1.5178572535514832, 'reward_std': 0.23321618139743805, 'kl': 1.037109375, 'epoch': 4.82} 96%|█████████▋| 1551/1610 [5:29:08<07:55, 8.05s/it] 96%|█████████▋| 1552/1610 [5:29:16<07:35, 7.86s/it] {'loss': 0.0533, 'grad_norm': 6.00408620084741, 'learning_rate': 3.602484472049689e-08, 'completion_length': 43.83035850524902, 'rewards/accuracy_reward': 0.160714291036129, 'rewards/format_reward': 0.8750000298023224, 'reward': 1.0357142984867096, 'reward_std': 0.25791001319885254, 'kl': 1.333984375, 'epoch': 4.82} 96%|█████████▋| 1552/1610 [5:29:16<07:35, 7.86s/it] 96%|█████████▋| 1553/1610 [5:29:22<07:09, 7.54s/it] {'loss': 0.0374, 'grad_norm': 3.2916933582501575, 'learning_rate': 3.5403726708074535e-08, 'completion_length': 38.56250190734863, 'rewards/accuracy_reward': 0.6339285969734192, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5803572535514832, 'reward_std': 0.1923990696668625, 'kl': 0.9375, 'epoch': 4.82} 96%|█████████▋| 1553/1610 [5:29:22<07:09, 7.54s/it] 97%|█████████▋| 1554/1610 [5:29:29<06:46, 7.26s/it] {'loss': 0.0286, 'grad_norm': 2.435836946199543, 'learning_rate': 3.4782608695652174e-08, 'completion_length': 37.83035850524902, 'rewards/accuracy_reward': 0.6339285969734192, 'rewards/format_reward': 0.955357164144516, 'reward': 1.5892857909202576, 'reward_std': 0.17355040460824966, 'kl': 0.71484375, 'epoch': 4.83} 97%|█████████▋| 1554/1610 [5:29:29<06:46, 7.26s/it] 97%|█████████▋| 1555/1610 [5:29:37<06:50, 7.47s/it] {'loss': 0.0489, 'grad_norm': 6.263119398850902, 'learning_rate': 3.416149068322981e-08, 'completion_length': 41.205360412597656, 'rewards/accuracy_reward': 0.4375000149011612, 'rewards/format_reward': 0.848214328289032, 'reward': 1.285714328289032, 'reward_std': 0.2778605967760086, 'kl': 1.22265625, 'epoch': 4.83} 97%|█████████▋| 1555/1610 [5:29:37<06:50, 7.47s/it] 97%|█████████▋| 1556/1610 [5:29:44<06:42, 7.45s/it] {'loss': 0.0798, 'grad_norm': 7.595653585961243, 'learning_rate': 3.3540372670807445e-08, 'completion_length': 41.77678680419922, 'rewards/accuracy_reward': 0.2857142984867096, 'rewards/format_reward': 0.8660714626312256, 'reward': 1.1517857909202576, 'reward_std': 0.3251933455467224, 'kl': 1.9921875, 'epoch': 4.83} 97%|█████████▋| 1556/1610 [5:29:44<06:42, 7.45s/it] 97%|█████████▋| 1557/1610 [5:29:51<06:27, 7.31s/it] {'loss': 0.0481, 'grad_norm': 5.7188760058603, 'learning_rate': 3.291925465838509e-08, 'completion_length': 36.410715103149414, 'rewards/accuracy_reward': 0.6339285969734192, 'rewards/format_reward': 0.928571492433548, 'reward': 1.5625001192092896, 'reward_std': 0.33357347175478935, 'kl': 1.201171875, 'epoch': 4.84} 97%|█████████▋| 1557/1610 [5:29:51<06:27, 7.31s/it] 97%|█████████▋| 1558/1610 [5:29:58<06:11, 7.14s/it] {'loss': 0.0179, 'grad_norm': 3.442230858158753, 'learning_rate': 3.229813664596273e-08, 'completion_length': 39.98214340209961, 'rewards/accuracy_reward': 0.4642857313156128, 'rewards/format_reward': 0.9732142984867096, 'reward': 1.4375000596046448, 'reward_std': 0.13671719655394554, 'kl': 0.44970703125, 'epoch': 4.84} 97%|█████████▋| 1558/1610 [5:29:58<06:11, 7.14s/it] 97%|█████████▋| 1559/1610 [5:30:05<06:04, 7.14s/it] {'loss': 0.0534, 'grad_norm': 5.007246982371467, 'learning_rate': 3.167701863354037e-08, 'completion_length': 40.428571701049805, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 0.9196428954601288, 'reward': 1.4375000596046448, 'reward_std': 0.25048841536045074, 'kl': 1.333984375, 'epoch': 4.84} 97%|█████████▋| 1559/1610 [5:30:05<06:04, 7.14s/it] 97%|█████████▋| 1560/1610 [5:30:12<05:57, 7.14s/it] {'loss': 0.0922, 'grad_norm': 6.4862061630922305, 'learning_rate': 3.105590062111801e-08, 'completion_length': 36.29464340209961, 'rewards/accuracy_reward': 0.5357142984867096, 'rewards/format_reward': 0.8928571939468384, 'reward': 1.4285715222358704, 'reward_std': 0.3603498488664627, 'kl': 2.3046875, 'epoch': 4.84} 97%|█████████▋| 1560/1610 [5:30:12<05:57, 7.14s/it] 97%|█████████▋| 1561/1610 [5:30:20<05:50, 7.15s/it] {'loss': 0.0419, 'grad_norm': 7.013446556009237, 'learning_rate': 3.0434782608695655e-08, 'completion_length': 38.96428871154785, 'rewards/accuracy_reward': 0.446428582072258, 'rewards/format_reward': 0.910714328289032, 'reward': 1.3571428656578064, 'reward_std': 0.22754664719104767, 'kl': 1.044921875, 'epoch': 4.85} 97%|█████████▋| 1561/1610 [5:30:20<05:50, 7.15s/it] 97%|█████████▋| 1562/1610 [5:30:26<05:38, 7.05s/it] {'loss': 0.0349, 'grad_norm': 3.173805115126616, 'learning_rate': 2.981366459627329e-08, 'completion_length': 37.33035850524902, 'rewards/accuracy_reward': 0.4375000149011612, 'rewards/format_reward': 0.955357164144516, 'reward': 1.3928572535514832, 'reward_std': 0.21765290200710297, 'kl': 0.87109375, 'epoch': 4.85} 97%|█████████▋| 1562/1610 [5:30:26<05:38, 7.05s/it] 97%|█████████▋| 1563/1610 [5:30:33<05:29, 7.01s/it] {'loss': 0.0455, 'grad_norm': 4.107462655819134, 'learning_rate': 2.9192546583850933e-08, 'completion_length': 38.85714530944824, 'rewards/accuracy_reward': 0.6875000298023224, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.6160714626312256, 'reward_std': 0.23656460642814636, 'kl': 1.13671875, 'epoch': 4.85} 97%|█████████▋| 1563/1610 [5:30:33<05:29, 7.01s/it] 97%|█████████▋| 1564/1610 [5:30:41<05:24, 7.06s/it] {'loss': 0.106, 'grad_norm': 9.491676148291509, 'learning_rate': 2.857142857142857e-08, 'completion_length': 40.80357360839844, 'rewards/accuracy_reward': 0.5357142984867096, 'rewards/format_reward': 0.8125000298023224, 'reward': 1.348214328289032, 'reward_std': 0.40384314954280853, 'kl': 2.6484375, 'epoch': 4.86} 97%|█████████▋| 1564/1610 [5:30:41<05:24, 7.06s/it] 97%|█████████▋| 1565/1610 [5:30:47<05:12, 6.93s/it] {'loss': 0.0735, 'grad_norm': 4.350423214218731, 'learning_rate': 2.795031055900621e-08, 'completion_length': 41.80357360839844, 'rewards/accuracy_reward': 0.392857164144516, 'rewards/format_reward': 0.910714328289032, 'reward': 1.3035714626312256, 'reward_std': 0.29272185266017914, 'kl': 1.8359375, 'epoch': 4.86} 97%|█████████▋| 1565/1610 [5:30:47<05:12, 6.93s/it] 97%|█████████▋| 1566/1610 [5:30:55<05:15, 7.17s/it] {'loss': 0.0801, 'grad_norm': 7.223323873689831, 'learning_rate': 2.7329192546583847e-08, 'completion_length': 40.40178680419922, 'rewards/accuracy_reward': 0.4375000298023224, 'rewards/format_reward': 0.8839286267757416, 'reward': 1.321428656578064, 'reward_std': 0.2919161468744278, 'kl': 2.0078125, 'epoch': 4.86} 97%|█████████▋| 1566/1610 [5:30:55<05:15, 7.17s/it] 97%|█████████▋| 1567/1610 [5:31:02<05:09, 7.19s/it] {'loss': 0.0286, 'grad_norm': 6.4133502087167855, 'learning_rate': 2.670807453416149e-08, 'completion_length': 41.55357360839844, 'rewards/accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5357143878936768, 'reward_std': 0.1917066052556038, 'kl': 0.71484375, 'epoch': 4.87} 97%|█████████▋| 1567/1610 [5:31:02<05:09, 7.19s/it] 97%|█████████▋| 1568/1610 [5:31:09<05:03, 7.22s/it] {'loss': 0.0791, 'grad_norm': 6.323736437075028, 'learning_rate': 2.608695652173913e-08, 'completion_length': 42.83035850524902, 'rewards/accuracy_reward': 0.5178571790456772, 'rewards/format_reward': 0.8392857611179352, 'reward': 1.3571428656578064, 'reward_std': 0.32199105620384216, 'kl': 1.984375, 'epoch': 4.87} 97%|█████████▋| 1568/1610 [5:31:09<05:03, 7.22s/it] 97%|█████████▋| 1569/1610 [5:31:16<04:51, 7.10s/it] {'loss': 0.0826, 'grad_norm': 6.697345160880611, 'learning_rate': 2.5465838509316768e-08, 'completion_length': 41.19643020629883, 'rewards/accuracy_reward': 0.321428582072258, 'rewards/format_reward': 0.9017857611179352, 'reward': 1.223214328289032, 'reward_std': 0.24573880061507225, 'kl': 2.0625, 'epoch': 4.87} 97%|█████████▋| 1569/1610 [5:31:16<04:51, 7.10s/it] 98%|█████████▊| 1570/1610 [5:31:23<04:43, 7.10s/it] {'loss': 0.0702, 'grad_norm': 6.037607961639341, 'learning_rate': 2.4844720496894407e-08, 'completion_length': 43.56250190734863, 'rewards/accuracy_reward': 0.6517857611179352, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.5803572535514832, 'reward_std': 0.24620163440704346, 'kl': 1.75390625, 'epoch': 4.88} 98%|█████████▊| 1570/1610 [5:31:23<04:43, 7.10s/it] 98%|█████████▊| 1571/1610 [5:31:31<04:37, 7.11s/it] {'loss': 0.0275, 'grad_norm': 6.1998606504266505, 'learning_rate': 2.422360248447205e-08, 'completion_length': 38.12500190734863, 'rewards/accuracy_reward': 0.4464285969734192, 'rewards/format_reward': 0.973214328289032, 'reward': 1.4196429252624512, 'reward_std': 0.14537875726819038, 'kl': 0.689453125, 'epoch': 4.88} 98%|█████████▊| 1571/1610 [5:31:31<04:37, 7.11s/it] 98%|█████████▊| 1572/1610 [5:31:37<04:27, 7.03s/it] {'loss': 0.0098, 'grad_norm': 2.331593447989141, 'learning_rate': 2.360248447204969e-08, 'completion_length': 39.32143020629883, 'rewards/accuracy_reward': 0.4910714626312256, 'rewards/format_reward': 1.0, 'reward': 1.4910714626312256, 'reward_std': 0.10821298509836197, 'kl': 0.244140625, 'epoch': 4.88} 98%|█████████▊| 1572/1610 [5:31:37<04:27, 7.03s/it] 98%|█████████▊| 1573/1610 [5:31:44<04:16, 6.93s/it] {'loss': 0.0292, 'grad_norm': 6.370621140324908, 'learning_rate': 2.2981366459627328e-08, 'completion_length': 39.17857360839844, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 0.9553571939468384, 'reward': 1.508928656578064, 'reward_std': 0.29977546632289886, 'kl': 0.73046875, 'epoch': 4.89} 98%|█████████▊| 1573/1610 [5:31:44<04:16, 6.93s/it] 98%|█████████▊| 1574/1610 [5:31:51<04:09, 6.93s/it] {'loss': 0.0564, 'grad_norm': 4.650520130916191, 'learning_rate': 2.236024844720497e-08, 'completion_length': 41.85714340209961, 'rewards/accuracy_reward': 0.5535714328289032, 'rewards/format_reward': 0.9196428954601288, 'reward': 1.4732143878936768, 'reward_std': 0.398899644613266, 'kl': 1.408203125, 'epoch': 4.89} 98%|█████████▊| 1574/1610 [5:31:51<04:09, 6.93s/it] 98%|█████████▊| 1575/1610 [5:31:58<04:02, 6.93s/it] {'loss': 0.084, 'grad_norm': 6.409336447857158, 'learning_rate': 2.1739130434782606e-08, 'completion_length': 39.15178680419922, 'rewards/accuracy_reward': 0.6160714626312256, 'rewards/format_reward': 0.8392857611179352, 'reward': 1.4553571939468384, 'reward_std': 0.33246469497680664, 'kl': 2.09765625, 'epoch': 4.89} 98%|█████████▊| 1575/1610 [5:31:58<04:02, 6.93s/it] 98%|█████████▊| 1576/1610 [5:32:05<03:54, 6.88s/it] {'loss': 0.0588, 'grad_norm': 5.297917625304408, 'learning_rate': 2.111801242236025e-08, 'completion_length': 38.10714530944824, 'rewards/accuracy_reward': 0.5267857313156128, 'rewards/format_reward': 0.8928571939468384, 'reward': 1.4196428656578064, 'reward_std': 0.2589074522256851, 'kl': 1.46875, 'epoch': 4.89} 98%|█████████▊| 1576/1610 [5:32:05<03:54, 6.88s/it] 98%|█████████▊| 1577/1610 [5:32:12<03:48, 6.92s/it] {'loss': 0.0171, 'grad_norm': 5.3702720157689665, 'learning_rate': 2.0496894409937888e-08, 'completion_length': 43.312503814697266, 'rewards/accuracy_reward': 0.5982142984867096, 'rewards/format_reward': 0.9642857611179352, 'reward': 1.5625000596046448, 'reward_std': 0.23057927191257477, 'kl': 0.427734375, 'epoch': 4.9} 98%|█████████▊| 1577/1610 [5:32:12<03:48, 6.92s/it] 98%|█████████▊| 1578/1610 [5:32:19<03:45, 7.06s/it] {'loss': 0.0358, 'grad_norm': 4.615487248695723, 'learning_rate': 1.9875776397515527e-08, 'completion_length': 39.25000190734863, 'rewards/accuracy_reward': 0.580357164144516, 'rewards/format_reward': 0.9553571939468384, 'reward': 1.535714328289032, 'reward_std': 0.28684911131858826, 'kl': 0.8935546875, 'epoch': 4.9} 98%|█████████▊| 1578/1610 [5:32:19<03:45, 7.06s/it] 98%|█████████▊| 1579/1610 [5:32:26<03:36, 7.00s/it] {'loss': 0.0703, 'grad_norm': 4.831294142717518, 'learning_rate': 1.9254658385093166e-08, 'completion_length': 38.46428680419922, 'rewards/accuracy_reward': 0.5446428954601288, 'rewards/format_reward': 0.8660714626312256, 'reward': 1.410714328289032, 'reward_std': 0.31353282928466797, 'kl': 1.7578125, 'epoch': 4.9} 98%|█████████▊| 1579/1610 [5:32:26<03:36, 7.00s/it] 98%|█████████▊| 1580/1610 [5:32:33<03:29, 6.99s/it] {'loss': 0.0436, 'grad_norm': 5.950644521047667, 'learning_rate': 1.863354037267081e-08, 'completion_length': 42.25000190734863, 'rewards/accuracy_reward': 0.5892857611179352, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.535714328289032, 'reward_std': 0.3207450956106186, 'kl': 1.087890625, 'epoch': 4.91} 98%|█████████▊| 1580/1610 [5:32:33<03:29, 6.99s/it] 98%|█████████▊| 1581/1610 [5:32:40<03:25, 7.08s/it] {'loss': 0.0254, 'grad_norm': 3.890172046699357, 'learning_rate': 1.8012422360248444e-08, 'completion_length': 39.68750190734863, 'rewards/accuracy_reward': 0.642857164144516, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6071429252624512, 'reward_std': 0.03818017989397049, 'kl': 0.63671875, 'epoch': 4.91} 98%|█████████▊| 1581/1610 [5:32:40<03:25, 7.08s/it] 98%|█████████▊| 1582/1610 [5:32:47<03:19, 7.11s/it] {'loss': 0.0424, 'grad_norm': 4.070765913791815, 'learning_rate': 1.7391304347826087e-08, 'completion_length': 42.27678871154785, 'rewards/accuracy_reward': 0.4285714477300644, 'rewards/format_reward': 0.9196428954601288, 'reward': 1.348214328289032, 'reward_std': 0.19576600193977356, 'kl': 1.05859375, 'epoch': 4.91} 98%|█████████▊| 1582/1610 [5:32:47<03:19, 7.11s/it] 98%|█████████▊| 1583/1610 [5:32:55<03:14, 7.20s/it] {'loss': 0.0232, 'grad_norm': 3.736393565947859, 'learning_rate': 1.6770186335403723e-08, 'completion_length': 40.41071701049805, 'rewards/accuracy_reward': 0.5000000149011612, 'rewards/format_reward': 0.9642857611179352, 'reward': 1.4642857909202576, 'reward_std': 0.2338685840368271, 'kl': 0.58203125, 'epoch': 4.92} 98%|█████████▊| 1583/1610 [5:32:55<03:14, 7.20s/it] 98%|█████████▊| 1584/1610 [5:33:01<03:03, 7.04s/it] {'loss': 0.046, 'grad_norm': 3.917324067963894, 'learning_rate': 1.6149068322981365e-08, 'completion_length': 37.83928680419922, 'rewards/accuracy_reward': 0.6071428656578064, 'rewards/format_reward': 0.910714328289032, 'reward': 1.5178571939468384, 'reward_std': 0.1995968073606491, 'kl': 1.1484375, 'epoch': 4.92} 98%|█████████▊| 1584/1610 [5:33:01<03:03, 7.04s/it] 98%|█████████▊| 1585/1610 [5:33:09<03:00, 7.22s/it] {'loss': 0.0275, 'grad_norm': 3.2705067647540913, 'learning_rate': 1.5527950310559004e-08, 'completion_length': 45.33035850524902, 'rewards/accuracy_reward': 0.375, 'rewards/format_reward': 0.9642857611179352, 'reward': 1.3392857909202576, 'reward_std': 0.17885925620794296, 'kl': 0.6884765625, 'epoch': 4.92} 98%|█████████▊| 1585/1610 [5:33:09<03:00, 7.22s/it] 99%|█████████▊| 1586/1610 [5:33:16<02:50, 7.11s/it] {'loss': 0.0564, 'grad_norm': 5.862718356213599, 'learning_rate': 1.4906832298136644e-08, 'completion_length': 39.88393020629883, 'rewards/accuracy_reward': 0.3035714477300644, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.2321429252624512, 'reward_std': 0.2501044273376465, 'kl': 1.412109375, 'epoch': 4.93} 99%|█████████▊| 1586/1610 [5:33:16<02:50, 7.11s/it] 99%|█████████▊| 1587/1610 [5:33:23<02:44, 7.15s/it] {'loss': 0.0388, 'grad_norm': 4.979410276676519, 'learning_rate': 1.4285714285714284e-08, 'completion_length': 41.65178871154785, 'rewards/accuracy_reward': 0.5178571790456772, 'rewards/format_reward': 0.9375000298023224, 'reward': 1.4553571939468384, 'reward_std': 0.25335484743118286, 'kl': 0.97265625, 'epoch': 4.93} 99%|█████████▊| 1587/1610 [5:33:23<02:44, 7.15s/it] 99%|█████████▊| 1588/1610 [5:33:30<02:35, 7.05s/it] {'loss': 0.0103, 'grad_norm': 4.95402216252188, 'learning_rate': 1.3664596273291924e-08, 'completion_length': 42.473215103149414, 'rewards/accuracy_reward': 0.4017857313156128, 'rewards/format_reward': 0.9910714626312256, 'reward': 1.3928571939468384, 'reward_std': 0.2338685840368271, 'kl': 0.2578125, 'epoch': 4.93} 99%|█████████▊| 1588/1610 [5:33:30<02:35, 7.05s/it] 99%|█████████▊| 1589/1610 [5:33:37<02:30, 7.15s/it] {'loss': 0.0564, 'grad_norm': 4.618606694468921, 'learning_rate': 1.3043478260869564e-08, 'completion_length': 38.70535850524902, 'rewards/accuracy_reward': 0.6160714626312256, 'rewards/format_reward': 0.892857164144516, 'reward': 1.508928656578064, 'reward_std': 0.4348701238632202, 'kl': 1.408203125, 'epoch': 4.93} 99%|█████████▊| 1589/1610 [5:33:37<02:30, 7.15s/it] 99%|█████████▉| 1590/1610 [5:33:45<02:28, 7.41s/it] {'loss': 0.0616, 'grad_norm': 5.6492596331859195, 'learning_rate': 1.2422360248447204e-08, 'completion_length': 42.30357360839844, 'rewards/accuracy_reward': 0.4375000298023224, 'rewards/format_reward': 0.892857164144516, 'reward': 1.3303571939468384, 'reward_std': 0.27804866433143616, 'kl': 1.544921875, 'epoch': 4.94} 99%|█████████▉| 1590/1610 [5:33:45<02:28, 7.41s/it] 99%|█████████▉| 1591/1610 [5:33:55<02:34, 8.13s/it] {'loss': 0.0712, 'grad_norm': 6.878573482141018, 'learning_rate': 1.1801242236024844e-08, 'completion_length': 41.15178680419922, 'rewards/accuracy_reward': 0.464285746216774, 'rewards/format_reward': 0.8839286267757416, 'reward': 1.348214328289032, 'reward_std': 0.36734098196029663, 'kl': 1.77734375, 'epoch': 4.94} 99%|█████████▉| 1591/1610 [5:33:55<02:34, 8.13s/it] 99%|█████████▉| 1592/1610 [5:34:02<02:21, 7.86s/it] {'loss': 0.0479, 'grad_norm': 8.147338312048067, 'learning_rate': 1.1180124223602485e-08, 'completion_length': 36.58928680419922, 'rewards/accuracy_reward': 0.5089285969734192, 'rewards/format_reward': 0.8750000298023224, 'reward': 1.383928656578064, 'reward_std': 0.49951496720314026, 'kl': 1.19921875, 'epoch': 4.94} 99%|█████████▉| 1592/1610 [5:34:02<02:21, 7.86s/it] 99%|█████████▉| 1593/1610 [5:34:10<02:09, 7.64s/it] {'loss': 0.062, 'grad_norm': 6.847993164671675, 'learning_rate': 1.0559006211180124e-08, 'completion_length': 40.50893020629883, 'rewards/accuracy_reward': 0.401785746216774, 'rewards/format_reward': 0.9017857313156128, 'reward': 1.3035714626312256, 'reward_std': 0.22727259993553162, 'kl': 1.54833984375, 'epoch': 4.95} 99%|█████████▉| 1593/1610 [5:34:10<02:09, 7.64s/it] 99%|█████████▉| 1594/1610 [5:34:16<01:58, 7.38s/it] {'loss': 0.0334, 'grad_norm': 6.495254768537416, 'learning_rate': 9.937888198757763e-09, 'completion_length': 40.26785850524902, 'rewards/accuracy_reward': 0.660714328289032, 'rewards/format_reward': 0.9196428954601288, 'reward': 1.5803571939468384, 'reward_std': 0.4431169927120209, 'kl': 0.8359375, 'epoch': 4.95} 99%|█████████▉| 1594/1610 [5:34:16<01:58, 7.38s/it] 99%|█████████▉| 1595/1610 [5:34:23<01:47, 7.20s/it] {'loss': 0.0311, 'grad_norm': 7.119805633088792, 'learning_rate': 9.316770186335404e-09, 'completion_length': 45.85714340209961, 'rewards/accuracy_reward': 0.3571428805589676, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.3035715222358704, 'reward_std': 0.3169811964035034, 'kl': 0.779296875, 'epoch': 4.95} 99%|█████████▉| 1595/1610 [5:34:23<01:47, 7.20s/it] 99%|█████████▉| 1596/1610 [5:34:30<01:39, 7.10s/it] {'loss': 0.0244, 'grad_norm': 4.052002497202286, 'learning_rate': 8.695652173913043e-09, 'completion_length': 40.58035850524902, 'rewards/accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 0.9821429252624512, 'reward': 1.4821429252624512, 'reward_std': 0.18397442251443863, 'kl': 0.611328125, 'epoch': 4.96} 99%|█████████▉| 1596/1610 [5:34:30<01:39, 7.10s/it] 99%|█████████▉| 1597/1610 [5:34:37<01:32, 7.10s/it] {'loss': 0.0495, 'grad_norm': 6.311410211467213, 'learning_rate': 8.074534161490683e-09, 'completion_length': 41.848215103149414, 'rewards/accuracy_reward': 0.4910714477300644, 'rewards/format_reward': 0.910714328289032, 'reward': 1.4017857313156128, 'reward_std': 0.3355749100446701, 'kl': 1.23828125, 'epoch': 4.96} 99%|█████████▉| 1597/1610 [5:34:37<01:32, 7.10s/it] 99%|█████████▉| 1598/1610 [5:34:44<01:25, 7.12s/it] {'loss': 0.0695, 'grad_norm': 8.096592376125, 'learning_rate': 7.453416149068322e-09, 'completion_length': 39.366071701049805, 'rewards/accuracy_reward': 0.4107142984867096, 'rewards/format_reward': 0.8571428954601288, 'reward': 1.2678572535514832, 'reward_std': 0.441828191280365, 'kl': 1.7421875, 'epoch': 4.96} 99%|█████████▉| 1598/1610 [5:34:44<01:25, 7.12s/it] 99%|█████████▉| 1599/1610 [5:34:51<01:18, 7.13s/it] {'loss': 0.0804, 'grad_norm': 5.588930607104728, 'learning_rate': 6.832298136645962e-09, 'completion_length': 43.45535850524902, 'rewards/accuracy_reward': 0.4464285969734192, 'rewards/format_reward': 0.8839286267757416, 'reward': 1.3303571939468384, 'reward_std': 0.3908363878726959, 'kl': 2.01171875, 'epoch': 4.97} 99%|█████████▉| 1599/1610 [5:34:51<01:18, 7.13s/it] 99%|█████████▉| 1600/1610 [5:34:58<01:10, 7.01s/it] {'loss': 0.053, 'grad_norm': 5.395971195522636, 'learning_rate': 6.211180124223602e-09, 'completion_length': 37.73214530944824, 'rewards/accuracy_reward': 0.4821428805589676, 'rewards/format_reward': 0.928571492433548, 'reward': 1.410714328289032, 'reward_std': 0.2702374756336212, 'kl': 1.3251953125, 'epoch': 4.97} 99%|█████████▉| 1600/1610 [5:34:58<01:10, 7.01s/it] 99%|█████████▉| 1601/1610 [5:35:51<03:05, 20.66s/it] {'loss': 0.025, 'grad_norm': 5.276807682791324, 'learning_rate': 5.5900621118012426e-09, 'completion_length': 34.07143020629883, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5000000596046448, 'reward_std': 0.0739355981349945, 'kl': 0.625, 'epoch': 4.97} 99%|█████████▉| 1601/1610 [5:35:51<03:05, 20.66s/it] 100%|█████████▉| 1602/1610 [5:36:01<02:21, 17.67s/it] {'loss': 0.0549, 'grad_norm': 5.597323192340001, 'learning_rate': 4.968944099378882e-09, 'completion_length': 54.392860412597656, 'rewards/accuracy_reward': 0.5267857313156128, 'rewards/format_reward': 0.8839285969734192, 'reward': 1.4107143878936768, 'reward_std': 0.34235283732414246, 'kl': 1.375, 'epoch': 4.98} 100%|█████████▉| 1602/1610 [5:36:01<02:21, 17.67s/it] 100%|█████████▉| 1603/1610 [5:36:08<01:41, 14.50s/it] {'loss': 0.0402, 'grad_norm': 4.671342675095641, 'learning_rate': 4.347826086956522e-09, 'completion_length': 39.79464530944824, 'rewards/accuracy_reward': 0.4464286118745804, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.3750001192092896, 'reward_std': 0.17433738708496094, 'kl': 1.00830078125, 'epoch': 4.98} 100%|█████████▉| 1603/1610 [5:36:08<01:41, 14.50s/it] 100%|█████████▉| 1604/1610 [5:36:16<01:14, 12.33s/it] {'loss': 0.0468, 'grad_norm': 3.6718873174235034, 'learning_rate': 3.726708074534161e-09, 'completion_length': 43.160715103149414, 'rewards/accuracy_reward': 0.6160714328289032, 'rewards/format_reward': 0.910714328289032, 'reward': 1.5267857909202576, 'reward_std': 0.30440113693475723, 'kl': 1.169921875, 'epoch': 4.98} 100%|█████████▉| 1604/1610 [5:36:16<01:14, 12.33s/it] 100%|█████████▉| 1605/1610 [5:36:23<00:54, 10.84s/it] {'loss': 0.0616, 'grad_norm': 9.310682791800847, 'learning_rate': 3.105590062111801e-09, 'completion_length': 41.37500190734863, 'rewards/accuracy_reward': 0.5357142984867096, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.4642857909202576, 'reward_std': 0.2872702106833458, 'kl': 1.54296875, 'epoch': 4.98} 100%|█████████▉| 1605/1610 [5:36:23<00:54, 10.84s/it] 100%|█████████▉| 1606/1610 [5:36:30<00:38, 9.68s/it] {'loss': 0.0302, 'grad_norm': 4.200394614773019, 'learning_rate': 2.484472049689441e-09, 'completion_length': 40.08035850524902, 'rewards/accuracy_reward': 0.6428571939468384, 'rewards/format_reward': 0.9375000298023224, 'reward': 1.5803572535514832, 'reward_std': 0.24290671944618225, 'kl': 0.7578125, 'epoch': 4.99} 100%|█████████▉| 1606/1610 [5:36:30<00:38, 9.68s/it] 100%|█████████▉| 1607/1610 [5:36:37<00:26, 8.93s/it] {'loss': 0.0853, 'grad_norm': 7.428548232274122, 'learning_rate': 1.8633540372670804e-09, 'completion_length': 44.92857360839844, 'rewards/accuracy_reward': 0.5357142984867096, 'rewards/format_reward': 0.8571429252624512, 'reward': 1.3928571939468384, 'reward_std': 0.4327283799648285, 'kl': 2.1328125, 'epoch': 4.99} 100%|█████████▉| 1607/1610 [5:36:37<00:26, 8.93s/it] 100%|█████████▉| 1608/1610 [5:36:45<00:16, 8.46s/it] {'loss': 0.0286, 'grad_norm': 3.8415583984129253, 'learning_rate': 1.2422360248447204e-09, 'completion_length': 39.99107360839844, 'rewards/accuracy_reward': 0.5089285969734192, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.4553572535514832, 'reward_std': 0.1767766922712326, 'kl': 0.712890625, 'epoch': 4.99} 100%|█████████▉| 1608/1610 [5:36:45<00:16, 8.46s/it] 100%|█████████▉| 1609/1610 [5:36:55<00:09, 9.08s/it] {'loss': 0.1625, 'grad_norm': 15.105079995470401, 'learning_rate': 6.211180124223602e-10, 'completion_length': 53.017860412597656, 'rewards/accuracy_reward': 0.3839285969734192, 'rewards/format_reward': 0.6964285969734192, 'reward': 1.0803571939468384, 'reward_std': 0.46089892089366913, 'kl': 4.0703125, 'epoch': 5.0} 100%|█████████▉| 1609/1610 [5:36:55<00:09, 9.08s/it] 100%|██████████| 1610/1610 [5:37:05<00:00, 9.25s/it] {'loss': 0.0587, 'grad_norm': 7.387599793913319, 'learning_rate': 0.0, 'completion_length': 40.75893020629883, 'rewards/accuracy_reward': 0.392857164144516, 'rewards/format_reward': 0.9017857611179352, 'reward': 1.2946429252624512, 'reward_std': 0.2546207010746002, 'kl': 1.46875, 'epoch': 5.0} 100%|██████████| 1610/1610 [5:37:05<00:00, 9.25s/it] {'train_runtime': 20285.2457, 'train_samples_per_second': 1.111, 'train_steps_per_second': 0.079, 'train_loss': 0.014168868047992162, 'epoch': 5.0} 100%|██████████| 1610/1610 [5:38:02<00:00, 9.25s/it] 100%|██████████| 1610/1610 [5:38:02<00:00, 12.60s/it] wandb: wandb: 🚀 View run VLLM-Correct-Qwen2-VL-2B-GRPO-GEOQA-4k5-2025-02-21-02-59-18 at: https://wandb.ai/tanhuajie264-peking-university/vison-open-r1/runs/vpi29oym wandb: Find logs at: wandb/run-20250221_030129-vpi29oym/logs