/scratch3/workspace/ctpham_umass_edu-ft/envs/prolong-final/lib/python3.10/site-packages/transformers/trainer.py:2833: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  checkpoint_rng_state = torch.load(rng_file)
/scratch3/workspace/ctpham_umass_edu-ft/envs/prolong-final/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
[INFO|trainer.py:175] 2025-01-21 08:34:03,763 >> {'loss': 0.35, 'grad_norm': 12.226541519165039, 'learning_rate': 1.1472985271657697e-07, 'epoch': 0.00027122321670735016, 'num_input_tokens_seen': 7132413952, 'completed': '92.24% (3_401 / 3_687)', 'remaining time': '3:22:21', 'throughput': '3087.42', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 08:34:35,102 >> {'loss': 0.5296, 'grad_norm': 15.276947975158691, 'learning_rate': 1.1462758757446728e-07, 'epoch': 0.0005424464334147003, 'num_input_tokens_seen': 7134511104, 'completed': '92.27% (3_402 / 3_687)', 'remaining time': '2:55:15', 'throughput': '8364.78', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 08:35:06,731 >> {'loss': 0.3315, 'grad_norm': 15.25047492980957, 'learning_rate': 1.1452567280350789e-07, 'epoch': 0.0008136696501220504, 'num_input_tokens_seen': 7136608256, 'completed': '92.30% (3_403 / 3_687)', 'remaining time': '2:46:19', 'throughput': '8288.07', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 08:35:40,881 >> {'loss': 0.369, 'grad_norm': 18.95487403869629, 'learning_rate': 1.1442410848571602e-07, 'epoch': 0.0010848928668294006, 'num_input_tokens_seen': 7138705408, 'completed': '92.32% (3_404 / 3_687)', 'remaining time': '2:44:34', 'throughput': '7676.29', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 08:36:11,222 >> {'loss': 0.4243, 'grad_norm': 15.813984870910645, 'learning_rate': 1.1432289470282683e-07, 'epoch': 0.0013561160835367507, 'num_input_tokens_seen': 7140802560, 'completed': '92.35% (3_405 / 3_687)', 'remaining time': '2:39:43', 'throughput': '8639.71', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 08:36:44,685 >> {'loss': 0.3994, 'grad_norm': 15.477920532226562, 'learning_rate': 1.1422203153629312e-07, 'epoch': 0.0016273393002441008, 'num_input_tokens_seen': 7142899712, 'completed': '92.38% (3_406 / 3_687)', 'remaining time': '2:38:44', 'throughput': '7833.95', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 08:37:15,937 >> {'loss': 0.4129, 'grad_norm': 15.906928062438965, 'learning_rate': 1.1412151906728589e-07, 'epoch': 0.001898562516951451, 'num_input_tokens_seen': 7144996864, 'completed': '92.41% (3_407 / 3_687)', 'remaining time': '2:36:25', 'throughput': '8387.97', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 08:37:42,984 >> {'loss': 0.7718, 'grad_norm': 19.251590728759766, 'learning_rate': 1.1402135737669372e-07, 'epoch': 0.0021697857336588013, 'num_input_tokens_seen': 7147094016, 'completed': '92.43% (3_408 / 3_687)', 'remaining time': '2:32:05', 'throughput': '9692.38', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 08:38:13,029 >> {'loss': 0.4258, 'grad_norm': 14.729852676391602, 'learning_rate': 1.1392154654512289e-07, 'epoch': 0.0024410089503661514, 'num_input_tokens_seen': 7149191168, 'completed': '92.46% (3_409 / 3_687)', 'remaining time': '2:30:10', 'throughput': '8724.97', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 08:38:47,312 >> {'loss': 0.6094, 'grad_norm': 16.52750587463379, 'learning_rate': 1.1382208665289742e-07, 'epoch': 0.0027122321670735015, 'num_input_tokens_seen': 7151288320, 'completed': '92.49% (3_410 / 3_687)', 'remaining time': '2:30:30', 'throughput': '7646.58', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 08:39:13,008 >> {'loss': 0.7976, 'grad_norm': 30.927867889404297, 'learning_rate': 1.1372297778005883e-07, 'epoch': 0.0029834553837808516, 'num_input_tokens_seen': 7153385472, 'completed': '92.51% (3_411 / 3_687)', 'remaining time': '2:27:04', 'throughput': '10201.77', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 08:39:44,072 >> {'loss': 0.3816, 'grad_norm': 14.352428436279297, 'learning_rate': 1.1362422000636609e-07, 'epoch': 0.0032546786004882017, 'num_input_tokens_seen': 7155482624, 'completed': '92.54% (3_412 / 3_687)', 'remaining time': '2:26:11', 'throughput': '8438.66', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 08:40:13,740 >> {'loss': 0.4478, 'grad_norm': 18.680057525634766, 'learning_rate': 1.135258134112958e-07, 'epoch': 0.003525901817195552, 'num_input_tokens_seen': 7157579776, 'completed': '92.57% (3_413 / 3_687)', 'remaining time': '2:24:52', 'throughput': '8836.09', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 08:40:47,065 >> {'loss': 0.4313, 'grad_norm': 13.85073471069336, 'learning_rate': 1.1342775807404177e-07, 'epoch': 0.003797125033902902, 'num_input_tokens_seen': 7159676928, 'completed': '92.60% (3_414 / 3_687)', 'remaining time': '2:24:52', 'throughput': '7866.13', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 08:41:17,781 >> {'loss': 0.301, 'grad_norm': 12.328075408935547, 'learning_rate': 1.1333005407351516e-07, 'epoch': 0.0040683482506102524, 'num_input_tokens_seen': 7161774080, 'completed': '92.62% (3_415 / 3_687)', 'remaining time': '2:24:00', 'throughput': '8534.58', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 08:41:45,555 >> {'loss': 0.6079, 'grad_norm': 15.454139709472656, 'learning_rate': 1.1323270148834461e-07, 'epoch': 0.0043395714673176026, 'num_input_tokens_seen': 7163871232, 'completed': '92.65% (3_416 / 3_687)', 'remaining time': '2:22:20', 'throughput': '9438.32', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 08:42:14,931 >> {'loss': 0.5509, 'grad_norm': 17.02078628540039, 'learning_rate': 1.1313570039687571e-07, 'epoch': 0.004610794684024953, 'num_input_tokens_seen': 7165968384, 'completed': '92.68% (3_417 / 3_687)', 'remaining time': '2:21:15', 'throughput': '8923.91', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 08:42:46,987 >> {'loss': 0.5081, 'grad_norm': 16.010780334472656, 'learning_rate': 1.1303905087717111e-07, 'epoch': 0.004882017900732303, 'num_input_tokens_seen': 7168065536, 'completed': '92.70% (3_418 / 3_687)', 'remaining time': '2:20:53', 'throughput': '8177.51', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 08:43:15,970 >> {'loss': 0.3705, 'grad_norm': 12.919666290283203, 'learning_rate': 1.1294275300701085e-07, 'epoch': 0.005153241117439653, 'num_input_tokens_seen': 7170162688, 'completed': '92.73% (3_419 / 3_687)', 'remaining time': '2:19:47', 'throughput': '9044.73', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 08:43:47,486 >> {'loss': 0.5029, 'grad_norm': 14.903799057006836, 'learning_rate': 1.1284680686389163e-07, 'epoch': 0.005424464334147003, 'num_input_tokens_seen': 7172259840, 'completed': '92.76% (3_420 / 3_687)', 'remaining time': '2:19:19', 'throughput': '8317.96', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 08:44:16,977 >> {'loss': 0.408, 'grad_norm': 18.086734771728516, 'learning_rate': 1.1275121252502738e-07, 'epoch': 0.005695687550854353, 'num_input_tokens_seen': 7174356992, 'completed': '92.79% (3_421 / 3_687)', 'remaining time': '2:18:25', 'throughput': '8888.78', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 08:44:51,748 >> {'loss': 0.3967, 'grad_norm': 13.25420093536377, 'learning_rate': 1.1265597006734872e-07, 'epoch': 0.005966910767561703, 'num_input_tokens_seen': 7176454144, 'completed': '92.81% (3_422 / 3_687)', 'remaining time': '2:18:36', 'throughput': '7539.28', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 08:45:20,142 >> {'loss': 0.4737, 'grad_norm': 16.535417556762695, 'learning_rate': 1.1256107956750319e-07, 'epoch': 0.006238133984269053, 'num_input_tokens_seen': 7178551296, 'completed': '92.84% (3_423 / 3_687)', 'remaining time': '2:17:30', 'throughput': '9232.39', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 08:45:49,239 >> {'loss': 0.2047, 'grad_norm': 13.090187072753906, 'learning_rate': 1.1246654110185501e-07, 'epoch': 0.006509357200976403, 'num_input_tokens_seen': 7180648448, 'completed': '92.87% (3_424 / 3_687)', 'remaining time': '2:16:36', 'throughput': '9009.28', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 08:46:17,642 >> {'loss': 0.8687, 'grad_norm': 21.00216293334961, 'learning_rate': 1.1237235474648516e-07, 'epoch': 0.0067805804176837535, 'num_input_tokens_seen': 7182745600, 'completed': '92.89% (3_425 / 3_687)', 'remaining time': '2:15:35', 'throughput': '9229.57', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 08:46:50,506 >> {'loss': 0.5996, 'grad_norm': 18.46866226196289, 'learning_rate': 1.1227852057719125e-07, 'epoch': 0.007051803634391104, 'num_input_tokens_seen': 7184842752, 'completed': '92.92% (3_426 / 3_687)', 'remaining time': '2:15:23', 'throughput': '7976.46', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 08:47:21,673 >> {'loss': 0.4136, 'grad_norm': 21.20566749572754, 'learning_rate': 1.121850386694875e-07, 'epoch': 0.007323026851098454, 'num_input_tokens_seen': 7186939904, 'completed': '92.95% (3_427 / 3_687)', 'remaining time': '2:14:52', 'throughput': '8410.90', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 08:47:52,825 >> {'loss': 0.5273, 'grad_norm': 15.836469650268555, 'learning_rate': 1.1209190909860453e-07, 'epoch': 0.007594250067805804, 'num_input_tokens_seen': 7189037056, 'completed': '92.98% (3_428 / 3_687)', 'remaining time': '2:14:21', 'throughput': '8415.17', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 08:48:21,903 >> {'loss': 0.4879, 'grad_norm': 17.805950164794922, 'learning_rate': 1.119991319394894e-07, 'epoch': 0.007865473284513154, 'num_input_tokens_seen': 7191134208, 'completed': '93.00% (3_429 / 3_687)', 'remaining time': '2:13:32', 'throughput': '9015.20', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 08:48:51,312 >> {'loss': 0.6706, 'grad_norm': 21.2362117767334, 'learning_rate': 1.1190670726680579e-07, 'epoch': 0.008136696501220505, 'num_input_tokens_seen': 7193231360, 'completed': '93.03% (3_430 / 3_687)', 'remaining time': '2:12:47', 'throughput': '8913.64', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 08:49:23,625 >> {'loss': 0.2483, 'grad_norm': 12.916275024414062, 'learning_rate': 1.1181463515493336e-07, 'epoch': 0.008407919717927854, 'num_input_tokens_seen': 7195328512, 'completed': '93.06% (3_431 / 3_687)', 'remaining time': '2:12:26', 'throughput': '8112.79', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 08:49:52,576 >> {'loss': 0.3723, 'grad_norm': 14.223495483398438, 'learning_rate': 1.1172291567796846e-07, 'epoch': 0.008679142934635205, 'num_input_tokens_seen': 7197425664, 'completed': '93.08% (3_432 / 3_687)', 'remaining time': '2:11:39', 'throughput': '9054.69', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 08:50:22,799 >> {'loss': 0.3586, 'grad_norm': 12.331622123718262, 'learning_rate': 1.1163154890972333e-07, 'epoch': 0.008950366151342554, 'num_input_tokens_seen': 7199522816, 'completed': '93.11% (3_433 / 3_687)', 'remaining time': '2:11:02', 'throughput': '8673.70', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 08:50:53,786 >> {'loss': 0.2378, 'grad_norm': 9.547661781311035, 'learning_rate': 1.1154053492372654e-07, 'epoch': 0.009221589368049905, 'num_input_tokens_seen': 7201619968, 'completed': '93.14% (3_434 / 3_687)', 'remaining time': '2:10:31', 'throughput': '8459.61', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 08:51:22,996 >> {'loss': 0.5829, 'grad_norm': 19.91844367980957, 'learning_rate': 1.1144987379322254e-07, 'epoch': 0.009492812584757255, 'num_input_tokens_seen': 7203717120, 'completed': '93.17% (3_435 / 3_687)', 'remaining time': '2:09:48', 'throughput': '8974.56', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 08:51:50,507 >> {'loss': 0.7757, 'grad_norm': 23.920061111450195, 'learning_rate': 1.1135956559117207e-07, 'epoch': 0.009764035801464606, 'num_input_tokens_seen': 7205814272, 'completed': '93.19% (3_436 / 3_687)', 'remaining time': '2:08:53', 'throughput': '9528.68', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 08:52:21,214 >> {'loss': 0.2022, 'grad_norm': 9.220373153686523, 'learning_rate': 1.1126961039025168e-07, 'epoch': 0.010035259018171955, 'num_input_tokens_seen': 7207911424, 'completed': '93.22% (3_437 / 3_687)', 'remaining time': '2:08:22', 'throughput': '8537.06', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 08:52:50,711 >> {'loss': 0.3518, 'grad_norm': 14.202072143554688, 'learning_rate': 1.111800082628539e-07, 'epoch': 0.010306482234879306, 'num_input_tokens_seen': 7210008576, 'completed': '93.25% (3_438 / 3_687)', 'remaining time': '2:07:42', 'throughput': '8887.09', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 08:53:18,769 >> {'loss': 0.5367, 'grad_norm': 16.335010528564453, 'learning_rate': 1.1109075928108715e-07, 'epoch': 0.010577705451586655, 'num_input_tokens_seen': 7212105728, 'completed': '93.27% (3_439 / 3_687)', 'remaining time': '2:06:54', 'throughput': '9342.87', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 08:53:47,589 >> {'loss': 0.8455, 'grad_norm': 20.81877326965332, 'learning_rate': 1.1100186351677567e-07, 'epoch': 0.010848928668294006, 'num_input_tokens_seen': 7214202880, 'completed': '93.30% (3_440 / 3_687)', 'remaining time': '2:06:12', 'throughput': '9095.79', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 08:54:17,982 >> {'loss': 0.404, 'grad_norm': 12.109017372131348, 'learning_rate': 1.1091332104145921e-07, 'epoch': 0.011120151885001357, 'num_input_tokens_seen': 7216300032, 'completed': '93.33% (3_441 / 3_687)', 'remaining time': '2:05:40', 'throughput': '8625.26', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 08:54:48,848 >> {'loss': 0.4581, 'grad_norm': 51.72409439086914, 'learning_rate': 1.1082513192639353e-07, 'epoch': 0.011391375101708706, 'num_input_tokens_seen': 7218397184, 'completed': '93.36% (3_442 / 3_687)', 'remaining time': '2:05:10', 'throughput': '8492.82', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 08:55:22,060 >> {'loss': 0.2449, 'grad_norm': 11.072718620300293, 'learning_rate': 1.1073729624254984e-07, 'epoch': 0.011662598318416057, 'num_input_tokens_seen': 7220494336, 'completed': '93.38% (3_443 / 3_687)', 'remaining time': '2:04:54', 'throughput': '7893.08', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 08:55:49,988 >> {'loss': 0.6404, 'grad_norm': 20.055789947509766, 'learning_rate': 1.1064981406061494e-07, 'epoch': 0.011933821535123406, 'num_input_tokens_seen': 7222591488, 'completed': '93.41% (3_444 / 3_687)', 'remaining time': '2:04:08', 'throughput': '9386.46', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 08:56:19,278 >> {'loss': 0.4006, 'grad_norm': 12.380927085876465, 'learning_rate': 1.1056268545099117e-07, 'epoch': 0.012205044751830757, 'num_input_tokens_seen': 7224688640, 'completed': '93.44% (3_445 / 3_687)', 'remaining time': '2:03:30', 'throughput': '8949.88', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 08:56:52,378 >> {'loss': 0.2531, 'grad_norm': 13.249361038208008, 'learning_rate': 1.1047591048379635e-07, 'epoch': 0.012476267968538107, 'num_input_tokens_seen': 7226785792, 'completed': '93.46% (3_446 / 3_687)', 'remaining time': '2:03:12', 'throughput': '7919.75', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 08:57:21,657 >> {'loss': 0.5931, 'grad_norm': 16.427322387695312, 'learning_rate': 1.1038948922886355e-07, 'epoch': 0.012747491185245458, 'num_input_tokens_seen': 7228882944, 'completed': '93.49% (3_447 / 3_687)', 'remaining time': '2:02:34', 'throughput': '8953.54', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 08:57:50,040 >> {'loss': 0.4725, 'grad_norm': 16.59769630432129, 'learning_rate': 1.1030342175574144e-07, 'epoch': 0.013018714401952807, 'num_input_tokens_seen': 7230980096, 'completed': '93.52% (3_448 / 3_687)', 'remaining time': '2:01:53', 'throughput': '9235.81', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 08:58:21,899 >> {'loss': 0.6082, 'grad_norm': 25.384323120117188, 'learning_rate': 1.1021770813369378e-07, 'epoch': 0.013289937618660158, 'num_input_tokens_seen': 7233077248, 'completed': '93.54% (3_449 / 3_687)', 'remaining time': '2:01:28', 'throughput': '8228.40', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 08:58:58,048 >> {'loss': 0.2975, 'grad_norm': 14.333030700683594, 'learning_rate': 1.1013234843169967e-07, 'epoch': 0.013561160835367507, 'num_input_tokens_seen': 7235174400, 'completed': '93.57% (3_450 / 3_687)', 'remaining time': '2:01:24', 'throughput': '7251.67', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 08:59:28,615 >> {'loss': 0.2593, 'grad_norm': 11.298495292663574, 'learning_rate': 1.100473427184534e-07, 'epoch': 0.013832384052074858, 'num_input_tokens_seen': 7237271552, 'completed': '93.60% (3_451 / 3_687)', 'remaining time': '2:00:52', 'throughput': '8576.17', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 08:59:59,922 >> {'loss': 0.5843, 'grad_norm': 16.831668853759766, 'learning_rate': 1.0996269106236425e-07, 'epoch': 0.014103607268782207, 'num_input_tokens_seen': 7239368704, 'completed': '93.63% (3_452 / 3_687)', 'remaining time': '2:00:24', 'throughput': '8373.87', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:00:30,902 >> {'loss': 0.3489, 'grad_norm': 12.40096664428711, 'learning_rate': 1.0987839353155661e-07, 'epoch': 0.014374830485489558, 'num_input_tokens_seen': 7241465856, 'completed': '93.65% (3_453 / 3_687)', 'remaining time': '1:59:54', 'throughput': '8461.14', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:01:02,748 >> {'loss': 0.4209, 'grad_norm': 14.534159660339355, 'learning_rate': 1.0979445019387e-07, 'epoch': 0.014646053702196907, 'num_input_tokens_seen': 7243563008, 'completed': '93.68% (3_454 / 3_687)', 'remaining time': '1:59:28', 'throughput': '8232.40', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:01:36,517 >> {'loss': 0.3707, 'grad_norm': 14.342657089233398, 'learning_rate': 1.0971086111685883e-07, 'epoch': 0.014917276918904258, 'num_input_tokens_seen': 7245660160, 'completed': '93.71% (3_455 / 3_687)', 'remaining time': '1:59:10', 'throughput': '7762.02', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:02:07,107 >> {'loss': 0.5742, 'grad_norm': 19.60882568359375, 'learning_rate': 1.0962762636779235e-07, 'epoch': 0.015188500135611608, 'num_input_tokens_seen': 7247757312, 'completed': '93.73% (3_456 / 3_687)', 'remaining time': '1:58:38', 'throughput': '8569.64', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:02:38,521 >> {'loss': 0.6129, 'grad_norm': 18.226106643676758, 'learning_rate': 1.0954474601365482e-07, 'epoch': 0.015459723352318959, 'num_input_tokens_seen': 7249854464, 'completed': '93.76% (3_457 / 3_687)', 'remaining time': '1:58:10', 'throughput': '8344.90', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:03:09,006 >> {'loss': 0.438, 'grad_norm': 15.364544868469238, 'learning_rate': 1.094622201211451e-07, 'epoch': 0.015730946569026308, 'num_input_tokens_seen': 7251951616, 'completed': '93.79% (3_458 / 3_687)', 'remaining time': '1:57:38', 'throughput': '8599.15', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:03:41,415 >> {'loss': 0.4399, 'grad_norm': 13.685408592224121, 'learning_rate': 1.0938004875667689e-07, 'epoch': 0.01600216978573366, 'num_input_tokens_seen': 7254048768, 'completed': '93.82% (3_459 / 3_687)', 'remaining time': '1:57:13', 'throughput': '8088.68', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:04:10,646 >> {'loss': 0.3581, 'grad_norm': 13.642942428588867, 'learning_rate': 1.0929823198637866e-07, 'epoch': 0.01627339300244101, 'num_input_tokens_seen': 7256145920, 'completed': '93.84% (3_460 / 3_687)', 'remaining time': '1:56:36', 'throughput': '8967.81', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:04:40,535 >> {'loss': 0.5739, 'grad_norm': 15.478311538696289, 'learning_rate': 1.0921676987609335e-07, 'epoch': 0.01654461621914836, 'num_input_tokens_seen': 7258243072, 'completed': '93.87% (3_461 / 3_687)', 'remaining time': '1:56:02', 'throughput': '8770.76', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:05:09,301 >> {'loss': 0.6112, 'grad_norm': 15.546080589294434, 'learning_rate': 1.0913566249137865e-07, 'epoch': 0.016815839435855708, 'num_input_tokens_seen': 7260340224, 'completed': '93.90% (3_462 / 3_687)', 'remaining time': '1:55:24', 'throughput': '9112.99', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:05:38,458 >> {'loss': 0.532, 'grad_norm': 14.256020545959473, 'learning_rate': 1.0905490989750656e-07, 'epoch': 0.01708706265256306, 'num_input_tokens_seen': 7262437376, 'completed': '93.92% (3_463 / 3_687)', 'remaining time': '1:54:47', 'throughput': '8990.55', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:06:07,816 >> {'loss': 0.673, 'grad_norm': 18.604835510253906, 'learning_rate': 1.0897451215946378e-07, 'epoch': 0.01735828586927041, 'num_input_tokens_seen': 7264534528, 'completed': '93.95% (3_464 / 3_687)', 'remaining time': '1:54:12', 'throughput': '8929.34', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:06:39,726 >> {'loss': 0.5489, 'grad_norm': 17.011693954467773, 'learning_rate': 1.0889446934195141e-07, 'epoch': 0.01762950908597776, 'num_input_tokens_seen': 7266631680, 'completed': '93.98% (3_465 / 3_687)', 'remaining time': '1:53:45', 'throughput': '8214.96', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:07:09,660 >> {'loss': 0.4176, 'grad_norm': 13.888276100158691, 'learning_rate': 1.0881478150938475e-07, 'epoch': 0.01790073230268511, 'num_input_tokens_seen': 7268728832, 'completed': '94.01% (3_466 / 3_687)', 'remaining time': '1:53:11', 'throughput': '8757.48', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:07:40,806 >> {'loss': 0.396, 'grad_norm': 15.176491737365723, 'learning_rate': 1.0873544872589361e-07, 'epoch': 0.01817195551939246, 'num_input_tokens_seen': 7270825984, 'completed': '94.03% (3_467 / 3_687)', 'remaining time': '1:52:42', 'throughput': '8416.67', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:08:14,575 >> {'loss': 0.4393, 'grad_norm': 14.927526473999023, 'learning_rate': 1.08656471055322e-07, 'epoch': 0.01844317873609981, 'num_input_tokens_seen': 7272923136, 'completed': '94.06% (3_468 / 3_687)', 'remaining time': '1:52:21', 'throughput': '7762.74', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:08:41,118 >> {'loss': 0.8493, 'grad_norm': 21.812307357788086, 'learning_rate': 1.0857784856122812e-07, 'epoch': 0.01871440195280716, 'num_input_tokens_seen': 7275020288, 'completed': '94.09% (3_469 / 3_687)', 'remaining time': '1:51:37', 'throughput': '9876.54', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:09:11,892 >> {'loss': 0.3828, 'grad_norm': 14.81186580657959, 'learning_rate': 1.084995813068843e-07, 'epoch': 0.01898562516951451, 'num_input_tokens_seen': 7277117440, 'completed': '94.11% (3_470 / 3_687)', 'remaining time': '1:51:06', 'throughput': '8518.31', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:09:41,302 >> {'loss': 0.5323, 'grad_norm': 16.475910186767578, 'learning_rate': 1.0842166935527716e-07, 'epoch': 0.01925684838622186, 'num_input_tokens_seen': 7279214592, 'completed': '94.14% (3_471 / 3_687)', 'remaining time': '1:50:32', 'throughput': '8913.36', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:10:11,408 >> {'loss': 0.3918, 'grad_norm': 14.052254676818848, 'learning_rate': 1.0834411276910715e-07, 'epoch': 0.01952807160292921, 'num_input_tokens_seen': 7281311744, 'completed': '94.17% (3_472 / 3_687)', 'remaining time': '1:49:59', 'throughput': '8707.33', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:10:45,040 >> {'loss': 0.5653, 'grad_norm': 17.08545684814453, 'learning_rate': 1.0826691161078895e-07, 'epoch': 0.019799294819636562, 'num_input_tokens_seen': 7283408896, 'completed': '94.20% (3_473 / 3_687)', 'remaining time': '1:49:37', 'throughput': '7794.51', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:11:15,929 >> {'loss': 0.3184, 'grad_norm': 13.46341323852539, 'learning_rate': 1.0819006594245114e-07, 'epoch': 0.02007051803634391, 'num_input_tokens_seen': 7285506048, 'completed': '94.22% (3_474 / 3_687)', 'remaining time': '1:49:07', 'throughput': '8486.73', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:11:47,052 >> {'loss': 0.6962, 'grad_norm': 18.920207977294922, 'learning_rate': 1.0811357582593613e-07, 'epoch': 0.02034174125305126, 'num_input_tokens_seen': 7287603200, 'completed': '94.25% (3_475 / 3_687)', 'remaining time': '1:48:37', 'throughput': '8422.62', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:12:16,495 >> {'loss': 0.8108, 'grad_norm': 18.384702682495117, 'learning_rate': 1.0803744132280025e-07, 'epoch': 0.02061296446975861, 'num_input_tokens_seen': 7289700352, 'completed': '94.28% (3_476 / 3_687)', 'remaining time': '1:48:03', 'throughput': '8903.55', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:12:48,355 >> {'loss': 0.5891, 'grad_norm': 19.510459899902344, 'learning_rate': 1.0796166249431371e-07, 'epoch': 0.020884187686465962, 'num_input_tokens_seen': 7291797504, 'completed': '94.30% (3_477 / 3_687)', 'remaining time': '1:47:35', 'throughput': '8228.04', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:13:21,434 >> {'loss': 0.5444, 'grad_norm': 18.225069046020508, 'learning_rate': 1.0788623940146032e-07, 'epoch': 0.02115541090317331, 'num_input_tokens_seen': 7293894656, 'completed': '94.33% (3_478 / 3_687)', 'remaining time': '1:47:11', 'throughput': '7924.79', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:13:52,075 >> {'loss': 0.3763, 'grad_norm': 13.333983421325684, 'learning_rate': 1.0781117210493781e-07, 'epoch': 0.02142663411988066, 'num_input_tokens_seen': 7295991808, 'completed': '94.36% (3_479 / 3_687)', 'remaining time': '1:46:39', 'throughput': '8555.41', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:14:23,093 >> {'loss': 0.6043, 'grad_norm': 18.025068283081055, 'learning_rate': 1.0773646066515748e-07, 'epoch': 0.021697857336588012, 'num_input_tokens_seen': 7298088960, 'completed': '94.39% (3_480 / 3_687)', 'remaining time': '1:46:09', 'throughput': '8451.36', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:14:54,330 >> {'loss': 0.2852, 'grad_norm': 11.711277961730957, 'learning_rate': 1.0766210514224419e-07, 'epoch': 0.021969080553295363, 'num_input_tokens_seen': 7300186112, 'completed': '94.41% (3_481 / 3_687)', 'remaining time': '1:45:40', 'throughput': '8391.87', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:15:23,424 >> {'loss': 0.4692, 'grad_norm': 17.692590713500977, 'learning_rate': 1.0758810559603651e-07, 'epoch': 0.022240303770002714, 'num_input_tokens_seen': 7302283264, 'completed': '94.44% (3_482 / 3_687)', 'remaining time': '1:45:05', 'throughput': '9010.40', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:15:53,420 >> {'loss': 0.3152, 'grad_norm': 13.046731948852539, 'learning_rate': 1.0751446208608642e-07, 'epoch': 0.02251152698671006, 'num_input_tokens_seen': 7304380416, 'completed': '94.47% (3_483 / 3_687)', 'remaining time': '1:44:32', 'throughput': '8739.34', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:16:21,513 >> {'loss': 0.4343, 'grad_norm': 11.570886611938477, 'learning_rate': 1.0744117467165938e-07, 'epoch': 0.022782750203417412, 'num_input_tokens_seen': 7306477568, 'completed': '94.49% (3_484 / 3_687)', 'remaining time': '1:43:55', 'throughput': '9331.34', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:16:51,699 >> {'loss': 0.3972, 'grad_norm': 13.106870651245117, 'learning_rate': 1.0736824341173442e-07, 'epoch': 0.023053973420124763, 'num_input_tokens_seen': 7308574720, 'completed': '94.52% (3_485 / 3_687)', 'remaining time': '1:43:23', 'throughput': '8684.30', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:17:23,082 >> {'loss': 0.4051, 'grad_norm': 16.621774673461914, 'learning_rate': 1.0729566836500373e-07, 'epoch': 0.023325196636832114, 'num_input_tokens_seen': 7310671872, 'completed': '94.55% (3_486 / 3_687)', 'remaining time': '1:42:54', 'throughput': '8353.02', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:17:52,718 >> {'loss': 0.4515, 'grad_norm': 18.08404541015625, 'learning_rate': 1.07223449589873e-07, 'epoch': 0.023596419853539462, 'num_input_tokens_seen': 7312769024, 'completed': '94.58% (3_487 / 3_687)', 'remaining time': '1:42:21', 'throughput': '8845.48', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:18:22,405 >> {'loss': 0.3258, 'grad_norm': 10.743152618408203, 'learning_rate': 1.0715158714446109e-07, 'epoch': 0.023867643070246813, 'num_input_tokens_seen': 7314866176, 'completed': '94.60% (3_488 / 3_687)', 'remaining time': '1:41:48', 'throughput': '8830.20', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:18:49,871 >> {'loss': 0.4079, 'grad_norm': 16.302248001098633, 'learning_rate': 1.0708008108660026e-07, 'epoch': 0.024138866286954164, 'num_input_tokens_seen': 7316963328, 'completed': '94.63% (3_489 / 3_687)', 'remaining time': '1:41:10', 'throughput': '9544.32', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:19:20,286 >> {'loss': 0.2726, 'grad_norm': 10.805963516235352, 'learning_rate': 1.0700893147383582e-07, 'epoch': 0.024410089503661515, 'num_input_tokens_seen': 7319060480, 'completed': '94.66% (3_490 / 3_687)', 'remaining time': '1:40:39', 'throughput': '8618.93', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:19:52,287 >> {'loss': 0.5156, 'grad_norm': 15.196581840515137, 'learning_rate': 1.069381383634263e-07, 'epoch': 0.024681312720368862, 'num_input_tokens_seen': 7321157632, 'completed': '94.68% (3_491 / 3_687)', 'remaining time': '1:40:11', 'throughput': '8191.69', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:20:23,759 >> {'loss': 0.4177, 'grad_norm': 12.510798454284668, 'learning_rate': 1.0686770181234322e-07, 'epoch': 0.024952535937076213, 'num_input_tokens_seen': 7323254784, 'completed': '94.71% (3_492 / 3_687)', 'remaining time': '1:39:42', 'throughput': '8329.49', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:20:53,794 >> {'loss': 0.356, 'grad_norm': 14.379700660705566, 'learning_rate': 1.0679762187727129e-07, 'epoch': 0.025223759153783564, 'num_input_tokens_seen': 7325351936, 'completed': '94.74% (3_493 / 3_687)', 'remaining time': '1:39:10', 'throughput': '8727.88', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:21:27,264 >> {'loss': 0.2734, 'grad_norm': 10.159677505493164, 'learning_rate': 1.0672789861460818e-07, 'epoch': 0.025494982370490915, 'num_input_tokens_seen': 7327449088, 'completed': '94.77% (3_494 / 3_687)', 'remaining time': '1:38:45', 'throughput': '7832.21', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:21:56,382 >> {'loss': 0.5266, 'grad_norm': 14.783156394958496, 'learning_rate': 1.0665853208046449e-07, 'epoch': 0.025766205587198263, 'num_input_tokens_seen': 7329546240, 'completed': '94.79% (3_495 / 3_687)', 'remaining time': '1:38:11', 'throughput': '9002.80', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:22:27,360 >> {'loss': 0.3847, 'grad_norm': 16.891870498657227, 'learning_rate': 1.0658952233066381e-07, 'epoch': 0.026037428803905614, 'num_input_tokens_seen': 7331643392, 'completed': '94.82% (3_496 / 3_687)', 'remaining time': '1:37:41', 'throughput': '8462.32', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:22:57,510 >> {'loss': 0.5542, 'grad_norm': 16.215871810913086, 'learning_rate': 1.0652086942074255e-07, 'epoch': 0.026308652020612965, 'num_input_tokens_seen': 7333740544, 'completed': '94.85% (3_497 / 3_687)', 'remaining time': '1:37:09', 'throughput': '8694.54', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:23:29,008 >> {'loss': 0.2651, 'grad_norm': 11.77064323425293, 'learning_rate': 1.0645257340594988e-07, 'epoch': 0.026579875237320316, 'num_input_tokens_seen': 7335837696, 'completed': '94.87% (3_498 / 3_687)', 'remaining time': '1:36:40', 'throughput': '8322.56', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:24:02,102 >> {'loss': 0.3247, 'grad_norm': 11.754627227783203, 'learning_rate': 1.06384634341248e-07, 'epoch': 0.026851098454027666, 'num_input_tokens_seen': 7337934848, 'completed': '94.90% (3_499 / 3_687)', 'remaining time': '1:36:14', 'throughput': '7921.30', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:24:32,353 >> {'loss': 0.3777, 'grad_norm': 12.69980239868164, 'learning_rate': 1.0631705228131149e-07, 'epoch': 0.027122321670735014, 'num_input_tokens_seen': 7340032000, 'completed': '94.93% (3_500 / 3_687)', 'remaining time': '1:35:42', 'throughput': '8665.84', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:25:04,367 >> {'loss': 0.2684, 'grad_norm': 13.112504959106445, 'learning_rate': 1.0624982728052795e-07, 'epoch': 0.027393544887442365, 'num_input_tokens_seen': 7342129152, 'completed': '94.96% (3_501 / 3_687)', 'remaining time': '1:35:14', 'throughput': '8188.18', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:25:35,962 >> {'loss': 0.3104, 'grad_norm': 13.424850463867188, 'learning_rate': 1.0618295939299752e-07, 'epoch': 0.027664768104149716, 'num_input_tokens_seen': 7344226304, 'completed': '94.98% (3_502 / 3_687)', 'remaining time': '1:34:45', 'throughput': '8297.13', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:26:05,585 >> {'loss': 0.6812, 'grad_norm': 17.421981811523438, 'learning_rate': 1.0611644867253284e-07, 'epoch': 0.027935991320857067, 'num_input_tokens_seen': 7346323456, 'completed': '95.01% (3_503 / 3_687)', 'remaining time': '1:34:12', 'throughput': '8849.27', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:26:38,509 >> {'loss': 0.6363, 'grad_norm': 18.970033645629883, 'learning_rate': 1.0605029517265918e-07, 'epoch': 0.028207214537564414, 'num_input_tokens_seen': 7348420608, 'completed': '95.04% (3_504 / 3_687)', 'remaining time': '1:33:45', 'throughput': '7961.93', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:27:11,377 >> {'loss': 0.3609, 'grad_norm': 35.321075439453125, 'learning_rate': 1.0598449894661445e-07, 'epoch': 0.028478437754271765, 'num_input_tokens_seen': 7350517760, 'completed': '95.06% (3_505 / 3_687)', 'remaining time': '1:33:18', 'throughput': '7975.71', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:27:41,059 >> {'loss': 0.6572, 'grad_norm': 16.84616470336914, 'learning_rate': 1.0591906004734895e-07, 'epoch': 0.028749660970979116, 'num_input_tokens_seen': 7352614912, 'completed': '95.09% (3_506 / 3_687)', 'remaining time': '1:32:46', 'throughput': '8831.85', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:28:10,997 >> {'loss': 0.87, 'grad_norm': 18.14307403564453, 'learning_rate': 1.0585397852752544e-07, 'epoch': 0.029020884187686467, 'num_input_tokens_seen': 7354712064, 'completed': '95.12% (3_507 / 3_687)', 'remaining time': '1:32:14', 'throughput': '8756.23', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:28:41,623 >> {'loss': 0.5537, 'grad_norm': 16.01613426208496, 'learning_rate': 1.0578925443951895e-07, 'epoch': 0.029292107404393815, 'num_input_tokens_seen': 7356809216, 'completed': '95.15% (3_508 / 3_687)', 'remaining time': '1:31:43', 'throughput': '8559.38', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:29:13,498 >> {'loss': 0.2193, 'grad_norm': 8.767593383789062, 'learning_rate': 1.0572488783541702e-07, 'epoch': 0.029563330621101166, 'num_input_tokens_seen': 7358906368, 'completed': '95.17% (3_509 / 3_687)', 'remaining time': '1:31:14', 'throughput': '8224.29', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:29:43,045 >> {'loss': 0.3346, 'grad_norm': 11.045705795288086, 'learning_rate': 1.0566087876701941e-07, 'epoch': 0.029834553837808517, 'num_input_tokens_seen': 7361003520, 'completed': '95.20% (3_510 / 3_687)', 'remaining time': '1:30:41', 'throughput': '8871.90', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:30:10,770 >> {'loss': 0.5335, 'grad_norm': 17.116188049316406, 'learning_rate': 1.0559722728583825e-07, 'epoch': 0.030105777054515868, 'num_input_tokens_seen': 7363100672, 'completed': '95.23% (3_511 / 3_687)', 'remaining time': '1:30:05', 'throughput': '9455.48', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:30:42,643 >> {'loss': 0.3105, 'grad_norm': 10.3762845993042, 'learning_rate': 1.0553393344309775e-07, 'epoch': 0.030377000271223215, 'num_input_tokens_seen': 7365197824, 'completed': '95.25% (3_512 / 3_687)', 'remaining time': '1:29:37', 'throughput': '8224.64', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:31:14,203 >> {'loss': 0.3106, 'grad_norm': 13.67977237701416, 'learning_rate': 1.054709972897344e-07, 'epoch': 0.030648223487930566, 'num_input_tokens_seen': 7367294976, 'completed': '95.28% (3_513 / 3_687)', 'remaining time': '1:29:07', 'throughput': '8305.99', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:31:41,496 >> {'loss': 0.6964, 'grad_norm': 17.335601806640625, 'learning_rate': 1.0540841887639698e-07, 'epoch': 0.030919446704637917, 'num_input_tokens_seen': 7369392128, 'completed': '95.31% (3_514 / 3_687)', 'remaining time': '1:28:31', 'throughput': '9604.83', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:32:13,941 >> {'loss': 0.3862, 'grad_norm': 15.248687744140625, 'learning_rate': 1.0534619825344596e-07, 'epoch': 0.031190669921345268, 'num_input_tokens_seen': 7371489280, 'completed': '95.33% (3_515 / 3_687)', 'remaining time': '1:28:03', 'throughput': '8079.76', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:32:45,974 >> {'loss': 0.3237, 'grad_norm': 12.66186809539795, 'learning_rate': 1.052843354709543e-07, 'epoch': 0.031461893138052616, 'num_input_tokens_seen': 7373586432, 'completed': '95.36% (3_516 / 3_687)', 'remaining time': '1:27:34', 'throughput': '8183.63', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:33:18,497 >> {'loss': 0.3875, 'grad_norm': 13.880027770996094, 'learning_rate': 1.0522283057870675e-07, 'epoch': 0.03173311635475997, 'num_input_tokens_seen': 7375683584, 'completed': '95.39% (3_517 / 3_687)', 'remaining time': '1:27:06', 'throughput': '8060.15', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:33:53,037 >> {'loss': 0.4796, 'grad_norm': 18.82491683959961, 'learning_rate': 1.0516168362620013e-07, 'epoch': 0.03200433957146732, 'num_input_tokens_seen': 7377780736, 'completed': '95.42% (3_518 / 3_687)', 'remaining time': '1:26:41', 'throughput': '7589.65', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:34:24,570 >> {'loss': 0.4197, 'grad_norm': 16.534194946289062, 'learning_rate': 1.0510089466264321e-07, 'epoch': 0.032275562788174665, 'num_input_tokens_seen': 7379877888, 'completed': '95.44% (3_519 / 3_687)', 'remaining time': '1:26:11', 'throughput': '8313.33', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:34:56,906 >> {'loss': 0.283, 'grad_norm': 10.575153350830078, 'learning_rate': 1.0504046373695648e-07, 'epoch': 0.03254678600488202, 'num_input_tokens_seen': 7381975040, 'completed': '95.47% (3_520 / 3_687)', 'remaining time': '1:25:43', 'throughput': '8106.91', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:35:26,003 >> {'loss': 0.4884, 'grad_norm': 16.936920166015625, 'learning_rate': 1.0498039089777265e-07, 'epoch': 0.03281800922158937, 'num_input_tokens_seen': 7384072192, 'completed': '95.50% (3_521 / 3_687)', 'remaining time': '1:25:09', 'throughput': '9009.18', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:35:58,961 >> {'loss': 0.4334, 'grad_norm': 13.413094520568848, 'learning_rate': 1.0492067619343594e-07, 'epoch': 0.03308923243829672, 'num_input_tokens_seen': 7386169344, 'completed': '95.52% (3_522 / 3_687)', 'remaining time': '1:24:42', 'throughput': '7953.78', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:36:29,930 >> {'loss': 0.2049, 'grad_norm': 9.609505653381348, 'learning_rate': 1.0486131967200254e-07, 'epoch': 0.03336045565500407, 'num_input_tokens_seen': 7388266496, 'completed': '95.55% (3_523 / 3_687)', 'remaining time': '1:24:11', 'throughput': '8464.74', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:37:00,161 >> {'loss': 0.296, 'grad_norm': 13.24195384979248, 'learning_rate': 1.048023213812403e-07, 'epoch': 0.033631678871711417, 'num_input_tokens_seen': 7390363648, 'completed': '95.58% (3_524 / 3_687)', 'remaining time': '1:23:39', 'throughput': '8671.34', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:37:31,606 >> {'loss': 0.4047, 'grad_norm': 15.955248832702637, 'learning_rate': 1.0474368136862876e-07, 'epoch': 0.03390290208841877, 'num_input_tokens_seen': 7392460800, 'completed': '95.61% (3_525 / 3_687)', 'remaining time': '1:23:09', 'throughput': '8336.71', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:38:01,413 >> {'loss': 0.6169, 'grad_norm': 18.489961624145508, 'learning_rate': 1.0468539968135922e-07, 'epoch': 0.03417412530512612, 'num_input_tokens_seen': 7394557952, 'completed': '95.63% (3_526 / 3_687)', 'remaining time': '1:22:37', 'throughput': '8794.71', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:38:32,445 >> {'loss': 0.4134, 'grad_norm': 15.667023658752441, 'learning_rate': 1.046274763663345e-07, 'epoch': 0.034445348521833466, 'num_input_tokens_seen': 7396655104, 'completed': '95.66% (3_527 / 3_687)', 'remaining time': '1:22:07', 'throughput': '8447.70', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:39:04,203 >> {'loss': 0.6246, 'grad_norm': 20.19091796875, 'learning_rate': 1.045699114701691e-07, 'epoch': 0.03471657173854082, 'num_input_tokens_seen': 7398752256, 'completed': '95.69% (3_528 / 3_687)', 'remaining time': '1:21:37', 'throughput': '8254.27', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:39:34,484 >> {'loss': 0.5747, 'grad_norm': 18.28851890563965, 'learning_rate': 1.0451270503918906e-07, 'epoch': 0.03498779495524817, 'num_input_tokens_seen': 7400849408, 'completed': '95.71% (3_529 / 3_687)', 'remaining time': '1:21:06', 'throughput': '8657.19', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:40:04,003 >> {'loss': 0.6841, 'grad_norm': 20.15885353088379, 'learning_rate': 1.0445585711943205e-07, 'epoch': 0.03525901817195552, 'num_input_tokens_seen': 7402946560, 'completed': '95.74% (3_530 / 3_687)', 'remaining time': '1:20:34', 'throughput': '8880.48', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:40:33,626 >> {'loss': 0.5073, 'grad_norm': 15.178775787353516, 'learning_rate': 1.0439936775664699e-07, 'epoch': 0.03553024138866287, 'num_input_tokens_seen': 7405043712, 'completed': '95.77% (3_531 / 3_687)', 'remaining time': '1:20:01', 'throughput': '8849.09', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:41:05,865 >> {'loss': 0.3552, 'grad_norm': 13.089923858642578, 'learning_rate': 1.043432369962943e-07, 'epoch': 0.03580146460537022, 'num_input_tokens_seen': 7407140864, 'completed': '95.80% (3_532 / 3_687)', 'remaining time': '1:19:32', 'throughput': '8131.45', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:41:37,888 >> {'loss': 0.4504, 'grad_norm': 12.704035758972168, 'learning_rate': 1.0428746488354606e-07, 'epoch': 0.03607268782207757, 'num_input_tokens_seen': 7409238016, 'completed': '95.82% (3_533 / 3_687)', 'remaining time': '1:19:03', 'throughput': '8185.93', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:42:08,716 >> {'loss': 0.3546, 'grad_norm': 13.723061561584473, 'learning_rate': 1.0423205146328548e-07, 'epoch': 0.03634391103878492, 'num_input_tokens_seen': 7411335168, 'completed': '95.85% (3_534 / 3_687)', 'remaining time': '1:18:32', 'throughput': '8503.48', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:42:36,673 >> {'loss': 0.5327, 'grad_norm': 15.719633102416992, 'learning_rate': 1.0417699678010708e-07, 'epoch': 0.03661513425549227, 'num_input_tokens_seen': 7413432320, 'completed': '95.88% (3_535 / 3_687)', 'remaining time': '1:17:58', 'throughput': '9376.88', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:43:03,235 >> {'loss': 0.607, 'grad_norm': 18.25145721435547, 'learning_rate': 1.0412230087831689e-07, 'epoch': 0.03688635747219962, 'num_input_tokens_seen': 7415529472, 'completed': '95.90% (3_536 / 3_687)', 'remaining time': '1:17:23', 'throughput': '9869.11', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:43:33,732 >> {'loss': 0.4334, 'grad_norm': 16.473398208618164, 'learning_rate': 1.0406796380193203e-07, 'epoch': 0.03715758068890697, 'num_input_tokens_seen': 7417626624, 'completed': '95.93% (3_537 / 3_687)', 'remaining time': '1:16:52', 'throughput': '8595.66', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:44:02,530 >> {'loss': 0.5355, 'grad_norm': 14.576959609985352, 'learning_rate': 1.0401398559468098e-07, 'epoch': 0.03742880390561432, 'num_input_tokens_seen': 7419723776, 'completed': '95.96% (3_538 / 3_687)', 'remaining time': '1:16:19', 'throughput': '9103.02', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:44:32,393 >> {'loss': 0.4434, 'grad_norm': 23.102252960205078, 'learning_rate': 1.0396036630000324e-07, 'epoch': 0.03770002712232167, 'num_input_tokens_seen': 7421820928, 'completed': '95.99% (3_539 / 3_687)', 'remaining time': '1:15:47', 'throughput': '8778.22', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:45:02,422 >> {'loss': 0.3967, 'grad_norm': 14.231484413146973, 'learning_rate': 1.0390710596104965e-07, 'epoch': 0.03797125033902902, 'num_input_tokens_seen': 7423918080, 'completed': '96.01% (3_540 / 3_687)', 'remaining time': '1:15:16', 'throughput': '8729.57', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:45:35,240 >> {'loss': 0.556, 'grad_norm': 20.063039779663086, 'learning_rate': 1.0385420462068206e-07, 'epoch': 0.03824247355573637, 'num_input_tokens_seen': 7426015232, 'completed': '96.04% (3_541 / 3_687)', 'remaining time': '1:14:47', 'throughput': '7987.81', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:46:06,744 >> {'loss': 0.308, 'grad_norm': 11.724411010742188, 'learning_rate': 1.0380166232147354e-07, 'epoch': 0.03851369677244372, 'num_input_tokens_seen': 7428112384, 'completed': '96.07% (3_542 / 3_687)', 'remaining time': '1:14:17', 'throughput': '8321.02', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:46:38,054 >> {'loss': 0.3655, 'grad_norm': 11.57480239868164, 'learning_rate': 1.0374947910570805e-07, 'epoch': 0.038784919989151075, 'num_input_tokens_seen': 7430209536, 'completed': '96.09% (3_543 / 3_687)', 'remaining time': '1:13:47', 'throughput': '8372.36', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:47:10,586 >> {'loss': 0.3675, 'grad_norm': 13.075111389160156, 'learning_rate': 1.0369765501538067e-07, 'epoch': 0.03905614320585842, 'num_input_tokens_seen': 7432306688, 'completed': '96.12% (3_544 / 3_687)', 'remaining time': '1:13:18', 'throughput': '8058.20', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:47:41,385 >> {'loss': 0.3647, 'grad_norm': 15.664528846740723, 'learning_rate': 1.0364619009219743e-07, 'epoch': 0.03932736642256577, 'num_input_tokens_seen': 7434403840, 'completed': '96.15% (3_545 / 3_687)', 'remaining time': '1:12:47', 'throughput': '8511.42', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:48:11,550 >> {'loss': 0.3922, 'grad_norm': 16.314237594604492, 'learning_rate': 1.0359508437757544e-07, 'epoch': 0.039598589639273124, 'num_input_tokens_seen': 7436500992, 'completed': '96.18% (3_546 / 3_687)', 'remaining time': '1:12:16', 'throughput': '8690.39', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:48:39,821 >> {'loss': 0.4337, 'grad_norm': 15.859907150268555, 'learning_rate': 1.0354433791264255e-07, 'epoch': 0.03986981285598047, 'num_input_tokens_seen': 7438598144, 'completed': '96.20% (3_547 / 3_687)', 'remaining time': '1:11:43', 'throughput': '9272.60', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:49:12,038 >> {'loss': 0.639, 'grad_norm': 21.789993286132812, 'learning_rate': 1.0349395073823768e-07, 'epoch': 0.04014103607268782, 'num_input_tokens_seen': 7440695296, 'completed': '96.23% (3_548 / 3_687)', 'remaining time': '1:11:13', 'throughput': '8136.73', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:49:41,292 >> {'loss': 0.3828, 'grad_norm': 13.827730178833008, 'learning_rate': 1.0344392289491038e-07, 'epoch': 0.040412259289395173, 'num_input_tokens_seen': 7442792448, 'completed': '96.26% (3_549 / 3_687)', 'remaining time': '1:10:41', 'throughput': '8960.97', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:50:08,716 >> {'loss': 0.7685, 'grad_norm': 18.37379264831543, 'learning_rate': 1.0339425442292118e-07, 'epoch': 0.04068348250610252, 'num_input_tokens_seen': 7444889600, 'completed': '96.28% (3_550 / 3_687)', 'remaining time': '1:10:08', 'throughput': '9558.77', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:50:39,674 >> {'loss': 0.508, 'grad_norm': 14.437710762023926, 'learning_rate': 1.0334494536224146e-07, 'epoch': 0.040954705722809875, 'num_input_tokens_seen': 7446986752, 'completed': '96.31% (3_551 / 3_687)', 'remaining time': '1:09:37', 'throughput': '8467.81', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:51:09,383 >> {'loss': 0.5133, 'grad_norm': 15.604809761047363, 'learning_rate': 1.0329599575255321e-07, 'epoch': 0.04122592893951722, 'num_input_tokens_seen': 7449083904, 'completed': '96.34% (3_552 / 3_687)', 'remaining time': '1:09:05', 'throughput': '8823.65', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:51:39,437 >> {'loss': 0.311, 'grad_norm': 11.595867156982422, 'learning_rate': 1.0324740563324923e-07, 'epoch': 0.04149715215622457, 'num_input_tokens_seen': 7451181056, 'completed': '96.37% (3_553 / 3_687)', 'remaining time': '1:08:34', 'throughput': '8722.51', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:52:05,300 >> {'loss': 0.8097, 'grad_norm': 20.499563217163086, 'learning_rate': 1.0319917504343297e-07, 'epoch': 0.041768375372931925, 'num_input_tokens_seen': 7453278208, 'completed': '96.39% (3_554 / 3_687)', 'remaining time': '1:07:59', 'throughput': '10135.87', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:52:38,488 >> {'loss': 0.3584, 'grad_norm': 15.316178321838379, 'learning_rate': 1.0315130402191866e-07, 'epoch': 0.04203959858963927, 'num_input_tokens_seen': 7455375360, 'completed': '96.42% (3_555 / 3_687)', 'remaining time': '1:07:31', 'throughput': '7898.85', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:53:08,227 >> {'loss': 0.4657, 'grad_norm': 13.93494701385498, 'learning_rate': 1.0310379260723094e-07, 'epoch': 0.04231082180634662, 'num_input_tokens_seen': 7457472512, 'completed': '96.45% (3_556 / 3_687)', 'remaining time': '1:06:59', 'throughput': '8814.68', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:53:36,083 >> {'loss': 0.6825, 'grad_norm': 18.21516227722168, 'learning_rate': 1.0305664083760532e-07, 'epoch': 0.042582045023053974, 'num_input_tokens_seen': 7459569664, 'completed': '96.47% (3_557 / 3_687)', 'remaining time': '1:06:26', 'throughput': '9410.65', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:54:06,370 >> {'loss': 0.5517, 'grad_norm': 15.643766403198242, 'learning_rate': 1.0300984875098772e-07, 'epoch': 0.04285326823976132, 'num_input_tokens_seen': 7461666816, 'completed': '96.50% (3_558 / 3_687)', 'remaining time': '1:05:55', 'throughput': '8655.35', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:54:37,561 >> {'loss': 0.4861, 'grad_norm': 19.415977478027344, 'learning_rate': 1.0296341638503458e-07, 'epoch': 0.043124491456468676, 'num_input_tokens_seen': 7463763968, 'completed': '96.53% (3_559 / 3_687)', 'remaining time': '1:05:25', 'throughput': '8404.55', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:55:07,499 >> {'loss': 0.3778, 'grad_norm': 14.686422348022461, 'learning_rate': 1.029173437771129e-07, 'epoch': 0.043395714673176024, 'num_input_tokens_seen': 7465861120, 'completed': '96.56% (3_560 / 3_687)', 'remaining time': '1:04:54', 'throughput': '8756.28', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:55:38,483 >> {'loss': 0.7157, 'grad_norm': 18.66650390625, 'learning_rate': 1.0287163096430024e-07, 'epoch': 0.04366693788988337, 'num_input_tokens_seen': 7467958272, 'completed': '96.58% (3_561 / 3_687)', 'remaining time': '1:04:23', 'throughput': '8460.63', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:56:08,984 >> {'loss': 0.331, 'grad_norm': 12.286962509155273, 'learning_rate': 1.0282627798338444e-07, 'epoch': 0.043938161106590726, 'num_input_tokens_seen': 7470055424, 'completed': '96.61% (3_562 / 3_687)', 'remaining time': '1:03:53', 'throughput': '8594.50', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:56:39,513 >> {'loss': 0.5657, 'grad_norm': 14.363128662109375, 'learning_rate': 1.0278128487086387e-07, 'epoch': 0.04420938432329807, 'num_input_tokens_seen': 7472152576, 'completed': '96.64% (3_563 / 3_687)', 'remaining time': '1:03:22', 'throughput': '8586.66', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:57:12,354 >> {'loss': 0.3759, 'grad_norm': 15.838930130004883, 'learning_rate': 1.0273665166294735e-07, 'epoch': 0.04448060754000543, 'num_input_tokens_seen': 7474249728, 'completed': '96.66% (3_564 / 3_687)', 'remaining time': '1:02:53', 'throughput': '7982.21', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:57:42,895 >> {'loss': 0.3603, 'grad_norm': 12.090499877929688, 'learning_rate': 1.0269237839555398e-07, 'epoch': 0.044751830756712775, 'num_input_tokens_seen': 7476346880, 'completed': '96.69% (3_565 / 3_687)', 'remaining time': '1:02:22', 'throughput': '8583.52', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:58:14,155 >> {'loss': 0.2938, 'grad_norm': 11.031499862670898, 'learning_rate': 1.0264846510431307e-07, 'epoch': 0.04502305397342012, 'num_input_tokens_seen': 7478444032, 'completed': '96.72% (3_566 / 3_687)', 'remaining time': '1:01:52', 'throughput': '8385.90', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:58:44,818 >> {'loss': 0.8536, 'grad_norm': 18.495397567749023, 'learning_rate': 1.0260491182456453e-07, 'epoch': 0.04529427719012748, 'num_input_tokens_seen': 7480541184, 'completed': '96.75% (3_567 / 3_687)', 'remaining time': '1:01:21', 'throughput': '8549.03', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:59:15,347 >> {'loss': 0.3868, 'grad_norm': 12.062615394592285, 'learning_rate': 1.0256171859135826e-07, 'epoch': 0.045565500406834825, 'num_input_tokens_seen': 7482638336, 'completed': '96.77% (3_568 / 3_687)', 'remaining time': '1:00:50', 'throughput': '8586.80', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 09:59:45,001 >> {'loss': 0.3781, 'grad_norm': 11.398265838623047, 'learning_rate': 1.0251888543945458e-07, 'epoch': 0.04583672362354217, 'num_input_tokens_seen': 7484735488, 'completed': '96.80% (3_569 / 3_687)', 'remaining time': '1:00:19', 'throughput': '8840.13', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:00:15,054 >> {'loss': 0.368, 'grad_norm': 11.258602142333984, 'learning_rate': 1.0247641240332397e-07, 'epoch': 0.04610794684024953, 'num_input_tokens_seen': 7486832640, 'completed': '96.83% (3_570 / 3_687)', 'remaining time': '0:59:48', 'throughput': '8722.78', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:00:48,632 >> {'loss': 0.389, 'grad_norm': 12.560171127319336, 'learning_rate': 1.0243429951714714e-07, 'epoch': 0.046379170056956874, 'num_input_tokens_seen': 7488929792, 'completed': '96.85% (3_571 / 3_687)', 'remaining time': '0:59:19', 'throughput': '7807.03', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:01:18,710 >> {'loss': 0.4665, 'grad_norm': 13.790288925170898, 'learning_rate': 1.023925468148149e-07, 'epoch': 0.04665039327366423, 'num_input_tokens_seen': 7491026944, 'completed': '96.88% (3_572 / 3_687)', 'remaining time': '0:58:48', 'throughput': '8715.34', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:01:51,663 >> {'loss': 0.1851, 'grad_norm': 8.38277530670166, 'learning_rate': 1.023511543299283e-07, 'epoch': 0.046921616490371576, 'num_input_tokens_seen': 7493124096, 'completed': '96.91% (3_573 / 3_687)', 'remaining time': '0:58:19', 'throughput': '7955.12', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:02:21,811 >> {'loss': 0.4549, 'grad_norm': 13.927973747253418, 'learning_rate': 1.0231012209579831e-07, 'epoch': 0.047192839707078924, 'num_input_tokens_seen': 7495221248, 'completed': '96.94% (3_574 / 3_687)', 'remaining time': '0:57:48', 'throughput': '8695.35', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:02:54,982 >> {'loss': 0.6259, 'grad_norm': 23.091039657592773, 'learning_rate': 1.0226945014544624e-07, 'epoch': 0.04746406292378628, 'num_input_tokens_seen': 7497318400, 'completed': '96.96% (3_575 / 3_687)', 'remaining time': '0:57:19', 'throughput': '7902.62', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:03:28,547 >> {'loss': 0.6754, 'grad_norm': 17.690340042114258, 'learning_rate': 1.0222913851160335e-07, 'epoch': 0.047735286140493625, 'num_input_tokens_seen': 7499415552, 'completed': '96.99% (3_576 / 3_687)', 'remaining time': '0:56:50', 'throughput': '7810.03', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:03:57,934 >> {'loss': 0.4299, 'grad_norm': 13.187427520751953, 'learning_rate': 1.0218918722671074e-07, 'epoch': 0.04800650935720097, 'num_input_tokens_seen': 7501512704, 'completed': '97.02% (3_577 / 3_687)', 'remaining time': '0:56:18', 'throughput': '8920.58', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:04:27,670 >> {'loss': 0.5009, 'grad_norm': 15.680404663085938, 'learning_rate': 1.0214959632291984e-07, 'epoch': 0.04827773257390833, 'num_input_tokens_seen': 7503609856, 'completed': '97.04% (3_578 / 3_687)', 'remaining time': '0:55:47', 'throughput': '8815.63', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:04:58,618 >> {'loss': 0.3405, 'grad_norm': 13.153005599975586, 'learning_rate': 1.0211036583209181e-07, 'epoch': 0.048548955790615675, 'num_input_tokens_seen': 7505707008, 'completed': '97.07% (3_579 / 3_687)', 'remaining time': '0:55:16', 'throughput': '8470.43', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:05:29,331 >> {'loss': 0.2888, 'grad_norm': 11.65854549407959, 'learning_rate': 1.0207149578579789e-07, 'epoch': 0.04882017900732303, 'num_input_tokens_seen': 7507804160, 'completed': '97.10% (3_580 / 3_687)', 'remaining time': '0:54:46', 'throughput': '8535.33', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:05:58,577 >> {'loss': 0.4813, 'grad_norm': 24.025188446044922, 'learning_rate': 1.0203298621531923e-07, 'epoch': 0.04909140222403038, 'num_input_tokens_seen': 7509901312, 'completed': '97.13% (3_581 / 3_687)', 'remaining time': '0:54:14', 'throughput': '8963.54', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:06:28,073 >> {'loss': 0.5313, 'grad_norm': 22.057125091552734, 'learning_rate': 1.0199483715164687e-07, 'epoch': 0.049362625440737724, 'num_input_tokens_seen': 7511998464, 'completed': '97.15% (3_582 / 3_687)', 'remaining time': '0:53:43', 'throughput': '8887.45', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:06:57,794 >> {'loss': 0.8506, 'grad_norm': 21.940088272094727, 'learning_rate': 1.0195704862548167e-07, 'epoch': 0.04963384865744508, 'num_input_tokens_seen': 7514095616, 'completed': '97.18% (3_583 / 3_687)', 'remaining time': '0:53:11', 'throughput': '8819.99', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:07:28,347 >> {'loss': 0.4072, 'grad_norm': 14.703182220458984, 'learning_rate': 1.0191962066723448e-07, 'epoch': 0.049905071874152426, 'num_input_tokens_seen': 7516192768, 'completed': '97.21% (3_584 / 3_687)', 'remaining time': '0:52:41', 'throughput': '8579.93', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:07:58,886 >> {'loss': 0.2465, 'grad_norm': 8.556319236755371, 'learning_rate': 1.0188255330702583e-07, 'epoch': 0.05017629509085978, 'num_input_tokens_seen': 7518289920, 'completed': '97.23% (3_585 / 3_687)', 'remaining time': '0:52:10', 'throughput': '8583.91', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:08:28,637 >> {'loss': 0.6037, 'grad_norm': 18.820598602294922, 'learning_rate': 1.0184584657468615e-07, 'epoch': 0.05044751830756713, 'num_input_tokens_seen': 7520387072, 'completed': '97.26% (3_586 / 3_687)', 'remaining time': '0:51:39', 'throughput': '8811.26', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:08:58,868 >> {'loss': 0.3087, 'grad_norm': 11.99779224395752, 'learning_rate': 1.018095004997556e-07, 'epoch': 0.050718741524274476, 'num_input_tokens_seen': 7522484224, 'completed': '97.29% (3_587 / 3_687)', 'remaining time': '0:51:08', 'throughput': '8671.60', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:09:30,302 >> {'loss': 0.2991, 'grad_norm': 11.369059562683105, 'learning_rate': 1.0177351511148414e-07, 'epoch': 0.05098996474098183, 'num_input_tokens_seen': 7524581376, 'completed': '97.31% (3_588 / 3_687)', 'remaining time': '0:50:37', 'throughput': '8339.29', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:10:02,432 >> {'loss': 0.476, 'grad_norm': 16.702545166015625, 'learning_rate': 1.0173789043883147e-07, 'epoch': 0.05126118795768918, 'num_input_tokens_seen': 7526678528, 'completed': '97.34% (3_589 / 3_687)', 'remaining time': '0:50:07', 'throughput': '8158.99', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:10:32,663 >> {'loss': 0.3989, 'grad_norm': 13.388700485229492, 'learning_rate': 1.0170262651046687e-07, 'epoch': 0.051532411174396525, 'num_input_tokens_seen': 7528775680, 'completed': '97.37% (3_590 / 3_687)', 'remaining time': '0:49:37', 'throughput': '8671.23', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:11:01,710 >> {'loss': 0.476, 'grad_norm': 16.760042190551758, 'learning_rate': 1.0166772335476951e-07, 'epoch': 0.05180363439110388, 'num_input_tokens_seen': 7530872832, 'completed': '97.40% (3_591 / 3_687)', 'remaining time': '0:49:05', 'throughput': '9024.96', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:11:34,673 >> {'loss': 0.4348, 'grad_norm': 16.29047203063965, 'learning_rate': 1.0163318099982808e-07, 'epoch': 0.05207485760781123, 'num_input_tokens_seen': 7532969984, 'completed': '97.42% (3_592 / 3_687)', 'remaining time': '0:48:35', 'throughput': '7952.66', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:12:02,911 >> {'loss': 0.4399, 'grad_norm': 14.072612762451172, 'learning_rate': 1.0159899947344094e-07, 'epoch': 0.05234608082451858, 'num_input_tokens_seen': 7535067136, 'completed': '97.45% (3_593 / 3_687)', 'remaining time': '0:48:04', 'throughput': '9283.45', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:12:34,519 >> {'loss': 0.3253, 'grad_norm': 12.154025077819824, 'learning_rate': 1.0156517880311614e-07, 'epoch': 0.05261730404122593, 'num_input_tokens_seen': 7537164288, 'completed': '97.48% (3_594 / 3_687)', 'remaining time': '0:47:33', 'throughput': '8293.61', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:13:06,746 >> {'loss': 0.3324, 'grad_norm': 14.722796440124512, 'learning_rate': 1.0153171901607118e-07, 'epoch': 0.05288852725793328, 'num_input_tokens_seen': 7539261440, 'completed': '97.50% (3_595 / 3_687)', 'remaining time': '0:47:03', 'throughput': '8134.30', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:13:34,903 >> {'loss': 0.6772, 'grad_norm': 17.35401153564453, 'learning_rate': 1.0149862013923329e-07, 'epoch': 0.05315975047464063, 'num_input_tokens_seen': 7541358592, 'completed': '97.53% (3_596 / 3_687)', 'remaining time': '0:46:32', 'throughput': '9310.09', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:14:05,394 >> {'loss': 0.4358, 'grad_norm': 13.542460441589355, 'learning_rate': 1.0146588219923917e-07, 'epoch': 0.05343097369134798, 'num_input_tokens_seen': 7543455744, 'completed': '97.56% (3_597 / 3_687)', 'remaining time': '0:46:01', 'throughput': '8597.27', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:14:38,117 >> {'loss': 0.3717, 'grad_norm': 11.1475248336792, 'learning_rate': 1.0143350522243509e-07, 'epoch': 0.05370219690805533, 'num_input_tokens_seen': 7545552896, 'completed': '97.59% (3_598 / 3_687)', 'remaining time': '0:45:31', 'throughput': '8011.08', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:15:09,347 >> {'loss': 0.4832, 'grad_norm': 14.331890106201172, 'learning_rate': 1.0140148923487675e-07, 'epoch': 0.05397342012476268, 'num_input_tokens_seen': 7547650048, 'completed': '97.61% (3_599 / 3_687)', 'remaining time': '0:45:01', 'throughput': '8394.01', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:15:39,026 >> {'loss': 0.4623, 'grad_norm': 13.898726463317871, 'learning_rate': 1.0136983426232945e-07, 'epoch': 0.05424464334147003, 'num_input_tokens_seen': 7549747200, 'completed': '97.64% (3_600 / 3_687)', 'remaining time': '0:44:29', 'throughput': '8832.43', 'gpu_mem_free': '5581MB'}
/scratch3/workspace/ctpham_umass_edu-ft/envs/prolong-final/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
  warnings.warn(
[INFO|trainer.py:3503] 2025-01-21 10:16:03,336 >> Saving model checkpoint to /scratch3/workspace/ctpham_umass_edu-ft/_llama-3.1-8b-instruct_bsz-16_lr-1e-6_epochs-1_/checkpoint-3600
[INFO|configuration_utils.py:472] 2025-01-21 10:16:03,339 >> Configuration saved in /scratch3/workspace/ctpham_umass_edu-ft/_llama-3.1-8b-instruct_bsz-16_lr-1e-6_epochs-1_/checkpoint-3600/config.json
[INFO|configuration_utils.py:807] 2025-01-21 10:16:03,341 >> Configuration saved in /scratch3/workspace/ctpham_umass_edu-ft/_llama-3.1-8b-instruct_bsz-16_lr-1e-6_epochs-1_/checkpoint-3600/generation_config.json
[INFO|modeling_utils.py:2807] 2025-01-21 10:17:00,824 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 7 checkpoint shards. You can find where each parameters has been saved in the index located at /scratch3/workspace/ctpham_umass_edu-ft/_llama-3.1-8b-instruct_bsz-16_lr-1e-6_epochs-1_/checkpoint-3600/model.safetensors.index.json.
[INFO|tokenization_utils_base.py:2684] 2025-01-21 10:17:00,828 >> tokenizer config file saved in /scratch3/workspace/ctpham_umass_edu-ft/_llama-3.1-8b-instruct_bsz-16_lr-1e-6_epochs-1_/checkpoint-3600/tokenizer_config.json
[INFO|tokenization_utils_base.py:2693] 2025-01-21 10:17:00,829 >> Special tokens file saved in /scratch3/workspace/ctpham_umass_edu-ft/_llama-3.1-8b-instruct_bsz-16_lr-1e-6_epochs-1_/checkpoint-3600/special_tokens_map.json
/scratch3/workspace/ctpham_umass_edu-ft/envs/prolong-final/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
  warnings.warn(
[WARNING|trainer.py:869] 2025-01-21 10:20:47,285 >> Save streaming dataset state: {'epoch': 0, 'sample_in_epoch': 7200, 'num_canonical_nodes': 1, 'shuffle_seed': 42, 'initial_physical_nodes': 1}
01/21/2025 10:20:47 - WARNING - streaming.base.dataset - Because `shuffle_block_size` was not specified, it will default to max(4_000_000 // num_canonical_nodes, 1 << 18) if num_canonical_nodes is not None, otherwise 262144. Prior to Streaming v0.7.0, `shuffle_block_size` defaulted to 262144.
/scratch3/workspace/ctpham_umass_edu-ft/envs/prolong-final/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
[INFO|trainer.py:175] 2025-01-21 10:21:18,152 >> {'loss': 0.2438, 'grad_norm': 9.847304344177246, 'learning_rate': 1.0133854033026789e-07, 'epoch': 0.05451586655817738, 'num_input_tokens_seen': 7551844352, 'completed': '97.67% (3_601 / 3_687)', 'remaining time': '0:46:11', 'throughput': '773.00', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:21:48,132 >> {'loss': 0.2657, 'grad_norm': 16.682790756225586, 'learning_rate': 1.0130760746387622e-07, 'epoch': 0.05478708977488473, 'num_input_tokens_seen': 7553941504, 'completed': '97.69% (3_602 / 3_687)', 'remaining time': '0:45:38', 'throughput': '8743.82', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:22:17,497 >> {'loss': 0.3056, 'grad_norm': 12.895301818847656, 'learning_rate': 1.0127703568804805e-07, 'epoch': 0.05505831299159208, 'num_input_tokens_seen': 7556038656, 'completed': '97.72% (3_603 / 3_687)', 'remaining time': '0:45:04', 'throughput': '8927.21', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:22:46,809 >> {'loss': 0.368, 'grad_norm': 14.191023826599121, 'learning_rate': 1.0124682502738638e-07, 'epoch': 0.05532953620829943, 'num_input_tokens_seen': 7558135808, 'completed': '97.75% (3_604 / 3_687)', 'remaining time': '0:44:31', 'throughput': '8943.31', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:23:20,426 >> {'loss': 0.5653, 'grad_norm': 18.443326950073242, 'learning_rate': 1.0121697550620365e-07, 'epoch': 0.05560075942500678, 'num_input_tokens_seen': 7560232960, 'completed': '97.78% (3_605 / 3_687)', 'remaining time': '0:43:59', 'throughput': '7797.86', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:23:51,667 >> {'loss': 0.4285, 'grad_norm': 15.984533309936523, 'learning_rate': 1.0118748714852156e-07, 'epoch': 0.055871982641714134, 'num_input_tokens_seen': 7562330112, 'completed': '97.80% (3_606 / 3_687)', 'remaining time': '0:43:27', 'throughput': '8391.16', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:24:22,000 >> {'loss': 0.2091, 'grad_norm': 12.376087188720703, 'learning_rate': 1.011583599780712e-07, 'epoch': 0.05614320585842148, 'num_input_tokens_seen': 7564427264, 'completed': '97.83% (3_607 / 3_687)', 'remaining time': '0:42:54', 'throughput': '8642.20', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:24:53,052 >> {'loss': 0.6654, 'grad_norm': 21.866743087768555, 'learning_rate': 1.01129594018293e-07, 'epoch': 0.05641442907512883, 'num_input_tokens_seen': 7566524416, 'completed': '97.86% (3_608 / 3_687)', 'remaining time': '0:42:21', 'throughput': '8442.07', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:25:25,906 >> {'loss': 0.4428, 'grad_norm': 18.539138793945312, 'learning_rate': 1.0110118929233682e-07, 'epoch': 0.05668565229183618, 'num_input_tokens_seen': 7568621568, 'completed': '97.88% (3_609 / 3_687)', 'remaining time': '0:41:49', 'throughput': '7979.00', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:25:55,085 >> {'loss': 0.6288, 'grad_norm': 19.51919174194336, 'learning_rate': 1.0107314582306156e-07, 'epoch': 0.05695687550854353, 'num_input_tokens_seen': 7570718720, 'completed': '97.91% (3_610 / 3_687)', 'remaining time': '0:41:16', 'throughput': '8983.96', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:26:24,435 >> {'loss': 0.5354, 'grad_norm': 16.458036422729492, 'learning_rate': 1.0104546363303566e-07, 'epoch': 0.05722809872525088, 'num_input_tokens_seen': 7572815872, 'completed': '97.94% (3_611 / 3_687)', 'remaining time': '0:40:43', 'throughput': '8931.78', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:26:52,453 >> {'loss': 0.6547, 'grad_norm': 18.08392906188965, 'learning_rate': 1.0101814274453661e-07, 'epoch': 0.05749932194195823, 'num_input_tokens_seen': 7574913024, 'completed': '97.97% (3_612 / 3_687)', 'remaining time': '0:40:09', 'throughput': '9356.29', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:27:28,110 >> {'loss': 0.5016, 'grad_norm': 16.274539947509766, 'learning_rate': 1.0099118317955127e-07, 'epoch': 0.05777054515866558, 'num_input_tokens_seen': 7577010176, 'completed': '97.99% (3_613 / 3_687)', 'remaining time': '0:39:38', 'throughput': '7351.69', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:27:56,978 >> {'loss': 0.5801, 'grad_norm': 16.817398071289062, 'learning_rate': 1.0096458495977564e-07, 'epoch': 0.058041768375372935, 'num_input_tokens_seen': 7579107328, 'completed': '98.02% (3_614 / 3_687)', 'remaining time': '0:39:05', 'throughput': '9080.79', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:28:29,473 >> {'loss': 0.3728, 'grad_norm': 12.463071823120117, 'learning_rate': 1.0093834810661498e-07, 'epoch': 0.05831299159208028, 'num_input_tokens_seen': 7581204480, 'completed': '98.05% (3_615 / 3_687)', 'remaining time': '0:38:33', 'throughput': '8067.42', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:28:59,063 >> {'loss': 0.3667, 'grad_norm': 11.567461967468262, 'learning_rate': 1.0091247264118372e-07, 'epoch': 0.05858421480878763, 'num_input_tokens_seen': 7583301632, 'completed': '98.07% (3_616 / 3_687)', 'remaining time': '0:38:00', 'throughput': '8859.14', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:29:32,771 >> {'loss': 0.2851, 'grad_norm': 10.735450744628906, 'learning_rate': 1.0088695858430539e-07, 'epoch': 0.058855438025494984, 'num_input_tokens_seen': 7585398784, 'completed': '98.10% (3_617 / 3_687)', 'remaining time': '0:37:28', 'throughput': '7776.94', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:30:04,562 >> {'loss': 0.4644, 'grad_norm': 13.669662475585938, 'learning_rate': 1.0086180595651278e-07, 'epoch': 0.05912666124220233, 'num_input_tokens_seen': 7587495936, 'completed': '98.13% (3_618 / 3_687)', 'remaining time': '0:36:56', 'throughput': '8245.79', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:30:32,384 >> {'loss': 0.5703, 'grad_norm': 15.31719970703125, 'learning_rate': 1.0083701477804778e-07, 'epoch': 0.059397884458909686, 'num_input_tokens_seen': 7589593088, 'completed': '98.16% (3_619 / 3_687)', 'remaining time': '0:36:23', 'throughput': '9422.07', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:31:05,209 >> {'loss': 0.5415, 'grad_norm': 14.730269432067871, 'learning_rate': 1.0081258506886134e-07, 'epoch': 0.059669107675617034, 'num_input_tokens_seen': 7591690240, 'completed': '98.18% (3_620 / 3_687)', 'remaining time': '0:35:51', 'throughput': '7986.24', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:31:35,856 >> {'loss': 0.3662, 'grad_norm': 12.143908500671387, 'learning_rate': 1.0078851684861357e-07, 'epoch': 0.05994033089232438, 'num_input_tokens_seen': 7593787392, 'completed': '98.21% (3_621 / 3_687)', 'remaining time': '0:35:18', 'throughput': '8553.55', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:32:05,063 >> {'loss': 0.2725, 'grad_norm': 12.585685729980469, 'learning_rate': 1.0076481013667376e-07, 'epoch': 0.060211554109031735, 'num_input_tokens_seen': 7595884544, 'completed': '98.24% (3_622 / 3_687)', 'remaining time': '0:34:45', 'throughput': '8975.49', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:32:36,610 >> {'loss': 0.307, 'grad_norm': 11.40467643737793, 'learning_rate': 1.0074146495212001e-07, 'epoch': 0.06048277732573908, 'num_input_tokens_seen': 7597981696, 'completed': '98.26% (3_623 / 3_687)', 'remaining time': '0:34:13', 'throughput': '8309.58', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:33:07,763 >> {'loss': 0.7096, 'grad_norm': 16.664461135864258, 'learning_rate': 1.0071848131373972e-07, 'epoch': 0.06075400054244643, 'num_input_tokens_seen': 7600078848, 'completed': '98.29% (3_624 / 3_687)', 'remaining time': '0:33:41', 'throughput': '8414.76', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:33:37,940 >> {'loss': 0.3869, 'grad_norm': 11.929876327514648, 'learning_rate': 1.0069585924002924e-07, 'epoch': 0.061025223759153785, 'num_input_tokens_seen': 7602176000, 'completed': '98.32% (3_625 / 3_687)', 'remaining time': '0:33:08', 'throughput': '8686.73', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:34:07,063 >> {'loss': 0.7844, 'grad_norm': 20.532058715820312, 'learning_rate': 1.0067359874919395e-07, 'epoch': 0.06129644697586113, 'num_input_tokens_seen': 7604273152, 'completed': '98.35% (3_626 / 3_687)', 'remaining time': '0:32:35', 'throughput': '9001.47', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:34:39,284 >> {'loss': 0.2573, 'grad_norm': 10.287115097045898, 'learning_rate': 1.0065169985914826e-07, 'epoch': 0.06156767019256849, 'num_input_tokens_seen': 7606370304, 'completed': '98.37% (3_627 / 3_687)', 'remaining time': '0:32:03', 'throughput': '8135.77', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:35:08,076 >> {'loss': 0.6302, 'grad_norm': 17.162843704223633, 'learning_rate': 1.0063016258751553e-07, 'epoch': 0.061838893409275834, 'num_input_tokens_seen': 7608467456, 'completed': '98.40% (3_628 / 3_687)', 'remaining time': '0:31:30', 'throughput': '9104.80', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:35:37,175 >> {'loss': 0.4309, 'grad_norm': 14.848957061767578, 'learning_rate': 1.0060898695162816e-07, 'epoch': 0.06211011662598318, 'num_input_tokens_seen': 7610564608, 'completed': '98.43% (3_629 / 3_687)', 'remaining time': '0:30:57', 'throughput': '9008.68', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:36:08,252 >> {'loss': 0.4711, 'grad_norm': 17.66904640197754, 'learning_rate': 1.005881729685275e-07, 'epoch': 0.062381339842690536, 'num_input_tokens_seen': 7612661760, 'completed': '98.45% (3_630 / 3_687)', 'remaining time': '0:30:25', 'throughput': '8435.11', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:36:35,858 >> {'loss': 0.8298, 'grad_norm': 20.336261749267578, 'learning_rate': 1.0056772065496387e-07, 'epoch': 0.06265256305939788, 'num_input_tokens_seen': 7614758912, 'completed': '98.48% (3_631 / 3_687)', 'remaining time': '0:29:52', 'throughput': '9495.95', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:37:08,341 >> {'loss': 0.6695, 'grad_norm': 22.26197052001953, 'learning_rate': 1.005476300273965e-07, 'epoch': 0.06292378627610523, 'num_input_tokens_seen': 7616856064, 'completed': '98.51% (3_632 / 3_687)', 'remaining time': '0:29:20', 'throughput': '8070.40', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:37:36,985 >> {'loss': 0.4288, 'grad_norm': 14.296065330505371, 'learning_rate': 1.0052790110199348e-07, 'epoch': 0.06319500949281258, 'num_input_tokens_seen': 7618953216, 'completed': '98.54% (3_633 / 3_687)', 'remaining time': '0:28:47', 'throughput': '9151.62', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:38:08,625 >> {'loss': 0.6585, 'grad_norm': 18.81780433654785, 'learning_rate': 1.0050853389463205e-07, 'epoch': 0.06346623270951994, 'num_input_tokens_seen': 7621050368, 'completed': '98.56% (3_634 / 3_687)', 'remaining time': '0:28:15', 'throughput': '8285.09', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:38:41,697 >> {'loss': 0.3255, 'grad_norm': 11.14263916015625, 'learning_rate': 1.0048952842089805e-07, 'epoch': 0.06373745592622729, 'num_input_tokens_seen': 7623147520, 'completed': '98.59% (3_635 / 3_687)', 'remaining time': '0:27:44', 'throughput': '7926.69', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:39:11,240 >> {'loss': 0.315, 'grad_norm': 12.55125904083252, 'learning_rate': 1.0047088469608648e-07, 'epoch': 0.06400867914293464, 'num_input_tokens_seen': 7625244672, 'completed': '98.62% (3_636 / 3_687)', 'remaining time': '0:27:11', 'throughput': '8873.19', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:39:40,749 >> {'loss': 0.3163, 'grad_norm': 20.127553939819336, 'learning_rate': 1.00452602735201e-07, 'epoch': 0.06427990235964198, 'num_input_tokens_seen': 7627341824, 'completed': '98.64% (3_637 / 3_687)', 'remaining time': '0:26:39', 'throughput': '8883.41', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:40:09,442 >> {'loss': 0.6731, 'grad_norm': 18.10858726501465, 'learning_rate': 1.0043468255295435e-07, 'epoch': 0.06455112557634933, 'num_input_tokens_seen': 7629438976, 'completed': '98.67% (3_638 / 3_687)', 'remaining time': '0:26:06', 'throughput': '9136.43', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:40:38,660 >> {'loss': 0.5683, 'grad_norm': 15.758679389953613, 'learning_rate': 1.0041712416376795e-07, 'epoch': 0.06482234879305669, 'num_input_tokens_seen': 7631536128, 'completed': '98.70% (3_639 / 3_687)', 'remaining time': '0:25:33', 'throughput': '8971.84', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:41:12,266 >> {'loss': 0.2851, 'grad_norm': 12.081664085388184, 'learning_rate': 1.0039992758177211e-07, 'epoch': 0.06509357200976404, 'num_input_tokens_seen': 7633633280, 'completed': '98.73% (3_640 / 3_687)', 'remaining time': '0:25:02', 'throughput': '7800.60', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:41:42,671 >> {'loss': 0.2524, 'grad_norm': 12.943633079528809, 'learning_rate': 1.0038309282080596e-07, 'epoch': 0.06536479522647139, 'num_input_tokens_seen': 7635730432, 'completed': '98.75% (3_641 / 3_687)', 'remaining time': '0:24:29', 'throughput': '8621.62', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:42:14,102 >> {'loss': 0.5379, 'grad_norm': 16.313081741333008, 'learning_rate': 1.0036661989441755e-07, 'epoch': 0.06563601844317873, 'num_input_tokens_seen': 7637827584, 'completed': '98.78% (3_642 / 3_687)', 'remaining time': '0:23:57', 'throughput': '8340.21', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:42:43,447 >> {'loss': 0.3989, 'grad_norm': 14.559349060058594, 'learning_rate': 1.0035050881586364e-07, 'epoch': 0.06590724165988608, 'num_input_tokens_seen': 7639924736, 'completed': '98.81% (3_643 / 3_687)', 'remaining time': '0:23:25', 'throughput': '8933.39', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:43:14,381 >> {'loss': 0.3097, 'grad_norm': 12.672063827514648, 'learning_rate': 1.0033475959810974e-07, 'epoch': 0.06617846487659344, 'num_input_tokens_seen': 7642021888, 'completed': '98.83% (3_644 / 3_687)', 'remaining time': '0:22:53', 'throughput': '8474.26', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:43:46,823 >> {'loss': 0.4511, 'grad_norm': 12.448989868164062, 'learning_rate': 1.0031937225383036e-07, 'epoch': 0.06644968809330079, 'num_input_tokens_seen': 7644119040, 'completed': '98.86% (3_645 / 3_687)', 'remaining time': '0:22:21', 'throughput': '8080.24', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:44:20,125 >> {'loss': 0.2887, 'grad_norm': 12.244915962219238, 'learning_rate': 1.0030434679540853e-07, 'epoch': 0.06672091131000814, 'num_input_tokens_seen': 7646216192, 'completed': '98.89% (3_646 / 3_687)', 'remaining time': '0:21:49', 'throughput': '7871.85', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:44:50,358 >> {'loss': 0.5713, 'grad_norm': 16.625158309936523, 'learning_rate': 1.0028968323493623e-07, 'epoch': 0.06699213452671549, 'num_input_tokens_seen': 7648313344, 'completed': '98.92% (3_647 / 3_687)', 'remaining time': '0:21:17', 'throughput': '8670.80', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:45:21,365 >> {'loss': 0.6223, 'grad_norm': 18.604345321655273, 'learning_rate': 1.0027538158421413e-07, 'epoch': 0.06726335774342283, 'num_input_tokens_seen': 7650410496, 'completed': '98.94% (3_648 / 3_687)', 'remaining time': '0:20:45', 'throughput': '8454.36', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:45:52,446 >> {'loss': 0.6491, 'grad_norm': 17.25090217590332, 'learning_rate': 1.002614418547516e-07, 'epoch': 0.06753458096013018, 'num_input_tokens_seen': 7652507648, 'completed': '98.97% (3_649 / 3_687)', 'remaining time': '0:20:13', 'throughput': '8434.12', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:46:21,198 >> {'loss': 0.5556, 'grad_norm': 17.5065975189209, 'learning_rate': 1.0024786405776686e-07, 'epoch': 0.06780580417683754, 'num_input_tokens_seen': 7654604800, 'completed': '99.00% (3_650 / 3_687)', 'remaining time': '0:19:41', 'throughput': '9117.44', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:46:51,031 >> {'loss': 0.451, 'grad_norm': 13.661022186279297, 'learning_rate': 1.0023464820418676e-07, 'epoch': 0.06807702739354489, 'num_input_tokens_seen': 7656701952, 'completed': '99.02% (3_651 / 3_687)', 'remaining time': '0:19:08', 'throughput': '8787.08', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:47:22,783 >> {'loss': 0.3287, 'grad_norm': 14.327960968017578, 'learning_rate': 1.00221794304647e-07, 'epoch': 0.06834825061025224, 'num_input_tokens_seen': 7658799104, 'completed': '99.05% (3_652 / 3_687)', 'remaining time': '0:18:36', 'throughput': '8256.01', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:47:56,247 >> {'loss': 0.363, 'grad_norm': 21.388771057128906, 'learning_rate': 1.0020930236949182e-07, 'epoch': 0.06861947382695958, 'num_input_tokens_seen': 7660896256, 'completed': '99.08% (3_653 / 3_687)', 'remaining time': '0:18:05', 'throughput': '7833.53', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:48:28,030 >> {'loss': 0.253, 'grad_norm': 9.650550842285156, 'learning_rate': 1.0019717240877424e-07, 'epoch': 0.06889069704366693, 'num_input_tokens_seen': 7662993408, 'completed': '99.10% (3_654 / 3_687)', 'remaining time': '0:17:33', 'throughput': '8248.03', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:48:55,466 >> {'loss': 0.7714, 'grad_norm': 17.260183334350586, 'learning_rate': 1.001854044322561e-07, 'epoch': 0.0691619202603743, 'num_input_tokens_seen': 7665090560, 'completed': '99.13% (3_655 / 3_687)', 'remaining time': '0:17:00', 'throughput': '9554.75', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:49:25,624 >> {'loss': 0.4834, 'grad_norm': 22.375675201416016, 'learning_rate': 1.0017399844940774e-07, 'epoch': 0.06943314347708164, 'num_input_tokens_seen': 7667187712, 'completed': '99.16% (3_656 / 3_687)', 'remaining time': '0:16:28', 'throughput': '8692.40', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:49:58,310 >> {'loss': 0.3653, 'grad_norm': 13.567085266113281, 'learning_rate': 1.0016295446940827e-07, 'epoch': 0.06970436669378899, 'num_input_tokens_seen': 7669284864, 'completed': '99.19% (3_657 / 3_687)', 'remaining time': '0:15:56', 'throughput': '8020.10', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:50:27,688 >> {'loss': 0.7221, 'grad_norm': 17.413497924804688, 'learning_rate': 1.001522725011455e-07, 'epoch': 0.06997558991049634, 'num_input_tokens_seen': 7671382016, 'completed': '99.21% (3_658 / 3_687)', 'remaining time': '0:15:24', 'throughput': '8922.93', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:50:57,554 >> {'loss': 0.4207, 'grad_norm': 13.802844047546387, 'learning_rate': 1.0014195255321583e-07, 'epoch': 0.07024681312720368, 'num_input_tokens_seen': 7673479168, 'completed': '99.24% (3_659 / 3_687)', 'remaining time': '0:14:52', 'throughput': '8777.42', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:51:30,232 >> {'loss': 0.3311, 'grad_norm': 13.901693344116211, 'learning_rate': 1.0013199463392433e-07, 'epoch': 0.07051803634391104, 'num_input_tokens_seen': 7675576320, 'completed': '99.27% (3_660 / 3_687)', 'remaining time': '0:14:20', 'throughput': '8022.12', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:52:01,202 >> {'loss': 0.2948, 'grad_norm': 11.964288711547852, 'learning_rate': 1.0012239875128484e-07, 'epoch': 0.07078925956061839, 'num_input_tokens_seen': 7677673472, 'completed': '99.29% (3_661 / 3_687)', 'remaining time': '0:13:48', 'throughput': '8464.42', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:52:32,423 >> {'loss': 0.4533, 'grad_norm': 16.391555786132812, 'learning_rate': 1.0011316491301973e-07, 'epoch': 0.07106048277732574, 'num_input_tokens_seen': 7679770624, 'completed': '99.32% (3_662 / 3_687)', 'remaining time': '0:13:16', 'throughput': '8396.51', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:53:03,406 >> {'loss': 0.5279, 'grad_norm': 15.69762134552002, 'learning_rate': 1.0010429312656006e-07, 'epoch': 0.07133170599403309, 'num_input_tokens_seen': 7681867776, 'completed': '99.35% (3_663 / 3_687)', 'remaining time': '0:12:44', 'throughput': '8460.69', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:53:31,401 >> {'loss': 0.6914, 'grad_norm': 18.312103271484375, 'learning_rate': 1.000957833990454e-07, 'epoch': 0.07160292921074043, 'num_input_tokens_seen': 7683964928, 'completed': '99.38% (3_664 / 3_687)', 'remaining time': '0:12:12', 'throughput': '9364.03', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:54:02,040 >> {'loss': 0.2825, 'grad_norm': 11.982022285461426, 'learning_rate': 1.0008763573732421e-07, 'epoch': 0.0718741524274478, 'num_input_tokens_seen': 7686062080, 'completed': '99.40% (3_665 / 3_687)', 'remaining time': '0:11:40', 'throughput': '8555.77', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:54:36,262 >> {'loss': 0.4631, 'grad_norm': 15.081099510192871, 'learning_rate': 1.0007985014795331e-07, 'epoch': 0.07214537564415514, 'num_input_tokens_seen': 7688159232, 'completed': '99.43% (3_666 / 3_687)', 'remaining time': '0:11:09', 'throughput': '7660.23', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:55:03,769 >> {'loss': 0.4301, 'grad_norm': 14.211804389953613, 'learning_rate': 1.0007242663719824e-07, 'epoch': 0.07241659886086249, 'num_input_tokens_seen': 7690256384, 'completed': '99.46% (3_667 / 3_687)', 'remaining time': '0:10:36', 'throughput': '9529.94', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:55:32,765 >> {'loss': 0.5014, 'grad_norm': 16.612119674682617, 'learning_rate': 1.0006536521103325e-07, 'epoch': 0.07268782207756984, 'num_input_tokens_seen': 7692353536, 'completed': '99.48% (3_668 / 3_687)', 'remaining time': '0:10:04', 'throughput': '9040.67', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:56:03,930 >> {'loss': 0.5847, 'grad_norm': 16.976350784301758, 'learning_rate': 1.0005866587514106e-07, 'epoch': 0.07295904529427719, 'num_input_tokens_seen': 7694450688, 'completed': '99.51% (3_669 / 3_687)', 'remaining time': '0:09:32', 'throughput': '8411.63', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:56:35,599 >> {'loss': 0.5873, 'grad_norm': 16.93351936340332, 'learning_rate': 1.0005232863491297e-07, 'epoch': 0.07323026851098453, 'num_input_tokens_seen': 7696547840, 'completed': '99.54% (3_670 / 3_687)', 'remaining time': '0:09:01', 'throughput': '8277.61', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:57:05,428 >> {'loss': 0.5883, 'grad_norm': 19.781625747680664, 'learning_rate': 1.0004635349544907e-07, 'epoch': 0.0735014917276919, 'num_input_tokens_seen': 7698644992, 'completed': '99.57% (3_671 / 3_687)', 'remaining time': '0:08:29', 'throughput': '8788.19', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:57:40,372 >> {'loss': 0.3702, 'grad_norm': 15.636222839355469, 'learning_rate': 1.0004074046155789e-07, 'epoch': 0.07377271494439924, 'num_input_tokens_seen': 7700742144, 'completed': '99.59% (3_672 / 3_687)', 'remaining time': '0:07:57', 'throughput': '7501.73', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:58:12,209 >> {'loss': 0.7943, 'grad_norm': 20.640913009643555, 'learning_rate': 1.000354895377565e-07, 'epoch': 0.07404393816110659, 'num_input_tokens_seen': 7702839296, 'completed': '99.62% (3_673 / 3_687)', 'remaining time': '0:07:25', 'throughput': '8233.89', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:58:40,685 >> {'loss': 0.4785, 'grad_norm': 16.68242073059082, 'learning_rate': 1.0003060072827073e-07, 'epoch': 0.07431516137781394, 'num_input_tokens_seen': 7704936448, 'completed': '99.65% (3_674 / 3_687)', 'remaining time': '0:06:53', 'throughput': '9205.86', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:59:13,156 >> {'loss': 0.3074, 'grad_norm': 14.670442581176758, 'learning_rate': 1.0002607403703492e-07, 'epoch': 0.07458638459452128, 'num_input_tokens_seen': 7707033600, 'completed': '99.67% (3_675 / 3_687)', 'remaining time': '0:06:21', 'throughput': '8073.34', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 10:59:42,560 >> {'loss': 0.7391, 'grad_norm': 19.64070701599121, 'learning_rate': 1.000219094676919e-07, 'epoch': 0.07485760781122865, 'num_input_tokens_seen': 7709130752, 'completed': '99.70% (3_676 / 3_687)', 'remaining time': '0:05:49', 'throughput': '8915.01', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 11:00:13,335 >> {'loss': 0.3917, 'grad_norm': 12.361337661743164, 'learning_rate': 1.0001810702359326e-07, 'epoch': 0.075128831027936, 'num_input_tokens_seen': 7711227904, 'completed': '99.73% (3_677 / 3_687)', 'remaining time': '0:05:18', 'throughput': '8518.08', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 11:00:44,613 >> {'loss': 0.3472, 'grad_norm': 11.465216636657715, 'learning_rate': 1.0001466670779896e-07, 'epoch': 0.07540005424464334, 'num_input_tokens_seen': 7713325056, 'completed': '99.76% (3_678 / 3_687)', 'remaining time': '0:04:46', 'throughput': '8381.24', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 11:01:16,566 >> {'loss': 0.2105, 'grad_norm': 10.229789733886719, 'learning_rate': 1.000115885230777e-07, 'epoch': 0.07567127746135069, 'num_input_tokens_seen': 7715422208, 'completed': '99.78% (3_679 / 3_687)', 'remaining time': '0:04:14', 'throughput': '8203.90', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 11:01:46,217 >> {'loss': 0.3553, 'grad_norm': 11.707015991210938, 'learning_rate': 1.0000887247190662e-07, 'epoch': 0.07594250067805804, 'num_input_tokens_seen': 7717519360, 'completed': '99.81% (3_680 / 3_687)', 'remaining time': '0:03:42', 'throughput': '8841.17', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 11:02:16,748 >> {'loss': 0.7441, 'grad_norm': 17.815807342529297, 'learning_rate': 1.000065185564716e-07, 'epoch': 0.0762137238947654, 'num_input_tokens_seen': 7719616512, 'completed': '99.84% (3_681 / 3_687)', 'remaining time': '0:03:10', 'throughput': '8586.05', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 11:02:44,094 >> {'loss': 0.5363, 'grad_norm': 16.822296142578125, 'learning_rate': 1.0000452677866691e-07, 'epoch': 0.07648494711147275, 'num_input_tokens_seen': 7721713664, 'completed': '99.86% (3_682 / 3_687)', 'remaining time': '0:02:38', 'throughput': '9586.39', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 11:03:14,324 >> {'loss': 0.5075, 'grad_norm': 16.27877426147461, 'learning_rate': 1.0000289714009542e-07, 'epoch': 0.07675617032818009, 'num_input_tokens_seen': 7723810816, 'completed': '99.89% (3_683 / 3_687)', 'remaining time': '0:02:07', 'throughput': '8671.54', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 11:03:45,598 >> {'loss': 0.4568, 'grad_norm': 14.539108276367188, 'learning_rate': 1.000016296420687e-07, 'epoch': 0.07702739354488744, 'num_input_tokens_seen': 7725907968, 'completed': '99.92% (3_684 / 3_687)', 'remaining time': '0:01:35', 'throughput': '8382.10', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 11:04:16,226 >> {'loss': 0.494, 'grad_norm': 16.22719383239746, 'learning_rate': 1.0000072428560674e-07, 'epoch': 0.07729861676159479, 'num_input_tokens_seen': 7728005120, 'completed': '99.95% (3_685 / 3_687)', 'remaining time': '0:01:03', 'throughput': '8559.07', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 11:04:49,240 >> {'loss': 0.4097, 'grad_norm': 15.531390190124512, 'learning_rate': 1.000001810714381e-07, 'epoch': 0.07756983997830215, 'num_input_tokens_seen': 7730102272, 'completed': '99.97% (3_686 / 3_687)', 'remaining time': '0:00:31', 'throughput': '7940.27', 'gpu_mem_free': '5581MB'}
[INFO|trainer.py:175] 2025-01-21 11:05:19,062 >> {'loss': 0.3166, 'grad_norm': 10.840725898742676, 'learning_rate': 1e-07, 'epoch': 0.0778410631950095, 'num_input_tokens_seen': 7732199424, 'completed': '100.00% (3_687 / 3_687)', 'remaining time': '0:00:00', 'throughput': '8790.46', 'gpu_mem_free': '5581MB'}
/scratch3/workspace/ctpham_umass_edu-ft/envs/prolong-final/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
  warnings.warn(
[INFO|trainer.py:3503] 2025-01-21 11:05:43,430 >> Saving model checkpoint to /scratch3/workspace/ctpham_umass_edu-ft/_llama-3.1-8b-instruct_bsz-16_lr-1e-6_epochs-1_/checkpoint-3687
[INFO|configuration_utils.py:472] 2025-01-21 11:05:43,433 >> Configuration saved in /scratch3/workspace/ctpham_umass_edu-ft/_llama-3.1-8b-instruct_bsz-16_lr-1e-6_epochs-1_/checkpoint-3687/config.json
[INFO|configuration_utils.py:807] 2025-01-21 11:05:43,434 >> Configuration saved in /scratch3/workspace/ctpham_umass_edu-ft/_llama-3.1-8b-instruct_bsz-16_lr-1e-6_epochs-1_/checkpoint-3687/generation_config.json
[INFO|modeling_utils.py:2807] 2025-01-21 11:06:40,727 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 7 checkpoint shards. You can find where each parameters has been saved in the index located at /scratch3/workspace/ctpham_umass_edu-ft/_llama-3.1-8b-instruct_bsz-16_lr-1e-6_epochs-1_/checkpoint-3687/model.safetensors.index.json.
[INFO|tokenization_utils_base.py:2684] 2025-01-21 11:06:40,730 >> tokenizer config file saved in /scratch3/workspace/ctpham_umass_edu-ft/_llama-3.1-8b-instruct_bsz-16_lr-1e-6_epochs-1_/checkpoint-3687/tokenizer_config.json
[INFO|tokenization_utils_base.py:2693] 2025-01-21 11:06:40,731 >> Special tokens file saved in /scratch3/workspace/ctpham_umass_edu-ft/_llama-3.1-8b-instruct_bsz-16_lr-1e-6_epochs-1_/checkpoint-3687/special_tokens_map.json
/scratch3/workspace/ctpham_umass_edu-ft/envs/prolong-final/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
  warnings.warn(
[WARNING|trainer.py:869] 2025-01-21 11:10:20,871 >> Save streaming dataset state: {'epoch': 0, 'sample_in_epoch': 7374, 'num_canonical_nodes': 1, 'shuffle_seed': 42, 'initial_physical_nodes': 1}
[INFO|trainer.py:2394] 2025-01-21 11:10:21,365 >>
Training completed. Do not forget to share your model on huggingface.co/models =)
[INFO|trainer.py:175] 2025-01-21 11:10:21,367 >> {'train_runtime': 9429.4679, 'train_samples_per_second': 0.782, 'train_steps_per_second': 0.391, 'train_loss': 0.036126901625228955, 'epoch': 0.0778410631950095, 'num_input_tokens_seen': 7732199424, 'completed': '100.00% (3_687 / 3_687)', 'remaining time': '0:00:00', 'throughput': '0.00', 'gpu_mem_free': '5581MB'}
/scratch3/workspace/ctpham_umass_edu-ft/envs/prolong-final/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
  warnings.warn(
[INFO|trainer.py:3503] 2025-01-21 11:10:45,916 >> Saving model checkpoint to /scratch3/workspace/ctpham_umass_edu-ft/_llama-3.1-8b-instruct_bsz-16_lr-1e-6_epochs-1_
[INFO|configuration_utils.py:472] 2025-01-21 11:10:45,928 >> Configuration saved in /scratch3/workspace/ctpham_umass_edu-ft/_llama-3.1-8b-instruct_bsz-16_lr-1e-6_epochs-1_/config.json
[INFO|configuration_utils.py:807] 2025-01-21 11:10:45,930 >> Configuration saved in /scratch3/workspace/ctpham_umass_edu-ft/_llama-3.1-8b-instruct_bsz-16_lr-1e-6_epochs-1_/generation_config.json
[INFO|modeling_utils.py:2807] 2025-01-21 11:11:45,895 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 7 checkpoint shards. You can find where each parameters has been saved in the index located at /scratch3/workspace/ctpham_umass_edu-ft/_llama-3.1-8b-instruct_bsz-16_lr-1e-6_epochs-1_/model.safetensors.index.json.
[INFO|tokenization_utils_base.py:2684] 2025-01-21 11:11:45,899 >> tokenizer config file saved in /scratch3/workspace/ctpham_umass_edu-ft/_llama-3.1-8b-instruct_bsz-16_lr-1e-6_epochs-1_/tokenizer_config.json
[INFO|tokenization_utils_base.py:2693] 2025-01-21 11:11:45,900 >> Special tokens file saved in /scratch3/workspace/ctpham_umass_edu-ft/_llama-3.1-8b-instruct_bsz-16_lr-1e-6_epochs-1_/special_tokens_map.json
***** train metrics *****
  epoch                    =     0.0778
  num_input_tokens_seen    = 7732199424
  train_loss               =     0.0361
  train_runtime            = 2:37:09.46
  train_samples_per_second =      0.782
  train_steps_per_second   =      0.391