--- language: - ms - en datasets: - mesolitica/Malay-Dialect-Reasoning base_model: - mesolitica/Malaysian-Qwen2.5-7B-Reasoning-SFT --- # Malaysian Qwen 2.5 7B Instruct Dialect Reasoning GRPO Online Reinforcement learning using GRPO full parameter on warmup reasoning SFT https://huggingface.co/mesolitica/Malaysian-Qwen2.5-7B-Reasoning-SFT on highly curated Malay Dialect Reasoning dataset. ## Improvement 1. Improve reasoning on Dialects, each datapoint been replicated to 6 generations. 2. Actual online reinforcement learning. ## Better performance To get better performance, use system prompt `You are going to enter reasoning mode. First, you try to think step-by-step in Malay. After that, put your final answer within $\\boxed{}$.`, you can check how we trained it at https://github.com/mesolitica/malaya/blob/master/session/qwen2.5/grpo.py#L80 ## Training session Finetune on [huseinzol05/malaysian-dialect-qa](https://huggingface.co/datasets/huseinzol05/malaysian-dialect-qa), this is train set from [mesolitica/Malay-Dialect-Reasoning](https://huggingface.co/datasets/mesolitica/Malay-Dialect-Reasoning). ## How we train 1. GRPO full parameters. 5. WanDB at https://wandb.ai/huseinzol05/fpf-Malaysian-Qwen2.5-7B-Reasoning-SFT-GRPO-v3 ## Checkpoints 1. Epoch 1.1, revision [78435e1edc593a842e6031ba6ee7a5930d9d2a83](https://huggingface.co/mesolitica/Malaysian-Qwen2.5-7B-Dialect-Reasoning-GRPO/commit/78435e1edc593a842e6031ba6ee7a5930d9d2a83) 2. Epoch 2.0, revision [25d418d0032f08c39a506b529e6133e60f998a61](https://huggingface.co/mesolitica/Malaysian-Qwen2.5-7B-Dialect-Reasoning-GRPO/commit/25d418d0032f08c39a506b529e6133e60f998a61) 3. Epoch 2.96, revision [4c6886a43f73767be61d67093f20dbdf1a7d8df6](https://huggingface.co/mesolitica/Malaysian-Qwen2.5-7B-Dialect-Reasoning-GRPO/commit/4c6886a43f73767be61d67093f20dbdf1a7d8df6) ## Source code Source code at https://github.com/mesolitica/malaya/blob/master/session/qwen2.5/7b-grpo-fsdp.sh ## Benchmark All the benchmarks generate using vLLM, evaluation based on sacrebleu CHRF max@5. Source code for evaluation at https://github.com/mesolitica/malaya/tree/master/session/qwen2.5/evaluate-dialect ### Float32 Dialect to standard Malay, ``` From: johor To: malay, score: 58.2189619529139 From: kedah To: malay, score: 59.21260384746205 From: pahang To: malay, score: 53.506270589822165 From: negeri sembilan To: malay, score: 56.94870448682657 From: kelantan To: malay, score: 50.64768195652429 From: penang To: malay, score: 62.964413639258034 From: melaka To: malay, score: 56.24541676643081 average: 56.82057903417684 ``` Standard Malay to dialect, ``` From: malay To: johor, score: 54.83246740931249 From: malay To: kedah, score: 59.069394967356274 From: malay To: pahang, score: 59.695207458023745 From: malay To: negeri sembilan, score: 50.69885056697714 From: malay To: kelantan, score: 44.66310165425512 From: malay To: penang, score: 65.39795752468879 From: malay To: melaka, score: 72.39183991789344 average: 58.10697421407243 ``` ### Float16 Dialect to standard Malay, ``` From: johor To: malay, score: 57.42949426937456 From: kedah To: malay, score: 58.12580212528728 From: pahang To: malay, score: 55.60484906845884 From: negeri sembilan To: malay, score: 56.4509629484568 From: kelantan To: malay, score: 53.944979416369996 From: penang To: malay, score: 62.20935643642939 From: melaka To: malay, score: 57.14492955494046 average: 57.27291054561676 ``` Standard Malay to dialect, ``` From: malay To: johor, score: 55.68356840259747 From: malay To: kedah, score: 56.264707994950186 From: malay To: pahang, score: 60.15982036912563 From: malay To: negeri sembilan, score: 48.71725827604103 From: malay To: kelantan, score: 43.948995049469474 From: malay To: penang, score: 63.15864675162173 From: malay To: melaka, score: 74.12398375006538 average: 57.436711513410124 ``` ## Special thanks Special thanks to https://www.sns.com.my and Nvidia for 8x H100 node!