--- language: - ms - en datasets: - mesolitica/Malay-Dialect-Reasoning base_model: - mesolitica/Malaysian-Qwen2.5-7B-Reasoning-SFT --- # Malaysian Qwen 2.5 7B Instruct Dialect Reasoning GRPO Online Reinforcement learning using GRPO full parameter on warmup reasoning SFT https://huggingface.co/mesolitica/Malaysian-Qwen2.5-7B-Reasoning-SFT on highly curated Malay Dialect Reasoning dataset. ## Improvement 1. Improve reasoning on Dialects. 2. actual online reinforcement learning. ## Training session Finetune on [huseinzol05/malaysian-dialect-qa](https://huggingface.co/datasets/huseinzol05/malaysian-dialect-qa), this is train set from [mesolitica/Malay-Dialect-Reasoning](https://huggingface.co/datasets/mesolitica/Malay-Dialect-Reasoning). ## How we train 1. GRPO full parameters. 5. WanDB at https://wandb.ai/huseinzol05/fpf-Malaysian-Qwen2.5-7B-Reasoning-SFT-GRPO-v3 ## Checkpoints 1. Epoch 1.1, revision [78435e1edc593a842e6031ba6ee7a5930d9d2a83](https://huggingface.co/mesolitica/Malaysian-Qwen2.5-7B-Dialect-Reasoning-GRPO/commit/78435e1edc593a842e6031ba6ee7a5930d9d2a83) 2. Epoch 2.0, revision [25d418d0032f08c39a506b529e6133e60f998a61](https://huggingface.co/mesolitica/Malaysian-Qwen2.5-7B-Dialect-Reasoning-GRPO/commit/25d418d0032f08c39a506b529e6133e60f998a61) ## Source code Source code at https://github.com/mesolitica/malaya/blob/master/session/qwen2.5/7b-grpo-fsdp.sh ## Benchmark All the benchmarks generate using vLLM. ### Dialect to standard Malay translation #### Float32 #### Float16 ### Standard Malay to dialect translation #### Float32 #### Float16 ## Special thanks Special thanks to https://www.sns.com.my and Nvidia for 8x H100 node!