Update README.md
Browse files
README.md
CHANGED
|
@@ -17,6 +17,10 @@ Online Reinforcement learning using GRPO full parameter on warmup reasoning SFT
|
|
| 17 |
1. Improve reasoning on Dialects, each datapoint been replicated to 12 generations.
|
| 18 |
2. Actual online reinforcement learning.
|
| 19 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
## Training session
|
| 21 |
|
| 22 |
Finetune on [huseinzol05/malaysian-dialect-qa](https://huggingface.co/datasets/huseinzol05/malaysian-dialect-qa), this is train set from [mesolitica/Malay-Dialect-Reasoning](https://huggingface.co/datasets/mesolitica/Malay-Dialect-Reasoning).
|
|
|
|
| 17 |
1. Improve reasoning on Dialects, each datapoint been replicated to 12 generations.
|
| 18 |
2. Actual online reinforcement learning.
|
| 19 |
|
| 20 |
+
## Better performance
|
| 21 |
+
|
| 22 |
+
To get better performance, use system prompt `You are going to enter reasoning mode. First, you try to think step-by-step in Malay. After that, put your final answer within $\\boxed{}$.`, you can check how we trained it at https://github.com/mesolitica/malaya/blob/master/session/qwen2.5/grpo.py#L80
|
| 23 |
+
|
| 24 |
## Training session
|
| 25 |
|
| 26 |
Finetune on [huseinzol05/malaysian-dialect-qa](https://huggingface.co/datasets/huseinzol05/malaysian-dialect-qa), this is train set from [mesolitica/Malay-Dialect-Reasoning](https://huggingface.co/datasets/mesolitica/Malay-Dialect-Reasoning).
|