You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Reasoning-Aware GRPO using Process Mining

BAELAB, Pusan National University, Busan, Korea

Taekyhun Park^* , Yongjae Lee^*, Hyerim Bae^†

🌟 Github | 📥 1.5B Download | 📥 7B Download | 📄 Arxiv Paper Link |

Abstract

Reinforcement learning (RL)-based post-training has been crucial for enabling multi-step reasoning in large reasoning models (LRMs), yet current reward schemes are typically outcome-centric. We propose PM4GRPO, a reasoning-aware Group Relative Policy Optimization (GRPO) that augments standard answer/format rewards with signals over the reasoning procedure. To this end, process mining techniques are utilized to compute a scalar conformance reward that measures how closely a policy model's reasoning aligns with the pretrained teacher model. The empirical results on five benchmarks demonstrate that PM4GRPO significantly outperforms existing methodologies for GRPO-based post-training. These results highlight that leveraging process mining for reasoning-aware GRPO effectively enhances the reasoning capabilities of policy models.

Illustration of PM4GRPO

Downloads last month: 1,069

Safetensors

Model size

2B params

Tensor type

BF16

Model tree for Thrillcrazyer/Qwen-1.5B_THIP

Unable to build the model tree, the base model loops to the model itself. Learn more.