TRL documentation
GFPO
You are viewing main version, which requires installation from source. If you'd like
regular pip install, checkout the latest stable version (v0.24.0).
GFPO
This feature implements the GFPO algorithm to enforce concise reasoning in the model’s output generation, as proposed in the paper Sample More to Think Less: Group Filtered Policy Optimization for Concise Reasoning.
Usage
To activate GFPO in GFPOTrainer:
- set
num_remains_in_groupinGFPOConfig - define a group filter function and set it to
group_filter_funcinGFPOTrainer.group_filter_funcwill score thenum_generationscompletions and The GFPOTrainer filters groups according to their scores to get topnum_remains_in_groupcompletions as a new group. Model will be trained on the filtered group.
# train_gfpo.py
from trl.experimental.gfpo import GFPOConfig, GFPOTrainer
# dummy group filter to scores the completions based on its indice in group
class GroupFilter:
def __call__(self, group_completions, group_rewards, **kwargs):
group_scores = []
for completions, rewards in zip(group_completions, group_rewards):
scores = [float(i) for i in range(len(completions))]
group_scores.append(scores)
return group_scores
training_args = GFPOConfig(
output_dir="Qwen3-0.6B-GFPO",
per_device_train_batch_size=4,
num_remains_in_group=2,
bf16=True,
)
trainer = GFPOTrainer(
model="Qwen/Qwen3-0.6B",
reward_funcs=...,
train_dataset=...,
args=training_args,
group_filter_func=GroupFilter(),
)
trainer.train()