Hugging Face
Models
Datasets
Spaces
Community
Docs
Enterprise
Pricing
Log In
Sign Up
Harsh Nilesh Pathak
harsh306
Follow
photosban's profile picture
ahmed-ali's profile picture
2 followers
·
0 following
harsh306
AI & ML interests
AI and ML
Recent Activity
replied
to
Kseniase
's
post
13 days ago
11 Fascinating new Policy Optimization techniques Policy optimization (PO) algorithms are central to training AI models with preference-based feedback. In recent weeks, numerous new PO methods have emerged that build on or replace the popular PPO and GRPO, solving their issues. Here are 11 of them: 1. BAlanced Policy Optimization (BAPO) → https://huggingface.co/papers/2510.18927 Dynamically adjusting the clipping bounds in PPO-style updates to balance positive and negative gradients and prevent entropy collapse 2. Training-Free GRPO → https://huggingface.co/papers/2510.08191 Instead of using numeric rewards, it compares rollouts semantically to distill useful knowledge as a token prior, which is then applied during inference to guide the model’s behavior 3. Asymmetric Importance Sampling Policy Optimization (ASPO) → https://huggingface.co/papers/2510.06062 Fixes imbalanced token weighting in LLM training. It flips the importance sampling ratios for positive tokens to correct over- and under-updates, and adds a soft dual-clipping step to keep gradients stable 4. In-Context Steered Policy Optimization (ICPO) → https://arxiv.org/abs/2510.26519 Uses a model’s own in-context learning ability to guide training with existing data. It combines Mixed-Policy GRPO with Implicit Expert Forcing to expand exploration and adds Expert Region Reject Sampling and Annealed Expert-Bonus Reward Shaping to ensure stability and balanced expert influence 5. Graph-Enhanced Policy Optimization (GEPO) → https://arxiv.org/abs/2510.26270 Builds a graph of an agent’s experiences to understand how different states connect, guide exploration and assign rewards more effectively 6. Information Gain-based Policy Optimization (IGPO) → https://huggingface.co/papers/2510.14967 Uses the model’s own belief updates to create dense, informative feedback for smoother multi-turn learning Read further below ⬇️ If you like this, also subscribe to the Turing post: https://www.turingpost.com/subscribe
View all activity
Organizations
None yet
harsh306
's datasets
None public yet