Harsh Nilesh Pathak's picture

Harsh Nilesh Pathak

harsh306

·

harsh306

AI & ML interests

AI and ML

Recent Activity

replied to Kseniase's post 13 days ago

11 Fascinating new Policy Optimization techniques Policy optimization (PO) algorithms are central to training AI models with preference-based feedback. In recent weeks, numerous new PO methods have emerged that build on or replace the popular PPO and GRPO, solving their issues. Here are 11 of them: 1. BAlanced Policy Optimization (BAPO) → https://huggingface.co/papers/2510.18927 Dynamically adjusting the clipping bounds in PPO-style updates to balance positive and negative gradients and prevent entropy collapse 2. Training-Free GRPO → https://huggingface.co/papers/2510.08191 Instead of using numeric rewards, it compares rollouts semantically to distill useful knowledge as a token prior, which is then applied during inference to guide the model’s behavior 3. Asymmetric Importance Sampling Policy Optimization (ASPO) → https://huggingface.co/papers/2510.06062 Fixes imbalanced token weighting in LLM training. It flips the importance sampling ratios for positive tokens to correct over- and under-updates, and adds a soft dual-clipping step to keep gradients stable 4. In-Context Steered Policy Optimization (ICPO) → https://arxiv.org/abs/2510.26519 Uses a model’s own in-context learning ability to guide training with existing data. It combines Mixed-Policy GRPO with Implicit Expert Forcing to expand exploration and adds Expert Region Reject Sampling and Annealed Expert-Bonus Reward Shaping to ensure stability and balanced expert influence 5. Graph-Enhanced Policy Optimization (GEPO) → https://arxiv.org/abs/2510.26270 Builds a graph of an agent’s experiences to understand how different states connect, guide exploration and assign rewards more effectively 6. Information Gain-based Policy Optimization (IGPO) → https://huggingface.co/papers/2510.14967 Uses the model’s own belief updates to create dense, informative feedback for smoother multi-turn learning Read further below ⬇️ If you like this, also subscribe to the Turing post: https://www.turingpost.com/subscribe

View all activity

Organizations

None yet

harsh306 's datasets

None public yet