view post Post 3454 Want to learn how to align a Vision Language Model (VLM) for reasoning using GRPO and TRL? 🌋🧑🍳 We've got you covered!!NEW multimodal post training recipe to align a VLM using TRL in @HuggingFace 's Cookbook.Go to the recipe 👉https://huggingface.co/learn/cookbook/fine_tuning_vlm_grpo_trlPowered by the latest TRL v0.20 release, this recipe shows how to teach Qwen2.5-VL-3B-Instruct to reason over images 🌋 See translation 🔥 6 6 + Reply
gpt-oss Collection Open-weight models designed for powerful reasoning, agentic tasks, and versatile developer use cases. • 2 items • Updated Aug 7 • 381
DeepVideo-R1: Video Reinforcement Fine-Tuning via Difficulty-aware Regressive GRPO Paper • 2506.07464 • Published Jun 9 • 14 • 3
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning Paper • 2506.09985 • Published Jun 11 • 29
SciVer: Evaluating Foundation Models for Multimodal Scientific Claim Verification Paper • 2506.15569 • Published Jun 18 • 12
Time Blindness: Why Video-Language Models Can't See What Humans Can? Paper • 2505.24867 • Published May 30 • 80
VLM-R^3: Region Recognition, Reasoning, and Refinement for Enhanced Multimodal Chain-of-Thought Paper • 2505.16192 • Published May 22 • 12
VideoEval-Pro: Robust and Realistic Long Video Understanding Evaluation Paper • 2505.14640 • Published May 20 • 16
Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning Paper • 2505.15966 • Published May 21 • 53
UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning Paper • 2505.14231 • Published May 20 • 52
GuardReasoner-VL: Safeguarding VLMs via Reinforced Reasoning Paper • 2505.11049 • Published May 16 • 60
Understanding and Mitigating Toxicity in Image-Text Pretraining Datasets: A Case Study on LLaVA Paper • 2505.06356 • Published May 9 • 3
Aya Vision: Advancing the Frontier of Multilingual Multimodality Paper • 2505.08751 • Published May 13 • 12