Gullal Singh Cheema's picture

Gullal Singh Cheema

gullalc

·

gullalc

AI & ML interests

Multimodality, Vision and Language, Cross-modal relations, Video Understanding

Recent Activity

upvoted an article about 1 month ago

nanoVLM: The simplest repository to train your VLM in pure PyTorch

upvoted an article about 1 month ago

Efficient MultiModal Data Pipeline

upvoted a paper about 1 month ago

FineVision: Open Data Is All You Need

View all activity

Organizations

None yet

upvoted 2 articles about 1 month ago

Article

nanoVLM: The simplest repository to train your VLM in pure PyTorch

May 21

•

229

Article

Efficient MultiModal Data Pipeline

Jul 8

•

59

upvoted a paper about 1 month ago

FineVision: Open Data Is All You Need

Paper • 2510.17269 • Published Oct 20 • 66

liked a dataset 3 months ago

HuggingFaceM4/FineVision

Viewer • Updated Oct 21 • 24.2M • 271k • 444

reacted to sergiopaniego's post with 🔥 4 months ago

Post

3454

Want to learn how to align a Vision Language Model (VLM) for reasoning using GRPO and TRL? 🌋

🧑‍🍳 We've got you covered!!

NEW multimodal post training recipe to align a VLM using TRL in @HuggingFace 's Cookbook.

Go to the recipe 👉https://huggingface.co/learn/cookbook/fine_tuning_vlm_grpo_trl

Powered by the latest TRL v0.20 release, this recipe shows how to teach Qwen2.5-VL-3B-Instruct to reason over images 🌋

upvoted a collection 4 months ago

gpt-oss

Open-weight models designed for powerful reasoning, agentic tasks, and versatile developer use cases. • 2 items • Updated Aug 7 • 381

liked 2 models 4 months ago

openai/gpt-oss-20b

Text Generation • 22B • Updated Aug 26 • 6.17M • • 3.97k

openai/gpt-oss-120b

Text Generation • 120B • Updated Aug 26 • 4.35M • • 4.18k

commented a paper 5 months ago

DeepVideo-R1: Video Reinforcement Fine-Tuning via Difficulty-aware Regressive GRPO

Paper • 2506.07464 • Published Jun 9 • 14 •

upvoted 9 papers 5 months ago

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Paper • 2506.09985 • Published Jun 11 • 29

SciVer: Evaluating Foundation Models for Multimodal Scientific Claim Verification

Paper • 2506.15569 • Published Jun 18 • 12

CoMemo: LVLMs Need Image Context with Image Memory

Paper • 2506.06279 • Published Jun 6 • 8

Time Blindness: Why Video-Language Models Can't See What Humans Can?

Paper • 2505.24867 • Published May 30 • 80

VLM-R^3: Region Recognition, Reasoning, and Refinement for Enhanced Multimodal Chain-of-Thought

Paper • 2505.16192 • Published May 22 • 12

VideoEval-Pro: Robust and Realistic Long Video Understanding Evaluation

Paper • 2505.14640 • Published May 20 • 16

Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

Paper • 2505.15966 • Published May 21 • 53

UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning

Paper • 2505.14231 • Published May 20 • 52

GuardReasoner-VL: Safeguarding VLMs via Reinforced Reasoning

Paper • 2505.11049 • Published May 16 • 60

upvoted 2 papers 6 months ago

Understanding and Mitigating Toxicity in Image-Text Pretraining Datasets: A Case Study on LLaVA

Paper • 2505.06356 • Published May 9 • 3

Aya Vision: Advancing the Frontier of Multilingual Multimodality

Paper • 2505.08751 • Published May 13 • 12