GlimpsePrune: A Dynamic Visual Token Pruning for Large Vision-Language Models

GlimpsePrune is a dynamic visual token pruning framework designed for Large Vision-Language Models (LVLMs). This model was presented in the paper A Glimpse to Compress: Dynamic Visual Token Pruning for Large Vision-Language Models.

Existing methods for visual token compression typically adopt fixed compression ratios, which cannot adapt to scenes of varying complexity, often causing imprecise pruning that discards informative visual tokens and results in degraded model performance. Inspired by human cognition, GlimpsePrune addresses this issue by taking a data-driven "glimpse" and pruning irrelevant visual tokens in a single forward pass before answer generation. This approach prunes 92.6% of visual tokens while on average fully retaining the baseline performance on free-form VQA tasks. The reduced computational cost also enables more effective fine-tuning: an enhanced GlimpsePrune+ achieves 110% of the baseline performance while maintaining a similarly high pruning rate. Our work paves a new way for building more powerful and efficient LVLMs.

For the official code and more details, please refer to the GitHub repository.


GlimpsePrune dynamically prunes a large number of irrelevant visual tokens before answering questions, reducing the model's inference overhead.

✨ Key Features

  • High Pruning Rate: Prunes over 90% of visual tokens on average with almost no performance loss, effectively reducing computational and memory overhead.
  • Robust Performance: Stable performance when processing high-resolution images and handling complex free-form VQA tasks.
  • Lightweight Training: Only a few extra parameters (Glimpse token and VIP) need to be trained, completed in less than 1 hour on a single A100 GPU.
  • Broad Compatibility: Supports single and multi-image inputs, is compatible with KV-Cache and Flash Attention 2, and provides a fair comparison benchmark with other mainstream visual compression methods.

πŸ–ΌοΈ Framework Overview

The core idea of GlimpsePrune is to introduce a glimpse token and a lightweight Visual tokens Important Predictor (VIP) that can quickly identify and retain the visual regions most relevant to the text prompt, pruning the remaining redundant information.

πŸ“Š Performance Results

We evaluated GlimpsePrune on multiple VQA benchmarks. The results show that it achieves a high pruning rate while maintaining performance on par with the original model, outperforming other visual compression methods.

Free-form VQA Benchmarks

Short-form VQA Benchmarks

πŸ“¦ Models and Data

Model Download

All models can be automatically downloaded from the Hugging Face Hub. <new_module> are the weights of the extra glimpse token and VIP modules we trained.

▢️ How to Use

You can use GlimpsePrune with the transformers_gp, which is located in the GitHub repository.

from transformers_gp.models.qwen2_5_vl import (
    Qwen2_5_VL_GP_ForConditionalGeneration,
    Qwen2_5_VL_GP_Processor
)
from qwen_vl_utils import process_vision_info
from PIL import Image
import torch

# Load the model and processor
base_model_name = "Qwen/Qwen2.5-VL-7B-Instruct"
new_model_name = "ashun989/GlimpsePrune_Qwen2.5-VL-7B-Instruct"

model = Qwen2_5_VL_GP_ForConditionalGeneration.from_pretrained(
    base_model,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map={"": "cuda:0"},
)
processor = Qwen2_5_VL_GP_Processor.from_pretrained(base_model)
model.load_new_modules(new_modules_dir)
model.eval()

# Prepare messages (image and text input)
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "../examples/people.png", # Placeholder: replace with your image path
            },
            {"type": "text", "text": "What kind of a tie is the groom wearing?"},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to(model.device)

# Generate output
model.reset_image_tokens_cache()  # NOTE: reset the cache before inference
with torch.inference_mode():
  generated_ids = model.generate(**inputs, max_new_tokens=1024, do_selection=True)  # Enable glimpse prune by do_selection=True

# Decode and print the response
generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=False, clean_up_tokenization_spaces=False
)
print(f"User: {question}
Assistant: {output_text[0]}")

πŸ–ŠοΈ Citation

If you find our work helpful, please consider citing our paper:

@misc{zeng2025glimpseprune,
      title={A Glimpse to Compress: Dynamic Visual Token Pruning for Large Vision-Language Models}, 
      author={Quan-Sheng Zeng and Yunheng Li and Qilong Wang and Peng-Tao Jiang and Zuxuan Wu and Ming-Ming Cheng and Qibin Hou},
      year={2025},
      eprint={2508.01548},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2508.01548}, 
}
Downloads last month
18
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including ashun989/GlimpsePrune_Qwen2.5-VL-7B-Instruct