Model Card for V2P: Valley-to-Peak GUI Grounding Model
Model Details
- Model Name: V2P (Valley-to-Peak)
- Version: 1.0
- Model Type: GUI Grounding / UI Element Localization
- Developers: Jikai Chen, Long Chen, Dong Wang, Zhixuan Chu, Qinglin Su, Leilei Gan, Chenyi Zhuang, Jinjie Gu
Model Description
V2P (Valley-to-Peak) is an advanced model designed for robust and precise Graphical User Interface (GUI) element localization (grounding). In the field of GUI automation agents, accurately identifying interactive elements on a screen is critical. Traditional methods like bounding box regression or center-point prediction often overlook the spatial uncertainty of interaction and the hierarchical visual-semantic relationships, leading to insufficient localization accuracy.
The V2P model was developed to address two major pain points in existing visual methods:
- Attention Drift due to Background Interference: The model's attention mistakenly disperses to irrelevant background areas.
- Imprecise Click Locations: The model fails to distinguish between the center and the edges of a target element, leading to interaction failures.
Inspired by human visual processing and interaction with GUIs, V2P introduces two innovative core mechanisms:
"Valley" Background Suppression: V2P employs a novel suppressive attention mechanism that actively minimizes the model's focus on irrelevant background regions. This significantly "pushes down" the weight of the background (forming a valley), which in turn "highlights" the target area (forming a peak), fundamentally solving the issue of attention drift.
"Peak" Center Focusing: Drawing inspiration from the classic Fitts' Law, V2P models the GUI interaction process as a 2D Gaussian heatmap. The model is trained to predict a weight distribution that peaks at the center of the target element and gradually decays towards the edges. This design enables the model to learn to focus on the most central and suitable area for clicking, rather than the entire ambiguous region, thereby dramatically improving click precision.
Intended Use
- Primary Use Case: This model is primarily intended for GUI automation agents, providing them with precise visual localization capabilities. It can take a visual interface (screenshot) and an instruction (e.g., "click the 'login' button") as input and output the most suitable interaction coordinates for the target element.
- Target Applications:
- Automated software testing
- Robotic Process Automation (RPA)
- Assistive technology development (e.g., UI interaction tools for people with disabilities)
- UI/UX design analysis
How to Use
import torch
from PIL import Image
import requests
from transformers import AutoModelForCausalLM, AutoProcessor
# Set device
device = "cuda" if torch.cuda.is_available() else "cpu"
# Load model and processor from Hugging Face Hub
model_id = "inclusionAI/V2P-7B"
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16, # Use bfloat16 for better performance if available
device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_id)
# Prepare the inputs: an image and a text prompt
# Example: Using an online image of a user interface
url = "[https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/bee.JPG](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/bee.JPG)"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
prompt_text = "Find the location of the 'Settings' button on this screen."
# Format the prompt for the model
messages = [
{
"role": "user",
"content": [
{"type": "image", "image_url": url}, # You can use image_url or a local PIL Image
{"type": "text", "text": prompt_text}
]
}
]
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="pt").to(device)
# Generate the response
generated_ids = model.generate(
**inputs,
max_new_tokens=1024,
do_sample=False,
temperature=0,
top_p=None,
)
# Decode and print the output
# We need to slice the generated IDs to exclude the prompt tokens
input_token_len = inputs['input_ids'].shape[1]
output_ids = generated_ids[0][input_token_len:]
output_text = processor.decode(output_ids, skip_special_tokens=True)
print(output_text)
# For more visualization code, please refer to the code in the V2P GitHub repository...
- Downloads last month
- 64
Model tree for inclusionAI/V2P-7B
Base model
Qwen/Qwen2.5-VL-7B-Instruct