This model is convert by mlx_vlm from ByteDance-Seed/UI-TARS-1.5-7B
Model Description
UI-TARS-1.5 is ByteDance's open-source multimodal agent built upon a powerful vision-language model. It is capable of effectively performing diverse tasks within virtual worlds.
The released UI-TARS-1.5-7B focuses primarily on enhancing general computer use capabilities and is not specifically optimized for game-based scenarios, where the UI-TARS-1.5 still holds a significant advantage.
| Benchmark Type | Benchmark | UI-TARS-1.5-7B | UI-TARS-1.5 | 
|---|---|---|---|
| Computer Use | OSWorld | 27.5 | 42.5 | 
| GUI Grounding | ScreenSpotPro | 49.6 | 61.6 | 
P.S. This is the performance of UI-TARS-1.5-7B and UI-TARS-1.5 on OSWorld and ScreenSpotProd.
Quick Start
mlx_vlm.generate --model flin775/UI-Tars-1.5-7B-4bit-mlx \
  --max-tokens 1024 \
  --temperature 0.0 \
  --prompt "List all contacts’ names and their corresponding grounding boxes([x1, y1, x2, y2]) from the left sidebar of the IM chat interface, return the results in JSON format." \
  --image https://wechat.qpic.cn/uploads/2016/05/WeChat-Windows-2.11.jpg
- Downloads last month
- 6
	Inference Providers
	NEW
	
	
	This model isn't deployed by any Inference Provider.
	🙋
			
		Ask for provider support
Model tree for flin775/UI-Tars-1.5-7B-4bit-mlx
Base model
ByteDance-Seed/UI-TARS-1.5-7B