Euclid’s Gift: Enhancing Spatial Perception and Reasoning in Vision‑Language Models via Geometric Surrogate Tasks

📢 News

  • [10/24/2025] We trained Qwen3VL (4B, 8B, and 30B) using Euclid30K, and the results show that the models also achieve significant gains across various spatial intelligence tasks. The weights of the fine-tuned models are available here.
Model SuperClevr Omni3D Bench VSIBench MindCube
Qwen3VL-4B 55.36 27.74 35.51 26.11
Qwen3VL-Euclid-4B 61.24 (+5.88) 31.74 (+4.00) 42.26 (+6.75) 32.98 (+6.87)
Qwen3VL-8B 48.30 34.01 33.25 34.16
Qwen3VL-Euclid-8B 48.96 (+0.66) 35.03 (+1.02) 35.54 (+2.29) 41.02 (+6.86)
Qwen3VL-30B 64.12 36.71 40.00 39.75
Qwen3VL-Euclid-30B 70.18 (+6.06) 38.90 (+2.19) 45.80 (+5.80) 40.68 (+0.93)

Qwen3VL and Qwen3VL-Euclid are evaluated using the same prompting template defined in test/eval_qwen.sh to ensure a fair comparison.

Abstract

Spatial intelligence spans abilities such as visualizing and transforming shapes, mental rotation, reasoning about relative positions and containment, and counting/estimation. These remain challenging for modern Multimodal Large Language Models (MLLMs). We propose solving Euclidean geometry problems as a surrogate task and construct Euclid30K, a dataset of roughly 30K 2D and 3D geometry questions. We then fine‑tune Qwen2.5‑VL and RoboBrain2.0 models with Group Relative Policy Optimization (GRPO), enabling the models to internalize and apply Euclidean principles for shape recognition, counting, relation extraction, and multi‑step deductive reasoning. Without task‑specific adaptations, our models achieve significant zero‑shot gains on four spatial‑reasoning benchmarks: Super‑CLEVR, Omni3DBench, VSI‑Bench, and MindCube. For example, on VSI‑Bench, average accuracy improves from 34.5% to 40.5% (+5.5 percentage points); RoboBrain2.0‑Euclid‑7B reaches 49.6%, surpassing the previous SOTA (Spatial‑MLLM).

image

Quick Start

1) Environment Setup

Training

Evaluation

2) Training

Below is an example command for training (e.g., 8 GPUs). For multi‑node multi‑GPU training, see the example script train/dist_train.sh.

python3 -m verl.trainer.main \
    config=examples/config.yaml \
    data.train_files=/mnt/datasets/Euclid30K/Euclid30K_train.parquet \
    data.val_files=/mnt/datasets/Euclid30K/Euclid30K_val.parquet \
    worker.actor.model.model_path=/mnt/models/Qwen2.5-VL-7B-Instruct \
    trainer.experiment_name=EXPERIMENT_NAME \
    worker.actor.micro_batch_size_per_device_for_update=1 \
    worker.actor.micro_batch_size_per_device_for_experience=8 \
    worker.actor.clip_ratio_low=0.2 \
    worker.actor.clip_ratio_high=0.28 \
    worker.reward.reward_function=/mnt/code/Euclids_Gift/train/euclid.py:compute_score \
    trainer.total_epochs=10 \
    trainer.n_gpus_per_node=8 \
    trainer.nnodes=2 \
    trainer.save_checkpoint_path=/mnt/models/Qwen2.5-VL-7B-Euclid

3) Evaluation

Use test/eval_qwen.sh, test/eval_robo.sh, and test/eval_euclid.sh to evaluate the Qwen2.5‑VL series, the RoboBrain 2.0 series, and Euclid models trained on Euclid30K, respectively.

Before running these scripts, set model_path in each script to the path of the model you want to evaluate.

Notably, our VSIBench evaluation differs from the original setup: the original limits the model’s output to 16 tokens and asks it to produce the final answer directly, whereas we allow up to 1024 tokens and instruct the model to think first and then answer, so that responses provide traceable reasoning and the necessary explanations expected in real-world use.

Citation

If you find this project or the dataset helpful, please cite:

@misc{Euclids_Gift,
    title={Euclid’s Gift: Enhancing Spatial Perception and Reasoning in Vision-Language Models via Geometric Surrogate Tasks},
    author={Shijie Lian and Changti Wu and Laurence Tianruo Yang and Hang Yuan and Bin Yu and Lei Zhang and Kai Chen},
    year={2025},
    eprint={2509.24473},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2509.24473}
}
Downloads last month
16
Safetensors
Model size
9B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for LiamLian0727/Qwen3VL_Euclid_8B

Finetuned
(21)
this model

Dataset used to train LiamLian0727/Qwen3VL_Euclid_8B

Collection including LiamLian0727/Qwen3VL_Euclid_8B