Euclid’s Gift: Enhancing Spatial Perception and Reasoning in Vision‑Language Models via Geometric Surrogate Tasks

📢 News

[10/24/2025] We trained Qwen3VL (4B, 8B, and 30B) using Euclid30K, and the results show that the models also achieve significant gains across various spatial intelligence tasks. The weights of the fine-tuned models are available here.

Model	SuperClevr	Omni3D Bench	VSIBench	MindCube
Qwen3VL-4B	55.36	27.74	35.51	26.11
Qwen3VL-Euclid-4B	61.24 (+5.88)	31.74 (+4.00)	42.26 (+6.75)	32.98 (+6.87)
Qwen3VL-8B	48.30	34.01	33.25	34.16
Qwen3VL-Euclid-8B	48.96 (+0.66)	35.03 (+1.02)	35.54 (+2.29)	41.02 (+6.86)
Qwen3VL-30B	64.12	36.71	40.00	39.75
Qwen3VL-Euclid-30B	70.18 (+6.06)	38.90 (+2.19)	45.80 (+5.80)	40.68 (+0.93)

Qwen3VL and Qwen3VL-Euclid are evaluated using the same prompting template defined in test/eval_qwen.sh to ensure a fair comparison.

[10/17/2025] Thanks to Synced (机器之心) for covering our work: wechat article / zhihu.
[09/30/2025] We release our paper in arXiv and Euclid30K dataset in huggingface.

Abstract

Spatial intelligence spans abilities such as visualizing and transforming shapes, mental rotation, reasoning about relative positions and containment, and counting/estimation. These remain challenging for modern Multimodal Large Language Models (MLLMs). We propose solving Euclidean geometry problems as a surrogate task and construct Euclid30K, a dataset of roughly 30K 2D and 3D geometry questions. We then fine‑tune Qwen2.5‑VL and RoboBrain2.0 models with Group Relative Policy Optimization (GRPO), enabling the models to internalize and apply Euclidean principles for shape recognition, counting, relation extraction, and multi‑step deductive reasoning. Without task‑specific adaptations, our models achieve significant zero‑shot gains on four spatial‑reasoning benchmarks: Super‑CLEVR, Omni3DBench, VSI‑Bench, and MindCube. For example, on VSI‑Bench, average accuracy improves from 34.5% to 40.5% (+5.5 percentage points); RoboBrain2.0‑Euclid‑7B reaches 49.6%, surpassing the previous SOTA (Spatial‑MLLM).

Quick Start

1) Environment Setup

Training

Install EasyR1 following the official documentation.
Install the required Python dependencies: pip install -r requirements.txt in our GitHub repository.
Download the Euclid30K dataset from Hugging Face: https://huggingface.co/datasets/LiamLian0727/Euclid30K

Evaluation

Install lmms‑eval following its official documentation. You can either:
- Use the lmms-eval/ copy included in this repository; or
- Copy the four task folders provided under test/lmms_eval/tasks/ into your existing lmms‑eval setup.
Download the benchmark datasets Super‑CLEVR, Omni3DBench, VSI‑Bench, and MindCube_lmms_eval; then update the dataset paths in each corresponding YAML under test/lmms_eval/tasks/.

2) Training

Below is an example command for training (e.g., 8 GPUs). For multi‑node multi‑GPU training, see the example script train/dist_train.sh.

python3 -m verl.trainer.main \
    config=examples/config.yaml \
    data.train_files=/mnt/datasets/Euclid30K/Euclid30K_train.parquet \
    data.val_files=/mnt/datasets/Euclid30K/Euclid30K_val.parquet \
    worker.actor.model.model_path=/mnt/models/Qwen2.5-VL-7B-Instruct \
    trainer.experiment_name=EXPERIMENT_NAME \
    worker.actor.micro_batch_size_per_device_for_update=1 \
    worker.actor.micro_batch_size_per_device_for_experience=8 \
    worker.actor.clip_ratio_low=0.2 \
    worker.actor.clip_ratio_high=0.28 \
    worker.reward.reward_function=/mnt/code/Euclids_Gift/train/euclid.py:compute_score \
    trainer.total_epochs=10 \
    trainer.n_gpus_per_node=8 \
    trainer.nnodes=2 \
    trainer.save_checkpoint_path=/mnt/models/Qwen2.5-VL-7B-Euclid

3) Evaluation

Use test/eval_qwen.sh, test/eval_robo.sh, and test/eval_euclid.sh to evaluate the Qwen2.5‑VL series, the RoboBrain 2.0 series, and Euclid models trained on Euclid30K, respectively.

Before running these scripts, set model_path in each script to the path of the model you want to evaluate.

Notably, our VSIBench evaluation differs from the original setup: the original limits the model’s output to 16 tokens and asks it to produce the final answer directly, whereas we allow up to 1024 tokens and instruct the model to think first and then answer, so that responses provide traceable reasoning and the necessary explanations expected in real-world use.

Citation

If you find this project or the dataset helpful, please cite:

@misc{Euclids_Gift,
    title={Euclid’s Gift: Enhancing Spatial Perception and Reasoning in Vision-Language Models via Geometric Surrogate Tasks},
    author={Shijie Lian and Changti Wu and Laurence Tianruo Yang and Hang Yuan and Bin Yu and Lei Zhang and Kai Chen},
    year={2025},
    eprint={2509.24473},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2509.24473}
}