COOPER / README.md
nielsr's picture
nielsr HF Staff
Improve model card: Add metadata, Hugging Face paper link, and GitHub link
a80209c verified
|
raw
history blame
5.5 kB
metadata
pipeline_tag: image-text-to-text
library_name: transformers
license: cc-by-nc-4.0

COOPER 🧭

📄 Paper (arXiv) | 🤗 Paper (Hugging Face) | 💻 Code | 🤖 COOPER Model | 🧠 COOPER-AMG Model | 📂 COOPER Training Data

This project provides the official implementation of COOPER, a unified multimodal large language model for visual spatial intelligence that cooperatively couples perception and reasoning. Built on top of the BAGEL framework, COOPER endows a single model with intrinsic perception enhancement (e.g., depth estimation and semantic segmentation) and reasoning enhancement via multimodal chain-of-thought. We further extend COOPER with reinforcement learning and a cooperative perception–reasoning reward, enabling the model to adaptively decide when to “perceive” and when to “reason” during inference.

model

🚀 Key Features

  • 🧠 GRPO Training for BAGEL via TRL:

    • Fine-tune BAGEL-style multimodal models with RL-style objectives.
    • Optimize perception–reasoning behavior directly from feedback signals.
    • Seamlessly extend from supervised multimodal CoT training to RL-based refinement.
  • 📊 VLMEvalKit Integration for BAGEL:

    • One-line evaluation on a wide range of multimodal benchmarks.
    • Unified interfaces for dataset loading, inference, and result aggregation.
    • Direct comparison with other VLMs under consistent evaluation protocols.
  • 🧩 SIBench (Single-Image Part) + GPT/Deepseek Answer Extraction:

    • Fully integrated into VLMEvalKit as a first-class evaluation task.
    • Equipped with GPT/Deepseek-based answer extractors to:
    • Robustly parse free-form model outputs.
    • Reduce evaluation noise from formatting and phrasing.
    • Provide more accurate and reliable spatial reasoning scores.

🔥 Quick Start

1️⃣ Set up environment 🛠️

git clone https://github.com/zhangzef/COOPER.git
cd COOPER
conda create -n cooper python=3.10 -y
conda activate cooper
pip install -r requirements.txt
pip install flash_attn==2.5.8 --no-build-isolation
pip intall -e ./transformers-4.54.0
pip install -e ./trl

2️⃣ Download checkpoints and datasets 📥

cd models
# Download the pretrained BAGEL and its config files.
huggingface-cli download --resume-download --local-dir-use-symlinks False ByteDance-Seed/BAGEL-7B-MoT --local-dir BAGEL-7B-MoT

# Not Necessary
# Download the COOPER-AMG ckpt(training with Auxiliary Modality Generation).
huggingface-cli download --resume-download --local-dir-use-symlinks False Starrrrrry/COOPER-AMG --local-dir COOPER-AMG

# Not Necessary
# Download the COOPER ckpt if you want to inference with COOPER.
huggingface-cli download --resume-download --local-dir-use-symlinks False Starrrrrry/COOPER --local-dir COOPER

# Download the training data(without Hypersim).
# If you want to train the COOPER-AMG, you need to download the Hypersim dataset first(https://github.com/apple/ml-hypersim).
cd ..
huggingface-cli download --resume-download --repo-type dataset Starrrrrry/COOPER_Train_Set --local-dir datasets
cd datasets
# merge the dataset with multiple threads(if you have pigz)(recommended)
cat COOPER_Train_Set.tar.gz.part.* | pigz -d | tar xf -
# OR merge the dataset with single thread(if you don't have pigz)
cat COOPER_Train_Set.tar.gz.part.* | gzip -dc | tar xf -

🔥 Train & Eval 🧪

🏋️ Train

# Training for Auxiliary Modality Generation from BAGEL.
# Or you can download the COOPER-AMG directly.
sh ./scripts/train_mix.sh

# Training for interleaved reasoning SFT.
sh ./scripts/train_reason_interleave_sft.sh

# Training for interleaved reasoning GRPO.
sh ./scripts/train_reason_interleave_grpo.sh

📐 Eval

# You can edit the eval config in /VLMEvalKit/eval_cfg/bagel_with_judge.json.
# Set your openai api key in eval_bagel_with_judge.sh and /VLMEvalKit/.env first.
cd VLMEvalKit
sh eval_bagel_with_judge.sh

📈 Results

main_result

📚 Cases

You can find more cases in the ./assets folder.

cases

generation_cases

✍️ Citation

@article{zhang2025cooper,
  title={COOPER: A Unified Model for Cooperative Perception and Reasoning in Spatial Intelligence},
  author={Zhang, Zefeng and Hao, Xiangzhao and Tang, Hengzhu and Zhang, Zhenyu and Sheng, Jiawei and Li, Xiaodong and Li, Zhenyang and Gao, Li and Shi, Daiting and Yin, Dawei and others},
  journal={arXiv preprint arXiv:2512.04563},
  year={2025}
}