Alligat0R: Pre-Training through Covisibility Segmentation for Relative Camera Pose Regression

NeurIPS 2025 Spotlight | Paper | Code | Dataset

Alligat0R is a novel pre-training approach for binocular vision tasks. Instead of cross-view completion (CroCo), it uses a covisibility segmentation objective: for each pixel in one image, the model predicts whether the corresponding 3D point is covisible, occluded, or outside the field of view in the other image.

Available Variants

This repository contains four pre-trained Alligat0R backbones (covisibility segmentation, before pose fine-tuning):

Subfolder	Training Data	Image Size	Description
`nuscenes_cub3-50`	nuScenes (Cub3-50, >= 50% overlap)	288 x 512	Outdoor driving
`nuscenes_cub3-all`	nuScenes (Cub3-all, >= 5% overlap)	288 x 512	Outdoor driving, challenging pairs
`scannet_cub3-50`	ScanNet (Cub3-50, >= 50% overlap)	384 x 512	Indoor scenes
`scannet_cub3-all`	ScanNet (Cub3-all, >= 5% overlap)	384 x 512	Indoor scenes, challenging pairs

All models use a ViT-Large encoder (24 layers, 1024-dim) and a ViT-Base decoder (12 layers, 768-dim) with RoPE positional embeddings. Weights are stored in fp16 safetensors format (~798 MB each).

Usage

A complete demo with visualization is provided in demo.py on the GitHub repository.

Covisibility prediction

import torch
from PIL import Image
from torchvision import transforms
from reloc3r.alligat0r import Alligat0R

device = "cuda"
model = Alligat0R.from_pretrained(
    "thibautloiseau/alligat0r",
    subfolder="scannet_cub3-all",
    device=device,
)

img_size = (384, 512)  # use (288, 512) for nuScenes variants
tf = transforms.Compose([
    transforms.Resize(img_size),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
])
view1 = {"img": tf(Image.open("img1.jpg").convert("RGB")).unsqueeze(0).to(device)}
view2 = {"img": tf(Image.open("img2.jpg").convert("RGB")).unsqueeze(0).to(device)}

with torch.no_grad():
    seg1, seg2 = model(view1, view2)  # (B, 3, H, W) logits per view

cov1 = seg1.argmax(1)[0].cpu().numpy()  # 0=covisible, 1=occluded, 2=outside-FOV
cov2 = seg2.argmax(1)[0].cpu().numpy()

Encoder feature extraction

The pre-trained encoder can be used as a feature backbone for downstream tasks:

feat1, feat2, pos1, pos2 = model.encode(view1, view2)
# feat1, feat2: (B, N_patches, 1024)  — ViT-L encoder features
# pos1, pos2:   (B, N_patches, 2)     — 2-D patch positions

# To get intermediate features from all 24 encoder blocks:
feat1_all, feat2_all, pos1, pos2 = model.encode(view1, view2, return_all_blocks=True)
# feat1_all: list of 24 tensors, each (B, N_patches, 1024)

Fine-Tuning for Pose Regression

These pre-trained weights serve as initialization for downstream tasks. To fine-tune for relative pose regression, use the training script from the GitHub repository:

torchrun --nproc_per_node=4 finetune_pose.py \
    --mode alligat0r_pose \
    --dataset scannet \
    --overlap all \
    --load_pretrained_default

Architecture

Encoder: ViT-Large (24 layers, 1024-dim, 16 heads, patch size 16)
Decoder: ViT-Base (12 layers, 768-dim, 12 heads) with cross-attention
Segmentation head: linear projection from decoder features to 3-class per-pixel predictions
Positional encoding: RoPE (freq=100)

The architecture is symmetric: both images are processed identically without masking, unlike CroCo which uses asymmetric masking.

Training Details

Optimizer: AdamW (lr=1.5e-4, weight_decay=0.05, betas=(0.9, 0.95))
Schedule: cosine decay with 2 epochs warmup, 25 training epochs
Batch size: 32 per GPU
Loss: cross-entropy on the 3-class covisibility prediction
Hardware: NVIDIA A100 GPUs

Results

After fine-tuning for metric relative pose regression on Cub3-all (backbone unfrozen):

Method	RUBIK 5deg/0.5m	RUBIK 5deg/2m	RUBIK 10deg/5m	ScanNet 10deg/0.25m	ScanNet 10deg/0.5m	ScanNet 10deg/1m
CroCo (Cub3-50)	12.4	38.3	66.7	75.7	87.4	91.5
Alligat0R (Cub3-all)	24.6	60.3	81.9	85.5	92.5	95.1

Limitations

Models are trained on driving (nuScenes) and indoor (ScanNet) domains. Generalization to other domains (e.g., aerial, underwater) has not been evaluated.
nuScenes covisibility annotations rely on monocular depth predictions, which may contain noise on reflective surfaces, transparent objects, or distant geometry.

Citation

@article{loiseau2026alligat0r,
  title={Alligat0r: Pre-training through covisibility segmentation for relative camera pose regression},
  author={Loiseau, Thibaut and Bourmaud, Guillaume and Lepetit, Vincent},
  journal={Advances in Neural Information Processing Systems},
  volume={38},
  pages={13762--13789},
  year={2026}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Dataset used to train thibautloiseau/alligat0r

Paper for thibautloiseau/alligat0r

Alligat0R: Pre-Training Through Co-Visibility Segmentation for Relative Camera Pose Regression

Paper • 2503.07561 • Published Mar 10, 2025 • 2