Alligat0R: Pre-Training through Covisibility Segmentation for Relative Camera Pose Regression

NeurIPS 2025 Spotlight | Paper | Code | Dataset

Alligat0R is a novel pre-training approach for binocular vision tasks. Instead of cross-view completion (CroCo), it uses a covisibility segmentation objective: for each pixel in one image, the model predicts whether the corresponding 3D point is covisible, occluded, or outside the field of view in the other image.

Available Variants

This repository contains four pre-trained Alligat0R backbones (covisibility segmentation, before pose fine-tuning):

Subfolder Training Data Image Size Description
nuscenes_cub3-50 nuScenes (Cub3-50, >= 50% overlap) 288 x 512 Outdoor driving
nuscenes_cub3-all nuScenes (Cub3-all, >= 5% overlap) 288 x 512 Outdoor driving, challenging pairs
scannet_cub3-50 ScanNet (Cub3-50, >= 50% overlap) 384 x 512 Indoor scenes
scannet_cub3-all ScanNet (Cub3-all, >= 5% overlap) 384 x 512 Indoor scenes, challenging pairs

All models use a ViT-Large encoder (24 layers, 1024-dim) and a ViT-Base decoder (12 layers, 768-dim) with RoPE positional embeddings. Weights are stored in fp16 safetensors format (~798 MB each).

Usage

A complete demo with visualization is provided in demo.py on the GitHub repository.

Covisibility prediction

import torch
from PIL import Image
from torchvision import transforms
from reloc3r.alligat0r import Alligat0R

device = "cuda"
model = Alligat0R.from_pretrained(
    "thibautloiseau/alligat0r",
    subfolder="scannet_cub3-all",
    device=device,
)

img_size = (384, 512)  # use (288, 512) for nuScenes variants
tf = transforms.Compose([
    transforms.Resize(img_size),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
])
view1 = {"img": tf(Image.open("img1.jpg").convert("RGB")).unsqueeze(0).to(device)}
view2 = {"img": tf(Image.open("img2.jpg").convert("RGB")).unsqueeze(0).to(device)}

with torch.no_grad():
    seg1, seg2 = model(view1, view2)  # (B, 3, H, W) logits per view

cov1 = seg1.argmax(1)[0].cpu().numpy()  # 0=covisible, 1=occluded, 2=outside-FOV
cov2 = seg2.argmax(1)[0].cpu().numpy()

Encoder feature extraction

The pre-trained encoder can be used as a feature backbone for downstream tasks:

feat1, feat2, pos1, pos2 = model.encode(view1, view2)
# feat1, feat2: (B, N_patches, 1024)  — ViT-L encoder features
# pos1, pos2:   (B, N_patches, 2)     — 2-D patch positions

# To get intermediate features from all 24 encoder blocks:
feat1_all, feat2_all, pos1, pos2 = model.encode(view1, view2, return_all_blocks=True)
# feat1_all: list of 24 tensors, each (B, N_patches, 1024)

Fine-Tuning for Pose Regression

These pre-trained weights serve as initialization for downstream tasks. To fine-tune for relative pose regression, use the training script from the GitHub repository:

torchrun --nproc_per_node=4 finetune_pose.py \
    --mode alligat0r_pose \
    --dataset scannet \
    --overlap all \
    --load_pretrained_default

Architecture

  • Encoder: ViT-Large (24 layers, 1024-dim, 16 heads, patch size 16)
  • Decoder: ViT-Base (12 layers, 768-dim, 12 heads) with cross-attention
  • Segmentation head: linear projection from decoder features to 3-class per-pixel predictions
  • Positional encoding: RoPE (freq=100)

The architecture is symmetric: both images are processed identically without masking, unlike CroCo which uses asymmetric masking.

Training Details

  • Optimizer: AdamW (lr=1.5e-4, weight_decay=0.05, betas=(0.9, 0.95))
  • Schedule: cosine decay with 2 epochs warmup, 25 training epochs
  • Batch size: 32 per GPU
  • Loss: cross-entropy on the 3-class covisibility prediction
  • Hardware: NVIDIA A100 GPUs

Results

After fine-tuning for metric relative pose regression on Cub3-all (backbone unfrozen):

Method RUBIK 5deg/0.5m RUBIK 5deg/2m RUBIK 10deg/5m ScanNet 10deg/0.25m ScanNet 10deg/0.5m ScanNet 10deg/1m
CroCo (Cub3-50) 12.4 38.3 66.7 75.7 87.4 91.5
Alligat0R (Cub3-all) 24.6 60.3 81.9 85.5 92.5 95.1

Limitations

  • Models are trained on driving (nuScenes) and indoor (ScanNet) domains. Generalization to other domains (e.g., aerial, underwater) has not been evaluated.
  • nuScenes covisibility annotations rely on monocular depth predictions, which may contain noise on reflective surfaces, transparent objects, or distant geometry.

Citation

@article{loiseau2026alligat0r,
  title={Alligat0r: Pre-training through covisibility segmentation for relative camera pose regression},
  author={Loiseau, Thibaut and Bourmaud, Guillaume and Lepetit, Vincent},
  journal={Advances in Neural Information Processing Systems},
  volume={38},
  pages={13762--13789},
  year={2026}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train thibautloiseau/alligat0r

Paper for thibautloiseau/alligat0r