Alligat0R: Pre-Training through Covisibility Segmentation for Relative Camera Pose Regression
NeurIPS 2025 Spotlight | Paper | Code | Dataset
Alligat0R is a novel pre-training approach for binocular vision tasks. Instead of cross-view completion (CroCo), it uses a covisibility segmentation objective: for each pixel in one image, the model predicts whether the corresponding 3D point is covisible, occluded, or outside the field of view in the other image.
Available Variants
This repository contains four pre-trained Alligat0R backbones (covisibility segmentation, before pose fine-tuning):
| Subfolder | Training Data | Image Size | Description |
|---|---|---|---|
nuscenes_cub3-50 |
nuScenes (Cub3-50, >= 50% overlap) | 288 x 512 | Outdoor driving |
nuscenes_cub3-all |
nuScenes (Cub3-all, >= 5% overlap) | 288 x 512 | Outdoor driving, challenging pairs |
scannet_cub3-50 |
ScanNet (Cub3-50, >= 50% overlap) | 384 x 512 | Indoor scenes |
scannet_cub3-all |
ScanNet (Cub3-all, >= 5% overlap) | 384 x 512 | Indoor scenes, challenging pairs |
All models use a ViT-Large encoder (24 layers, 1024-dim) and a ViT-Base decoder (12 layers, 768-dim) with RoPE positional embeddings. Weights are stored in fp16 safetensors format (~798 MB each).
Usage
A complete demo with visualization is provided in demo.py on the GitHub repository.
Covisibility prediction
import torch
from PIL import Image
from torchvision import transforms
from reloc3r.alligat0r import Alligat0R
device = "cuda"
model = Alligat0R.from_pretrained(
"thibautloiseau/alligat0r",
subfolder="scannet_cub3-all",
device=device,
)
img_size = (384, 512) # use (288, 512) for nuScenes variants
tf = transforms.Compose([
transforms.Resize(img_size),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
])
view1 = {"img": tf(Image.open("img1.jpg").convert("RGB")).unsqueeze(0).to(device)}
view2 = {"img": tf(Image.open("img2.jpg").convert("RGB")).unsqueeze(0).to(device)}
with torch.no_grad():
seg1, seg2 = model(view1, view2) # (B, 3, H, W) logits per view
cov1 = seg1.argmax(1)[0].cpu().numpy() # 0=covisible, 1=occluded, 2=outside-FOV
cov2 = seg2.argmax(1)[0].cpu().numpy()
Encoder feature extraction
The pre-trained encoder can be used as a feature backbone for downstream tasks:
feat1, feat2, pos1, pos2 = model.encode(view1, view2)
# feat1, feat2: (B, N_patches, 1024) — ViT-L encoder features
# pos1, pos2: (B, N_patches, 2) — 2-D patch positions
# To get intermediate features from all 24 encoder blocks:
feat1_all, feat2_all, pos1, pos2 = model.encode(view1, view2, return_all_blocks=True)
# feat1_all: list of 24 tensors, each (B, N_patches, 1024)
Fine-Tuning for Pose Regression
These pre-trained weights serve as initialization for downstream tasks. To fine-tune for relative pose regression, use the training script from the GitHub repository:
torchrun --nproc_per_node=4 finetune_pose.py \
--mode alligat0r_pose \
--dataset scannet \
--overlap all \
--load_pretrained_default
Architecture
- Encoder: ViT-Large (24 layers, 1024-dim, 16 heads, patch size 16)
- Decoder: ViT-Base (12 layers, 768-dim, 12 heads) with cross-attention
- Segmentation head: linear projection from decoder features to 3-class per-pixel predictions
- Positional encoding: RoPE (freq=100)
The architecture is symmetric: both images are processed identically without masking, unlike CroCo which uses asymmetric masking.
Training Details
- Optimizer: AdamW (lr=1.5e-4, weight_decay=0.05, betas=(0.9, 0.95))
- Schedule: cosine decay with 2 epochs warmup, 25 training epochs
- Batch size: 32 per GPU
- Loss: cross-entropy on the 3-class covisibility prediction
- Hardware: NVIDIA A100 GPUs
Results
After fine-tuning for metric relative pose regression on Cub3-all (backbone unfrozen):
| Method | RUBIK 5deg/0.5m | RUBIK 5deg/2m | RUBIK 10deg/5m | ScanNet 10deg/0.25m | ScanNet 10deg/0.5m | ScanNet 10deg/1m |
|---|---|---|---|---|---|---|
| CroCo (Cub3-50) | 12.4 | 38.3 | 66.7 | 75.7 | 87.4 | 91.5 |
| Alligat0R (Cub3-all) | 24.6 | 60.3 | 81.9 | 85.5 | 92.5 | 95.1 |
Limitations
- Models are trained on driving (nuScenes) and indoor (ScanNet) domains. Generalization to other domains (e.g., aerial, underwater) has not been evaluated.
- nuScenes covisibility annotations rely on monocular depth predictions, which may contain noise on reflective surfaces, transparent objects, or distant geometry.
Citation
@article{loiseau2026alligat0r,
title={Alligat0r: Pre-training through covisibility segmentation for relative camera pose regression},
author={Loiseau, Thibaut and Bourmaud, Guillaume and Lepetit, Vincent},
journal={Advances in Neural Information Processing Systems},
volume={38},
pages={13762--13789},
year={2026}
}