GEOLIP-Bertenstein

A multi-expert geometric fusion transformer that bridges 4 independently-trained encoders into a shared embedding space using BERT-large as a universal text hub.

One layer. One epoch. Perfect cross-modal retrieval. Universal pentachoron geometry.

Results

Expert Pair	R@1	Cosine	Pentachoron CV
text ↔ audio (1.5K val)	1.0000	0.972	0.203
text ↔ code (5K val)	0.9996	0.988	0.195
text ↔ image (5K val)	1.0000	0.986	0.196
text ↔ protein (2.3K val)	0.9987	0.979	0.200
text ↔ image (40K test)	1.0000	0.980	0.208

All CV values converge to the 0.20 ± 0.01 universal band — the same geometric constant measured across 17 neural architectures before this model existed.

The current results only used a partial protein and partial audio dataset for preliminary, which left the sample size smaller and less conclusively potential as a full dataset run can yield due to the increased data capacity for alignment.

Architecture

                        ┌─────────────┐
                        │  Shared     │
    ┌──────┐           │  Fusion     │           ┌──────┐
    │ BERT │──text──→  │  Transformer│  ←──img──│DINOv2│
    │large │           │  (1 layer)  │           │large │
    └──────┘           │  1024-d     │           └──────┘
                        │  16 heads   │
    ┌──────┐           │             │           ┌──────┐
    │Whisp.│──audio─→  │  Procrustes │  ←─prot──│ESM-2 │
    │large │           │  pre-aligned│           │650M  │
    └──────┘           │             │           └──────┘
                        │             │
    ┌──────┐           │             │
    │Code- │──code──→  │             │
    │BERT  │           └─────────────┘
    └──────┘

Frozen encoders (not trained, not included in this repo):

BERT-large (336M) — universal text hub
DINOv2-large (302M) — natural images
Whisper-large-v3 encoder (1.5B) — speech audio
ESM-2-650M (652M) — protein sequences
CodeBERT-base (125M) — source code

Trainable fusion (this repo): 41.5M params

1 shared transformer layer (8.9M)
5 expert modules (text + 4 modalities, ~7M each)
Includes Procrustes pre-alignment buffers per expert

How It Works

Procrustes Pre-Alignment: Before training, compute optimal orthogonal rotation to align each expert's coordinate system with BERT's text space. Whitened Procrustes with centering.
Expert Modules: Each modality gets a projection layer, learned cross-attention pooling (257 image patches → 16 tokens, 1500 audio frames → 16 tokens, etc.), a special <|MODALITY|> token, and an output head.
Fusion Sequence: [<|TEXT|>] [text_tokens] [<|IMAGE|>] [img_tokens] ... — bidirectional self-attention across all modalities. Text tokens attend to image patches. Audio tokens attend to text tokens.
Geometric Loss: InfoNCE contrastive + pentachoron CV variance (Cayley-Menger) + Procrustes SVD alignment. The geometric components drive the embedding manifold toward the universal 0.20 CV band.
Text Hub: All modalities pair with text during training. Cross-modal alignment (e.g., audio↔image) emerges transitively through the shared text space.

Procrustes Pre-Alignment

Expert	cos before	cos after	Dimension
audio	0.0004	0.4404	1280 → 1024 (PCA down)
code	-0.0016	0.4036	768 → 1024 (zero-pad up)
image	0.0038	0.4107	1024 → 1024 (direct)
protein	0.0005	0.3771	1280 → 1024 (PCA down)

Training

Data: COCO-Caption (40K), LibriSpeech clean-100 (10K), CodeSearchNet Python (50K), Protein2Text-QA (15K)
Schedule: 3 epochs, round-robin across experts, cosine LR 3e-4 → 1e-6
Hardware: NVIDIA RTX PRO 6000 Blackwell (102 GB VRAM)
Time: ~11 minutes total (3 × ~220s)

Key Finding: Universal Pentachoron Geometry

The coefficient of variation (CV) of pentachoron (4-simplex) volumes, measured via Cayley-Menger determinants, converges to 0.20 ± 0.01 across:

17 pretrained models (T5, BERT, CLIP, DINOv2, SD UNets, VAEs)
5 architecture families (transformer, UNet, convolutional autoencoder)
4 modalities in this fusion model (audio, code, image, protein)

This constant emerges in ANY embedding space that reaches convergence through gradient-based contrastive learning, regardless of modality, architecture, or training objective.

File Structure

geolip-bertenstein/
├── checkpoints/
│   ├── epoch_001/
│   │   ├── model.safetensors
│   │   ├── loss.safetensors
│   │   ├── training_state.pt
│   │   └── config.json
│   ├── epoch_002/
│   ├── epoch_003/
│   └── final/
│       ├── model.safetensors
│       ├── loss.safetensors
│       ├── training_state.pt
│       ├── config.json
│       ├── aligner_audio.safetensors
│       ├── aligner_code.safetensors
│       ├── aligner_image.safetensors
│       └── aligner_protein.safetensors
├── tensorboard/
├── bertenstein_results.json
├── stage2_bertenstein.py
└── README.md

Usage

from safetensors.torch import load_file
import torch

# Load model weights
state = load_file("checkpoints/final/model.safetensors")

# Reconstruct model (see stage2_bertenstein.py for full class definitions)
from stage2_bertenstein import BertensteinFusion, ExpertModule, ProcrustesAligner, FusionConfig

# ... build model, load state_dict, run inference

Precomputed embedding caches (Arrow format) for all modalities: AbstractPhil/geolip-bertenstein-cache

Geometric Terrain Analysis

The foundational profiling of 17 models and Procrustes alignment analysis: AbstractPhil/procrustes-analysis

Citation

@misc{abstractphil2026bertenstein,
  title={GEOLIP-Bertenstein: Cross-Modal Alignment Through Shared Pentachoron Geometry},
  author={AbstractPhil},
  year={2026},
  url={https://huggingface.co/AbstractPhil/geolip-bertenstein}
}

License

MIT

Downloads last month: -; Downloads are not tracked for this model. How to track