GEOLIP-Bertenstein
A multi-expert geometric fusion transformer that bridges 4 independently-trained encoders into a shared embedding space using BERT-large as a universal text hub.
One layer. One epoch. Perfect cross-modal retrieval. Universal pentachoron geometry.
Results
| Expert Pair | R@1 | Cosine | Pentachoron CV |
|---|---|---|---|
| text β audio (1.5K val) | 1.0000 | 0.972 | 0.203 |
| text β code (5K val) | 0.9996 | 0.988 | 0.195 |
| text β image (5K val) | 1.0000 | 0.986 | 0.196 |
| text β protein (2.3K val) | 0.9987 | 0.979 | 0.200 |
| text β image (40K test) | 1.0000 | 0.980 | 0.208 |
All CV values converge to the 0.20 Β± 0.01 universal band β the same geometric constant measured across 17 neural architectures before this model existed.
The current results only used a partial protein and partial audio dataset for preliminary, which left the sample size smaller and less conclusively potential as a full dataset run can yield due to the increased data capacity for alignment.
Architecture
βββββββββββββββ
β Shared β
ββββββββ β Fusion β ββββββββ
β BERT βββtextβββ β Transformerβ βββimgβββDINOv2β
βlarge β β (1 layer) β βlarge β
ββββββββ β 1024-d β ββββββββ
β 16 heads β
ββββββββ β β ββββββββ
βWhisp.βββaudioββ β Procrustes β ββprotβββESM-2 β
βlarge β β pre-alignedβ β650M β
ββββββββ β β ββββββββ
β β
ββββββββ β β
βCode- βββcodeβββ β β
βBERT β βββββββββββββββ
ββββββββ
Frozen encoders (not trained, not included in this repo):
- BERT-large (336M) β universal text hub
- DINOv2-large (302M) β natural images
- Whisper-large-v3 encoder (1.5B) β speech audio
- ESM-2-650M (652M) β protein sequences
- CodeBERT-base (125M) β source code
Trainable fusion (this repo): 41.5M params
- 1 shared transformer layer (8.9M)
- 5 expert modules (text + 4 modalities, ~7M each)
- Includes Procrustes pre-alignment buffers per expert
How It Works
Procrustes Pre-Alignment: Before training, compute optimal orthogonal rotation to align each expert's coordinate system with BERT's text space. Whitened Procrustes with centering.
Expert Modules: Each modality gets a projection layer, learned cross-attention pooling (257 image patches β 16 tokens, 1500 audio frames β 16 tokens, etc.), a special
<|MODALITY|>token, and an output head.Fusion Sequence:
[<|TEXT|>] [text_tokens] [<|IMAGE|>] [img_tokens] ...β bidirectional self-attention across all modalities. Text tokens attend to image patches. Audio tokens attend to text tokens.Geometric Loss: InfoNCE contrastive + pentachoron CV variance (Cayley-Menger) + Procrustes SVD alignment. The geometric components drive the embedding manifold toward the universal 0.20 CV band.
Text Hub: All modalities pair with text during training. Cross-modal alignment (e.g., audioβimage) emerges transitively through the shared text space.
Procrustes Pre-Alignment
| Expert | cos before | cos after | Dimension |
|---|---|---|---|
| audio | 0.0004 | 0.4404 | 1280 β 1024 (PCA down) |
| code | -0.0016 | 0.4036 | 768 β 1024 (zero-pad up) |
| image | 0.0038 | 0.4107 | 1024 β 1024 (direct) |
| protein | 0.0005 | 0.3771 | 1280 β 1024 (PCA down) |
Training
- Data: COCO-Caption (40K), LibriSpeech clean-100 (10K), CodeSearchNet Python (50K), Protein2Text-QA (15K)
- Schedule: 3 epochs, round-robin across experts, cosine LR 3e-4 β 1e-6
- Hardware: NVIDIA RTX PRO 6000 Blackwell (102 GB VRAM)
- Time: ~11 minutes total (3 Γ ~220s)
Key Finding: Universal Pentachoron Geometry
The coefficient of variation (CV) of pentachoron (4-simplex) volumes, measured via Cayley-Menger determinants, converges to 0.20 Β± 0.01 across:
- 17 pretrained models (T5, BERT, CLIP, DINOv2, SD UNets, VAEs)
- 5 architecture families (transformer, UNet, convolutional autoencoder)
- 4 modalities in this fusion model (audio, code, image, protein)
This constant emerges in ANY embedding space that reaches convergence through gradient-based contrastive learning, regardless of modality, architecture, or training objective.
File Structure
geolip-bertenstein/
βββ checkpoints/
β βββ epoch_001/
β β βββ model.safetensors
β β βββ loss.safetensors
β β βββ training_state.pt
β β βββ config.json
β βββ epoch_002/
β βββ epoch_003/
β βββ final/
β βββ model.safetensors
β βββ loss.safetensors
β βββ training_state.pt
β βββ config.json
β βββ aligner_audio.safetensors
β βββ aligner_code.safetensors
β βββ aligner_image.safetensors
β βββ aligner_protein.safetensors
βββ tensorboard/
βββ bertenstein_results.json
βββ stage2_bertenstein.py
βββ README.md
Usage
from safetensors.torch import load_file
import torch
# Load model weights
state = load_file("checkpoints/final/model.safetensors")
# Reconstruct model (see stage2_bertenstein.py for full class definitions)
from stage2_bertenstein import BertensteinFusion, ExpertModule, ProcrustesAligner, FusionConfig
# ... build model, load state_dict, run inference
Precomputed embedding caches (Arrow format) for all modalities: AbstractPhil/geolip-bertenstein-cache
Geometric Terrain Analysis
The foundational profiling of 17 models and Procrustes alignment analysis: AbstractPhil/procrustes-analysis
Citation
@misc{abstractphil2026bertenstein,
title={GEOLIP-Bertenstein: Cross-Modal Alignment Through Shared Pentachoron Geometry},
author={AbstractPhil},
year={2026},
url={https://huggingface.co/AbstractPhil/geolip-bertenstein}
}
License
MIT