GEOLIP-Bertenstein

A multi-expert geometric fusion transformer that bridges 4 independently-trained encoders into a shared embedding space using BERT-large as a universal text hub.

One layer. One epoch. Perfect cross-modal retrieval. Universal pentachoron geometry.

Results

Expert Pair R@1 Cosine Pentachoron CV
text ↔ audio (1.5K val) 1.0000 0.972 0.203
text ↔ code (5K val) 0.9996 0.988 0.195
text ↔ image (5K val) 1.0000 0.986 0.196
text ↔ protein (2.3K val) 0.9987 0.979 0.200
text ↔ image (40K test) 1.0000 0.980 0.208

All CV values converge to the 0.20 Β± 0.01 universal band β€” the same geometric constant measured across 17 neural architectures before this model existed.

The current results only used a partial protein and partial audio dataset for preliminary, which left the sample size smaller and less conclusively potential as a full dataset run can yield due to the increased data capacity for alignment.

Architecture

                        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                        β”‚  Shared     β”‚
    β”Œβ”€β”€β”€β”€β”€β”€β”           β”‚  Fusion     β”‚           β”Œβ”€β”€β”€β”€β”€β”€β”
    β”‚ BERT │──text──→  β”‚  Transformerβ”‚  ←──img──│DINOv2β”‚
    β”‚large β”‚           β”‚  (1 layer)  β”‚           β”‚large β”‚
    β””β”€β”€β”€β”€β”€β”€β”˜           β”‚  1024-d     β”‚           β””β”€β”€β”€β”€β”€β”€β”˜
                        β”‚  16 heads   β”‚
    β”Œβ”€β”€β”€β”€β”€β”€β”           β”‚             β”‚           β”Œβ”€β”€β”€β”€β”€β”€β”
    β”‚Whisp.│──audio─→  β”‚  Procrustes β”‚  ←─prot──│ESM-2 β”‚
    β”‚large β”‚           β”‚  pre-alignedβ”‚           β”‚650M  β”‚
    β””β”€β”€β”€β”€β”€β”€β”˜           β”‚             β”‚           β””β”€β”€β”€β”€β”€β”€β”˜
                        β”‚             β”‚
    β”Œβ”€β”€β”€β”€β”€β”€β”           β”‚             β”‚
    β”‚Code- │──code──→  β”‚             β”‚
    β”‚BERT  β”‚           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    β””β”€β”€β”€β”€β”€β”€β”˜

Frozen encoders (not trained, not included in this repo):

  • BERT-large (336M) β€” universal text hub
  • DINOv2-large (302M) β€” natural images
  • Whisper-large-v3 encoder (1.5B) β€” speech audio
  • ESM-2-650M (652M) β€” protein sequences
  • CodeBERT-base (125M) β€” source code

Trainable fusion (this repo): 41.5M params

  • 1 shared transformer layer (8.9M)
  • 5 expert modules (text + 4 modalities, ~7M each)
  • Includes Procrustes pre-alignment buffers per expert

How It Works

  1. Procrustes Pre-Alignment: Before training, compute optimal orthogonal rotation to align each expert's coordinate system with BERT's text space. Whitened Procrustes with centering.

  2. Expert Modules: Each modality gets a projection layer, learned cross-attention pooling (257 image patches β†’ 16 tokens, 1500 audio frames β†’ 16 tokens, etc.), a special <|MODALITY|> token, and an output head.

  3. Fusion Sequence: [<|TEXT|>] [text_tokens] [<|IMAGE|>] [img_tokens] ... β€” bidirectional self-attention across all modalities. Text tokens attend to image patches. Audio tokens attend to text tokens.

  4. Geometric Loss: InfoNCE contrastive + pentachoron CV variance (Cayley-Menger) + Procrustes SVD alignment. The geometric components drive the embedding manifold toward the universal 0.20 CV band.

  5. Text Hub: All modalities pair with text during training. Cross-modal alignment (e.g., audio↔image) emerges transitively through the shared text space.

Procrustes Pre-Alignment

Expert cos before cos after Dimension
audio 0.0004 0.4404 1280 β†’ 1024 (PCA down)
code -0.0016 0.4036 768 β†’ 1024 (zero-pad up)
image 0.0038 0.4107 1024 β†’ 1024 (direct)
protein 0.0005 0.3771 1280 β†’ 1024 (PCA down)

Training

  • Data: COCO-Caption (40K), LibriSpeech clean-100 (10K), CodeSearchNet Python (50K), Protein2Text-QA (15K)
  • Schedule: 3 epochs, round-robin across experts, cosine LR 3e-4 β†’ 1e-6
  • Hardware: NVIDIA RTX PRO 6000 Blackwell (102 GB VRAM)
  • Time: ~11 minutes total (3 Γ— ~220s)

Key Finding: Universal Pentachoron Geometry

The coefficient of variation (CV) of pentachoron (4-simplex) volumes, measured via Cayley-Menger determinants, converges to 0.20 Β± 0.01 across:

  • 17 pretrained models (T5, BERT, CLIP, DINOv2, SD UNets, VAEs)
  • 5 architecture families (transformer, UNet, convolutional autoencoder)
  • 4 modalities in this fusion model (audio, code, image, protein)

This constant emerges in ANY embedding space that reaches convergence through gradient-based contrastive learning, regardless of modality, architecture, or training objective.

File Structure

geolip-bertenstein/
β”œβ”€β”€ checkpoints/
β”‚   β”œβ”€β”€ epoch_001/
β”‚   β”‚   β”œβ”€β”€ model.safetensors
β”‚   β”‚   β”œβ”€β”€ loss.safetensors
β”‚   β”‚   β”œβ”€β”€ training_state.pt
β”‚   β”‚   └── config.json
β”‚   β”œβ”€β”€ epoch_002/
β”‚   β”œβ”€β”€ epoch_003/
β”‚   └── final/
β”‚       β”œβ”€β”€ model.safetensors
β”‚       β”œβ”€β”€ loss.safetensors
β”‚       β”œβ”€β”€ training_state.pt
β”‚       β”œβ”€β”€ config.json
β”‚       β”œβ”€β”€ aligner_audio.safetensors
β”‚       β”œβ”€β”€ aligner_code.safetensors
β”‚       β”œβ”€β”€ aligner_image.safetensors
β”‚       └── aligner_protein.safetensors
β”œβ”€β”€ tensorboard/
β”œβ”€β”€ bertenstein_results.json
β”œβ”€β”€ stage2_bertenstein.py
└── README.md

Usage

from safetensors.torch import load_file
import torch

# Load model weights
state = load_file("checkpoints/final/model.safetensors")

# Reconstruct model (see stage2_bertenstein.py for full class definitions)
from stage2_bertenstein import BertensteinFusion, ExpertModule, ProcrustesAligner, FusionConfig

# ... build model, load state_dict, run inference

Precomputed embedding caches (Arrow format) for all modalities: AbstractPhil/geolip-bertenstein-cache

Geometric Terrain Analysis

The foundational profiling of 17 models and Procrustes alignment analysis: AbstractPhil/procrustes-analysis

Citation

@misc{abstractphil2026bertenstein,
  title={GEOLIP-Bertenstein: Cross-Modal Alignment Through Shared Pentachoron Geometry},
  author={AbstractPhil},
  year={2026},
  url={https://huggingface.co/AbstractPhil/geolip-bertenstein}
}

License

MIT

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support