--- tags: - vae - multimodal - text-embeddings - clip - t5 - sdxl - stable-diffusion - adaptive-cantor - geometric-fusion license: mit --- # VAE Lyra 🎵 - Adaptive Cantor Edition Multi-modal Variational Autoencoder for SDXL text embedding transformation using adaptive Cantor fractal fusion with learned alpha (visibility) and beta (capacity) parameters. Fuses CLIP-L, CLIP-G, and decoupled T5-XL scales into a unified latent space. ## Model Details - **Fusion Strategy**: adaptive_cantor - **Latent Dimension**: 2048 - **Training Steps**: 78,750 - **Best Loss**: 0.2336 ## Learned Parameters **Alpha (Visibility):** - clip_g: 0.7291 - clip_l: 0.7280 - t5_xl_g: 0.7244 - t5_xl_l: 0.7161 **Beta (Capacity):** - clip_l_t5_xl_l: 0.5726 - clip_g_t5_xl_g: 0.5744 ## Architecture - **Modalities** (with sequence lengths): - CLIP-L (768d @ 77 tokens) - SDXL text_encoder - CLIP-G (1280d @ 77 tokens) - SDXL text_encoder_2 - T5-XL-L (2048d @ 512 tokens) - Auxiliary for CLIP-L - T5-XL-G (2048d @ 512 tokens) - Auxiliary for CLIP-G - **Encoder Layers**: 3 - **Decoder Layers**: 3 - **Hidden Dimension**: 1024 - **Cantor Depth**: 8 - **Local Window**: 3 ## Key Features ### Adaptive Cantor Fusion - **Cantor Fractal Routing**: Sparse attention based on fractal coordinate mapping - **Learned Alpha (Visibility)**: Per-modality parameters controlling latent space usage (tied to KL divergence) - **Learned Beta (Capacity)**: Per-binding-pair parameters controlling source influence strength ### Decoupled T5 Scales - T5-XL-L binds specifically to CLIP-L (weight: 0.3) - T5-XL-G binds specifically to CLIP-G (weight: 0.3) - Independent T5 representations allow specialized semantic enrichment per CLIP encoder ### Variable Sequence Lengths - CLIP: 77 tokens (standard) - T5: 512 tokens (extended context for richer semantic capture) ## SDXL Compatibility This model outputs both CLIP embeddings needed for SDXL: - `clip_l`: [batch, 77, 768] → text_encoder output - `clip_g`: [batch, 77, 1280] → text_encoder_2 output T5 information is encoded into the latent space and influences both CLIP outputs through learned binding weights. ## Usage ```python from geovocab2.train.model.vae.vae_lyra import MultiModalVAE, MultiModalVAEConfig from huggingface_hub import hf_hub_download import torch # Download model model_path = hf_hub_download( repo_id="AbstractPhil/vae-lyra-xl-adaptive-cantor", filename="model.pt" ) # Load checkpoint checkpoint = torch.load(model_path) # Create model config = MultiModalVAEConfig( modality_dims={ "clip_l": 768, "clip_g": 1280, "t5_xl_l": 2048, "t5_xl_g": 2048 }, modality_seq_lens={ "clip_l": 77, "clip_g": 77, "t5_xl_l": 512, "t5_xl_g": 512 }, binding_config={ "clip_l": {"t5_xl_l": 0.3}, "clip_g": {"t5_xl_g": 0.3}, "t5_xl_l": {}, "t5_xl_g": {} }, latent_dim=2048, fusion_strategy="adaptive_cantor", cantor_depth=8, cantor_local_window=3 ) model = MultiModalVAE(config) model.load_state_dict(checkpoint['model_state_dict']) model.eval() # Use model - train on all four modalities inputs = { "clip_l": clip_l_embeddings, # [batch, 77, 768] "clip_g": clip_g_embeddings, # [batch, 77, 1280] "t5_xl_l": t5_xl_l_embeddings, # [batch, 512, 2048] "t5_xl_g": t5_xl_g_embeddings # [batch, 512, 2048] } # For SDXL inference - only decode CLIP outputs recons, mu, logvar, per_mod_mus = model(inputs, target_modalities=["clip_l", "clip_g"]) # Use recons["clip_l"] and recons["clip_g"] with SDXL ``` ## Training Details - Trained on 10,000 diverse prompts - Mix of LAION flavors (85%) and synthetic prompts (15%) - KL Annealing: True - Learning Rate: 0.0001 - Alpha Init: 1.0 - Beta Init: 0.3 ## Citation ```bibtex @software{vae_lyra_adaptive_cantor_2025, author = {AbstractPhil}, title = {VAE Lyra: Adaptive Cantor Multi-Modal Variational Autoencoder}, year = {2025}, url = {https://huggingface.co/AbstractPhil/vae-lyra-xl-adaptive-cantor} } ```