# CLAUDE.md This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. ## Project Overview FlashWorld is a high-quality 3D scene generation system that creates 3D scenes from text or image prompts in ~7 seconds on a single A100/A800 GPU. The project uses diffusion-based transformers with Gaussian Splatting for 3D reconstruction. **Key capabilities:** - Fast 3D scene generation (7 seconds on A100/A800) - Text-to-3D and Image-to-3D generation - Supports 24GB GPU memory configurations - Outputs 3D Gaussian Splatting (.ply) files ## Running the Application ### Local Demo (Flask + Custom UI) ```bash python app.py --port 7860 --gpu 0 --cache_dir ./tmpfiles --max_concurrent 1 ``` Access the web interface at `http://HOST_IP:7860` **Important flags:** - `--offload_t5`: Offload text encoding to CPU to reduce GPU memory (trades speed for memory) - `--ckpt`: Path to custom checkpoint (auto-downloads from HuggingFace if not provided) - `--max_concurrent`: Maximum concurrent generation tasks (default: 1) ### ZeroGPU Demo (Gradio) ```bash python app_gradio.py ``` **ZeroGPU Configuration:** - Uses `@spaces.GPU(duration=15)` decorator with 15-second GPU budget - Model loading happens **outside** GPU decorator scope (in global scope) - Gradio 5.49.1+ required - Compatible with Hugging Face Spaces ZeroGPU hardware - Automatically downloads model checkpoint from HuggingFace Hub ### Installation Dependencies are in `requirements.txt`. Key packages: - PyTorch 2.6.0 with CUDA support - Custom gsplat version from specific commit - Custom diffusers version from specific commit Install with: ```bash pip install -r requirements.txt ``` ## Architecture ### Core Components **GenerationSystem** (app.py:90-346) - Main neural network system combining VAE, text encoder, transformer, and 3D reconstruction - Key submodules: - `vae`: AutoencoderKLWan for image encoding/decoding (from Wan2.2-TI2V-5B model) - `text_encoder`: UMT5 for text embedding - `transformer`: WanTransformer3DModel for diffusion denoising - `recon_decoder`: WANDecoderPixelAligned3DGSReconstructionModel for 3D Gaussian Splatting reconstruction - Uses flow matching scheduler with 4 denoising steps - Implements feedback mechanism where previous predictions inform next denoising step **Key Generation Pipeline:** 1. Text/image prompt → text embeddings + optional image latents 2. Create raymaps from camera parameters (6DOF) 3. Iterative denoising with 3D feedback loop (4 steps at timesteps [0, 250, 500, 750]) 4. Final prediction → 3D Gaussian parameters → render to images 5. Export to PLY file format ### Model Files **models/transformer_wan.py** - 3D transformer for video diffusion (adapted from Wan2.2 model) - Handles temporal + spatial attention with RoPE (Rotary Position Embeddings) **models/reconstruction_model.py** - `WANDecoderPixelAligned3DGSReconstructionModel`: Converts latent features to 3D Gaussian parameters - `PixelAligned3DGS`: Per-pixel Gaussian parameter prediction - Outputs: positions (xyz), opacity, scales, rotations, SH features **models/autoencoder_kl_wan.py** - VAE for image encoding/decoding (WAN architecture) - Custom 3D causal convolutions adapted for single-frame processing **models/render.py** - Gaussian Splatting rasterization using gsplat library **utils.py** - Camera utilities: normalize_cameras, create_rays, create_raymaps - Quaternion operations: quaternion_to_matrix, matrix_to_quaternion, quaternion_slerp - Camera interpolation: sample_from_dense_cameras, sample_from_two_pose - Export: export_ply_for_gaussians ### Gradio Interface (app_gradio.py) **ZeroGPU Integration:** - Model initialized in global scope (outside @spaces.GPU decorator) - `generate_scene()` function decorated with `@spaces.GPU(duration=15)` - Accepts image prompts (PIL), text prompts, camera JSON, and resolution - Returns PLY file and status message - Uses Gradio Progress API for user feedback **Input Format:** - Image: PIL Image (optional) - Text: String prompt (optional) - Camera JSON: Array of camera dictionaries with `quaternion`, `position`, `fx`, `fy`, `cx`, `cy` - Resolution: String format "NxHxW" (e.g., "24x480x704") ### Flask API (app.py - Local Only) **Concurrency Management** (concurrency_manager.py) - Thread-pool based task queue for handling multiple generation requests - Task states: QUEUED → RUNNING → COMPLETED/FAILED - Automatic cleanup of old cached files (30 minute TTL) **API Endpoints:** - `POST /generate`: Submit generation task (returns task_id immediately) - `GET /task/`: Poll task status and get results - `GET /download/`: Download generated PLY file - `DELETE /delete/`: Clean up generated files - `GET /status`: Get queue status - `GET /`: Serve web interface (index.html) **Request Format:** ```json { "image_prompt": "", // optional "text_prompt": "...", "cameras": [{"quaternion": [...], "position": [...], "fx": ..., "fy": ..., "cx": ..., "cy": ...}], "resolution": [n_frames, height, width], "image_index": 0 // which frame to condition on } ``` ### Camera System Cameras are represented as 11D vectors: `[qw, qx, qy, qz, tx, ty, tz, fx, fy, cx, cy]` - First 4: quaternion rotation (real-first convention) - Next 3: translation - Last 4: intrinsics (normalized by image dimensions) **Camera normalization** (utils.py:269-296): - Centers scene around first camera - Normalizes translation scale based on max camera distance - Critical for stable 3D generation ## Development Notes ### Memory Management - Model uses FP8 quantization (quant.py) for transformer to reduce memory - VAE and text encoder can be offloaded to CPU with `--offload_t5` and `--offload_vae` flags - Checkpoint mechanism for decoder to reduce memory during training ### Key Constants - Latent dimension: 48 channels - Temporal downsample: 4x - Spatial downsample: 16x - Feature dimension: 1024 channels - Latent patch size: 2 - Denoising timesteps: [0, 250, 500, 750] ### Model Weights - Primary checkpoint auto-downloads from HuggingFace: `imlixinyang/FlashWorld` - Base diffusion model: `Wan-AI/Wan2.2-TI2V-5B-Diffusers` - Model is adapted with additional input/output channels for 3D features ### Rendering - Uses gsplat 1.5.2 for differentiable Gaussian Splatting - SH degree: 2 (supports spherical harmonics up to degree 2) - Background modes: 'white', 'black', 'random' - Output FPS: 15 ## License CC BY-NC-SA 4.0 (Attribution-NonCommercial-ShareAlike 4.0 International) - Academic research use only.