Spaces:

jbilcke-hf
/

FlashWorld-ZeroGPU

Running on Zero

File size: 6,576 Bytes

2a6e562

# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Project Overview

FlashWorld is a high-quality 3D scene generation system that creates 3D scenes from text or image prompts in ~7 seconds on a single A100/A800 GPU. The project uses diffusion-based transformers with Gaussian Splatting for 3D reconstruction.

**Key capabilities:**
- Fast 3D scene generation (7 seconds on A100/A800)
- Text-to-3D and Image-to-3D generation
- Supports 24GB GPU memory configurations
- Outputs 3D Gaussian Splatting (.ply) files

## Running the Application

### Local Demo (Flask + Custom UI)
```bash
python app.py --port 7860 --gpu 0 --cache_dir ./tmpfiles --max_concurrent 1
```

Access the web interface at `http://HOST_IP:7860`

**Important flags:**
- `--offload_t5`: Offload text encoding to CPU to reduce GPU memory (trades speed for memory)
- `--ckpt`: Path to custom checkpoint (auto-downloads from HuggingFace if not provided)
- `--max_concurrent`: Maximum concurrent generation tasks (default: 1)

### ZeroGPU Demo (Gradio)
```bash
python app_gradio.py
```

**ZeroGPU Configuration:**
- Uses `@spaces.GPU(duration=15)` decorator with 15-second GPU budget
- Model loading happens **outside** GPU decorator scope (in global scope)
- Gradio 5.49.1+ required
- Compatible with Hugging Face Spaces ZeroGPU hardware
- Automatically downloads model checkpoint from HuggingFace Hub

### Installation
Dependencies are in `requirements.txt`. Key packages:
- PyTorch 2.6.0 with CUDA support
- Custom gsplat version from specific commit
- Custom diffusers version from specific commit

Install with:
```bash
pip install -r requirements.txt
```

## Architecture

### Core Components

**GenerationSystem** (app.py:90-346)
- Main neural network system combining VAE, text encoder, transformer, and 3D reconstruction
- Key submodules:
  - `vae`: AutoencoderKLWan for image encoding/decoding (from Wan2.2-TI2V-5B model)
  - `text_encoder`: UMT5 for text embedding
  - `transformer`: WanTransformer3DModel for diffusion denoising
  - `recon_decoder`: WANDecoderPixelAligned3DGSReconstructionModel for 3D Gaussian Splatting reconstruction
- Uses flow matching scheduler with 4 denoising steps
- Implements feedback mechanism where previous predictions inform next denoising step

**Key Generation Pipeline:**
1. Text/image prompt → text embeddings + optional image latents
2. Create raymaps from camera parameters (6DOF)
3. Iterative denoising with 3D feedback loop (4 steps at timesteps [0, 250, 500, 750])
4. Final prediction → 3D Gaussian parameters → render to images
5. Export to PLY file format

### Model Files

**models/transformer_wan.py**
- 3D transformer for video diffusion (adapted from Wan2.2 model)
- Handles temporal + spatial attention with RoPE (Rotary Position Embeddings)

**models/reconstruction_model.py**
- `WANDecoderPixelAligned3DGSReconstructionModel`: Converts latent features to 3D Gaussian parameters
- `PixelAligned3DGS`: Per-pixel Gaussian parameter prediction
- Outputs: positions (xyz), opacity, scales, rotations, SH features

**models/autoencoder_kl_wan.py**
- VAE for image encoding/decoding (WAN architecture)
- Custom 3D causal convolutions adapted for single-frame processing

**models/render.py**
- Gaussian Splatting rasterization using gsplat library

**utils.py**
- Camera utilities: normalize_cameras, create_rays, create_raymaps
- Quaternion operations: quaternion_to_matrix, matrix_to_quaternion, quaternion_slerp
- Camera interpolation: sample_from_dense_cameras, sample_from_two_pose
- Export: export_ply_for_gaussians

### Gradio Interface (app_gradio.py)

**ZeroGPU Integration:**
- Model initialized in global scope (outside @spaces.GPU decorator)
- `generate_scene()` function decorated with `@spaces.GPU(duration=15)`
- Accepts image prompts (PIL), text prompts, camera JSON, and resolution
- Returns PLY file and status message
- Uses Gradio Progress API for user feedback

**Input Format:**
- Image: PIL Image (optional)
- Text: String prompt (optional)
- Camera JSON: Array of camera dictionaries with `quaternion`, `position`, `fx`, `fy`, `cx`, `cy`
- Resolution: String format "NxHxW" (e.g., "24x480x704")

### Flask API (app.py - Local Only)

**Concurrency Management** (concurrency_manager.py)
- Thread-pool based task queue for handling multiple generation requests
- Task states: QUEUED → RUNNING → COMPLETED/FAILED
- Automatic cleanup of old cached files (30 minute TTL)

**API Endpoints:**
- `POST /generate`: Submit generation task (returns task_id immediately)
- `GET /task/<task_id>`: Poll task status and get results
- `GET /download/<file_id>`: Download generated PLY file
- `DELETE /delete/<file_id>`: Clean up generated files
- `GET /status`: Get queue status
- `GET /`: Serve web interface (index.html)

**Request Format:**
```json
{
  "image_prompt": "<base64 or path>",  // optional
  "text_prompt": "...",
  "cameras": [{"quaternion": [...], "position": [...], "fx": ..., "fy": ..., "cx": ..., "cy": ...}],
  "resolution": [n_frames, height, width],
  "image_index": 0  // which frame to condition on
}
```

### Camera System

Cameras are represented as 11D vectors: `[qw, qx, qy, qz, tx, ty, tz, fx, fy, cx, cy]`
- First 4: quaternion rotation (real-first convention)
- Next 3: translation
- Last 4: intrinsics (normalized by image dimensions)

**Camera normalization** (utils.py:269-296):
- Centers scene around first camera
- Normalizes translation scale based on max camera distance
- Critical for stable 3D generation

## Development Notes

### Memory Management
- Model uses FP8 quantization (quant.py) for transformer to reduce memory
- VAE and text encoder can be offloaded to CPU with `--offload_t5` and `--offload_vae` flags
- Checkpoint mechanism for decoder to reduce memory during training

### Key Constants
- Latent dimension: 48 channels
- Temporal downsample: 4x
- Spatial downsample: 16x
- Feature dimension: 1024 channels
- Latent patch size: 2
- Denoising timesteps: [0, 250, 500, 750]

### Model Weights
- Primary checkpoint auto-downloads from HuggingFace: `imlixinyang/FlashWorld`
- Base diffusion model: `Wan-AI/Wan2.2-TI2V-5B-Diffusers`
- Model is adapted with additional input/output channels for 3D features

### Rendering
- Uses gsplat 1.5.2 for differentiable Gaussian Splatting
- SH degree: 2 (supports spherical harmonics up to degree 2)
- Background modes: 'white', 'black', 'random'
- Output FPS: 15

## License

CC BY-NC-SA 4.0 (Attribution-NonCommercial-ShareAlike 4.0 International) - Academic research use only.