FlashWorld-ZeroGPU / CLAUDE.md
jbilcke-hf's picture
jbilcke-hf HF Staff
Add ZeroGPU Gradio app and deployment documentation
2a6e562

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

FlashWorld is a high-quality 3D scene generation system that creates 3D scenes from text or image prompts in ~7 seconds on a single A100/A800 GPU. The project uses diffusion-based transformers with Gaussian Splatting for 3D reconstruction.

Key capabilities:

  • Fast 3D scene generation (7 seconds on A100/A800)
  • Text-to-3D and Image-to-3D generation
  • Supports 24GB GPU memory configurations
  • Outputs 3D Gaussian Splatting (.ply) files

Running the Application

Local Demo (Flask + Custom UI)

python app.py --port 7860 --gpu 0 --cache_dir ./tmpfiles --max_concurrent 1

Access the web interface at http://HOST_IP:7860

Important flags:

  • --offload_t5: Offload text encoding to CPU to reduce GPU memory (trades speed for memory)
  • --ckpt: Path to custom checkpoint (auto-downloads from HuggingFace if not provided)
  • --max_concurrent: Maximum concurrent generation tasks (default: 1)

ZeroGPU Demo (Gradio)

python app_gradio.py

ZeroGPU Configuration:

  • Uses @spaces.GPU(duration=15) decorator with 15-second GPU budget
  • Model loading happens outside GPU decorator scope (in global scope)
  • Gradio 5.49.1+ required
  • Compatible with Hugging Face Spaces ZeroGPU hardware
  • Automatically downloads model checkpoint from HuggingFace Hub

Installation

Dependencies are in requirements.txt. Key packages:

  • PyTorch 2.6.0 with CUDA support
  • Custom gsplat version from specific commit
  • Custom diffusers version from specific commit

Install with:

pip install -r requirements.txt

Architecture

Core Components

GenerationSystem (app.py:90-346)

  • Main neural network system combining VAE, text encoder, transformer, and 3D reconstruction
  • Key submodules:
    • vae: AutoencoderKLWan for image encoding/decoding (from Wan2.2-TI2V-5B model)
    • text_encoder: UMT5 for text embedding
    • transformer: WanTransformer3DModel for diffusion denoising
    • recon_decoder: WANDecoderPixelAligned3DGSReconstructionModel for 3D Gaussian Splatting reconstruction
  • Uses flow matching scheduler with 4 denoising steps
  • Implements feedback mechanism where previous predictions inform next denoising step

Key Generation Pipeline:

  1. Text/image prompt β†’ text embeddings + optional image latents
  2. Create raymaps from camera parameters (6DOF)
  3. Iterative denoising with 3D feedback loop (4 steps at timesteps [0, 250, 500, 750])
  4. Final prediction β†’ 3D Gaussian parameters β†’ render to images
  5. Export to PLY file format

Model Files

models/transformer_wan.py

  • 3D transformer for video diffusion (adapted from Wan2.2 model)
  • Handles temporal + spatial attention with RoPE (Rotary Position Embeddings)

models/reconstruction_model.py

  • WANDecoderPixelAligned3DGSReconstructionModel: Converts latent features to 3D Gaussian parameters
  • PixelAligned3DGS: Per-pixel Gaussian parameter prediction
  • Outputs: positions (xyz), opacity, scales, rotations, SH features

models/autoencoder_kl_wan.py

  • VAE for image encoding/decoding (WAN architecture)
  • Custom 3D causal convolutions adapted for single-frame processing

models/render.py

  • Gaussian Splatting rasterization using gsplat library

utils.py

  • Camera utilities: normalize_cameras, create_rays, create_raymaps
  • Quaternion operations: quaternion_to_matrix, matrix_to_quaternion, quaternion_slerp
  • Camera interpolation: sample_from_dense_cameras, sample_from_two_pose
  • Export: export_ply_for_gaussians

Gradio Interface (app_gradio.py)

ZeroGPU Integration:

  • Model initialized in global scope (outside @spaces.GPU decorator)
  • generate_scene() function decorated with @spaces.GPU(duration=15)
  • Accepts image prompts (PIL), text prompts, camera JSON, and resolution
  • Returns PLY file and status message
  • Uses Gradio Progress API for user feedback

Input Format:

  • Image: PIL Image (optional)
  • Text: String prompt (optional)
  • Camera JSON: Array of camera dictionaries with quaternion, position, fx, fy, cx, cy
  • Resolution: String format "NxHxW" (e.g., "24x480x704")

Flask API (app.py - Local Only)

Concurrency Management (concurrency_manager.py)

  • Thread-pool based task queue for handling multiple generation requests
  • Task states: QUEUED β†’ RUNNING β†’ COMPLETED/FAILED
  • Automatic cleanup of old cached files (30 minute TTL)

API Endpoints:

  • POST /generate: Submit generation task (returns task_id immediately)
  • GET /task/<task_id>: Poll task status and get results
  • GET /download/<file_id>: Download generated PLY file
  • DELETE /delete/<file_id>: Clean up generated files
  • GET /status: Get queue status
  • GET /: Serve web interface (index.html)

Request Format:

{
  "image_prompt": "<base64 or path>",  // optional
  "text_prompt": "...",
  "cameras": [{"quaternion": [...], "position": [...], "fx": ..., "fy": ..., "cx": ..., "cy": ...}],
  "resolution": [n_frames, height, width],
  "image_index": 0  // which frame to condition on
}

Camera System

Cameras are represented as 11D vectors: [qw, qx, qy, qz, tx, ty, tz, fx, fy, cx, cy]

  • First 4: quaternion rotation (real-first convention)
  • Next 3: translation
  • Last 4: intrinsics (normalized by image dimensions)

Camera normalization (utils.py:269-296):

  • Centers scene around first camera
  • Normalizes translation scale based on max camera distance
  • Critical for stable 3D generation

Development Notes

Memory Management

  • Model uses FP8 quantization (quant.py) for transformer to reduce memory
  • VAE and text encoder can be offloaded to CPU with --offload_t5 and --offload_vae flags
  • Checkpoint mechanism for decoder to reduce memory during training

Key Constants

  • Latent dimension: 48 channels
  • Temporal downsample: 4x
  • Spatial downsample: 16x
  • Feature dimension: 1024 channels
  • Latent patch size: 2
  • Denoising timesteps: [0, 250, 500, 750]

Model Weights

  • Primary checkpoint auto-downloads from HuggingFace: imlixinyang/FlashWorld
  • Base diffusion model: Wan-AI/Wan2.2-TI2V-5B-Diffusers
  • Model is adapted with additional input/output channels for 3D features

Rendering

  • Uses gsplat 1.5.2 for differentiable Gaussian Splatting
  • SH degree: 2 (supports spherical harmonics up to degree 2)
  • Background modes: 'white', 'black', 'random'
  • Output FPS: 15

License

CC BY-NC-SA 4.0 (Attribution-NonCommercial-ShareAlike 4.0 International) - Academic research use only.