# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Project Overview

InfiniteTalk is a talking video generator that creates realistic talking head videos with accurate lip-sync. It supports two modes:
- **Image-to-Video**: Transform static portraits into talking videos using audio input
- **Video Dubbing**: Re-sync existing videos with new audio while maintaining natural movements

Built on the Wan2.1 diffusion model with specialized audio conditioning for photorealistic results.

## Architecture

### Core Components

**Main Application** (`app.py`)
- Gradio interface with ZeroGPU support via `@spaces.GPU(duration=180)` decorator
- Two-tab interface: Image-to-Video and Video Dubbing
- Lazy model loading on first inference to minimize startup time
- Global `ModelManager` and `GPUManager` instances for resource management

**Model Pipeline** (`wan/multitalk.py`)
- `InfiniteTalkPipeline`: Main generation pipeline using Wan2.1-I2V-14B model
- Supports two resolutions: 480p (640x640) and 720p (960x960)
- Uses diffusion-based generation with audio conditioning
- Implements chunked processing for long videos to manage memory

**Audio Processing** (`src/audio_analysis/wav2vec2.py`)
- Custom `Wav2Vec2Model` extending HuggingFace's implementation
- Extracts audio embeddings with temporal interpolation via `linear_interpolation`
- Processes audio at 16kHz with loudness normalization (pyloudnorm)
- Stacks hidden states from all encoder layers for rich audio representation

**Model Management** (`utils/model_loader.py`)
- `ModelManager`: Handles lazy loading and caching of models from HuggingFace Hub
- Downloads three model types:
  - Wan2.1-I2V-14B: Main video generation model (Kijai/WanVideo_comfy)
  - InfiniteTalk weights: Specialized talking head weights (MeiGen-AI/InfiniteTalk)
  - Wav2Vec2: Audio encoder (TencentGameMate/chinese-wav2vec2-base)
- Models cached in `HF_HOME` or `/data/.huggingface`

**GPU Management** (`utils/gpu_manager.py`)
- `GPUManager`: Monitors memory usage and performs cleanup
- Calculates ZeroGPU duration based on video length and resolution
- Memory estimation: ~20GB base + 0.8GB/s (480p) or 1.5GB/s (720p)
- Recommends chunking for videos requiring >50GB memory

**Configuration** (`wan/configs/__init__.py`)
- `WAN_CONFIGS`: Model configurations for different tasks (t2v, i2v, infinitetalk)
- `SIZE_CONFIGS`: Resolution mappings (infinitetalk-480: 640x640, infinitetalk-720: 960x960)
- `SUPPORTED_SIZES`: Valid resolution options per model type

### Data Flow

1. **Audio Processing**: Audio file → librosa load → loudness normalization → Wav2Vec2 feature extraction → audio embeddings (shape: [seq_len, batch, dim])
2. **Input Processing**: Image/video → PIL/cache_video → frame extraction → resize and center crop to target resolution
3. **Generation**: InfiniteTalk pipeline combines visual input + audio embeddings → diffusion sampling → video tensor
4. **Output**: Video tensor → save_video_ffmpeg with audio track → MP4 file

### Key Design Patterns

- **Lazy Loading**: Models only loaded on first inference to reduce cold start time
- **Memory Management**: Aggressive cleanup with `torch.cuda.empty_cache()` and `gc.collect()` after generation
- **ZeroGPU Integration**: `@spaces.GPU` decorator with calculated duration based on video length
- **Offloading**: Models can be offloaded to CPU between forward passes to save VRAM

## Development Commands

### Docker Build and Run
```bash
# Build Docker image
docker build -t infinitetalk .

# Run locally
docker run -p 7860:7860 --gpus all infinitetalk
```

### Python Environment
```bash
# Install dependencies (requires PyTorch 2.5.1+ for xfuser compatibility)
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
pip install flash-attn==2.7.4.post1 --no-build-isolation  # Optional, may fail on some systems
pip install -r requirements.txt

# Run application
python app.py
```

### System Dependencies
Required packages (see `packages.txt`):
- ffmpeg (video processing)
- build-essential (compilation)
- libsndfile1 (audio I/O)
- git (model downloads)

## Important Implementation Details

### Resolution Handling
- User selects "480p" or "720p" in UI
- Internally mapped to `infinitetalk-480` (640x640) or `infinitetalk-720` (960x960)
- `sample_shift` parameter: 7 for 480p, 11 for 720p (controls diffusion sampling)

### Audio Embedding Format
Audio embeddings must be saved as `.pt` files in the format expected by the pipeline:
```python
audio_embeddings = torch.stack(embeddings.hidden_states[1:], dim=1).squeeze(0)
audio_embeddings = rearrange(audio_embeddings, "b s d -> s b d")  # Shape: [seq_len, batch, dim]
torch.save(audio_embeddings, emb_path)
```

### Pipeline Input Format
The `generate_infinitetalk` method expects:
```python
input_clip = {
    "prompt": "",  # Empty for talking head
    "cond_video": image_or_video_path,
    "cond_audio": {"person1": embedding_path},
    "video_audio": audio_wav_path
}
```

### ZeroGPU Duration Calculation
```python
base_time = 60  # Model loading
processing_rate = 2.5 (480p) or 3.5 (720p)  # Seconds per video second
duration = int((base_time + video_duration * processing_rate) * 1.2)  # 20% safety margin
duration = min(duration, 300)  # Cap at 300s for free tier
```

### Memory Optimization
- Use `offload_model=True` in pipeline to offload between forwards
- Enable VRAM management for low-memory scenarios: `pipeline.enable_vram_management()`
- Flash-attention (if available) reduces memory usage significantly
- Chunked processing for videos >15s (480p) or >10s (720p)

## HuggingFace Space Deployment

This project is designed for HuggingFace Spaces with ZeroGPU:
- SDK: `docker` (specified in README.md frontmatter)
- Hardware: `zero-gpu` (H200 with 70GB VRAM)
- Port: `7860` (Gradio default)
- First generation downloads ~15GB of models (2-3 minutes)
- Subsequent generations: ~40s for 10s video at 480p

See `DEPLOYMENT.md` for detailed deployment instructions and troubleshooting.

## Common Pitfalls

1. **Flash-attn compilation**: May fail on some systems. The Dockerfile handles this gracefully with `|| echo "Warning..."` fallback
2. **PyTorch version**: Must use 2.5.1+ for xfuser's `torch.distributed.tensor.experimental` support
3. **Audio sample rate**: Must be 16kHz for Wav2Vec2 model
4. **Frame format**: Pipeline expects 4n+1 frames (e.g., 81 frames) for proper temporal modeling
5. **Model paths**: InfiniteTalk weights must be loaded separately from base Wan model
6. **TOKENIZERS_PARALLELISM**: Set to 'false' to avoid deadlocks in multi-threaded environments

## File Structure

```
├── app.py                          # Main Gradio application
├── Dockerfile                      # Docker build configuration
├── requirements.txt                # Python dependencies
├── packages.txt                    # System dependencies
├── utils/
│   ├── model_loader.py            # Model download and loading
│   └── gpu_manager.py             # GPU memory management
├── wan/
│   ├── multitalk.py               # InfiniteTalk pipeline
│   ├── configs/                   # Model configurations
│   ├── modules/                   # Model architecture (VAE, DiT, etc.)
│   └── utils/                     # Video/audio utilities
└── src/
    └── audio_analysis/
        └── wav2vec2.py            # Audio encoder with interpolation
```