Wan 2.2 Image-to-Video (I2V-A14B) - GGUF FP16 Quantized Models
This repository contains GGUF quantized versions of the Wan 2.2 Image-to-Video A14B model, optimized for efficient inference with reduced VRAM requirements while maintaining high-quality video generation capabilities.
Model Description
Wan 2.2 is an advanced large-scale video generative model that uses a Mixture-of-Experts (MoE) architecture specifically designed for image-to-video synthesis. The A14B variant features a dual-expert design with approximately 14 billion parameters per expert:
- High-Noise Expert: Optimized for early denoising stages, focusing on overall layout and composition
- Low-Noise Expert: Specialized for later denoising stages, refining video details and quality
The model generates videos at 480P and 720P resolutions from static images, with support for text-guided prompts to control the generation process. Wan 2.2 incorporates meticulously curated aesthetic data with detailed labels for lighting, composition, contrast, and color tone, enabling precise cinematic-style video generation.
Repository Contents
This repository contains three GGUF model files optimized for different use cases:
diffusion_models/wan/
βββ wan22-i2v-a14b-high.gguf (15 GB) - Full FP16 high-noise expert
βββ wan22-i2v-a14b-high-q4-k-s.gguf (8.2 GB) - Q4_K_S quantized high-noise expert
βββ wan22-i2v-a14b-low-q4-k-s.gguf (8.2 GB) - Q4_K_S quantized low-noise expert
Total Repository Size: 31 GB
Model Files Explained
- wan22-i2v-a14b-high.gguf: Full precision FP16 high-noise expert model for maximum quality
- wan22-i2v-a14b-high-q4-k-s.gguf: Q4_K_S quantized high-noise expert (46% size reduction)
- wan22-i2v-a14b-low-q4-k-s.gguf: Q4_K_S quantized low-noise expert (46% size reduction)
Quantization Format: Q4_K_S (4-bit K-quant Small) provides an optimal balance between model size, memory usage, and generation quality.
Hardware Requirements
Minimum Requirements
| Configuration | VRAM | Disk Space | RAM |
|---|---|---|---|
| Full FP16 | 24 GB | 31 GB | 32 GB |
| Q4_K_S Quantized | 12 GB | 31 GB | 16 GB |
| Mixed (FP16 + Q4_K_S) | 18 GB | 31 GB | 24 GB |
Recommended Requirements
- GPU: NVIDIA RTX 4090 (24GB), RTX 6000 Ada (48GB), or A6000 (48GB)
- CPU: Modern multi-core processor (8+ cores recommended)
- Storage: SSD for faster model loading
- Operating System: Windows 10/11, Linux (Ubuntu 22.04+)
Performance Notes
- FP16 models provide the highest quality but require more VRAM
- Q4_K_S quantization reduces VRAM usage by ~50% with minimal quality loss
- Video generation time depends on resolution (480P ~30-60s, 720P ~60-120s per video)
- Batch processing can improve throughput but requires additional VRAM
Usage Examples
ComfyUI Integration
The most common way to use these GGUF models is through ComfyUI with the ComfyUI-GGUF custom node.
Installation:
# Navigate to ComfyUI custom nodes directory
cd ComfyUI/custom_nodes
# Clone the GGUF node
git clone https://github.com/city96/ComfyUI-GGUF
# Install dependencies
cd ComfyUI-GGUF
pip install -r requirements.txt
Model Setup:
# Copy models to ComfyUI directory
cp E:\huggingface\wan22-fp16-i2v-gguf\diffusion_models\wan\*.gguf ComfyUI\models\unet\
Workflow Configuration:
- Load image input node
- Add GGUF Model Loader node
- Select
wan22-i2v-a14b-high-q4-k-s.gguf(for high-noise expert) - Add prompt conditioning (optional)
- Configure video sampler with:
- Steps: 50-100
- CFG Scale: 7-9
- Resolution: 480P or 720P
- Connect to video output node
Python Usage (Diffusers)
For direct Python usage with absolute paths:
from diffusers import DiffusionPipeline
import torch
# Note: GGUF models require conversion or specialized loaders
# For native Diffusers support, use the base model:
# pipe = DiffusionPipeline.from_pretrained("Wan-AI/Wan2.2-I2V-A14B-Diffusers")
# For GGUF files, use ComfyUI or llama.cpp-based loaders
# Example using custom GGUF loader (requires compatible library):
from comfyui_gguf_loader import load_gguf_model
model_path = r"E:\huggingface\wan22-fp16-i2v-gguf\diffusion_models\wan\wan22-i2v-a14b-high-q4-k-s.gguf"
model = load_gguf_model(model_path, device="cuda", dtype=torch.float16)
# Generate video from image
image = load_image("input_image.jpg")
video = model.generate(
image=image,
prompt="A serene landscape with gentle wind moving through grass",
num_frames=48,
resolution="720p",
guidance_scale=8.0,
num_inference_steps=75
)
# Save video
video.save("output_video.mp4")
Advanced Configuration
# Memory-optimized configuration for 12GB VRAM
config = {
"model_path": r"E:\huggingface\wan22-fp16-i2v-gguf\diffusion_models\wan\wan22-i2v-a14b-high-q4-k-s.gguf",
"vae_tiling": True, # Reduce VAE memory usage
"enable_xformers": True, # Memory-efficient attention
"gradient_checkpointing": True,
"low_vram_mode": True,
"chunk_size": 2, # Process video in chunks
}
Model Specifications
Architecture
- Base Model: Wan 2.2 I2V-A14B (Image-to-Video)
- Parameters: 14.3 billion per expert (~27B total, 14B active)
- Architecture: Mixture-of-Experts (MoE) Diffusion Transformer
- Experts: Dual-expert design (high-noise + low-noise)
- Precision: FP16 (full) / Q4_K_S (quantized)
- Format: GGUF (GPT-Generated Unified Format)
Capabilities
- Input: Static images (any resolution, recommended 512x512 or higher)
- Output: Video sequences at 480P (854x480) or 720P (1280x720)
- Frame Count: Configurable (typically 24-96 frames)
- Frame Rate: 24 FPS (configurable)
- Duration: 1-4 seconds typical output
- Text Conditioning: Optional prompt-guided generation
- Style Control: Lighting, composition, contrast, color tone
Quantization Details
Q4_K_S Quantization:
- Bit Depth: 4-bit per weight (mixed with some 6-bit components)
- Method: K-quant Small (balanced quality/size trade-off)
- Size Reduction: ~46% compared to FP16
- Quality Loss: Minimal (~2-5% perceptual difference)
- Speed: Similar or faster inference due to reduced memory bandwidth
Performance Tips and Optimization
Memory Optimization
- Use Quantized Models: Start with Q4_K_S versions for 12GB VRAM systems
- Enable VAE Tiling: Reduces memory usage by processing image tiles
- Lower Resolution: Generate at 480P first, upscale if needed
- Reduce Batch Size: Process one video at a time on limited VRAM
- Model Offloading: Move models to CPU between inference steps
Quality Optimization
- Inference Steps: Use 75-100 steps for best quality (50 minimum)
- Guidance Scale: CFG 7-9 provides good prompt adherence
- Prompt Engineering: Describe motion, lighting, and camera movement
- Input Image Quality: Higher quality input = better video output
- Resolution Matching: Match input aspect ratio to output resolution
Speed Optimization
- Use Quantized Models: Q4_K_S inference is 10-20% faster
- Enable xFormers: Memory-efficient attention for faster processing
- Optimize Steps: Balance quality vs speed (50-75 steps for faster generation)
- Compile Model: Use
torch.compile()for 15-25% speedup (PyTorch 2.0+) - GPU Warmup: Run one generation to compile kernels before batch processing
Example Prompts
Good Prompts:
- "Gentle camera pan right, golden hour lighting, soft wind through trees"
- "Slow zoom in, dramatic lighting from left, subtle motion in background"
- "Static camera, clouds moving across sky, soft ambient lighting"
Avoid:
- Overly complex multi-action prompts
- Conflicting motion directions
- Unrealistic physics or transformations
License
This model is released under a custom Wan license. Please refer to the original Wan 2.2 model repository for complete licensing terms.
Usage Terms
Users are accountable for the content they generate and must not:
- Violate laws or regulations
- Cause harm to individuals or groups
- Generate or spread misinformation or disinformation
- Target or harm vulnerable populations
Commercial Use
Please consult the original Wan 2.2 license for commercial use terms and conditions.
Citation
If you use Wan 2.2 models in your research or applications, please cite:
@article{wan2025,
title={Wan: Open and Advanced Large-Scale Video Generative Models},
author={Team Wan and Contributors},
journal={arXiv preprint arXiv:2503.20314},
year={2025}
}
Related Resources
Official Resources
- Original Model: Wan-AI/Wan2.2-I2V-A14B
- Diffusers Version: Wan-AI/Wan2.2-I2V-A14B-Diffusers
- GGUF Collection: QuantStack/Wan2.2-I2V-A14B-GGUF
- GitHub Repository: Wan-Video/Wan2.2
- Research Paper: arXiv:2503.20314
Community Resources
- ComfyUI Integration: ComfyUI-GGUF
- Tutorial: Wan 2.2 VideoGen in ComfyUI
- Low VRAM Guide: Running Wan 2.2 GGUF with Low VRAM
Other Wan 2.2 Variants
- Text-to-Video: Wan2.2-T2V-A14B
- Text+Image-to-Video: Wan2.2-TI2V-5B
- Speech-to-Video: Wan2.2-S2V-14B
Troubleshooting
Common Issues
Issue: Out of memory errors Solution: Use Q4_K_S quantized models, enable VAE tiling, reduce resolution to 480P
Issue: Slow generation speed Solution: Use quantized models, enable xFormers, reduce inference steps to 50-75
Issue: Poor video quality Solution: Increase inference steps to 75-100, use higher guidance scale (8-9), improve input image quality
Issue: Model fails to load Solution: Verify GGUF loader compatibility, check file integrity, ensure sufficient disk space
Issue: Inconsistent motion Solution: Use clearer motion prompts, adjust guidance scale, increase inference steps
Support and Contact
For issues, questions, or contributions:
- Model Issues: Wan-AI on Hugging Face
- GGUF Issues: ComfyUI-GGUF GitHub
- General Discussion: Hugging Face Forums
Model Version: v2.2 README Version: v1.3 Last Updated: 2025-10-14 Format: GGUF (FP16 + Q4_K_S) Base Model: Wan-AI/Wan2.2-I2V-A14B
- Downloads last month
- 49
We're not able to determine the quantization variants.