-
WorldDreamer: Towards General World Models for Video Generation via Predicting Masked Tokens
Paper • 2401.09985 • Published • 18 -
CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects
Paper • 2401.09962 • Published • 9 -
Inflation with Diffusion: Efficient Temporal Adaptation for Text-to-Video Super-Resolution
Paper • 2401.10404 • Published • 10 -
ActAnywhere: Subject-Aware Video Background Generation
Paper • 2401.10822 • Published • 13
Collections
Discover the best community collections!
Collections including paper arxiv:2603.21986
-
Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device
Paper • 2602.20161 • Published • 23 -
A Very Big Video Reasoning Suite
Paper • 2602.20159 • Published • 522 -
Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model
Paper • 2603.21986 • Published • 125 -
AURA: Always-On Understanding and Real-Time Assistance via Video Streams
Paper • 2604.04184 • Published • 50
-
yandex/stable-diffusion-3.5-medium-alchemist
Text-to-Image • Updated • 12 • 7 -
Ovis-U1 Technical Report
Paper • 2506.23044 • Published • 61 -
FreeMorph: Tuning-Free Generalized Image Morphing with Diffusion Model
Paper • 2507.01953 • Published • 18 -
LongAnimation: Long Animation Generation with Dynamic Global-Local Memory
Paper • 2507.01945 • Published • 76
-
MACRO: Advancing Multi-Reference Image Generation with Structured Long-Context Data
Paper • 2603.25319 • Published • 32 -
Intern-S1-Pro: Scientific Multimodal Foundation Model at Trillion Scale
Paper • 2603.25040 • Published • 131 -
MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding
Paper • 2603.22458 • Published • 135 -
Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model
Paper • 2603.21986 • Published • 125
-
The Dragon Hatchling: The Missing Link between the Transformer and Models of the Brain
Paper • 2509.26507 • Published • 550 -
mHC: Manifold-Constrained Hyper-Connections
Paper • 2512.24880 • Published • 324 -
NeoVerse: Enhancing 4D World Model with in-the-wild Monocular Videos
Paper • 2601.00393 • Published • 133 -
LTX-2: Efficient Joint Audio-Visual Foundation Model
Paper • 2601.03233 • Published • 177
-
A Survey on Vision-Language-Action Models: An Action Tokenization Perspective
Paper • 2507.01925 • Published • 39 -
Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning
Paper • 2507.16746 • Published • 35 -
MolmoAct: Action Reasoning Models that can Reason in Space
Paper • 2508.07917 • Published • 45 -
Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies
Paper • 2508.20072 • Published • 32
-
DynamicScaler: Seamless and Scalable Video Generation for Panoramic Scenes
Paper • 2412.11100 • Published • 7 -
LinGen: Towards High-Resolution Minute-Length Text-to-Video Generation with Linear Computational Complexity
Paper • 2412.09856 • Published • 11 -
DisPose: Disentangling Pose Guidance for Controllable Human Image Animation
Paper • 2412.09349 • Published • 8 -
MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation
Paper • 2412.04448 • Published • 10
-
WorldDreamer: Towards General World Models for Video Generation via Predicting Masked Tokens
Paper • 2401.09985 • Published • 18 -
CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects
Paper • 2401.09962 • Published • 9 -
Inflation with Diffusion: Efficient Temporal Adaptation for Text-to-Video Super-Resolution
Paper • 2401.10404 • Published • 10 -
ActAnywhere: Subject-Aware Video Background Generation
Paper • 2401.10822 • Published • 13
-
MACRO: Advancing Multi-Reference Image Generation with Structured Long-Context Data
Paper • 2603.25319 • Published • 32 -
Intern-S1-Pro: Scientific Multimodal Foundation Model at Trillion Scale
Paper • 2603.25040 • Published • 131 -
MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding
Paper • 2603.22458 • Published • 135 -
Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model
Paper • 2603.21986 • Published • 125
-
Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device
Paper • 2602.20161 • Published • 23 -
A Very Big Video Reasoning Suite
Paper • 2602.20159 • Published • 522 -
Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model
Paper • 2603.21986 • Published • 125 -
AURA: Always-On Understanding and Real-Time Assistance via Video Streams
Paper • 2604.04184 • Published • 50
-
The Dragon Hatchling: The Missing Link between the Transformer and Models of the Brain
Paper • 2509.26507 • Published • 550 -
mHC: Manifold-Constrained Hyper-Connections
Paper • 2512.24880 • Published • 324 -
NeoVerse: Enhancing 4D World Model with in-the-wild Monocular Videos
Paper • 2601.00393 • Published • 133 -
LTX-2: Efficient Joint Audio-Visual Foundation Model
Paper • 2601.03233 • Published • 177
-
A Survey on Vision-Language-Action Models: An Action Tokenization Perspective
Paper • 2507.01925 • Published • 39 -
Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning
Paper • 2507.16746 • Published • 35 -
MolmoAct: Action Reasoning Models that can Reason in Space
Paper • 2508.07917 • Published • 45 -
Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies
Paper • 2508.20072 • Published • 32
-
yandex/stable-diffusion-3.5-medium-alchemist
Text-to-Image • Updated • 12 • 7 -
Ovis-U1 Technical Report
Paper • 2506.23044 • Published • 61 -
FreeMorph: Tuning-Free Generalized Image Morphing with Diffusion Model
Paper • 2507.01953 • Published • 18 -
LongAnimation: Long Animation Generation with Dynamic Global-Local Memory
Paper • 2507.01945 • Published • 76
-
DynamicScaler: Seamless and Scalable Video Generation for Panoramic Scenes
Paper • 2412.11100 • Published • 7 -
LinGen: Towards High-Resolution Minute-Length Text-to-Video Generation with Linear Computational Complexity
Paper • 2412.09856 • Published • 11 -
DisPose: Disentangling Pose Guidance for Controllable Human Image Animation
Paper • 2412.09349 • Published • 8 -
MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation
Paper • 2412.04448 • Published • 10