How to Make Cross Encoder a Good Teacher for Efficient Image-Text Retrieval? Paper • 2407.07479 • Published Jul 10, 2024
SynopGround: A Large-Scale Dataset for Multi-Paragraph Video Grounding from TV Dramas and Synopses Paper • 2408.01669 • Published Aug 3, 2024
Music-driven Dance Regeneration with Controllable Key Pose Constraints Paper • 2207.03682 • Published Jul 8, 2022
Learning Music-Dance Representations through Explicit-Implicit Rhythm Synchronization Paper • 2207.03190 • Published Jul 7, 2022
ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts Paper • 2507.20939 • Published Jul 28 • 56
UniPixel: Unified Object Referring and Segmentation for Pixel-Level Visual Reasoning Paper • 2509.18094 • Published Sep 22 • 4
mR$^2$AG: Multimodal Retrieval-Reflection-Augmented Generation for Knowledge-Based VQA Paper • 2411.15041 • Published Nov 22, 2024
ARC-Chapter: Structuring Hour-Long Videos into Navigable Chapters and Hierarchical Summaries Paper • 2511.14349 • Published 9 days ago • 16
VideoMaker: Zero-shot Customized Video Generation with the Inherent Force of Video Diffusion Models Paper • 2412.19645 • Published Dec 27, 2024 • 13