Fine-grained Audio-Visual Joint Representations for Multimodal Large Language Models Paper • 2310.05863 • Published Oct 9, 2023 • 2
Enhancing Multimodal LLM for Detailed and Accurate Video Captioning using Multi-Round Preference Optimization Paper • 2410.06682 • Published Oct 9, 2024
video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models Paper • 2506.15220 • Published Jun 18 • 1
video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models Paper • 2506.15220 • Published Jun 18 • 1