Hugging Face's logo Hugging Face
  • Models
  • Datasets
  • Spaces
  • Docs
  • Enterprise
  • Pricing

  • Log In
  • Sign Up
plmsmile 's Collections
image llm
video llm
vision foundation modesl
mllm datasets
image-video llm
benchmarks
llm
text datasets
video generation
train methods
mllm applications

vision foundation modesl

updated Jun 17, 2024

vision foundation models

Upvote
-

  • ViTAR: Vision Transformer with Any Resolution

    Paper • 2403.18361 • Published Mar 27, 2024 • 55

  • BRAVE: Broadening the visual encoding of vision-language models

    Paper • 2404.07204 • Published Apr 10, 2024 • 19

  • CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data

    Paper • 2404.15653 • Published Apr 24, 2024 • 29

  • Chameleon: Mixed-Modal Early-Fusion Foundation Models

    Paper • 2405.09818 • Published May 16, 2024 • 131

  • Many-Shot In-Context Learning in Multimodal Foundation Models

    Paper • 2405.09798 • Published May 16, 2024 • 32

  • ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models

    Paper • 2405.15738 • Published May 24, 2024 • 46

  • Matryoshka Multimodal Models

    Paper • 2405.17430 • Published May 27, 2024 • 34

  • An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels

    Paper • 2406.09415 • Published Jun 13, 2024 • 51
Upvote
-
  • Collection guide
  • Browse collections
Company
TOS Privacy About Jobs
Website
Models Datasets Spaces Pricing Docs