deepseek-ai/DeepSeek-OCR is out! ๐ฅ my take โคต๏ธ > pretty insane it can parse and re-render charts in HTML > it uses CLIP and SAM features concatenated, so better grounding > very efficient per vision tokens/performance ratio > covers 100 languages
The purpose here is to get an idea of the profile of the models with the greatest impact in open source (we are not interested in closed models here!).
IBM just released small swiss army knife for the document models: granite-docling-258M on Hugging Face ๐ฅ
> not only a document converter but also can do document question answering, understand multiple languages ๐คฏ > best part: released with Apache 2.0 license ๐ use it with your commercial projects! > it supports transformers, vLLM and MLX from the get-go! ๐ค > built on SigLIP2 & granite-165M
first vision language model built off openai/gpt-oss-20b just dropped! ๐ฅ
InternVL3.5 comes with 32 models ๐คฏ pre-trained, fine-tuned, aligned in various sizes OpenGVLab/internvl35-68ac87bd52ebe953485927fb comes with gpt-oss or Qwen3 for LLM part โคต๏ธ
Fine-tune Gemma3n on videos with audios inside with Colab A100 ๐ฅ Just dropped the notebook where you can learn how to fine-tune Gemma3n on images+audio+text at the same time!
keep in mind, it's made for educational purposes ๐ซก we do LoRA, audio resampling & video downsampling to be able to train <40GB VRAM stretch modalities and unfreeze layers as you wish! ๐๐ป merve/smol-vision