deepseek-ai/DeepSeek-OCR is out! 🔥 my take ⤵️ > pretty insane it can parse and re-render charts in HTML > it uses CLIP and SAM features concatenated, so better grounding > very efficient per vision tokens/performance ratio > covers 100 languages
DeepSeek-OCR is a new open-source, vision-language OCR model from DeepSeek-AI (the same lab behind the DeepSeek-V and DeepSeek-R series). It’s built to read complex, real-world documents — screenshots, PDFs, forms, tables, and handwritten or noisy text — and output clean, structured Markdown.
---
⚙️ Core capabilities
Multimodal (Vision + Language): Uses a hybrid vision encoder + causal text decoder to “see” layouts and generate text like a language model rather than just classifying characters.
Markdown output: Instead of raw text, it structures output with Markdown syntax — headings, bullet lists, tables, and inline formatting — which makes the results ideal for direct use in notebooks or LLM pipelines.
PDF-aware: Includes a built-in PDF runner that automatically slices pages into tiles, processes each region, and re-assembles multi-page outputs.
Adaptive tiling (“crop_mode”): Automatically splits large pages into overlapping tiles for better recognition of dense, small fonts — the “Gundam mode” mentioned in their docs.
Vision backbone: Based on DeepSeek-V2’s VL-encoder (≈3 B parameters) trained on massive document + scene-text corpora. Handles resolutions up to 1280 × 1280 px and dynamically scales lower.
Language head: Uses the same causal decoder family as DeepSeek-V2, fine-tuned for text reconstruction, so it can reason about table alignment, code blocks, and list structures.
Open and MIT-licensed: Weights and inference code are fully open under the MIT license, allowing integration into other projects or retraining for domain-specific OCR.
---
🆕 What’s new about its approach
Traditional OCR (e.g., Tesseract, PaddleOCR) → detects and classifies glyphs. DeepSeek-OCR → interprets the entire document as a multimodal