DeepSeek AI

🌟 Github | 📥 Model Download | 📄 Paper Link | 📄 Arxiv Paper Link |

DeepSeek-OCR: Contexts Optical Compression

Explore the boundaries of visual-text compression.

The official version of DeepSeek-OCR has limited the transformers version to 4.46.3 and has not been adapted to the latest version. Therefore, this community edition has modified the modeling.py module to facilitate user convenience without requiring a transformers downgrade. Additionally, this version has been adapted for MindSpore+MindNLP compatibility, and users are welcome to utilize it on Ascend hardware.

Feel free to opt for various attention implementations such as Flash Attention or SDPA to leverage the latest optimizations in transformers for a performance boost.

Combined MoE

In Transformer-based Mixture-of-Experts (MoE) models, the conventional approach relies on an MoE gating module to select experts, followed by processing hidden states through iterative loops. This often results in host-bound comparisons, which can significantly slow down token generation—especially on Ascend hardware.

To address this, we introduce a method that consolidates the MoE layer into three unified weight matrices (up, down, and gate_proj). This design is particularly suitable for smaller MoE models that can be fully loaded into memory. Below is the key implementation:

# combine weights of expert befor inference:
for layer in self.model.layers:
    if isinstance(layer.mlp, DeepseekV2MoE):
        moe_layer = layer.mlp
        # combine experts
        moe_layer.w1 = nn.Parameter(torch.stack([moe_layer.experts[i].gate_proj.weight.T for i in range(moe_layer.config.n_routed_experts)]), requires_grad=False)
        moe_layer.w2 = nn.Parameter(torch.stack([moe_layer.experts[i].down_proj.weight.T for i in range(moe_layer.config.n_routed_experts)]), requires_grad=False)
        moe_layer.w3 = nn.Parameter(torch.stack([moe_layer.experts[i].up_proj.weight.T for i in range(moe_layer.config.n_routed_experts)]), requires_grad=False)

# patch the new forward method of DeepseekV2MoE

def new_forward_for_moe(self, hidden_states):
    batch_size, sequence_length, hidden_dim = hidden_states.shape
    selected_experts, routing_weights = self.gate(hidden_states)
    router_scores = torch.zeros(size=(batch_size * sequence_length, self.config.n_routed_experts), device=hidden_states.device, dtype=hidden_states.dtype)
    # we cast back to the input dtype
    routing_weights = routing_weights.to(hidden_states.dtype)
    router_scores = torch.scatter_add(router_scores, -1, selected_experts, routing_weights)
    hidden_states = hidden_states.view(-1, hidden_dim)
    if self.config.n_shared_experts is not None:
        shared_expert_output = self.shared_experts(hidden_states)

    hidden_w1 = torch.matmul(hidden_states, self.w1)
    hidden_w3 = torch.matmul(hidden_states, self.w3)
    hidden_states = self.act(hidden_w1) * hidden_w3
    hidden_states = torch.bmm(hidden_states, self.w2) * torch.transpose(router_scores, 0, 1).unsqueeze(-1)
    final_hidden_states = hidden_states.sum(dim=0, dtype=hidden_states.dtype)
    if self.config.n_shared_experts is not None:
        hidden_states = final_hidden_states + shared_expert_output
    return hidden_states.view(batch_size, sequence_length, hidden_dim)

As a result, we achieve a 3–4x speedup in OCR text generation. This dramatic improvement makes the optimized model a game-changer for production environments.

MindSpore Usage

Inference using Huggingface transformers on Ascend NPUs. Requirements tested on MindSpore2.7+ CANN8.2:

mindspore==2.7.0
mindnlp==0.5.0rc4
transformers==4.57.1
tokenizers
einops
addict 
easydict
import os
import mindnlp
import mindspore
from transformers import AutoModel, AutoTokenizer

model_name = 'lvyufeng/DeepSeek-OCR'

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name, dtype=mindspore.float16, _attn_implementation='sdpa', trust_remote_code=True, use_safetensors=True, device_map='auto')
model = model.eval()

# combine experts
model.combine_moe()

# prompt = "<image>\nFree OCR. "
prompt = "<image>\n<|grounding|>Convert the document to markdown. "
image_file = 'your_image.jpg'
output_path = 'your/output/dir'

# infer(self, tokenizer, prompt='', image_file='', output_path = ' ', base_size = 1024, image_size = 640, crop_mode = True, test_compress = False, save_results = False):

# Tiny: base_size = 512, image_size = 512, crop_mode = False
# Small: base_size = 640, image_size = 640, crop_mode = False
# Base: base_size = 1024, image_size = 1024, crop_mode = False
# Large: base_size = 1280, image_size = 1280, crop_mode = False

# Gundam: base_size = 1024, image_size = 640, crop_mode = True

res = model.infer(tokenizer, prompt=prompt, image_file=image_file, output_path = output_path, base_size = 1024, image_size = 640, crop_mode=True, save_results = True, test_compress = True)

Pytorch Usage

Inference using Huggingface transformers on NVIDIA GPUs. Requirements tested on python 3.12.9 + CUDA11.8:

torch
transformers==4.57.1
tokenizers
einops
addict 
easydict
pip install flash-attn
from transformers import AutoModel, AutoTokenizer
import torch
import os
os.environ["CUDA_VISIBLE_DEVICES"] = '0'
model_name = 'lvyufeng/DeepSeek-OCR'
tokenizer = AutoTokenizer.from_pretrained(model_name, dtype=torch.bfloat16,trust_remote_code=True, device_map='auto')
model = AutoModel.from_pretrained(model_name, _attn_implementation='flash_attention_2', trust_remote_code=True, use_safetensors=True)
model = model.eval()

# combine experts
model.combine_moe()

# prompt = "<image>\nFree OCR. "
prompt = "<image>\n<|grounding|>Convert the document to markdown. "
image_file = 'your_image.jpg'
output_path = 'your/output/dir'
# infer(self, tokenizer, prompt='', image_file='', output_path = ' ', base_size = 1024, image_size = 640, crop_mode = True, test_compress = False, save_results = False):
# Tiny: base_size = 512, image_size = 512, crop_mode = False
# Small: base_size = 640, image_size = 640, crop_mode = False
# Base: base_size = 1024, image_size = 1024, crop_mode = False
# Large: base_size = 1280, image_size = 1280, crop_mode = False
# Gundam: base_size = 1024, image_size = 640, crop_mode = True
res = model.infer(tokenizer, prompt=prompt, image_file=image_file, output_path = output_path, base_size = 1024, image_size = 640, crop_mode=True, save_results = True, test_compress = True)

Acknowledgement

We would like to thank Vary, GOT-OCR2.0, MinerU, PaddleOCR, OneChart, Slow Perception for their valuable models and ideas.

We also appreciate the benchmarks: Fox, OminiDocBench.

Downloads last month
577
Safetensors
Model size
3B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support