🌟 Github | 📥 Model Download | 📄 Paper Link | 📄 Arxiv Paper Link |
DeepSeek-OCR: Contexts Optical Compression
Explore the boundaries of visual-text compression.
The official version of DeepSeek-OCR has limited the transformers version to 4.46.3 and has not been adapted to the latest version. Therefore, this community edition has modified the modeling.py module to facilitate user convenience without requiring a transformers downgrade. Additionally, this version has been adapted for MindSpore+MindNLP compatibility, and users are welcome to utilize it on Ascend hardware.
Feel free to opt for various attention implementations such as Flash Attention or SDPA to leverage the latest optimizations in transformers for a performance boost.
Combined MoE
In Transformer-based Mixture-of-Experts (MoE) models, the conventional approach relies on an MoE gating module to select experts, followed by processing hidden states through iterative loops. This often results in host-bound comparisons, which can significantly slow down token generation—especially on Ascend hardware.
To address this, we introduce a method that consolidates the MoE layer into three unified weight matrices (up, down, and gate_proj). This design is particularly suitable for smaller MoE models that can be fully loaded into memory. Below is the key implementation:
# combine weights of expert befor inference:
for layer in self.model.layers:
if isinstance(layer.mlp, DeepseekV2MoE):
moe_layer = layer.mlp
# combine experts
moe_layer.w1 = nn.Parameter(torch.stack([moe_layer.experts[i].gate_proj.weight.T for i in range(moe_layer.config.n_routed_experts)]), requires_grad=False)
moe_layer.w2 = nn.Parameter(torch.stack([moe_layer.experts[i].down_proj.weight.T for i in range(moe_layer.config.n_routed_experts)]), requires_grad=False)
moe_layer.w3 = nn.Parameter(torch.stack([moe_layer.experts[i].up_proj.weight.T for i in range(moe_layer.config.n_routed_experts)]), requires_grad=False)
# patch the new forward method of DeepseekV2MoE
def new_forward_for_moe(self, hidden_states):
batch_size, sequence_length, hidden_dim = hidden_states.shape
selected_experts, routing_weights = self.gate(hidden_states)
router_scores = torch.zeros(size=(batch_size * sequence_length, self.config.n_routed_experts), device=hidden_states.device, dtype=hidden_states.dtype)
# we cast back to the input dtype
routing_weights = routing_weights.to(hidden_states.dtype)
router_scores = torch.scatter_add(router_scores, -1, selected_experts, routing_weights)
hidden_states = hidden_states.view(-1, hidden_dim)
if self.config.n_shared_experts is not None:
shared_expert_output = self.shared_experts(hidden_states)
hidden_w1 = torch.matmul(hidden_states, self.w1)
hidden_w3 = torch.matmul(hidden_states, self.w3)
hidden_states = self.act(hidden_w1) * hidden_w3
hidden_states = torch.bmm(hidden_states, self.w2) * torch.transpose(router_scores, 0, 1).unsqueeze(-1)
final_hidden_states = hidden_states.sum(dim=0, dtype=hidden_states.dtype)
if self.config.n_shared_experts is not None:
hidden_states = final_hidden_states + shared_expert_output
return hidden_states.view(batch_size, sequence_length, hidden_dim)
As a result, we achieve a 3–4x speedup in OCR text generation. This dramatic improvement makes the optimized model a game-changer for production environments.
MindSpore Usage
Inference using Huggingface transformers on Ascend NPUs. Requirements tested on MindSpore2.7+ CANN8.2:
mindspore==2.7.0
mindnlp==0.5.0rc4
transformers==4.57.1
tokenizers
einops
addict
easydict
import os
import mindnlp
import mindspore
from transformers import AutoModel, AutoTokenizer
model_name = 'lvyufeng/DeepSeek-OCR'
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name, dtype=mindspore.float16, _attn_implementation='sdpa', trust_remote_code=True, use_safetensors=True, device_map='auto')
model = model.eval()
# combine experts
model.combine_moe()
# prompt = "<image>\nFree OCR. "
prompt = "<image>\n<|grounding|>Convert the document to markdown. "
image_file = 'your_image.jpg'
output_path = 'your/output/dir'
# infer(self, tokenizer, prompt='', image_file='', output_path = ' ', base_size = 1024, image_size = 640, crop_mode = True, test_compress = False, save_results = False):
# Tiny: base_size = 512, image_size = 512, crop_mode = False
# Small: base_size = 640, image_size = 640, crop_mode = False
# Base: base_size = 1024, image_size = 1024, crop_mode = False
# Large: base_size = 1280, image_size = 1280, crop_mode = False
# Gundam: base_size = 1024, image_size = 640, crop_mode = True
res = model.infer(tokenizer, prompt=prompt, image_file=image_file, output_path = output_path, base_size = 1024, image_size = 640, crop_mode=True, save_results = True, test_compress = True)
Pytorch Usage
Inference using Huggingface transformers on NVIDIA GPUs. Requirements tested on python 3.12.9 + CUDA11.8:
torch
transformers==4.57.1
tokenizers
einops
addict
easydict
pip install flash-attn
from transformers import AutoModel, AutoTokenizer
import torch
import os
os.environ["CUDA_VISIBLE_DEVICES"] = '0'
model_name = 'lvyufeng/DeepSeek-OCR'
tokenizer = AutoTokenizer.from_pretrained(model_name, dtype=torch.bfloat16,trust_remote_code=True, device_map='auto')
model = AutoModel.from_pretrained(model_name, _attn_implementation='flash_attention_2', trust_remote_code=True, use_safetensors=True)
model = model.eval()
# combine experts
model.combine_moe()
# prompt = "<image>\nFree OCR. "
prompt = "<image>\n<|grounding|>Convert the document to markdown. "
image_file = 'your_image.jpg'
output_path = 'your/output/dir'
# infer(self, tokenizer, prompt='', image_file='', output_path = ' ', base_size = 1024, image_size = 640, crop_mode = True, test_compress = False, save_results = False):
# Tiny: base_size = 512, image_size = 512, crop_mode = False
# Small: base_size = 640, image_size = 640, crop_mode = False
# Base: base_size = 1024, image_size = 1024, crop_mode = False
# Large: base_size = 1280, image_size = 1280, crop_mode = False
# Gundam: base_size = 1024, image_size = 640, crop_mode = True
res = model.infer(tokenizer, prompt=prompt, image_file=image_file, output_path = output_path, base_size = 1024, image_size = 640, crop_mode=True, save_results = True, test_compress = True)
Acknowledgement
We would like to thank Vary, GOT-OCR2.0, MinerU, PaddleOCR, OneChart, Slow Perception for their valuable models and ideas.
We also appreciate the benchmarks: Fox, OminiDocBench.
- Downloads last month
- 577