# DeepEncoder This is the encoder component of DeepSeek-OCR, containing: - **sam_encoder.pth**: SAM ViT-B encoder for high-resolution feature extraction - **clip_encoder.pth**: CLIP-L encoder for semantic feature extraction - **projector.pth**: Linear projector layer ## Architecture The DeepEncoder processes images through: 1. SAM encoder: Extracts high-resolution visual features with window attention 2. CLIP encoder: Extracts semantic features with global attention (uses SAM features as input) 3. Projector: Projects concatenated features to decoder dimension (1280) ## Usage ```python import torch from deepencoder.sam_vary_sdpa import build_sam_vit_b from deepencoder.clip_sdpa import build_clip_l from deepencoder.build_linear import MlpProjector from addict import Dict # Load models sam_model = build_sam_vit_b() vision_model = build_clip_l() projector = MlpProjector(Dict(projector_type="linear", input_dim=2048, n_embed=1280)) # Load weights sam_model.load_state_dict(torch.load("sam_encoder.pth")) vision_model.load_state_dict(torch.load("clip_encoder.pth")) projector.load_state_dict(torch.load("projector.pth")) # Process image with torch.no_grad(): sam_features = sam_model(image) # [B, 1024, H/16, W/16] clip_features = vision_model(image, sam_features) # [B, N, 1024] # Concatenate features combined_features = torch.cat( (clip_features[:, 1:], sam_features.flatten(2).permute(0, 2, 1)), dim=-1 ) # [B, N, 2048] # Project to decoder dimension vision_embeddings = projector(combined_features) # [B, N, 1280] ``` ## Source Extracted from [deepseek-ai/DeepSeek-OCR](https://huggingface.co/deepseek-ai/DeepSeek-OCR)