Instructions to use deepseek-ai/DeepSeek-V4-Pro with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use deepseek-ai/DeepSeek-V4-Pro with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="deepseek-ai/DeepSeek-V4-Pro") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-V4-Pro") model = AutoModelForCausalLM.from_pretrained("deepseek-ai/DeepSeek-V4-Pro") - Inference
- HuggingChat
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use deepseek-ai/DeepSeek-V4-Pro with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "deepseek-ai/DeepSeek-V4-Pro" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "deepseek-ai/DeepSeek-V4-Pro", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/deepseek-ai/DeepSeek-V4-Pro
- SGLang
How to use deepseek-ai/DeepSeek-V4-Pro with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "deepseek-ai/DeepSeek-V4-Pro" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "deepseek-ai/DeepSeek-V4-Pro", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "deepseek-ai/DeepSeek-V4-Pro" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "deepseek-ai/DeepSeek-V4-Pro", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use deepseek-ai/DeepSeek-V4-Pro with Docker Model Runner:
docker model run hf.co/deepseek-ai/DeepSeek-V4-Pro
16 - 24B models with FP8 quantization
Hello Deepseek team,
Your models are greate. I wish I can try it locally, it's too big for me. I think most of people like me that want to try your models locally.
Could you please release smaller models, such as 16B, 24B with FP8 quantization? Because, like me, I can build a PC with 32G vRam, so 16-24B FP8 models are reasonable.
Many thanks.
I'm so active
It won't be any better than gemma 4 31b so why?
@Tikhonum Gemma 4 31b is unquestionably the best small model; easily outperforming Qwen3.5 across most tasks. However, Google deliberately crippled Gemma, so in theory DeepSeek could easily make a comparably sized model that's far more usable to the general population despite having slightly lower scores on standardized STEM-focused tests.
That is, Google's primary revenue stream comes from search/ads, which is why Gemini is the most generally capable AI model with the most broad knowledge (e.g. highest SimpleQA score). Because of this Google doesn't want local models to be generally capable, which would reduce online users, hence its profits. Consequently, Google deliberately designed Gemma to hallucinate like crazy when it comes to what most people in the general population care about.
This has nothing to do with Gemma 4's relatively small size (31b) since much smaller earlier models like Llama 3.1 8b, including Google's own Gemma 2 9b, have significantly more broad knowledge.
Anyways, by making a model good at coding, math, STEM, writing stories, and so on, but not good enough to justify the performance drop relative to the proprietary models (e.g. you're better off paying a few buck for a proprietary coding model then wasting hours fixing the sub-par coding of Gemma), and then making Gemma little more than a hallucination generator when it comes to what most people in the general population care about Google not only protects their business model, but pushes out potential competition from DeepSeek and others who can't match the performance in STEM focused domains without also crippling their general performance. So in the end there's a bunch of generally useless OS models.
Still, kudos to Google for releasing the best performing small OS model. But its release was strategic, not altruistic.
That sounds reasonable. But fortunately, Qwen appears as a king in the small open models, especially qwen3.6 recently.
I think general users will try many models to choose the best one. Gemma4 will be forgetten soon if it isn’t good as expected.
In fact, models from china are best in all categories, such as text, image, video, embedding, reranker,…. So, hope that Deepseek also help comunity with best models.
@Duonglv Models from China are NOT the best in all categories, and that's not my personal opinion.
For example, across the board on https://arena.ai Gemma 4 handily outperforms Qwen 3.5 in every tested category, and Qwen 3.6 wasn't even evaluated because it's the exact same model, just grossly overtrained for select domains, hence it performs generally worse than Qwen 3.5.
And in my testing even though Qwen 3.5 is a notably improvement over Qwen 3, Gemma 4 easily outperforms it in virtually every category. Plus professional institutions using exhaustive and complex hidden tests came to the same conclusions as LMsys and myself. For example, The Center for AI Standards and Innovation (CAISI) found DeepSeek v4 is the most powerful Chinese model, but is still well behind the latest models from OpenAI and Anthropic.
For the life of me I can't figure out the fanboy obsession the coding obsessed early adopting community has for Qwen models. Even when a clearly superior model family is released (Gemma 4) they still claim the infinite looping, token burning, and mistake prone Qwen 3.5 family is better.