DeepSeek V4 and Gemma 4 are not in the same class for local deployment. With Gemma 4, it still makes sense to discuss how to run 26B or 31B models on 24GB or 32GB GPUs. DeepSeek V4 is a huge MoE model, and full local deployment quickly moves into multi-GPU workstation or server territory.
The official DeepSeek V4 Preview release mainly includes two inference models:
DeepSeek-V4-Pro:1.6T total / 49B active paramsDeepSeek-V4-Flash:284B total / 13B active params
The official Hugging Face collection also includes two Base models:
DeepSeek-V4-Pro-BaseDeepSeek-V4-Flash-Base
This article only discusses rough VRAM requirements when the full model weights are loaded.
For MoE models, active params mainly affects per-token compute. It does not mean only those parameters need to be loaded.
Without expert-on-demand loading, CPU/NVMe offload, distributed inference, or specialized runtime optimizations, VRAM should still be estimated from the full weight size.
Quick Summary
| VRAM Scale | What Is Realistic | Do Not Expect |
|---|---|---|
| 24GB | Cannot fully run DeepSeek V4; use smaller distilled models or API | Full V4-Flash / V4-Pro local loading |
| 48GB | Still not suitable for full loading; good for small models or remote API clients | Stable V4-Flash Q4 |
| 80GB | Theoretically try V4-Flash Q2/Q3 or heavy offload | V4-Pro |
| 128GB | V4-Flash Q4 becomes more realistic; Q5/Q6 still tight | V4-Pro Q4 |
| 192GB | V4-Flash FP8/Q6 is more comfortable; Pro Q2 enters experimental range | V4-Pro Q4 |
| 256GB | V4-Flash FP8 is fairly comfortable; Pro Q2/Q3 can be tested | V4-Pro Q5 and above |
| 512GB | V4-Pro Q4 starts to become discussable | V4-Pro FP8 |
| 1TB+ | V4-Pro FP8 and low-bit Pro-Base are more realistic | Low-cost single-machine deployment |
| 2TB+ | Pro-Base FP8 class | Ordinary workstation deployment |
If your goal is to run a model on a personal computer, DeepSeek V4 is not the right target. More realistic options are:
- Use the official DeepSeek API or compatible services.
- Wait for stable community GGUF/EXL2/MLX quantizations and inference support.
- Use smaller DeepSeek distilled models.
- Use local models in the 7B to 70B range from Qwen, Gemma, Llama, and similar families.
Official Weight Sizes
The following figures come from model.safetensors.index.json in the official Hugging Face repositories.
They reflect current public weight file sizes, not full runtime VRAM use under long context.
| Model | Parameter Scale | Official Weight Size | Notes |
|---|---|---|---|
DeepSeek-V4-Flash |
284B total / 13B active | 159.61GB | Inference model, smallest in this group |
DeepSeek-V4-Pro |
1.6T total / 49B active | 864.70GB | Inference model, stronger but enormous |
DeepSeek-V4-Flash-Base |
284B total | 294.67GB | Base model, closer to full FP8 weight size |
DeepSeek-V4-Pro-Base |
1.6T total | 1606.03GB | Base model, about 1.6TB |
Even the smallest V4-Flash is already close to 160GB of official weights.
That is why it should not be treated like a 13B model just because it has 13B active params.
DeepSeek V4 Flash VRAM Estimate
V4-Flash is the most approachable DeepSeek V4 variant for local experiments.
But that only means “more approachable than Pro”; it is still not a consumer single-GPU model.
The table below uses the official 159.61GB weight size as the baseline. Q4/Q3/Q2 rows are bit-width estimates and do not imply that stable official GGUF versions currently exist.
| Version / Quantization | Estimated Weight Size | Minimum VRAM | Safer VRAM | Best For |
|---|---|---|---|---|
FP8 / official weights |
159.61GB | 192GB | 256GB | Multi-GPU servers, inference service |
Q6 |
120GB | 160GB | 192GB | Quality-first quantization tests |
Q5 |
100GB | 128GB | 160GB | Quality/size balance |
Q4 |
80GB | 96GB | 128GB | More realistic starting point for Flash |
Q3 |
60GB | 80GB | 96GB | Large-VRAM single GPU or multi-GPU tests |
Q2 |
40GB | 48GB | 64GB | Extreme low-bit experiments with clear quality risk |
If mature V4-Flash Q4 builds appear later, it still probably will not be a 24GB GPU model.
A more realistic starting point is 96GB to 128GB total VRAM, or CPU/offload setups that trade speed for capacity.
DeepSeek V4 Pro VRAM Estimate
V4-Pro is the flagship inference model, with official weights around 864.70GB.
Even at 4-bit quantization, the full weights remain in the hundreds of GB.
| Version / Quantization | Estimated Weight Size | Minimum VRAM | Safer VRAM | Best For |
|---|---|---|---|---|
FP8 / official weights |
864.70GB | 1TB | 1.2TB+ | Multi-node or multi-GPU inference service |
Q6 |
648GB | 768GB | 1TB | High-quality quantized service |
Q5 |
540GB | 640GB | 768GB | Quality/cost balance |
Q4 |
432GB | 512GB | 640GB | Lowest practical quality line for Pro |
Q3 |
324GB | 384GB | 512GB | Low-bit experiments |
Q2 |
216GB | 256GB | 320GB | Extreme experiments with high quality and stability risk |
For individual users, V4-Pro is better consumed through an API.
If the goal is full local deployment, treat it as a multi-GPU server model, not a 4090, 5090, or RTX PRO single-GPU model.
DeepSeek V4 Flash-Base VRAM Estimate
Base models are usually for research, fine-tuning, or continued training, not ordinary chat deployment.
V4-Flash-Base has official weights of about 294.67GB.
| Version / Quantization | Estimated Weight Size | Minimum VRAM | Safer VRAM | Best For |
|---|---|---|---|---|
FP8 / official weights |
294.67GB | 384GB | 512GB | Research, preprocessing, evaluation |
Q6 |
221GB | 256GB | 320GB | High-quality quantization research |
Q5 |
184GB | 224GB | 256GB | Quality/size balance |
Q4 |
147GB | 192GB | 224GB | Lower-cost Base experiments |
Q3 |
111GB | 128GB | 160GB | Low-bit experiments |
Q2 |
74GB | 96GB | 128GB | Extreme experiments |
If you only want to use DeepSeek V4 capabilities, do not start with the Base model. Base models cost more to deploy and tune; most applications should use the inference model or API.
DeepSeek V4 Pro-Base VRAM Estimate
V4-Pro-Base is the heaviest variant, with official weights around 1606.03GB.
That is already a 1.6TB-class model file.
| Version / Quantization | Estimated Weight Size | Minimum VRAM | Safer VRAM | Best For |
|---|---|---|---|---|
FP8 / official weights |
1606.03GB | 2TB | 2.4TB+ | Large-scale research clusters |
Q6 |
1205GB | 1.5TB | 2TB | High-quality quantization research |
Q5 |
1004GB | 1.2TB | 1.5TB | Research and evaluation |
Q4 |
803GB | 1TB | 1.2TB | Low-bit research |
Q3 |
602GB | 768GB | 1TB | Extreme low-bit research |
Q2 |
402GB | 512GB | 640GB | Extreme experiments |
This kind of model should not be discussed in the framework of “can a home GPU run it?” Even Q4 is already beyond the comfortable range of most single-machine workstations.
Why Active Params Are Not Enough
DeepSeek V4 is an MoE model. MoE means each token activates only part of the experts, so compute is much lower than the total parameter count. But this does not mean VRAM only needs to hold the active parameters.
Full local inference also depends on:
- Whether all expert weights must stay resident on GPU.
- Whether on-demand expert loading is supported.
- CPU memory to GPU memory transfer costs.
- NVMe offload latency.
- KV cache growth under long context.
- Extra runtime overhead under 1M context.
- Multi-node and multi-GPU communication cost.
So V4-Pro with 49B active should not be deployed like a 49B model.
V4-Flash with 13B active should not be treated like a 13B small model either.
How to Choose
If you are an ordinary individual user:
- Do not try to fully self-host DeepSeek V4.
- Use the official API when you need DeepSeek V4 capabilities.
- For private local deployment, first check whether you have mature inference infrastructure or internal multi-GPU servers.
- With only 24GB to 48GB VRAM, 7B, 14B, 32B, or 70B quantized models are more practical.
If you have 128GB to 256GB total VRAM:
- Watch for stable community implementations of
V4-Flash Q4/Q5. - Do not treat
V4-Proas your main local model.
If you have 512GB+ total VRAM:
V4-Pro Q4starts to become an engineering validation target.- You still need to care about inference framework support, expert scheduling, KV cache, throughput, and concurrency.
The key question for DeepSeek V4 local deployment is not “which quantized file should I download?” It is “do I have the system-level inference capacity for this model?” It is closer to a server model than a desktop model.