VRAM on KnightLi Blog

Running Qwen3.6 Locally: VRAM Requirements for 27B and 35B-A3B Quantized Models

Fri, 01 May 2026 12:02:00 +0800

The Qwen3.6 open-weight models that are most relevant for local deployment are mainly:

Qwen3.6-27B: a 27B dense model.
Qwen3.6-35B-A3B: a 35B total / 3B active MoE model.

There are also online product or API model names such as Qwen3.6-Plus and Qwen3.6-Max. If a model does not have public full weights and stable quantized files, it is not suitable for a local VRAM table. This article only covers versions that can be deployed around Hugging Face weights and GGUF quantized files.

As with the Gemma 4 table in /05/10, two concepts need to be separated first:

GGUF file size: how large the model weight file is.
Actual VRAM usage: affected by weights, KV cache, context length, runtime backend, multimodal modules, and batch size.

Qwen3.6 has a very long default context. The official model card states native support for 262,144 tokens and extension to 1,010,000 tokens. So the “minimum VRAM” column below only applies to short or medium context. If you really want 128K, 256K, or longer context, reserve much more room for KV cache.

Quick Summary

VRAM	Good Fit	Avoid
8GB	Extreme 2-bit tests for 27B / 35B-A3B, with clear quality risk	Q4 and above
12GB	27B Q2/Q3, 35B-A3B Q2/Q3 with short context	27B Q4 with long context
16GB	27B Q3/Q4, 35B-A3B Q3/IQ4_XS	35B-A3B Q4 with long context
24GB	27B Q4/Q5/Q6, 35B-A3B Q4	35B-A3B Q8, BF16
32GB	27B Q8, 35B-A3B Q5/Q6	BF16
48GB	35B-A3B Q8, 27B with longer context more comfortably	35B-A3B BF16
80GB+	27B / 35B-A3B BF16	No need to chase BF16 for ordinary local chat

If you have a 24GB GPU, focus on:

Qwen3.6-27B Q4_K_M
Qwen3.6-27B Q5_K_M
Qwen3.6-35B-A3B UD-Q4_K_M

If you only have 16GB VRAM, start with low-bit variants and do not enable very long context right away.

Official Weight Sizes

The following BF16 weight sizes come from model.safetensors.index.json in the official Hugging Face repositories. They are useful as a reference for the original model scale.

Model	Architecture	Official BF16 Weight Size	Official Context
`Qwen3.6-27B`	27B dense	55.56GB	Native 262K, extendable to 1,010K
`Qwen3.6-35B-A3B`	35B total / 3B active MoE	71.90GB	Native 262K, extendable to 1,010K

Although 35B-A3B activates about 3B parameters per step, it still needs to load the full MoE weights. So it should not be estimated like a 3B small model.

Qwen3.6-27B VRAM Table

Qwen3.6-27B is a dense model. Its advantage is stable behavior, while its inference cost is closer to a traditional 27B model. For local deployment, it is more compute-heavy than 35B-A3B, but its VRAM requirements are easier to estimate.

Quantization	GGUF File Size	Minimum VRAM	Safer VRAM	Best For
`UD-IQ2_XXS`	9.39GB	12GB	16GB	Extreme low-VRAM tests
`UD-IQ2_M`	10.85GB	12GB	16GB	Low-VRAM usability
`UD-Q2_K_XL`	11.85GB	14GB	18GB	Low-bit compromise
`UD-IQ3_XXS`	11.99GB	14GB	18GB	VRAM-saving 3-bit
`Q3_K_S`	12.36GB	16GB	20GB	3-bit entry point
`Q3_K_M`	13.59GB	16GB	20GB	Common 3-bit compromise
`IQ4_XS`	15.44GB	20GB	24GB	Near-Q4, more VRAM efficient
`IQ4_NL`	16.07GB	20GB	24GB	Quality/size balance
`Q4_K_M`	16.82GB	20GB	24GB	Recommended 27B default
`Q5_K_M`	19.51GB	24GB	32GB	Higher-quality quantization
`Q6_K`	22.52GB	28GB	32GB	Quality first
`Q8_0`	28.60GB	32GB	40GB	Near-original precision
`BF16`	53.80GB	64GB	80GB	Research, evaluation, precision comparison

For ordinary local coding and chat, Q4_K_M is the easiest starting point to recommend. A 24GB GPU can run Q4_K_M fairly comfortably, but for long context, reduce quantization size or context length.

Qwen3.6-35B-A3B VRAM Table

Qwen3.6-35B-A3B is an MoE model with 35B total parameters and about 3B active parameters per step. Its advantage is a strong balance between speed and capability, especially for local agents, tool use, and coding workflows.

But note that MoE 3B active mainly affects compute. It does not mean VRAM usage is comparable to a 3B model. Full operation still needs the expert weights.

Quantization	GGUF File Size	Minimum VRAM	Safer VRAM	Best For
`UD-IQ2_XXS`	10.76GB	12GB	16GB	Extreme low-VRAM tests
`UD-IQ2_M`	11.52GB	14GB	16GB	Low-VRAM usability
`UD-Q2_K_XL`	12.29GB	14GB	18GB	Low-bit compromise
`UD-IQ3_XXS`	13.21GB	16GB	20GB	VRAM-saving 3-bit
`UD-Q3_K_S`	15.36GB	18GB	24GB	3-bit entry point
`UD-Q3_K_M`	16.60GB	20GB	24GB	Common 3-bit compromise
`UD-IQ4_XS`	17.73GB	20GB	24GB	Quality/size balance
`UD-IQ4_NL`	18.04GB	20GB	24GB	Near-Q4 recommended option
`UD-Q4_K_M`	22.13GB	24GB	32GB	Recommended 35B-A3B default
`UD-Q5_K_M`	26.46GB	32GB	40GB	Higher-quality quantization
`UD-Q6_K`	29.31GB	32GB	48GB	Quality first
`Q8_0`	36.90GB	48GB	64GB	Near-original precision
`BF16`	69.37GB	80GB	96GB	Research, evaluation, precision comparison

With 24GB VRAM, UD-Q4_K_M is a key option, but do not set the context too high. If you want room for 128K+ context, UD-IQ4_XS, UD-IQ4_NL, or 3-bit versions are more realistic.

27B vs 35B-A3B

Need	Better Choice
Stable dense-model behavior	`Qwen3.6-27B`
Faster response, agents, and tool use	`Qwen3.6-35B-A3B`
Daily local use on 24GB VRAM	`35B-A3B UD-Q4_K_M` or `27B Q4_K_M`
Testing on 16GB VRAM	Use 2-bit/3-bit for both; avoid long context
Long context first	Use lower-bit quantization and leave more KV cache room
Quality first with 32GB+ VRAM	`27B Q5/Q6` or `35B-A3B Q5/Q6`

If you mainly write code, run agents, or use tools, 35B-A3B is worth trying first. If you care more about dense-model stability and consistency, 27B is more straightforward.

Why Long Context Uses So Much VRAM

The Qwen3.6 model card recommends keeping longer context for complex tasks and even notes that 128K+ context can help reasoning. But for local deployment, long context means a much larger KV cache.

Actual VRAM usage is affected by:

KV cache: longer context means higher usage.
Whether vision input is enabled: Qwen3.6 includes a vision encoder, and multimodal use adds overhead.
Whether --language-model-only is used: in runtimes such as vLLM, skipping vision can free memory for KV cache.
Batch size and concurrency: more concurrency requires more VRAM.
KV cache quantization: q8_0, q4_0, and similar settings can save VRAM, but may affect details.
Runtime differences: llama.cpp, vLLM, SGLang, KTransformers, and LM Studio do not use exactly the same amount of memory.

So do not look only at GGUF file size. If the file is already close to the VRAM limit, the model may load but still OOM when generating long outputs or using long context.

How to Choose

If you just want to try Qwen3.6 locally:

12GB VRAM: try 27B UD-IQ2_M or 35B-A3B UD-IQ2_M, with short context.
16GB VRAM: try 27B Q3_K_M or 35B-A3B UD-IQ3_XXS.
24GB VRAM: prefer 27B Q4_K_M, 35B-A3B UD-IQ4_NL, or 35B-A3B UD-Q4_K_M.
32GB VRAM: consider 27B Q5/Q6 or 35B-A3B Q5/Q6.
48GB and above: try Q8_0, or reserve more room for long context.

Most users do not need BF16. The point of local Qwen3.6 deployment is not to choose the largest file, but to balance VRAM, context length, speed, and output quality.

References

Running DeepSeek V4 Locally: VRAM Estimates for Pro, Flash, and Base Versions

Fri, 01 May 2026 11:55:25 +0800

DeepSeek V4 and Gemma 4 are not in the same class for local deployment. With Gemma 4, it still makes sense to discuss how to run 26B or 31B models on 24GB or 32GB GPUs. DeepSeek V4 is a huge MoE model, and full local deployment quickly moves into multi-GPU workstation or server territory.

The official DeepSeek V4 Preview release mainly includes two inference models:

DeepSeek-V4-Pro: 1.6T total / 49B active params
DeepSeek-V4-Flash: 284B total / 13B active params

The official Hugging Face collection also includes two Base models:

DeepSeek-V4-Pro-Base
DeepSeek-V4-Flash-Base

This article only discusses rough VRAM requirements when the full model weights are loaded. For MoE models, active params mainly affects per-token compute. It does not mean only those parameters need to be loaded. Without expert-on-demand loading, CPU/NVMe offload, distributed inference, or specialized runtime optimizations, VRAM should still be estimated from the full weight size.

Quick Summary

VRAM Scale	What Is Realistic	Do Not Expect
24GB	Cannot fully run DeepSeek V4; use smaller distilled models or API	Full V4-Flash / V4-Pro local loading
48GB	Still not suitable for full loading; good for small models or remote API clients	Stable V4-Flash Q4
80GB	Theoretically try V4-Flash Q2/Q3 or heavy offload	V4-Pro
128GB	V4-Flash Q4 becomes more realistic; Q5/Q6 still tight	V4-Pro Q4
192GB	V4-Flash FP8/Q6 is more comfortable; Pro Q2 enters experimental range	V4-Pro Q4
256GB	V4-Flash FP8 is fairly comfortable; Pro Q2/Q3 can be tested	V4-Pro Q5 and above
512GB	V4-Pro Q4 starts to become discussable	V4-Pro FP8
1TB+	V4-Pro FP8 and low-bit Pro-Base are more realistic	Low-cost single-machine deployment
2TB+	Pro-Base FP8 class	Ordinary workstation deployment

If your goal is to run a model on a personal computer, DeepSeek V4 is not the right target. More realistic options are:

Use the official DeepSeek API or compatible services.
Wait for stable community GGUF/EXL2/MLX quantizations and inference support.
Use smaller DeepSeek distilled models.
Use local models in the 7B to 70B range from Qwen, Gemma, Llama, and similar families.

Official Weight Sizes

The following figures come from model.safetensors.index.json in the official Hugging Face repositories. They reflect current public weight file sizes, not full runtime VRAM use under long context.

Model	Parameter Scale	Official Weight Size	Notes
`DeepSeek-V4-Flash`	284B total / 13B active	159.61GB	Inference model, smallest in this group
`DeepSeek-V4-Pro`	1.6T total / 49B active	864.70GB	Inference model, stronger but enormous
`DeepSeek-V4-Flash-Base`	284B total	294.67GB	Base model, closer to full FP8 weight size
`DeepSeek-V4-Pro-Base`	1.6T total	1606.03GB	Base model, about 1.6TB

Even the smallest V4-Flash is already close to 160GB of official weights. That is why it should not be treated like a 13B model just because it has 13B active params.

DeepSeek V4 Flash VRAM Estimate

V4-Flash is the most approachable DeepSeek V4 variant for local experiments. But that only means “more approachable than Pro”; it is still not a consumer single-GPU model.

The table below uses the official 159.61GB weight size as the baseline. Q4/Q3/Q2 rows are bit-width estimates and do not imply that stable official GGUF versions currently exist.

Version / Quantization	Estimated Weight Size	Minimum VRAM	Safer VRAM	Best For
`FP8 / official weights`	159.61GB	192GB	256GB	Multi-GPU servers, inference service
`Q6`	120GB	160GB	192GB	Quality-first quantization tests
`Q5`	100GB	128GB	160GB	Quality/size balance
`Q4`	80GB	96GB	128GB	More realistic starting point for Flash
`Q3`	60GB	80GB	96GB	Large-VRAM single GPU or multi-GPU tests
`Q2`	40GB	48GB	64GB	Extreme low-bit experiments with clear quality risk

If mature V4-Flash Q4 builds appear later, it still probably will not be a 24GB GPU model. A more realistic starting point is 96GB to 128GB total VRAM, or CPU/offload setups that trade speed for capacity.

DeepSeek V4 Pro VRAM Estimate

V4-Pro is the flagship inference model, with official weights around 864.70GB. Even at 4-bit quantization, the full weights remain in the hundreds of GB.

Version / Quantization	Estimated Weight Size	Minimum VRAM	Safer VRAM	Best For
`FP8 / official weights`	864.70GB	1TB	1.2TB+	Multi-node or multi-GPU inference service
`Q6`	648GB	768GB	1TB	High-quality quantized service
`Q5`	540GB	640GB	768GB	Quality/cost balance
`Q4`	432GB	512GB	640GB	Lowest practical quality line for Pro
`Q3`	324GB	384GB	512GB	Low-bit experiments
`Q2`	216GB	256GB	320GB	Extreme experiments with high quality and stability risk

For individual users, V4-Pro is better consumed through an API. If the goal is full local deployment, treat it as a multi-GPU server model, not a 4090, 5090, or RTX PRO single-GPU model.

DeepSeek V4 Flash-Base VRAM Estimate

Base models are usually for research, fine-tuning, or continued training, not ordinary chat deployment. V4-Flash-Base has official weights of about 294.67GB.

Version / Quantization	Estimated Weight Size	Minimum VRAM	Safer VRAM	Best For
`FP8 / official weights`	294.67GB	384GB	512GB	Research, preprocessing, evaluation
`Q6`	221GB	256GB	320GB	High-quality quantization research
`Q5`	184GB	224GB	256GB	Quality/size balance
`Q4`	147GB	192GB	224GB	Lower-cost Base experiments
`Q3`	111GB	128GB	160GB	Low-bit experiments
`Q2`	74GB	96GB	128GB	Extreme experiments

If you only want to use DeepSeek V4 capabilities, do not start with the Base model. Base models cost more to deploy and tune; most applications should use the inference model or API.

DeepSeek V4 Pro-Base VRAM Estimate

V4-Pro-Base is the heaviest variant, with official weights around 1606.03GB. That is already a 1.6TB-class model file.

Version / Quantization	Estimated Weight Size	Minimum VRAM	Safer VRAM	Best For
`FP8 / official weights`	1606.03GB	2TB	2.4TB+	Large-scale research clusters
`Q6`	1205GB	1.5TB	2TB	High-quality quantization research
`Q5`	1004GB	1.2TB	1.5TB	Research and evaluation
`Q4`	803GB	1TB	1.2TB	Low-bit research
`Q3`	602GB	768GB	1TB	Extreme low-bit research
`Q2`	402GB	512GB	640GB	Extreme experiments

This kind of model should not be discussed in the framework of “can a home GPU run it?” Even Q4 is already beyond the comfortable range of most single-machine workstations.

Why Active Params Are Not Enough

DeepSeek V4 is an MoE model. MoE means each token activates only part of the experts, so compute is much lower than the total parameter count. But this does not mean VRAM only needs to hold the active parameters.

Full local inference also depends on:

Whether all expert weights must stay resident on GPU.
Whether on-demand expert loading is supported.
CPU memory to GPU memory transfer costs.
NVMe offload latency.
KV cache growth under long context.
Extra runtime overhead under 1M context.
Multi-node and multi-GPU communication cost.

So V4-Pro with 49B active should not be deployed like a 49B model. V4-Flash with 13B active should not be treated like a 13B small model either.

How to Choose

If you are an ordinary individual user:

Do not try to fully self-host DeepSeek V4.
Use the official API when you need DeepSeek V4 capabilities.
For private local deployment, first check whether you have mature inference infrastructure or internal multi-GPU servers.
With only 24GB to 48GB VRAM, 7B, 14B, 32B, or 70B quantized models are more practical.

If you have 128GB to 256GB total VRAM:

Watch for stable community implementations of V4-Flash Q4/Q5.
Do not treat V4-Pro as your main local model.

If you have 512GB+ total VRAM:

V4-Pro Q4 starts to become an engineering validation target.
You still need to care about inference framework support, expert scheduling, KV cache, throughput, and concurrency.

The key question for DeepSeek V4 local deployment is not “which quantized file should I download?” It is “do I have the system-level inference capacity for this model?” It is closer to a server model than a desktop model.

References

Running Gemma 4 Locally: VRAM Requirements for E2B, E4B, 26B, and 31B Quantized Models

Fri, 01 May 2026 11:42:34 +0800

Gemma 4 currently has four main sizes for local deployment: E2B, E4B, 26B A4B, and 31B. E2B and E4B target lightweight and edge devices, 26B A4B uses an MoE architecture, and 31B is the larger dense model.

The easiest mistake in local inference is mixing up two numbers:

GGUF file size: how large the model weight file is.
Actual VRAM usage: affected by model weights, KV cache, runtime overhead, context length, and whether multimodal projection files are loaded.

The tables below estimate VRAM requirements based on GGUF file size. The default assumption is local text inference with llama.cpp, LM Studio, Ollama, or similar runtimes, using short to medium context. If you need long context, image/audio input, or concurrent requests, leave more VRAM headroom.

Quick Summary

VRAM	Good Fit	Avoid
4GB	Low-bit E2B quantizations	E4B and above
6GB	E2B Q4/Q5, low-bit E4B	26B, 31B
8GB	E2B Q8, E4B Q4/Q5	26B Q4, 31B Q4
12GB	E4B Q8, low-quality 2-bit/3-bit 26B or 31B tests	26B Q4 with long context, 31B Q4
16GB	Low-bit 26B, low-bit 31B	31B Q4 with long context, 26B Q5 and above
24GB	26B Q4/Q5, 31B Q4	31B Q8, BF16
32GB	26B Q6/Q8, 31B Q5/Q6	BF16
48GB	31B Q8 more comfortably, 26B Q8 with longer context	31B BF16
80GB+	26B/31B BF16	Single consumer GPU deployment

If you just want something usable locally, start with E4B Q4_K_M or E2B Q4_K_M. With 24GB VRAM, 26B A4B Q4_K_M and 31B Q4_K_M start to become realistic choices.

Gemma 4 E2B VRAM Table

E2B is the lightest version, suitable for laptops, mini PCs, mobile devices, and low-VRAM testing. It is easy to run, but complex reasoning, coding, and long tasks are limited.

Quantization	GGUF File Size	Minimum VRAM	Safer VRAM	Best For
`UD-IQ2_M`	2.29GB	4GB	6GB	Extreme low-VRAM tests
`UD-Q2_K_XL`	2.40GB	4GB	6GB	Low-VRAM usability
`Q3_K_M`	2.54GB	4GB	6GB	Lightweight chat and summaries
`IQ4_XS`	2.98GB	6GB	8GB	Balance of quality and size
`Q4_K_M`	3.11GB	6GB	8GB	Recommended E2B default
`Q5_K_M`	3.36GB	6GB	8GB	Slightly steadier than Q4
`Q6_K`	4.50GB	8GB	10GB	Higher-quality small model
`Q8_0`	5.05GB	8GB	10GB	Near-original precision for lightweight deployment
`BF16`	9.31GB	12GB	16GB	Debugging, comparison, research

For daily use, E2B Q4_K_M is already enough. With only 4GB VRAM, 2-bit or 3-bit variants can work, but output quality will be less stable.

Gemma 4 E4B VRAM Table

E4B is the more practical lightweight model. Compared with E2B, it is better for everyday writing, document summaries, light coding assistance, and local assistant use.

Quantization	GGUF File Size	Minimum VRAM	Safer VRAM	Best For
`UD-IQ2_M`	3.53GB	6GB	8GB	Low-VRAM tests
`UD-Q2_K_XL`	3.74GB	6GB	8GB	Low-VRAM usability
`Q3_K_M`	4.06GB	6GB	10GB	Lightweight local assistant
`IQ4_XS`	4.72GB	8GB	12GB	Balance of quality and speed
`Q4_K_M`	4.98GB	8GB	12GB	Recommended E4B default
`Q5_K_M`	5.48GB	8GB	12GB	Steadier everyday use
`Q6_K`	7.07GB	10GB	16GB	Quality first
`Q8_0`	8.19GB	12GB	16GB	Near-original precision
`BF16`	15.05GB	20GB	24GB	Research, evaluation, precision comparison

If your GPU has 8GB VRAM, E4B Q4_K_M is a realistic starting point. With 12GB or 16GB VRAM, E4B Q8_0 is also worth considering.

Gemma 4 26B A4B VRAM Table

26B A4B is the MoE version. It has a larger total parameter count, but activates only part of the experts during inference. It is better suited to more complex Q&A, coding, tool use, and agent workflows.

Quantization	GGUF File Size	Minimum VRAM	Safer VRAM	Best For
`UD-IQ2_M`	9.97GB	14GB	16GB	Extreme 16GB GPU tests
`UD-Q2_K_XL`	10.55GB	14GB	16GB	Running 26B with low VRAM
`UD-Q3_K_M`	12.53GB	16GB	20GB	Better quality while still VRAM-conscious
`UD-IQ4_XS`	13.42GB	16GB	24GB	Balance of quality and size
`UD-Q4_K_M`	16.87GB	20GB	24GB	Recommended 26B default
`UD-Q5_K_M`	21.15GB	24GB	32GB	Higher-quality quantization
`UD-Q6_K`	23.17GB	28GB	32GB	Quality first
`Q8_0`	26.86GB	32GB	40GB	Near-original precision
`BF16`	50.51GB	64GB	80GB	Not realistic for most single consumer GPUs

24GB VRAM is the comfortable dividing line for 26B A4B. A 16GB GPU can try low-bit versions, but context length, concurrency, and multimodal input should be kept modest.

Gemma 4 31B VRAM Table

31B is the larger dense model. Its strength is stronger overall capability, but its VRAM pressure is more direct than 26B A4B.

Quantization	GGUF File Size	Minimum VRAM	Safer VRAM	Best For
`UD-IQ2_XXS`	8.53GB	12GB	16GB	Extreme low-VRAM tests with clear quality loss
`UD-IQ2_M`	10.75GB	14GB	18GB	Low-VRAM tests
`UD-Q2_K_XL`	11.77GB	16GB	20GB	16GB GPU experiments
`Q3_K_S`	13.21GB	16GB	24GB	More VRAM-efficient 3-bit
`Q3_K_M`	14.74GB	20GB	24GB	Common 3-bit compromise
`IQ4_XS`	16.37GB	20GB	24GB	Near-Q4 compromise
`Q4_K_M`	18.32GB	24GB	32GB	Recommended 31B default
`Q5_K_M`	21.66GB	28GB	32GB	Higher-quality quantization
`Q6_K`	25.20GB	32GB	40GB	Quality first
`Q8_0`	32.64GB	40GB	48GB	Near-original precision
`BF16`	61.41GB	80GB	96GB	Server or large-VRAM workstation

Low-bit 31B can be tested on a 16GB GPU, but for daily use, 24GB VRAM is a better starting point. Q4_K_M is the balanced choice, while Q5_K_M and above make more sense with 32GB+ VRAM.

Why Actual Usage Is Higher Than File Size

The GGUF file size is only the weight size. Runtime usage also includes:

KV cache: longer context means higher memory use.
Batch size and concurrency: processing more tokens or more users increases VRAM.
Multimodal components: image, audio, or video input often requires mmproj or extra modules.
Runtime backend: CUDA, Metal, ROCm, and CPU/GPU split loading behave differently.
KV cache quantization: q8_0, q4_0, and similar modes can save VRAM, but may affect detail.

So the “minimum VRAM” column should be read as the threshold for startup and short-context inference. For 32K, 64K, 128K, or even 256K context, VRAM requirements rise significantly.

How to Choose

If you just want to try Gemma 4 locally:

4GB to 6GB VRAM: choose E2B Q3_K_M or E2B Q4_K_M.
8GB VRAM: prefer E4B Q4_K_M; E2B Q8_0 is also fine.
12GB VRAM: choose E4B Q8_0, or try low-bit 26B/31B variants.
16GB VRAM: try 26B A4B UD-Q3_K_M or 31B Q3_K_S, but do not expect long context to feel comfortable.
24GB VRAM: focus on 26B A4B UD-Q4_K_M and 31B Q4_K_M.
32GB and above: consider Q5_K_M, Q6_K, or longer context.

Most users do not need BF16. Local deployment is not about picking the largest file, but about balancing VRAM, speed, context length, and output quality.