Running Qwen3.6 Locally: VRAM Requirements for 27B and 35B-A3B Quantized Models

Fri, 01 May 2026 12:02:00 +0800

The Qwen3.6 open-weight models that are most relevant for local deployment are mainly:

Qwen3.6-27B: a 27B dense model.
Qwen3.6-35B-A3B: a 35B total / 3B active MoE model.

There are also online product or API model names such as Qwen3.6-Plus and Qwen3.6-Max. If a model does not have public full weights and stable quantized files, it is not suitable for a local VRAM table. This article only covers versions that can be deployed around Hugging Face weights and GGUF quantized files.

As with the Gemma 4 table in /05/10, two concepts need to be separated first:

GGUF file size: how large the model weight file is.
Actual VRAM usage: affected by weights, KV cache, context length, runtime backend, multimodal modules, and batch size.

Qwen3.6 has a very long default context. The official model card states native support for 262,144 tokens and extension to 1,010,000 tokens. So the “minimum VRAM” column below only applies to short or medium context. If you really want 128K, 256K, or longer context, reserve much more room for KV cache.

Quick Summary

VRAM	Good Fit	Avoid
8GB	Extreme 2-bit tests for 27B / 35B-A3B, with clear quality risk	Q4 and above
12GB	27B Q2/Q3, 35B-A3B Q2/Q3 with short context	27B Q4 with long context
16GB	27B Q3/Q4, 35B-A3B Q3/IQ4_XS	35B-A3B Q4 with long context
24GB	27B Q4/Q5/Q6, 35B-A3B Q4	35B-A3B Q8, BF16
32GB	27B Q8, 35B-A3B Q5/Q6	BF16
48GB	35B-A3B Q8, 27B with longer context more comfortably	35B-A3B BF16
80GB+	27B / 35B-A3B BF16	No need to chase BF16 for ordinary local chat

If you have a 24GB GPU, focus on:

Qwen3.6-27B Q4_K_M
Qwen3.6-27B Q5_K_M
Qwen3.6-35B-A3B UD-Q4_K_M

If you only have 16GB VRAM, start with low-bit variants and do not enable very long context right away.

Official Weight Sizes

The following BF16 weight sizes come from model.safetensors.index.json in the official Hugging Face repositories. They are useful as a reference for the original model scale.

Model	Architecture	Official BF16 Weight Size	Official Context
`Qwen3.6-27B`	27B dense	55.56GB	Native 262K, extendable to 1,010K
`Qwen3.6-35B-A3B`	35B total / 3B active MoE	71.90GB	Native 262K, extendable to 1,010K

Although 35B-A3B activates about 3B parameters per step, it still needs to load the full MoE weights. So it should not be estimated like a 3B small model.

Qwen3.6-27B VRAM Table

Qwen3.6-27B is a dense model. Its advantage is stable behavior, while its inference cost is closer to a traditional 27B model. For local deployment, it is more compute-heavy than 35B-A3B, but its VRAM requirements are easier to estimate.

Quantization	GGUF File Size	Minimum VRAM	Safer VRAM	Best For
`UD-IQ2_XXS`	9.39GB	12GB	16GB	Extreme low-VRAM tests
`UD-IQ2_M`	10.85GB	12GB	16GB	Low-VRAM usability
`UD-Q2_K_XL`	11.85GB	14GB	18GB	Low-bit compromise
`UD-IQ3_XXS`	11.99GB	14GB	18GB	VRAM-saving 3-bit
`Q3_K_S`	12.36GB	16GB	20GB	3-bit entry point
`Q3_K_M`	13.59GB	16GB	20GB	Common 3-bit compromise
`IQ4_XS`	15.44GB	20GB	24GB	Near-Q4, more VRAM efficient
`IQ4_NL`	16.07GB	20GB	24GB	Quality/size balance
`Q4_K_M`	16.82GB	20GB	24GB	Recommended 27B default
`Q5_K_M`	19.51GB	24GB	32GB	Higher-quality quantization
`Q6_K`	22.52GB	28GB	32GB	Quality first
`Q8_0`	28.60GB	32GB	40GB	Near-original precision
`BF16`	53.80GB	64GB	80GB	Research, evaluation, precision comparison

For ordinary local coding and chat, Q4_K_M is the easiest starting point to recommend. A 24GB GPU can run Q4_K_M fairly comfortably, but for long context, reduce quantization size or context length.

Qwen3.6-35B-A3B VRAM Table

Qwen3.6-35B-A3B is an MoE model with 35B total parameters and about 3B active parameters per step. Its advantage is a strong balance between speed and capability, especially for local agents, tool use, and coding workflows.

But note that MoE 3B active mainly affects compute. It does not mean VRAM usage is comparable to a 3B model. Full operation still needs the expert weights.

Quantization	GGUF File Size	Minimum VRAM	Safer VRAM	Best For
`UD-IQ2_XXS`	10.76GB	12GB	16GB	Extreme low-VRAM tests
`UD-IQ2_M`	11.52GB	14GB	16GB	Low-VRAM usability
`UD-Q2_K_XL`	12.29GB	14GB	18GB	Low-bit compromise
`UD-IQ3_XXS`	13.21GB	16GB	20GB	VRAM-saving 3-bit
`UD-Q3_K_S`	15.36GB	18GB	24GB	3-bit entry point
`UD-Q3_K_M`	16.60GB	20GB	24GB	Common 3-bit compromise
`UD-IQ4_XS`	17.73GB	20GB	24GB	Quality/size balance
`UD-IQ4_NL`	18.04GB	20GB	24GB	Near-Q4 recommended option
`UD-Q4_K_M`	22.13GB	24GB	32GB	Recommended 35B-A3B default
`UD-Q5_K_M`	26.46GB	32GB	40GB	Higher-quality quantization
`UD-Q6_K`	29.31GB	32GB	48GB	Quality first
`Q8_0`	36.90GB	48GB	64GB	Near-original precision
`BF16`	69.37GB	80GB	96GB	Research, evaluation, precision comparison

With 24GB VRAM, UD-Q4_K_M is a key option, but do not set the context too high. If you want room for 128K+ context, UD-IQ4_XS, UD-IQ4_NL, or 3-bit versions are more realistic.

27B vs 35B-A3B

Need	Better Choice
Stable dense-model behavior	`Qwen3.6-27B`
Faster response, agents, and tool use	`Qwen3.6-35B-A3B`
Daily local use on 24GB VRAM	`35B-A3B UD-Q4_K_M` or `27B Q4_K_M`
Testing on 16GB VRAM	Use 2-bit/3-bit for both; avoid long context
Long context first	Use lower-bit quantization and leave more KV cache room
Quality first with 32GB+ VRAM	`27B Q5/Q6` or `35B-A3B Q5/Q6`

If you mainly write code, run agents, or use tools, 35B-A3B is worth trying first. If you care more about dense-model stability and consistency, 27B is more straightforward.

Why Long Context Uses So Much VRAM

The Qwen3.6 model card recommends keeping longer context for complex tasks and even notes that 128K+ context can help reasoning. But for local deployment, long context means a much larger KV cache.

Actual VRAM usage is affected by:

KV cache: longer context means higher usage.
Whether vision input is enabled: Qwen3.6 includes a vision encoder, and multimodal use adds overhead.
Whether --language-model-only is used: in runtimes such as vLLM, skipping vision can free memory for KV cache.
Batch size and concurrency: more concurrency requires more VRAM.
KV cache quantization: q8_0, q4_0, and similar settings can save VRAM, but may affect details.
Runtime differences: llama.cpp, vLLM, SGLang, KTransformers, and LM Studio do not use exactly the same amount of memory.

So do not look only at GGUF file size. If the file is already close to the VRAM limit, the model may load but still OOM when generating long outputs or using long context.

How to Choose

If you just want to try Qwen3.6 locally:

12GB VRAM: try 27B UD-IQ2_M or 35B-A3B UD-IQ2_M, with short context.
16GB VRAM: try 27B Q3_K_M or 35B-A3B UD-IQ3_XXS.
24GB VRAM: prefer 27B Q4_K_M, 35B-A3B UD-IQ4_NL, or 35B-A3B UD-Q4_K_M.
32GB VRAM: consider 27B Q5/Q6 or 35B-A3B Q5/Q6.
48GB and above: try Q8_0, or reserve more room for long context.

Most users do not need BF16. The point of local Qwen3.6 deployment is not to choose the largest file, but to balance VRAM, context length, speed, and output quality.

Qwen3.6 on KnightLi Blog