Running Qwen3.6 Locally: VRAM Requirements for 27B and 35B-A3B Quantized Models

A Gemma 4-style VRAM table for Qwen3.6-27B and Qwen3.6-35B-A3B across common GGUF quantization levels, including file size, minimum VRAM, and safer VRAM targets.

The Qwen3.6 open-weight models that are most relevant for local deployment are mainly:

  • Qwen3.6-27B: a 27B dense model.
  • Qwen3.6-35B-A3B: a 35B total / 3B active MoE model.

There are also online product or API model names such as Qwen3.6-Plus and Qwen3.6-Max. If a model does not have public full weights and stable quantized files, it is not suitable for a local VRAM table. This article only covers versions that can be deployed around Hugging Face weights and GGUF quantized files.

As with the Gemma 4 table in /05/10, two concepts need to be separated first:

  • GGUF file size: how large the model weight file is.
  • Actual VRAM usage: affected by weights, KV cache, context length, runtime backend, multimodal modules, and batch size.

Qwen3.6 has a very long default context. The official model card states native support for 262,144 tokens and extension to 1,010,000 tokens. So the “minimum VRAM” column below only applies to short or medium context. If you really want 128K, 256K, or longer context, reserve much more room for KV cache.

Quick Summary

VRAM Good Fit Avoid
8GB Extreme 2-bit tests for 27B / 35B-A3B, with clear quality risk Q4 and above
12GB 27B Q2/Q3, 35B-A3B Q2/Q3 with short context 27B Q4 with long context
16GB 27B Q3/Q4, 35B-A3B Q3/IQ4_XS 35B-A3B Q4 with long context
24GB 27B Q4/Q5/Q6, 35B-A3B Q4 35B-A3B Q8, BF16
32GB 27B Q8, 35B-A3B Q5/Q6 BF16
48GB 35B-A3B Q8, 27B with longer context more comfortably 35B-A3B BF16
80GB+ 27B / 35B-A3B BF16 No need to chase BF16 for ordinary local chat

If you have a 24GB GPU, focus on:

  • Qwen3.6-27B Q4_K_M
  • Qwen3.6-27B Q5_K_M
  • Qwen3.6-35B-A3B UD-Q4_K_M

If you only have 16GB VRAM, start with low-bit variants and do not enable very long context right away.

Official Weight Sizes

The following BF16 weight sizes come from model.safetensors.index.json in the official Hugging Face repositories. They are useful as a reference for the original model scale.

Model Architecture Official BF16 Weight Size Official Context
Qwen3.6-27B 27B dense 55.56GB Native 262K, extendable to 1,010K
Qwen3.6-35B-A3B 35B total / 3B active MoE 71.90GB Native 262K, extendable to 1,010K

Although 35B-A3B activates about 3B parameters per step, it still needs to load the full MoE weights. So it should not be estimated like a 3B small model.

Qwen3.6-27B VRAM Table

Qwen3.6-27B is a dense model. Its advantage is stable behavior, while its inference cost is closer to a traditional 27B model. For local deployment, it is more compute-heavy than 35B-A3B, but its VRAM requirements are easier to estimate.

Quantization GGUF File Size Minimum VRAM Safer VRAM Best For
UD-IQ2_XXS 9.39GB 12GB 16GB Extreme low-VRAM tests
UD-IQ2_M 10.85GB 12GB 16GB Low-VRAM usability
UD-Q2_K_XL 11.85GB 14GB 18GB Low-bit compromise
UD-IQ3_XXS 11.99GB 14GB 18GB VRAM-saving 3-bit
Q3_K_S 12.36GB 16GB 20GB 3-bit entry point
Q3_K_M 13.59GB 16GB 20GB Common 3-bit compromise
IQ4_XS 15.44GB 20GB 24GB Near-Q4, more VRAM efficient
IQ4_NL 16.07GB 20GB 24GB Quality/size balance
Q4_K_M 16.82GB 20GB 24GB Recommended 27B default
Q5_K_M 19.51GB 24GB 32GB Higher-quality quantization
Q6_K 22.52GB 28GB 32GB Quality first
Q8_0 28.60GB 32GB 40GB Near-original precision
BF16 53.80GB 64GB 80GB Research, evaluation, precision comparison

For ordinary local coding and chat, Q4_K_M is the easiest starting point to recommend. A 24GB GPU can run Q4_K_M fairly comfortably, but for long context, reduce quantization size or context length.

Qwen3.6-35B-A3B VRAM Table

Qwen3.6-35B-A3B is an MoE model with 35B total parameters and about 3B active parameters per step. Its advantage is a strong balance between speed and capability, especially for local agents, tool use, and coding workflows.

But note that MoE 3B active mainly affects compute. It does not mean VRAM usage is comparable to a 3B model. Full operation still needs the expert weights.

Quantization GGUF File Size Minimum VRAM Safer VRAM Best For
UD-IQ2_XXS 10.76GB 12GB 16GB Extreme low-VRAM tests
UD-IQ2_M 11.52GB 14GB 16GB Low-VRAM usability
UD-Q2_K_XL 12.29GB 14GB 18GB Low-bit compromise
UD-IQ3_XXS 13.21GB 16GB 20GB VRAM-saving 3-bit
UD-Q3_K_S 15.36GB 18GB 24GB 3-bit entry point
UD-Q3_K_M 16.60GB 20GB 24GB Common 3-bit compromise
UD-IQ4_XS 17.73GB 20GB 24GB Quality/size balance
UD-IQ4_NL 18.04GB 20GB 24GB Near-Q4 recommended option
UD-Q4_K_M 22.13GB 24GB 32GB Recommended 35B-A3B default
UD-Q5_K_M 26.46GB 32GB 40GB Higher-quality quantization
UD-Q6_K 29.31GB 32GB 48GB Quality first
Q8_0 36.90GB 48GB 64GB Near-original precision
BF16 69.37GB 80GB 96GB Research, evaluation, precision comparison

With 24GB VRAM, UD-Q4_K_M is a key option, but do not set the context too high. If you want room for 128K+ context, UD-IQ4_XS, UD-IQ4_NL, or 3-bit versions are more realistic.

27B vs 35B-A3B

Need Better Choice
Stable dense-model behavior Qwen3.6-27B
Faster response, agents, and tool use Qwen3.6-35B-A3B
Daily local use on 24GB VRAM 35B-A3B UD-Q4_K_M or 27B Q4_K_M
Testing on 16GB VRAM Use 2-bit/3-bit for both; avoid long context
Long context first Use lower-bit quantization and leave more KV cache room
Quality first with 32GB+ VRAM 27B Q5/Q6 or 35B-A3B Q5/Q6

If you mainly write code, run agents, or use tools, 35B-A3B is worth trying first. If you care more about dense-model stability and consistency, 27B is more straightforward.

Why Long Context Uses So Much VRAM

The Qwen3.6 model card recommends keeping longer context for complex tasks and even notes that 128K+ context can help reasoning. But for local deployment, long context means a much larger KV cache.

Actual VRAM usage is affected by:

  • KV cache: longer context means higher usage.
  • Whether vision input is enabled: Qwen3.6 includes a vision encoder, and multimodal use adds overhead.
  • Whether --language-model-only is used: in runtimes such as vLLM, skipping vision can free memory for KV cache.
  • Batch size and concurrency: more concurrency requires more VRAM.
  • KV cache quantization: q8_0, q4_0, and similar settings can save VRAM, but may affect details.
  • Runtime differences: llama.cpp, vLLM, SGLang, KTransformers, and LM Studio do not use exactly the same amount of memory.

So do not look only at GGUF file size. If the file is already close to the VRAM limit, the model may load but still OOM when generating long outputs or using long context.

How to Choose

If you just want to try Qwen3.6 locally:

  • 12GB VRAM: try 27B UD-IQ2_M or 35B-A3B UD-IQ2_M, with short context.
  • 16GB VRAM: try 27B Q3_K_M or 35B-A3B UD-IQ3_XXS.
  • 24GB VRAM: prefer 27B Q4_K_M, 35B-A3B UD-IQ4_NL, or 35B-A3B UD-Q4_K_M.
  • 32GB VRAM: consider 27B Q5/Q6 or 35B-A3B Q5/Q6.
  • 48GB and above: try Q8_0, or reserve more room for long context.

Most users do not need BF16. The point of local Qwen3.6 deployment is not to choose the largest file, but to balance VRAM, context length, speed, and output quality.

References

记录并分享
Built with Hugo
Theme Stack designed by Jimmy