Running Gemma 4 Locally: VRAM Requirements for E2B, E4B, 26B, and 31B Quantized Models

A practical VRAM table for Gemma 4 E2B, E4B, 26B A4B, and 31B across common GGUF quantization levels, including file size, minimum VRAM, and safer VRAM targets.

Gemma 4 currently has four main sizes for local deployment: E2B, E4B, 26B A4B, and 31B. E2B and E4B target lightweight and edge devices, 26B A4B uses an MoE architecture, and 31B is the larger dense model.

The easiest mistake in local inference is mixing up two numbers:

  • GGUF file size: how large the model weight file is.
  • Actual VRAM usage: affected by model weights, KV cache, runtime overhead, context length, and whether multimodal projection files are loaded.

The tables below estimate VRAM requirements based on GGUF file size. The default assumption is local text inference with llama.cpp, LM Studio, Ollama, or similar runtimes, using short to medium context. If you need long context, image/audio input, or concurrent requests, leave more VRAM headroom.

Quick Summary

VRAM Good Fit Avoid
4GB Low-bit E2B quantizations E4B and above
6GB E2B Q4/Q5, low-bit E4B 26B, 31B
8GB E2B Q8, E4B Q4/Q5 26B Q4, 31B Q4
12GB E4B Q8, low-quality 2-bit/3-bit 26B or 31B tests 26B Q4 with long context, 31B Q4
16GB Low-bit 26B, low-bit 31B 31B Q4 with long context, 26B Q5 and above
24GB 26B Q4/Q5, 31B Q4 31B Q8, BF16
32GB 26B Q6/Q8, 31B Q5/Q6 BF16
48GB 31B Q8 more comfortably, 26B Q8 with longer context 31B BF16
80GB+ 26B/31B BF16 Single consumer GPU deployment

If you just want something usable locally, start with E4B Q4_K_M or E2B Q4_K_M. With 24GB VRAM, 26B A4B Q4_K_M and 31B Q4_K_M start to become realistic choices.

Gemma 4 E2B VRAM Table

E2B is the lightest version, suitable for laptops, mini PCs, mobile devices, and low-VRAM testing. It is easy to run, but complex reasoning, coding, and long tasks are limited.

Quantization GGUF File Size Minimum VRAM Safer VRAM Best For
UD-IQ2_M 2.29GB 4GB 6GB Extreme low-VRAM tests
UD-Q2_K_XL 2.40GB 4GB 6GB Low-VRAM usability
Q3_K_M 2.54GB 4GB 6GB Lightweight chat and summaries
IQ4_XS 2.98GB 6GB 8GB Balance of quality and size
Q4_K_M 3.11GB 6GB 8GB Recommended E2B default
Q5_K_M 3.36GB 6GB 8GB Slightly steadier than Q4
Q6_K 4.50GB 8GB 10GB Higher-quality small model
Q8_0 5.05GB 8GB 10GB Near-original precision for lightweight deployment
BF16 9.31GB 12GB 16GB Debugging, comparison, research

For daily use, E2B Q4_K_M is already enough. With only 4GB VRAM, 2-bit or 3-bit variants can work, but output quality will be less stable.

Gemma 4 E4B VRAM Table

E4B is the more practical lightweight model. Compared with E2B, it is better for everyday writing, document summaries, light coding assistance, and local assistant use.

Quantization GGUF File Size Minimum VRAM Safer VRAM Best For
UD-IQ2_M 3.53GB 6GB 8GB Low-VRAM tests
UD-Q2_K_XL 3.74GB 6GB 8GB Low-VRAM usability
Q3_K_M 4.06GB 6GB 10GB Lightweight local assistant
IQ4_XS 4.72GB 8GB 12GB Balance of quality and speed
Q4_K_M 4.98GB 8GB 12GB Recommended E4B default
Q5_K_M 5.48GB 8GB 12GB Steadier everyday use
Q6_K 7.07GB 10GB 16GB Quality first
Q8_0 8.19GB 12GB 16GB Near-original precision
BF16 15.05GB 20GB 24GB Research, evaluation, precision comparison

If your GPU has 8GB VRAM, E4B Q4_K_M is a realistic starting point. With 12GB or 16GB VRAM, E4B Q8_0 is also worth considering.

Gemma 4 26B A4B VRAM Table

26B A4B is the MoE version. It has a larger total parameter count, but activates only part of the experts during inference. It is better suited to more complex Q&A, coding, tool use, and agent workflows.

Quantization GGUF File Size Minimum VRAM Safer VRAM Best For
UD-IQ2_M 9.97GB 14GB 16GB Extreme 16GB GPU tests
UD-Q2_K_XL 10.55GB 14GB 16GB Running 26B with low VRAM
UD-Q3_K_M 12.53GB 16GB 20GB Better quality while still VRAM-conscious
UD-IQ4_XS 13.42GB 16GB 24GB Balance of quality and size
UD-Q4_K_M 16.87GB 20GB 24GB Recommended 26B default
UD-Q5_K_M 21.15GB 24GB 32GB Higher-quality quantization
UD-Q6_K 23.17GB 28GB 32GB Quality first
Q8_0 26.86GB 32GB 40GB Near-original precision
BF16 50.51GB 64GB 80GB Not realistic for most single consumer GPUs

24GB VRAM is the comfortable dividing line for 26B A4B. A 16GB GPU can try low-bit versions, but context length, concurrency, and multimodal input should be kept modest.

Gemma 4 31B VRAM Table

31B is the larger dense model. Its strength is stronger overall capability, but its VRAM pressure is more direct than 26B A4B.

Quantization GGUF File Size Minimum VRAM Safer VRAM Best For
UD-IQ2_XXS 8.53GB 12GB 16GB Extreme low-VRAM tests with clear quality loss
UD-IQ2_M 10.75GB 14GB 18GB Low-VRAM tests
UD-Q2_K_XL 11.77GB 16GB 20GB 16GB GPU experiments
Q3_K_S 13.21GB 16GB 24GB More VRAM-efficient 3-bit
Q3_K_M 14.74GB 20GB 24GB Common 3-bit compromise
IQ4_XS 16.37GB 20GB 24GB Near-Q4 compromise
Q4_K_M 18.32GB 24GB 32GB Recommended 31B default
Q5_K_M 21.66GB 28GB 32GB Higher-quality quantization
Q6_K 25.20GB 32GB 40GB Quality first
Q8_0 32.64GB 40GB 48GB Near-original precision
BF16 61.41GB 80GB 96GB Server or large-VRAM workstation

Low-bit 31B can be tested on a 16GB GPU, but for daily use, 24GB VRAM is a better starting point. Q4_K_M is the balanced choice, while Q5_K_M and above make more sense with 32GB+ VRAM.

Why Actual Usage Is Higher Than File Size

The GGUF file size is only the weight size. Runtime usage also includes:

  • KV cache: longer context means higher memory use.
  • Batch size and concurrency: processing more tokens or more users increases VRAM.
  • Multimodal components: image, audio, or video input often requires mmproj or extra modules.
  • Runtime backend: CUDA, Metal, ROCm, and CPU/GPU split loading behave differently.
  • KV cache quantization: q8_0, q4_0, and similar modes can save VRAM, but may affect detail.

So the “minimum VRAM” column should be read as the threshold for startup and short-context inference. For 32K, 64K, 128K, or even 256K context, VRAM requirements rise significantly.

How to Choose

If you just want to try Gemma 4 locally:

  • 4GB to 6GB VRAM: choose E2B Q3_K_M or E2B Q4_K_M.
  • 8GB VRAM: prefer E4B Q4_K_M; E2B Q8_0 is also fine.
  • 12GB VRAM: choose E4B Q8_0, or try low-bit 26B/31B variants.
  • 16GB VRAM: try 26B A4B UD-Q3_K_M or 31B Q3_K_S, but do not expect long context to feel comfortable.
  • 24GB VRAM: focus on 26B A4B UD-Q4_K_M and 31B Q4_K_M.
  • 32GB and above: consider Q5_K_M, Q6_K, or longer context.

Most users do not need BF16. Local deployment is not about picking the largest file, but about balancing VRAM, speed, context length, and output quality.

References

记录并分享
Built with Hugo
Theme Stack designed by Jimmy