Gemma 4 currently has four main sizes for local deployment: E2B, E4B, 26B A4B, and 31B.
E2B and E4B target lightweight and edge devices, 26B A4B uses an MoE architecture, and 31B is the larger dense model.
The easiest mistake in local inference is mixing up two numbers:
- GGUF file size: how large the model weight file is.
- Actual VRAM usage: affected by model weights, KV cache, runtime overhead, context length, and whether multimodal projection files are loaded.
The tables below estimate VRAM requirements based on GGUF file size.
The default assumption is local text inference with llama.cpp, LM Studio, Ollama, or similar runtimes, using short to medium context.
If you need long context, image/audio input, or concurrent requests, leave more VRAM headroom.
Quick Summary
| VRAM | Good Fit | Avoid |
|---|---|---|
| 4GB | Low-bit E2B quantizations | E4B and above |
| 6GB | E2B Q4/Q5, low-bit E4B | 26B, 31B |
| 8GB | E2B Q8, E4B Q4/Q5 | 26B Q4, 31B Q4 |
| 12GB | E4B Q8, low-quality 2-bit/3-bit 26B or 31B tests | 26B Q4 with long context, 31B Q4 |
| 16GB | Low-bit 26B, low-bit 31B | 31B Q4 with long context, 26B Q5 and above |
| 24GB | 26B Q4/Q5, 31B Q4 | 31B Q8, BF16 |
| 32GB | 26B Q6/Q8, 31B Q5/Q6 | BF16 |
| 48GB | 31B Q8 more comfortably, 26B Q8 with longer context | 31B BF16 |
| 80GB+ | 26B/31B BF16 | Single consumer GPU deployment |
If you just want something usable locally, start with E4B Q4_K_M or E2B Q4_K_M.
With 24GB VRAM, 26B A4B Q4_K_M and 31B Q4_K_M start to become realistic choices.
Gemma 4 E2B VRAM Table
E2B is the lightest version, suitable for laptops, mini PCs, mobile devices, and low-VRAM testing.
It is easy to run, but complex reasoning, coding, and long tasks are limited.
| Quantization | GGUF File Size | Minimum VRAM | Safer VRAM | Best For |
|---|---|---|---|---|
UD-IQ2_M |
2.29GB | 4GB | 6GB | Extreme low-VRAM tests |
UD-Q2_K_XL |
2.40GB | 4GB | 6GB | Low-VRAM usability |
Q3_K_M |
2.54GB | 4GB | 6GB | Lightweight chat and summaries |
IQ4_XS |
2.98GB | 6GB | 8GB | Balance of quality and size |
Q4_K_M |
3.11GB | 6GB | 8GB | Recommended E2B default |
Q5_K_M |
3.36GB | 6GB | 8GB | Slightly steadier than Q4 |
Q6_K |
4.50GB | 8GB | 10GB | Higher-quality small model |
Q8_0 |
5.05GB | 8GB | 10GB | Near-original precision for lightweight deployment |
BF16 |
9.31GB | 12GB | 16GB | Debugging, comparison, research |
For daily use, E2B Q4_K_M is already enough.
With only 4GB VRAM, 2-bit or 3-bit variants can work, but output quality will be less stable.
Gemma 4 E4B VRAM Table
E4B is the more practical lightweight model.
Compared with E2B, it is better for everyday writing, document summaries, light coding assistance, and local assistant use.
| Quantization | GGUF File Size | Minimum VRAM | Safer VRAM | Best For |
|---|---|---|---|---|
UD-IQ2_M |
3.53GB | 6GB | 8GB | Low-VRAM tests |
UD-Q2_K_XL |
3.74GB | 6GB | 8GB | Low-VRAM usability |
Q3_K_M |
4.06GB | 6GB | 10GB | Lightweight local assistant |
IQ4_XS |
4.72GB | 8GB | 12GB | Balance of quality and speed |
Q4_K_M |
4.98GB | 8GB | 12GB | Recommended E4B default |
Q5_K_M |
5.48GB | 8GB | 12GB | Steadier everyday use |
Q6_K |
7.07GB | 10GB | 16GB | Quality first |
Q8_0 |
8.19GB | 12GB | 16GB | Near-original precision |
BF16 |
15.05GB | 20GB | 24GB | Research, evaluation, precision comparison |
If your GPU has 8GB VRAM, E4B Q4_K_M is a realistic starting point.
With 12GB or 16GB VRAM, E4B Q8_0 is also worth considering.
Gemma 4 26B A4B VRAM Table
26B A4B is the MoE version. It has a larger total parameter count, but activates only part of the experts during inference.
It is better suited to more complex Q&A, coding, tool use, and agent workflows.
| Quantization | GGUF File Size | Minimum VRAM | Safer VRAM | Best For |
|---|---|---|---|---|
UD-IQ2_M |
9.97GB | 14GB | 16GB | Extreme 16GB GPU tests |
UD-Q2_K_XL |
10.55GB | 14GB | 16GB | Running 26B with low VRAM |
UD-Q3_K_M |
12.53GB | 16GB | 20GB | Better quality while still VRAM-conscious |
UD-IQ4_XS |
13.42GB | 16GB | 24GB | Balance of quality and size |
UD-Q4_K_M |
16.87GB | 20GB | 24GB | Recommended 26B default |
UD-Q5_K_M |
21.15GB | 24GB | 32GB | Higher-quality quantization |
UD-Q6_K |
23.17GB | 28GB | 32GB | Quality first |
Q8_0 |
26.86GB | 32GB | 40GB | Near-original precision |
BF16 |
50.51GB | 64GB | 80GB | Not realistic for most single consumer GPUs |
24GB VRAM is the comfortable dividing line for 26B A4B. A 16GB GPU can try low-bit versions, but context length, concurrency, and multimodal input should be kept modest.
Gemma 4 31B VRAM Table
31B is the larger dense model.
Its strength is stronger overall capability, but its VRAM pressure is more direct than 26B A4B.
| Quantization | GGUF File Size | Minimum VRAM | Safer VRAM | Best For |
|---|---|---|---|---|
UD-IQ2_XXS |
8.53GB | 12GB | 16GB | Extreme low-VRAM tests with clear quality loss |
UD-IQ2_M |
10.75GB | 14GB | 18GB | Low-VRAM tests |
UD-Q2_K_XL |
11.77GB | 16GB | 20GB | 16GB GPU experiments |
Q3_K_S |
13.21GB | 16GB | 24GB | More VRAM-efficient 3-bit |
Q3_K_M |
14.74GB | 20GB | 24GB | Common 3-bit compromise |
IQ4_XS |
16.37GB | 20GB | 24GB | Near-Q4 compromise |
Q4_K_M |
18.32GB | 24GB | 32GB | Recommended 31B default |
Q5_K_M |
21.66GB | 28GB | 32GB | Higher-quality quantization |
Q6_K |
25.20GB | 32GB | 40GB | Quality first |
Q8_0 |
32.64GB | 40GB | 48GB | Near-original precision |
BF16 |
61.41GB | 80GB | 96GB | Server or large-VRAM workstation |
Low-bit 31B can be tested on a 16GB GPU, but for daily use, 24GB VRAM is a better starting point.
Q4_K_M is the balanced choice, while Q5_K_M and above make more sense with 32GB+ VRAM.
Why Actual Usage Is Higher Than File Size
The GGUF file size is only the weight size. Runtime usage also includes:
KV cache: longer context means higher memory use.- Batch size and concurrency: processing more tokens or more users increases VRAM.
- Multimodal components: image, audio, or video input often requires
mmprojor extra modules. - Runtime backend: CUDA, Metal, ROCm, and CPU/GPU split loading behave differently.
- KV cache quantization:
q8_0,q4_0, and similar modes can save VRAM, but may affect detail.
So the “minimum VRAM” column should be read as the threshold for startup and short-context inference. For 32K, 64K, 128K, or even 256K context, VRAM requirements rise significantly.
How to Choose
If you just want to try Gemma 4 locally:
- 4GB to 6GB VRAM: choose
E2B Q3_K_MorE2B Q4_K_M. - 8GB VRAM: prefer
E4B Q4_K_M;E2B Q8_0is also fine. - 12GB VRAM: choose
E4B Q8_0, or try low-bit 26B/31B variants. - 16GB VRAM: try
26B A4B UD-Q3_K_Mor31B Q3_K_S, but do not expect long context to feel comfortable. - 24GB VRAM: focus on
26B A4B UD-Q4_K_Mand31B Q4_K_M. - 32GB and above: consider
Q5_K_M,Q6_K, or longer context.
Most users do not need BF16. Local deployment is not about picking the largest file, but about balancing VRAM, speed, context length, and output quality.