LLM Quantization Explained: How to Choose FP16, Q8, Q5, Q4, or Q2

Sun, 05 Apr 2026 22:09:11 +0800

The core goal of quantization is simple: trade a small amount of precision for a smaller model size, lower VRAM usage, and faster inference.
For local deployment, picking the right quantization format is often more important than chasing a larger parameter count.

What Is Quantization

Quantization means compressing model parameters from higher-precision formats (such as FP16) into lower-bit formats (such as Q8 and Q4).

A simple analogy:

Original model: like a high-quality photo, clear but large.
Quantized model: like a compressed photo, slightly less detail but lighter and faster.

Common Quantization Formats

Quantization	Precision / Bit Width	Size	Quality Loss	Recommended Use
FP16	16-bit float	Largest	Almost none	Research, evaluation, max quality
Q8_0	8-bit integer	Larger	Almost none	High-end PCs, quality + performance
Q5_K_M	5-bit mixed	Medium	Slight	Daily driver, balanced choice
Q4_K_M	4-bit mixed	Smaller	Acceptable	General default, strong value
Q3_K_M	3-bit mixed	Very small	Noticeable	Low-spec devices, run-first
Q2_K	2-bit mixed	Smallest	Significant	Extreme resource limits, fallback

Quantization Naming Rules

Take gemma-4:4b-q4_k_m as an example:

gemma-4:4b: model name and parameter scale.
q4: 4-bit quantization.
k: K-quants (an improved quantization method).
m: medium level (common options also include s/small and l/large).

Quick Selection by VRAM

RAM / VRAM	Recommended Quantization
4 GB	Q3_K_M / Q2_K
8 GB	Q4_K_M
16 GB	Q5_K_M / Q8_0
32 GB+	FP16 / Q8_0

Start with a version that runs stably on your machine, then move up in precision step by step instead of jumping straight to the biggest model.

Practical Tips

Start with Q4_K_M by default and test real tasks first.
If response quality is not enough, move up to Q5_K_M or Q8_0.
If VRAM or speed is the main bottleneck, move down to Q3_K_M.
Use the same test set every time you switch quantization formats.

Conclusion

Quality first: FP16 or Q8_0.
Balance first: Q5_K_M.
General default: Q4_K_M.
Low-spec fallback: Q3_K_M or Q2_K.

The key is not “bigger is always better”, but “the most stable and usable result under your hardware limits.”

Google Gemma 4 Model Comparison: How to Choose Between 2B/4B/26B/31B

Sun, 05 Apr 2026 08:30:00 +0800

Gemma 4 focuses on multimodality and local offline inference, with a full range from lightweight to high-performance models. For most local deployment users, the key is not choosing the largest model, but choosing the one that best matches hardware and task needs.

Gemma 4 Model Comparison

The table below is for quick model selection. Actual performance and resource usage should be validated in your own environment.

Model	Parameter Size	Positioning	Key Strengths	Main Limitations	Recommended Scenarios
Gemma 4 2B	2B	Ultra-lightweight	Low latency, low resource usage, lowest deployment barrier	Limited performance on complex reasoning and long task chains	Mobile, IoT, lightweight Q&A, simple automation
Gemma 4 4B	4B	Lightweight enhanced	Stronger understanding and generation than 2B, still easy to deploy locally	Limited ceiling for heavy coding and complex agent tasks	Local assistant, basic document work, multilingual daily tasks
Gemma 4 26B	26B	High-performance (MoE)	Better reasoning and tool use, suitable for production workflows	Significantly higher VRAM requirement and hardware threshold	Coding assistant, complex workflows, enterprise internal agents
Gemma 4 31B	31B	High-performance (dense)	Best overall capability and stronger stability on complex tasks	Highest resource cost and tuning complexity	Advanced reasoning, complex coding tasks, heavy automation

How to Choose: Start from Hardware and Tasks

If your top concern is whether it runs smoothly, use this guideline:

8GB VRAM: prioritize 2B/4B.
12GB VRAM: prioritize 4B or quantized variants of larger models.
24GB VRAM: focus on 26B, and evaluate quantized 31B based on workload.
Higher VRAM or multi-GPU: consider high-precision 31B setups.

Prioritize stability and inference speed first, then scale up model size gradually.

Four Typical Use Cases

1) Local General Assistant

Preferred model: 4B
Why: strong balance between cost and quality, suitable for long-running local use.

2) Coding and Automation

Preferred model: 26B
Why: more stable in multi-step tasks, tool calls, and script generation.

3) Advanced Reasoning and Complex Agents

Preferred model: 31B
Why: stronger robustness under complex context.

4) Edge Devices and Lightweight Offline Use

Preferred model: 2B
Why: easiest to deploy on resource-constrained devices.

Deployment Suggestions (Ollama)

A practical approach is to iterate in small steps:

Start with 4B to establish a baseline (latency, memory, quality).
Build a fixed test set from real tasks (for example, 20 common questions + 10 automation tasks).
Compare 26B/31B against that set for accuracy, latency, and VRAM cost.
Upgrade only when the gain is clear.

This avoids jumping to a large model too early and running into lag, low throughput, and maintenance overhead.

Conclusion

The real value of Gemma 4 is not just larger parameter counts, but a practical model ladder from lightweight to high-performance:

For low-cost fast rollout: start with 2B/4B.
For production-grade local AI workflows: prioritize 26B.
For advanced reasoning and heavy automation: move to 31B.

In most cases, the best Gemma 4 choice is not the biggest model, but the one with the best fit for your hardware and task goals.

Ollama on KnightLi Blog

LLM Quantization Explained: How to Choose FP16, Q8, Q5, Q4, or Q2

What Is Quantization

Common Quantization Formats

Quantization Naming Rules

Quick Selection by VRAM

Practical Tips

Conclusion

Google Gemma 4 Model Comparison: How to Choose Between 2B/4B/26B/31B

Gemma 4 Model Comparison

How to Choose: Start from Hardware and Tasks

Four Typical Use Cases

1) Local General Assistant

2) Coding and Automation

3) Advanced Reasoning and Complex Agents

4) Edge Devices and Lightweight Offline Use

Deployment Suggestions (Ollama)

Conclusion