LLM Quantization Explained: How to Choose FP16, Q8, Q5, Q4, or Q2

The core goal of quantization is simple: trade a small amount of precision for a smaller model size, lower VRAM usage, and faster inference.
For local deployment, picking the right quantization format is often more important than chasing a larger parameter count.

What Is Quantization

Quantization means compressing model parameters from higher-precision formats (such as FP16) into lower-bit formats (such as Q8 and Q4).

A simple analogy:

Original model: like a high-quality photo, clear but large.
Quantized model: like a compressed photo, slightly less detail but lighter and faster.

Common Quantization Formats

Quantization	Precision / Bit Width	Size	Quality Loss	Recommended Use
FP16	16-bit float	Largest	Almost none	Research, evaluation, max quality
Q8_0	8-bit integer	Larger	Almost none	High-end PCs, quality + performance
Q5_K_M	5-bit mixed	Medium	Slight	Daily driver, balanced choice
Q4_K_M	4-bit mixed	Smaller	Acceptable	General default, strong value
Q3_K_M	3-bit mixed	Very small	Noticeable	Low-spec devices, run-first
Q2_K	2-bit mixed	Smallest	Significant	Extreme resource limits, fallback

Quantization Naming Rules

Take gemma-4:4b-q4_k_m as an example:

gemma-4:4b: model name and parameter scale.
q4: 4-bit quantization.
k: K-quants (an improved quantization method).
m: medium level (common options also include s/small and l/large).

Quick Selection by VRAM

RAM / VRAM	Recommended Quantization
4 GB	Q3_K_M / Q2_K
8 GB	Q4_K_M
16 GB	Q5_K_M / Q8_0
32 GB+	FP16 / Q8_0

Start with a version that runs stably on your machine, then move up in precision step by step instead of jumping straight to the biggest model.

Practical Tips

Start with Q4_K_M by default and test real tasks first.
If response quality is not enough, move up to Q5_K_M or Q8_0.
If VRAM or speed is the main bottleneck, move down to Q3_K_M.
Use the same test set every time you switch quantization formats.

Conclusion

Quality first: FP16 or Q8_0.
Balance first: Q5_K_M.
General default: Q4_K_M.
Low-spec fallback: Q3_K_M or Q2_K.

The key is not “bigger is always better”, but “the most stable and usable result under your hardware limits.”