LLM Quantization Explained: How to Choose FP16, Q8, Q5, Q4, or Q2

A practical guide to LLM quantization, common format differences, and VRAM-based model selection to balance quality, speed, and resource cost.

The core goal of quantization is simple: trade a small amount of precision for a smaller model size, lower VRAM usage, and faster inference.
For local deployment, picking the right quantization format is often more important than chasing a larger parameter count.

What Is Quantization

Quantization means compressing model parameters from higher-precision formats (such as FP16) into lower-bit formats (such as Q8 and Q4).

A simple analogy:

  • Original model: like a high-quality photo, clear but large.
  • Quantized model: like a compressed photo, slightly less detail but lighter and faster.

Common Quantization Formats

Quantization Precision / Bit Width Size Quality Loss Recommended Use
FP16 16-bit float Largest Almost none Research, evaluation, max quality
Q8_0 8-bit integer Larger Almost none High-end PCs, quality + performance
Q5_K_M 5-bit mixed Medium Slight Daily driver, balanced choice
Q4_K_M 4-bit mixed Smaller Acceptable General default, strong value
Q3_K_M 3-bit mixed Very small Noticeable Low-spec devices, run-first
Q2_K 2-bit mixed Smallest Significant Extreme resource limits, fallback

Quantization Naming Rules

Take gemma-4:4b-q4_k_m as an example:

  • gemma-4:4b: model name and parameter scale.
  • q4: 4-bit quantization.
  • k: K-quants (an improved quantization method).
  • m: medium level (common options also include s/small and l/large).

Quick Selection by VRAM

RAM / VRAM Recommended Quantization
4 GB Q3_K_M / Q2_K
8 GB Q4_K_M
16 GB Q5_K_M / Q8_0
32 GB+ FP16 / Q8_0

Start with a version that runs stably on your machine, then move up in precision step by step instead of jumping straight to the biggest model.

Practical Tips

  1. Start with Q4_K_M by default and test real tasks first.
  2. If response quality is not enough, move up to Q5_K_M or Q8_0.
  3. If VRAM or speed is the main bottleneck, move down to Q3_K_M.
  4. Use the same test set every time you switch quantization formats.

Conclusion

  • Quality first: FP16 or Q8_0.
  • Balance first: Q5_K_M.
  • General default: Q4_K_M.
  • Low-spec fallback: Q3_K_M or Q2_K.

The key is not “bigger is always better”, but “the most stable and usable result under your hardware limits.”

记录并分享
Built with Hugo
Theme Stack designed by Jimmy