The core goal of quantization is simple: trade a small amount of precision for a smaller model size, lower VRAM usage, and faster inference.
For local deployment, picking the right quantization format is often more important than chasing a larger parameter count.
What Is Quantization
Quantization means compressing model parameters from higher-precision formats (such as FP16) into lower-bit formats (such as Q8 and Q4).
A simple analogy:
- Original model: like a high-quality photo, clear but large.
- Quantized model: like a compressed photo, slightly less detail but lighter and faster.
Common Quantization Formats
| Quantization | Precision / Bit Width | Size | Quality Loss | Recommended Use |
|---|---|---|---|---|
| FP16 | 16-bit float | Largest | Almost none | Research, evaluation, max quality |
| Q8_0 | 8-bit integer | Larger | Almost none | High-end PCs, quality + performance |
| Q5_K_M | 5-bit mixed | Medium | Slight | Daily driver, balanced choice |
| Q4_K_M | 4-bit mixed | Smaller | Acceptable | General default, strong value |
| Q3_K_M | 3-bit mixed | Very small | Noticeable | Low-spec devices, run-first |
| Q2_K | 2-bit mixed | Smallest | Significant | Extreme resource limits, fallback |
Quantization Naming Rules
Take gemma-4:4b-q4_k_m as an example:
gemma-4:4b: model name and parameter scale.q4: 4-bit quantization.k: K-quants (an improved quantization method).m: medium level (common options also includes/small andl/large).
Quick Selection by VRAM
| RAM / VRAM | Recommended Quantization |
|---|---|
| 4 GB | Q3_K_M / Q2_K |
| 8 GB | Q4_K_M |
| 16 GB | Q5_K_M / Q8_0 |
| 32 GB+ | FP16 / Q8_0 |
Start with a version that runs stably on your machine, then move up in precision step by step instead of jumping straight to the biggest model.
Practical Tips
- Start with
Q4_K_Mby default and test real tasks first. - If response quality is not enough, move up to
Q5_K_MorQ8_0. - If VRAM or speed is the main bottleneck, move down to
Q3_K_M. - Use the same test set every time you switch quantization formats.
Conclusion
- Quality first:
FP16orQ8_0. - Balance first:
Q5_K_M. - General default:
Q4_K_M. - Low-spec fallback:
Q3_K_MorQ2_K.
The key is not “bigger is always better”, but “the most stable and usable result under your hardware limits.”