How to Use llama-quantize for GGUF Models

A short introduction to what llama-quantize does, its basic commands, common options, and the tradeoffs between model size, speed, and quality.

llama-quantize is the quantization tool in llama.cpp. It is used to convert high-precision GGUF models into smaller quantized versions.

Its most common use is turning formats such as F32, BF16, or FP16 into versions like Q4_K_M, Q5_K_M, or Q8_0 that are easier to run locally. After quantization, models usually become much smaller and often faster at inference, but some quality loss is expected.

Basic workflow

A typical workflow is to prepare the original model, convert it to GGUF, and then run quantization.

1
2
3
4
5
6
7
8
# install Python dependencies
python3 -m pip install -r requirements.txt

# convert the model to ggml FP16 format
python3 convert_hf_to_gguf.py ./models/mymodel/

# quantize the model to 4-bits (using Q4_K_M method)
./llama-quantize ./models/mymodel/ggml-model-f16.gguf ./models/mymodel/ggml-model-Q4_K_M.gguf Q4_K_M

After that, you can run the quantized model with llama-cli:

1
2
# start inference on a gguf model
./llama-cli -m ./models/mymodel/ggml-model-Q4_K_M.gguf -cnv -p "You are a helpful assistant"

Common options

  • --allow-requantize: allows requantizing an already quantized model, usually not ideal for quality
  • --leave-output-tensor: keeps the output layer unquantized, increasing size but sometimes helping quality
  • --pure: disables mixed quantization and uses a more uniform quant type
  • --imatrix: uses an importance matrix to improve quantization quality
  • --keep-split: keeps the original shard layout instead of producing one merged file

If you just want a practical starting point, this is often enough:

1
./llama-quantize ./models/mymodel/ggml-model-f16.gguf ./models/mymodel/ggml-model-Q4_K_M.gguf Q4_K_M

How to choose a quant

You can think of quant levels as a tradeoff between size, speed, and quality:

  • Q8_0: larger, but usually safer for quality
  • Q6_K / Q5_K_M: common balanced choices
  • Q4_K_M: a very common default with a good size-quality balance
  • Q3 / Q2: useful when hardware is very limited, but quality loss is more visible

The practical goal is usually not to pick the biggest quant you can fit, but the one that runs reliably on your hardware while keeping acceptable quality.

Practical takeaway

  • start with Q4_K_M or Q5_K_M
  • move up to Q6_K or Q8_0 if quality matters more
  • move down to Q3 or Q2 if memory is tight
  • compare versions with the same prompt set

In short, llama-quantize is useful because it makes GGUF models easier to run on local hardware, not just because it makes files smaller.

记录并分享
Built with Hugo
Theme Stack designed by Jimmy