How to Use llama-quantize for GGUF Models

llama-quantize is the quantization tool in llama.cpp. It is used to convert high-precision GGUF models into smaller quantized versions.

Its most common use is turning formats such as F32, BF16, or FP16 into versions like Q4_K_M, Q5_K_M, or Q8_0 that are easier to run locally. After quantization, models usually become much smaller and often faster at inference, but some quality loss is expected.

Basic workflow

A typical workflow is to prepare the original model, convert it to GGUF, and then run quantization.

1
2
3
4
5
6
7
8


# install Python dependencies
python3 -m pip install -r requirements.txt

# convert the model to ggml FP16 format
python3 convert_hf_to_gguf.py ./models/mymodel/

# quantize the model to 4-bits (using Q4_K_M method)
./llama-quantize ./models/mymodel/ggml-model-f16.gguf ./models/mymodel/ggml-model-Q4_K_M.gguf Q4_K_M

After that, you can run the quantized model with llama-cli:

1
2


# start inference on a gguf model
./llama-cli -m ./models/mymodel/ggml-model-Q4_K_M.gguf -cnv -p "You are a helpful assistant"

Common options

--allow-requantize: allows requantizing an already quantized model, usually not ideal for quality
--leave-output-tensor: keeps the output layer unquantized, increasing size but sometimes helping quality
--pure: disables mixed quantization and uses a more uniform quant type
--imatrix: uses an importance matrix to improve quantization quality
--keep-split: keeps the original shard layout instead of producing one merged file

If you just want a practical starting point, this is often enough:

1

./llama-quantize ./models/mymodel/ggml-model-f16.gguf ./models/mymodel/ggml-model-Q4_K_M.gguf Q4_K_M

How to choose a quant

You can think of quant levels as a tradeoff between size, speed, and quality:

Q8_0: larger, but usually safer for quality
Q6_K / Q5_K_M: common balanced choices
Q4_K_M: a very common default with a good size-quality balance
Q3 / Q2: useful when hardware is very limited, but quality loss is more visible

The practical goal is usually not to pick the biggest quant you can fit, but the one that runs reliably on your hardware while keeping acceptable quality.

Practical takeaway

start with Q4_K_M or Q5_K_M
move up to Q6_K or Q8_0 if quality matters more
move down to Q3 or Q2 if memory is tight
compare versions with the same prompt set

In short, llama-quantize is useful because it makes GGUF models easier to run on local hardware, not just because it makes files smaller.