llama-quantize is the quantization tool in llama.cpp. It is used to convert high-precision GGUF models into smaller quantized versions.
Its most common use is turning formats such as F32, BF16, or FP16 into versions like Q4_K_M, Q5_K_M, or Q8_0 that are easier to run locally. After quantization, models usually become much smaller and often faster at inference, but some quality loss is expected.
Basic workflow
A typical workflow is to prepare the original model, convert it to GGUF, and then run quantization.
|
|
After that, you can run the quantized model with llama-cli:
|
|
Common options
--allow-requantize: allows requantizing an already quantized model, usually not ideal for quality--leave-output-tensor: keeps the output layer unquantized, increasing size but sometimes helping quality--pure: disables mixed quantization and uses a more uniform quant type--imatrix: uses an importance matrix to improve quantization quality--keep-split: keeps the original shard layout instead of producing one merged file
If you just want a practical starting point, this is often enough:
|
|
How to choose a quant
You can think of quant levels as a tradeoff between size, speed, and quality:
Q8_0: larger, but usually safer for qualityQ6_K/Q5_K_M: common balanced choicesQ4_K_M: a very common default with a good size-quality balanceQ3/Q2: useful when hardware is very limited, but quality loss is more visible
The practical goal is usually not to pick the biggest quant you can fit, but the one that runs reliably on your hardware while keeping acceptable quality.
Practical takeaway
- start with
Q4_K_MorQ5_K_M - move up to
Q6_KorQ8_0if quality matters more - move down to
Q3orQ2if memory is tight - compare versions with the same prompt set
In short, llama-quantize is useful because it makes GGUF models easier to run on local hardware, not just because it makes files smaller.