<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
    <channel>
        <title>Model Conversion on KnightLi Blog</title>
        <link>https://www.knightli.com/en/tags/model-conversion/</link>
        <description>Recent content in Model Conversion on KnightLi Blog</description>
        <generator>Hugo -- gohugo.io</generator>
        <language>en</language>
        <lastBuildDate>Sun, 12 Apr 2026 09:42:36 +0800</lastBuildDate><atom:link href="https://www.knightli.com/en/tags/model-conversion/index.xml" rel="self" type="application/rss+xml" /><item>
        <title>How to Use llama-quantize for GGUF Models</title>
        <link>https://www.knightli.com/en/2026/04/12/llama-quantize-gguf-guide/</link>
        <pubDate>Sun, 12 Apr 2026 09:42:36 +0800</pubDate>
        
        <guid>https://www.knightli.com/en/2026/04/12/llama-quantize-gguf-guide/</guid>
        <description>&lt;p&gt;&lt;code&gt;llama-quantize&lt;/code&gt; is the quantization tool in &lt;code&gt;llama.cpp&lt;/code&gt;. It is used to convert high-precision &lt;code&gt;GGUF&lt;/code&gt; models into smaller quantized versions.&lt;/p&gt;
&lt;p&gt;Its most common use is turning formats such as &lt;code&gt;F32&lt;/code&gt;, &lt;code&gt;BF16&lt;/code&gt;, or &lt;code&gt;FP16&lt;/code&gt; into versions like &lt;code&gt;Q4_K_M&lt;/code&gt;, &lt;code&gt;Q5_K_M&lt;/code&gt;, or &lt;code&gt;Q8_0&lt;/code&gt; that are easier to run locally. After quantization, models usually become much smaller and often faster at inference, but some quality loss is expected.&lt;/p&gt;
&lt;h2 id=&#34;basic-workflow&#34;&gt;Basic workflow
&lt;/h2&gt;&lt;p&gt;A typical workflow is to prepare the original model, convert it to GGUF, and then run quantization.&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;4
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;5
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;6
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;7
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;8
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# install Python dependencies&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;python3 -m pip install -r requirements.txt
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# convert the model to ggml FP16 format&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;python3 convert_hf_to_gguf.py ./models/mymodel/
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# quantize the model to 4-bits (using Q4_K_M method)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;./llama-quantize ./models/mymodel/ggml-model-f16.gguf ./models/mymodel/ggml-model-Q4_K_M.gguf Q4_K_M
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;After that, you can run the quantized model with &lt;code&gt;llama-cli&lt;/code&gt;:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# start inference on a gguf model&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;./llama-cli -m ./models/mymodel/ggml-model-Q4_K_M.gguf -cnv -p &lt;span class=&#34;s2&#34;&gt;&amp;#34;You are a helpful assistant&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h2 id=&#34;common-options&#34;&gt;Common options
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;code&gt;--allow-requantize&lt;/code&gt;: allows requantizing an already quantized model, usually not ideal for quality&lt;/li&gt;
&lt;li&gt;&lt;code&gt;--leave-output-tensor&lt;/code&gt;: keeps the output layer unquantized, increasing size but sometimes helping quality&lt;/li&gt;
&lt;li&gt;&lt;code&gt;--pure&lt;/code&gt;: disables mixed quantization and uses a more uniform quant type&lt;/li&gt;
&lt;li&gt;&lt;code&gt;--imatrix&lt;/code&gt;: uses an importance matrix to improve quantization quality&lt;/li&gt;
&lt;li&gt;&lt;code&gt;--keep-split&lt;/code&gt;: keeps the original shard layout instead of producing one merged file&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you just want a practical starting point, this is often enough:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;./llama-quantize ./models/mymodel/ggml-model-f16.gguf ./models/mymodel/ggml-model-Q4_K_M.gguf Q4_K_M
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h2 id=&#34;how-to-choose-a-quant&#34;&gt;How to choose a quant
&lt;/h2&gt;&lt;p&gt;You can think of quant levels as a tradeoff between size, speed, and quality:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Q8_0&lt;/code&gt;: larger, but usually safer for quality&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q6_K&lt;/code&gt; / &lt;code&gt;Q5_K_M&lt;/code&gt;: common balanced choices&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q4_K_M&lt;/code&gt;: a very common default with a good size-quality balance&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q3&lt;/code&gt; / &lt;code&gt;Q2&lt;/code&gt;: useful when hardware is very limited, but quality loss is more visible&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The practical goal is usually not to pick the biggest quant you can fit, but the one that runs reliably on your hardware while keeping acceptable quality.&lt;/p&gt;
&lt;h2 id=&#34;practical-takeaway&#34;&gt;Practical takeaway
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;start with &lt;code&gt;Q4_K_M&lt;/code&gt; or &lt;code&gt;Q5_K_M&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;move up to &lt;code&gt;Q6_K&lt;/code&gt; or &lt;code&gt;Q8_0&lt;/code&gt; if quality matters more&lt;/li&gt;
&lt;li&gt;move down to &lt;code&gt;Q3&lt;/code&gt; or &lt;code&gt;Q2&lt;/code&gt; if memory is tight&lt;/li&gt;
&lt;li&gt;compare versions with the same prompt set&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In short, &lt;code&gt;llama-quantize&lt;/code&gt; is useful because it makes GGUF models easier to run on local hardware, not just because it makes files smaller.&lt;/p&gt;
</description>
        </item>
        
    </channel>
</rss>
