<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
    <channel>
        <title>Ollama on KnightLi Blog</title>
        <link>https://www.knightli.com/en/tags/ollama/</link>
        <description>Recent content in Ollama on KnightLi Blog</description>
        <generator>Hugo -- gohugo.io</generator>
        <language>en</language>
        <lastBuildDate>Sun, 05 Apr 2026 22:09:11 +0800</lastBuildDate><atom:link href="https://www.knightli.com/en/tags/ollama/index.xml" rel="self" type="application/rss+xml" /><item>
        <title>LLM Quantization Explained: How to Choose FP16, Q8, Q5, Q4, or Q2</title>
        <link>https://www.knightli.com/en/2026/04/05/llm-quantization-guide-fp16-q4-q2/</link>
        <pubDate>Sun, 05 Apr 2026 22:09:11 +0800</pubDate>
        
        <guid>https://www.knightli.com/en/2026/04/05/llm-quantization-guide-fp16-q4-q2/</guid>
        <description>&lt;p&gt;The core goal of quantization is simple: trade a small amount of precision for a smaller model size, lower VRAM usage, and faster inference.&lt;br&gt;
For local deployment, picking the right quantization format is often more important than chasing a larger parameter count.&lt;/p&gt;
&lt;h2 id=&#34;what-is-quantization&#34;&gt;What Is Quantization
&lt;/h2&gt;&lt;p&gt;Quantization means compressing model parameters from higher-precision formats (such as &lt;code&gt;FP16&lt;/code&gt;) into lower-bit formats (such as &lt;code&gt;Q8&lt;/code&gt; and &lt;code&gt;Q4&lt;/code&gt;).&lt;/p&gt;
&lt;p&gt;A simple analogy:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Original model: like a high-quality photo, clear but large.&lt;/li&gt;
&lt;li&gt;Quantized model: like a compressed photo, slightly less detail but lighter and faster.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;common-quantization-formats&#34;&gt;Common Quantization Formats
&lt;/h2&gt;&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Quantization&lt;/th&gt;
          &lt;th&gt;Precision / Bit Width&lt;/th&gt;
          &lt;th&gt;Size&lt;/th&gt;
          &lt;th&gt;Quality Loss&lt;/th&gt;
          &lt;th&gt;Recommended Use&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;FP16&lt;/td&gt;
          &lt;td&gt;16-bit float&lt;/td&gt;
          &lt;td&gt;Largest&lt;/td&gt;
          &lt;td&gt;Almost none&lt;/td&gt;
          &lt;td&gt;Research, evaluation, max quality&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Q8_0&lt;/td&gt;
          &lt;td&gt;8-bit integer&lt;/td&gt;
          &lt;td&gt;Larger&lt;/td&gt;
          &lt;td&gt;Almost none&lt;/td&gt;
          &lt;td&gt;High-end PCs, quality + performance&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Q5_K_M&lt;/td&gt;
          &lt;td&gt;5-bit mixed&lt;/td&gt;
          &lt;td&gt;Medium&lt;/td&gt;
          &lt;td&gt;Slight&lt;/td&gt;
          &lt;td&gt;Daily driver, balanced choice&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Q4_K_M&lt;/td&gt;
          &lt;td&gt;4-bit mixed&lt;/td&gt;
          &lt;td&gt;Smaller&lt;/td&gt;
          &lt;td&gt;Acceptable&lt;/td&gt;
          &lt;td&gt;General default, strong value&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Q3_K_M&lt;/td&gt;
          &lt;td&gt;3-bit mixed&lt;/td&gt;
          &lt;td&gt;Very small&lt;/td&gt;
          &lt;td&gt;Noticeable&lt;/td&gt;
          &lt;td&gt;Low-spec devices, run-first&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Q2_K&lt;/td&gt;
          &lt;td&gt;2-bit mixed&lt;/td&gt;
          &lt;td&gt;Smallest&lt;/td&gt;
          &lt;td&gt;Significant&lt;/td&gt;
          &lt;td&gt;Extreme resource limits, fallback&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;h2 id=&#34;quantization-naming-rules&#34;&gt;Quantization Naming Rules
&lt;/h2&gt;&lt;p&gt;Take &lt;code&gt;gemma-4:4b-q4_k_m&lt;/code&gt; as an example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;gemma-4:4b&lt;/code&gt;: model name and parameter scale.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;q4&lt;/code&gt;: 4-bit quantization.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;k&lt;/code&gt;: K-quants (an improved quantization method).&lt;/li&gt;
&lt;li&gt;&lt;code&gt;m&lt;/code&gt;: medium level (common options also include &lt;code&gt;s&lt;/code&gt;/small and &lt;code&gt;l&lt;/code&gt;/large).&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;quick-selection-by-vram&#34;&gt;Quick Selection by VRAM
&lt;/h2&gt;&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;RAM / VRAM&lt;/th&gt;
          &lt;th&gt;Recommended Quantization&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;4 GB&lt;/td&gt;
          &lt;td&gt;Q3_K_M / Q2_K&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;8 GB&lt;/td&gt;
          &lt;td&gt;Q4_K_M&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;16 GB&lt;/td&gt;
          &lt;td&gt;Q5_K_M / Q8_0&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;32 GB+&lt;/td&gt;
          &lt;td&gt;FP16 / Q8_0&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Start with a version that runs stably on your machine, then move up in precision step by step instead of jumping straight to the biggest model.&lt;/p&gt;
&lt;h2 id=&#34;practical-tips&#34;&gt;Practical Tips
&lt;/h2&gt;&lt;ol&gt;
&lt;li&gt;Start with &lt;code&gt;Q4_K_M&lt;/code&gt; by default and test real tasks first.&lt;/li&gt;
&lt;li&gt;If response quality is not enough, move up to &lt;code&gt;Q5_K_M&lt;/code&gt; or &lt;code&gt;Q8_0&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;If VRAM or speed is the main bottleneck, move down to &lt;code&gt;Q3_K_M&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Use the same test set every time you switch quantization formats.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&#34;conclusion&#34;&gt;Conclusion
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;Quality first: &lt;code&gt;FP16&lt;/code&gt; or &lt;code&gt;Q8_0&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Balance first: &lt;code&gt;Q5_K_M&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;General default: &lt;code&gt;Q4_K_M&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Low-spec fallback: &lt;code&gt;Q3_K_M&lt;/code&gt; or &lt;code&gt;Q2_K&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The key is not &amp;ldquo;bigger is always better&amp;rdquo;, but &amp;ldquo;the most stable and usable result under your hardware limits.&amp;rdquo;&lt;/p&gt;
</description>
        </item>
        <item>
        <title>Google Gemma 4 Model Comparison: How to Choose Between 2B/4B/26B/31B</title>
        <link>https://www.knightli.com/en/2026/04/05/google-gemma-4-model-comparison/</link>
        <pubDate>Sun, 05 Apr 2026 08:30:00 +0800</pubDate>
        
        <guid>https://www.knightli.com/en/2026/04/05/google-gemma-4-model-comparison/</guid>
        <description>&lt;p&gt;Gemma 4 focuses on &lt;code&gt;multimodality&lt;/code&gt; and &lt;code&gt;local offline inference&lt;/code&gt;, with a full range from lightweight to high-performance models. For most local deployment users, the key is not choosing the largest model, but choosing the one that best matches hardware and task needs.&lt;/p&gt;
&lt;h2 id=&#34;gemma-4-model-comparison&#34;&gt;Gemma 4 Model Comparison
&lt;/h2&gt;&lt;blockquote&gt;
&lt;p&gt;The table below is for quick model selection. Actual performance and resource usage should be validated in your own environment.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Model&lt;/th&gt;
          &lt;th&gt;Parameter Size&lt;/th&gt;
          &lt;th&gt;Positioning&lt;/th&gt;
          &lt;th&gt;Key Strengths&lt;/th&gt;
          &lt;th&gt;Main Limitations&lt;/th&gt;
          &lt;th&gt;Recommended Scenarios&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;Gemma 4 2B&lt;/td&gt;
          &lt;td&gt;2B&lt;/td&gt;
          &lt;td&gt;Ultra-lightweight&lt;/td&gt;
          &lt;td&gt;Low latency, low resource usage, lowest deployment barrier&lt;/td&gt;
          &lt;td&gt;Limited performance on complex reasoning and long task chains&lt;/td&gt;
          &lt;td&gt;Mobile, IoT, lightweight Q&amp;amp;A, simple automation&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Gemma 4 4B&lt;/td&gt;
          &lt;td&gt;4B&lt;/td&gt;
          &lt;td&gt;Lightweight enhanced&lt;/td&gt;
          &lt;td&gt;Stronger understanding and generation than 2B, still easy to deploy locally&lt;/td&gt;
          &lt;td&gt;Limited ceiling for heavy coding and complex agent tasks&lt;/td&gt;
          &lt;td&gt;Local assistant, basic document work, multilingual daily tasks&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Gemma 4 26B&lt;/td&gt;
          &lt;td&gt;26B&lt;/td&gt;
          &lt;td&gt;High-performance (MoE)&lt;/td&gt;
          &lt;td&gt;Better reasoning and tool use, suitable for production workflows&lt;/td&gt;
          &lt;td&gt;Significantly higher VRAM requirement and hardware threshold&lt;/td&gt;
          &lt;td&gt;Coding assistant, complex workflows, enterprise internal agents&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Gemma 4 31B&lt;/td&gt;
          &lt;td&gt;31B&lt;/td&gt;
          &lt;td&gt;High-performance (dense)&lt;/td&gt;
          &lt;td&gt;Best overall capability and stronger stability on complex tasks&lt;/td&gt;
          &lt;td&gt;Highest resource cost and tuning complexity&lt;/td&gt;
          &lt;td&gt;Advanced reasoning, complex coding tasks, heavy automation&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;h2 id=&#34;how-to-choose-start-from-hardware-and-tasks&#34;&gt;How to Choose: Start from Hardware and Tasks
&lt;/h2&gt;&lt;p&gt;If your top concern is whether it runs smoothly, use this guideline:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;8GB&lt;/code&gt; VRAM: prioritize &lt;code&gt;2B/4B&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;12GB&lt;/code&gt; VRAM: prioritize &lt;code&gt;4B&lt;/code&gt; or quantized variants of larger models.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;24GB&lt;/code&gt; VRAM: focus on &lt;code&gt;26B&lt;/code&gt;, and evaluate quantized &lt;code&gt;31B&lt;/code&gt; based on workload.&lt;/li&gt;
&lt;li&gt;Higher VRAM or multi-GPU: consider high-precision &lt;code&gt;31B&lt;/code&gt; setups.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Prioritize stability and inference speed first, then scale up model size gradually.&lt;/p&gt;
&lt;h2 id=&#34;four-typical-use-cases&#34;&gt;Four Typical Use Cases
&lt;/h2&gt;&lt;h3 id=&#34;1-local-general-assistant&#34;&gt;1) Local General Assistant
&lt;/h3&gt;&lt;ul&gt;
&lt;li&gt;Preferred model: &lt;code&gt;4B&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Why: strong balance between cost and quality, suitable for long-running local use.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;2-coding-and-automation&#34;&gt;2) Coding and Automation
&lt;/h3&gt;&lt;ul&gt;
&lt;li&gt;Preferred model: &lt;code&gt;26B&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Why: more stable in multi-step tasks, tool calls, and script generation.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;3-advanced-reasoning-and-complex-agents&#34;&gt;3) Advanced Reasoning and Complex Agents
&lt;/h3&gt;&lt;ul&gt;
&lt;li&gt;Preferred model: &lt;code&gt;31B&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Why: stronger robustness under complex context.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;4-edge-devices-and-lightweight-offline-use&#34;&gt;4) Edge Devices and Lightweight Offline Use
&lt;/h3&gt;&lt;ul&gt;
&lt;li&gt;Preferred model: &lt;code&gt;2B&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Why: easiest to deploy on resource-constrained devices.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;deployment-suggestions-ollama&#34;&gt;Deployment Suggestions (Ollama)
&lt;/h2&gt;&lt;p&gt;A practical approach is to iterate in small steps:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Start with &lt;code&gt;4B&lt;/code&gt; to establish a baseline (latency, memory, quality).&lt;/li&gt;
&lt;li&gt;Build a fixed test set from real tasks (for example, 20 common questions + 10 automation tasks).&lt;/li&gt;
&lt;li&gt;Compare &lt;code&gt;26B/31B&lt;/code&gt; against that set for accuracy, latency, and VRAM cost.&lt;/li&gt;
&lt;li&gt;Upgrade only when the gain is clear.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This avoids jumping to a large model too early and running into lag, low throughput, and maintenance overhead.&lt;/p&gt;
&lt;h2 id=&#34;conclusion&#34;&gt;Conclusion
&lt;/h2&gt;&lt;p&gt;The real value of Gemma 4 is not just larger parameter counts, but a practical model ladder from lightweight to high-performance:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;For low-cost fast rollout: start with &lt;code&gt;2B/4B&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;For production-grade local AI workflows: prioritize &lt;code&gt;26B&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;For advanced reasoning and heavy automation: move to &lt;code&gt;31B&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In most cases, the best Gemma 4 choice is not the biggest model, but the one with the best fit for your hardware and task goals.&lt;/p&gt;
</description>
        </item>
        
    </channel>
</rss>
