<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
    <channel>
        <title>GPU Tuning on KnightLi Blog</title>
        <link>https://www.knightli.com/en/tags/gpu-tuning/</link>
        <description>Recent content in GPU Tuning on KnightLi Blog</description>
        <generator>Hugo -- gohugo.io</generator>
        <language>en</language>
        <lastBuildDate>Thu, 23 Apr 2026 12:13:04 +0800</lastBuildDate><atom:link href="https://www.knightli.com/en/tags/gpu-tuning/index.xml" rel="self" type="application/rss+xml" /><item>
        <title>How to Tune llama.cpp on 8GB VRAM: Why 32K Is Safer and 64K Needs KV Cache Quantization</title>
        <link>https://www.knightli.com/en/2026/04/23/llama-cpp-8g-vram-32k-64k-kv-cache-tuning/</link>
        <pubDate>Thu, 23 Apr 2026 12:13:04 +0800</pubDate>
        
        <guid>https://www.knightli.com/en/2026/04/23/llama-cpp-8g-vram-32k-64k-kv-cache-tuning/</guid>
        <description>&lt;p&gt;Whether &lt;code&gt;8GB&lt;/code&gt; of VRAM is enough to run local LLMs smoothly, especially under long-context workloads, is one of the most common questions people run into when using &lt;code&gt;llama.cpp&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;There are three key takeaways worth remembering first:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;On &lt;code&gt;8GB&lt;/code&gt; VRAM, &lt;code&gt;32K&lt;/code&gt; context is usually the safer balance point&lt;/li&gt;
&lt;li&gt;If you really want to run &lt;code&gt;64K&lt;/code&gt;, &lt;code&gt;KV Cache&lt;/code&gt; quantization is often essential&lt;/li&gt;
&lt;li&gt;In full-GPU inference, blindly increasing &lt;code&gt;CPU&lt;/code&gt; thread count can actually make performance worse&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;1-first-what-do-32k-64k-and-kv-cache-actually-mean&#34;&gt;1. First, what do 32K, 64K, and KV Cache actually mean?
&lt;/h2&gt;&lt;p&gt;For many readers, these are the three terms that cause the most confusion.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;32K&lt;/code&gt; and &lt;code&gt;64K&lt;/code&gt; refer to context length, meaning how many &lt;code&gt;tokens&lt;/code&gt; the model can process at one time. Here, &lt;code&gt;K&lt;/code&gt; means thousand, so &lt;code&gt;32K&lt;/code&gt; is about &lt;code&gt;32000 tokens&lt;/code&gt;, and &lt;code&gt;64K&lt;/code&gt; is about &lt;code&gt;64000 tokens&lt;/code&gt;. The longer the context, the more prior content the model can see at once, which is useful for long-document QA, long conversations, and multi-step analysis.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;KV Cache&lt;/code&gt; is an intermediate-result cache that the model keeps in order to speed up autoregressive generation. You can think of it like this: once the model has already read and computed part of the context, it does not need to recompute everything from scratch every time. Instead, it stores key intermediate information and reuses it. The &lt;code&gt;K&lt;/code&gt; and &lt;code&gt;V&lt;/code&gt; come from &lt;code&gt;Key&lt;/code&gt; and &lt;code&gt;Value&lt;/code&gt; in the Transformer architecture.&lt;/p&gt;
&lt;p&gt;Why do these three terms always appear together? Because:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;32K&lt;/code&gt; and &lt;code&gt;64K&lt;/code&gt; define how much content you want the model to remember at once&lt;/li&gt;
&lt;li&gt;&lt;code&gt;KV Cache&lt;/code&gt; determines how much extra VRAM is needed to maintain that memory&lt;/li&gt;
&lt;li&gt;The longer the context, the larger the &lt;code&gt;KV Cache&lt;/code&gt; usually becomes, and the higher the VRAM pressure gets&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So when long-context inference slows down, the root problem is often not that the model is &amp;ldquo;bad at computing&amp;rdquo;, but that the cache has grown large enough to push VRAM to its limit.&lt;/p&gt;
&lt;h2 id=&#34;2-why-does-32k-perform-so-differently-from-64k&#34;&gt;2. Why does 32K perform so differently from 64K?
&lt;/h2&gt;&lt;p&gt;Using roughly &lt;code&gt;30000&lt;/code&gt; Chinese characters from &lt;em&gt;The Three-Body Problem&lt;/em&gt; as a stress-test input, the comparison between &lt;code&gt;32K&lt;/code&gt; and &lt;code&gt;64K&lt;/code&gt; context can look dramatic: with similar document size, &lt;code&gt;64K&lt;/code&gt; can become much slower and total runtime can increase significantly.&lt;/p&gt;
&lt;p&gt;The reason is not that the model suddenly becomes worse. The real issue is hitting the VRAM boundary.&lt;/p&gt;
&lt;p&gt;At &lt;code&gt;32K&lt;/code&gt;, model weights plus cache may still fit within &lt;code&gt;8GB&lt;/code&gt; VRAM, so most data traffic stays on the GPU&amp;rsquo;s own memory bandwidth. But once you move to &lt;code&gt;64K&lt;/code&gt;, the cache grows further, total memory use approaches or exceeds the VRAM ceiling, and part of the data gets pushed into shared or system memory.&lt;/p&gt;
&lt;p&gt;At that point, what collapses is not raw compute, but bandwidth.&lt;/p&gt;
&lt;p&gt;In other words, what looks like &amp;ldquo;context doubled and performance crashed&amp;rdquo; is often really a case of the data path falling out of VRAM and into much slower memory.&lt;/p&gt;
&lt;h2 id=&#34;3-if-you-want-64k-kv-cache-quantization-matters-a-lot&#34;&gt;3. If you want 64K, KV Cache quantization matters a lot
&lt;/h2&gt;&lt;p&gt;One of the most important conclusions for &lt;code&gt;8GB&lt;/code&gt; VRAM users is that &lt;code&gt;KV Cache&lt;/code&gt; quantization matters a great deal.&lt;/p&gt;
&lt;p&gt;Without changing the model itself, quantizing only the cache can directly reduce cache memory usage under long context. That means some of the data that previously spilled out of VRAM can move back into VRAM. As a result, &lt;code&gt;64K&lt;/code&gt; is still heavier than &lt;code&gt;32K&lt;/code&gt;, but it is less likely to fall into the slowest performance zone.&lt;/p&gt;
&lt;p&gt;Put simply:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;32K&lt;/code&gt; is the more practical default range for &lt;code&gt;8GB&lt;/code&gt; VRAM&lt;/li&gt;
&lt;li&gt;&lt;code&gt;64K&lt;/code&gt; is not impossible&lt;/li&gt;
&lt;li&gt;But without cache quantization, performance can drop from &amp;ldquo;usable&amp;rdquo; to &amp;ldquo;hard to use&amp;rdquo;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If your goal is stable long-context inference, the usual priority should be:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Check whether VRAM is already near its ceiling&lt;/li&gt;
&lt;li&gt;Decide whether to enable &lt;code&gt;KV Cache&lt;/code&gt; quantization&lt;/li&gt;
&lt;li&gt;Only then continue experimenting with more aggressive throughput settings&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&#34;4-low-gpu-utilization-does-not-mean-the-gpu-is-idle&#34;&gt;4. Low GPU utilization does not mean the GPU is idle
&lt;/h2&gt;&lt;p&gt;This is a point that often breaks intuition.&lt;/p&gt;
&lt;p&gt;When people see only 20% or 30% &lt;code&gt;GPU&lt;/code&gt; usage in Task Manager, they often assume:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;the parameters must be wrong&lt;/li&gt;
&lt;li&gt;the model is not really running on the GPU&lt;/li&gt;
&lt;li&gt;the GPU is not being used fully&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;But the more likely explanation in &lt;code&gt;llama.cpp&lt;/code&gt; inference is that the bottleneck is not core compute, but memory reads and writes.&lt;/p&gt;
&lt;p&gt;That means GPU cores may finish a batch of computation quickly, then spend the rest of the time waiting for the next batch of weights or cached data to arrive.&lt;/p&gt;
&lt;p&gt;So what you see becomes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;core utilization is not especially high&lt;/li&gt;
&lt;li&gt;but end-to-end speed still fails to improve&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This is not the GPU being lazy. It is the data path being too narrow.&lt;/p&gt;
&lt;p&gt;That is why you should not look only at &lt;code&gt;GPU Usage&lt;/code&gt; when judging local LLM performance. VRAM capacity, memory bandwidth, and cache spillover often matter more.&lt;/p&gt;
&lt;h2 id=&#34;5-increasing-throughput-parameters-can-help-but-only-if-vram-can-handle-it&#34;&gt;5. Increasing throughput parameters can help, but only if VRAM can handle it
&lt;/h2&gt;&lt;p&gt;Another useful idea is this: if GPU cores are not fully saturated, maybe you can increase throughput-related parameters so the GPU processes more data at once and uses its parallelism more effectively.&lt;/p&gt;
&lt;p&gt;This can indeed improve speed.&lt;/p&gt;
&lt;p&gt;But there is an important condition: VRAM must still have headroom.&lt;/p&gt;
&lt;p&gt;Because once you increase throughput-related settings, you often also increase VRAM usage. If you are already in a &lt;code&gt;64K&lt;/code&gt; scenario with large cache and VRAM near exhaustion, pushing those parameters further can lead to two outcomes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;a crash&lt;/li&gt;
&lt;li&gt;or a fallback into much slower shared-memory behavior&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So the safer sequence is usually not &amp;ldquo;max out the knobs first&amp;rdquo;, but:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;protect the VRAM boundary first&lt;/li&gt;
&lt;li&gt;then try throughput optimization&lt;/li&gt;
&lt;li&gt;after every change, check both speed and stability again&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;6-more-cpu-threads-are-not-always-better&#34;&gt;6. More CPU threads are not always better
&lt;/h2&gt;&lt;p&gt;This is one of the easiest traps to remember.&lt;/p&gt;
&lt;p&gt;It is very natural to assume that more threads should mean better speed. But in practice, once the model is already running mostly on the GPU, forcing &lt;code&gt;CPU&lt;/code&gt; thread count higher can make performance noticeably worse.&lt;/p&gt;
&lt;p&gt;The reason is straightforward.&lt;/p&gt;
&lt;p&gt;In full-GPU inference, the &lt;code&gt;CPU&lt;/code&gt; is more of a scheduler and preprocessing helper than the main compute engine. If you open too many threads, CPU-side thread contention, scheduling overhead, and context-switching costs all become heavier, which can disrupt the data flow that should have stayed smooth.&lt;/p&gt;
&lt;p&gt;The result is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;the &lt;code&gt;CPU&lt;/code&gt; looks busier&lt;/li&gt;
&lt;li&gt;but overall speed gets slower&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So in this kind of setup, default settings or lower thread counts are often more reliable than simply maxing everything out.&lt;/p&gt;
&lt;h2 id=&#34;7-a-more-practical-approach-for-8gb-vram-users&#34;&gt;7. A more practical approach for 8GB VRAM users
&lt;/h2&gt;&lt;p&gt;If we compress the conclusions above into a practical workflow, it looks roughly like this:&lt;/p&gt;
&lt;h3 id=&#34;1-treat-32k-as-the-default-goal&#34;&gt;1. Treat 32K as the default goal
&lt;/h3&gt;&lt;p&gt;If you only have an &lt;code&gt;8GB&lt;/code&gt; GPU, do not rush to chase &lt;code&gt;64K&lt;/code&gt;. &lt;code&gt;32K&lt;/code&gt; is usually the more realistic balance between speed, stability, and memory usage.&lt;/p&gt;
&lt;h3 id=&#34;2-if-you-want-64k-deal-with-the-cache-first&#34;&gt;2. If you want 64K, deal with the cache first
&lt;/h3&gt;&lt;p&gt;Do not start by asking whether you can squeeze out a little more speed. First confirm whether &lt;code&gt;KV Cache&lt;/code&gt; is quantized and whether VRAM is already near the limit.&lt;/p&gt;
&lt;h3 id=&#34;3-do-not-judge-everything-by-gpu-utilization&#34;&gt;3. Do not judge everything by GPU utilization
&lt;/h3&gt;&lt;p&gt;Low utilization does not necessarily mean the settings are wrong. It may simply mean memory bandwidth is the real bottleneck.&lt;/p&gt;
&lt;h3 id=&#34;4-throughput-optimization-is-valid-but-do-not-cross-the-vram-boundary&#34;&gt;4. Throughput optimization is valid, but do not cross the VRAM boundary
&lt;/h3&gt;&lt;p&gt;These parameters can help, but only if there is still enough VRAM headroom.&lt;/p&gt;
&lt;h3 id=&#34;5-be-conservative-with-cpu-threads-first&#34;&gt;5. Be conservative with CPU threads first
&lt;/h3&gt;&lt;p&gt;If the model is already running mostly on the GPU, higher CPU thread counts are not automatically better. Start with defaults or lower thread counts, then test gradually.&lt;/p&gt;
&lt;h2 id=&#34;conclusion&#34;&gt;Conclusion
&lt;/h2&gt;&lt;p&gt;The most valuable part of this whole discussion is not just a few benchmark numbers, but the fact that it makes one easily overlooked truth much clearer:&lt;/p&gt;
&lt;p&gt;Local LLM tuning is often not about pushing every setting to the maximum. It is about understanding whether your real bottleneck is compute, VRAM capacity, memory bandwidth, or &lt;code&gt;CPU&lt;/code&gt; scheduling.&lt;/p&gt;
&lt;p&gt;For &lt;code&gt;8GB&lt;/code&gt; VRAM users, the safer strategy is usually not to force the longest possible context, but to protect the VRAM boundary first and only then decide how far to push further.&lt;/p&gt;
&lt;p&gt;If you only remember one sentence, make it this:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;32K&lt;/code&gt; is often the more stable working range for &lt;code&gt;8GB&lt;/code&gt; VRAM; &lt;code&gt;64K&lt;/code&gt; is possible, but only if you have already brought &lt;code&gt;KV Cache&lt;/code&gt; and VRAM usage under control.&lt;/p&gt;
</description>
        </item>
        
    </channel>
</rss>
