<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
    <channel>
        <title>Flash Attention on KnightLi Blog</title>
        <link>https://www.knightli.com/en/tags/flash-attention/</link>
        <description>Recent content in Flash Attention on KnightLi Blog</description>
        <generator>Hugo -- gohugo.io</generator>
        <language>en</language>
        <lastBuildDate>Thu, 23 Apr 2026 00:15:00 +0800</lastBuildDate><atom:link href="https://www.knightli.com/en/tags/flash-attention/index.xml" rel="self" type="application/rss+xml" /><item>
        <title>What the Common GPU Inference Benchmark Metrics Actually Mean: FA, pp512, tg128, and Q4_0</title>
        <link>https://www.knightli.com/en/2026/04/23/how-to-read-llm-cuda-scoreboard-fa-pp512-tg128-q4-0/</link>
        <pubDate>Thu, 23 Apr 2026 00:15:00 +0800</pubDate>
        
        <guid>https://www.knightli.com/en/2026/04/23/how-to-read-llm-cuda-scoreboard-fa-pp512-tg128-q4-0/</guid>
        <description>&lt;p&gt;As soon as you start looking at local LLM or GPU inference benchmarks, you quickly run into a stack of abbreviations: &lt;code&gt;FA&lt;/code&gt;, &lt;code&gt;pp512&lt;/code&gt;, &lt;code&gt;tg128&lt;/code&gt;, and &lt;code&gt;Q4_0&lt;/code&gt;. They all look like performance metrics, but without context they can be surprisingly hard to interpret.&lt;/p&gt;
&lt;p&gt;For example, you may see a line like this:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;CUDA Scoreboard for Llama 2 7B, Q4_0 (no FA)
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Then right below it, you might also see:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;pp512 t/s
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;tg128 t/s
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If you do not unpack what these terms mean, it becomes difficult to understand what the benchmark is actually measuring, or how to compare the results of two different GPUs.&lt;/p&gt;
&lt;p&gt;This article is not about which GPU is the better buy. It is specifically about breaking down the most common metrics you see in GPU inference benchmarks.&lt;/p&gt;
&lt;h2 id=&#34;first-what-the-whole-title-line-is-actually-saying&#34;&gt;First, what the whole title line is actually saying
&lt;/h2&gt;&lt;p&gt;A line like &lt;code&gt;CUDA Scoreboard for Llama 2 7B, Q4_0 (no FA)&lt;/code&gt; already tells you most of the test setup.&lt;/p&gt;
&lt;p&gt;At minimum, it contains four layers of information:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;CUDA&lt;/code&gt;: the benchmark is running on the NVIDIA CUDA path&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Llama 2 7B&lt;/code&gt;: the model being tested is the 7B version of &lt;code&gt;Llama 2&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q4_0&lt;/code&gt;: the model uses a 4-bit quantized format&lt;/li&gt;
&lt;li&gt;&lt;code&gt;no FA&lt;/code&gt;: &lt;code&gt;Flash Attention&lt;/code&gt; was disabled in this test&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So in practical terms, this kind of title usually means:&lt;/p&gt;
&lt;p&gt;&amp;ldquo;A benchmark of a quantized large model running on an NVIDIA GPU, measured under a specific inference path.&amp;rdquo;&lt;/p&gt;
&lt;h2 id=&#34;what-fa-means-flash-attention&#34;&gt;What FA means: Flash Attention
&lt;/h2&gt;&lt;p&gt;Here, &lt;code&gt;FA&lt;/code&gt; stands for &lt;code&gt;Flash Attention&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;It is one of the most important acceleration techniques in large-model training and inference, mainly because it optimizes how attention is computed. In Transformer models, attention is already one of the most expensive and memory-bandwidth-heavy parts of the entire pipeline.&lt;/p&gt;
&lt;p&gt;A traditional attention implementation often suffers from a few problems:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;frequent memory reads and writes&lt;/li&gt;
&lt;li&gt;many intermediate results&lt;/li&gt;
&lt;li&gt;repeated data movement between VRAM and on-chip cache&lt;/li&gt;
&lt;li&gt;rapidly growing overhead as context length increases&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;What &lt;code&gt;Flash Attention&lt;/code&gt; does, in simple terms, is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;reorganize the computation order&lt;/li&gt;
&lt;li&gt;reduce how often intermediate results are written back to VRAM&lt;/li&gt;
&lt;li&gt;keep more of the work inside faster cache&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That gives it three typical advantages:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;it is faster&lt;/li&gt;
&lt;li&gt;it saves memory&lt;/li&gt;
&lt;li&gt;it is mathematically equivalent to standard attention rather than a lower-accuracy shortcut&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That is why so many modern inference and training frameworks treat it as a major optimization feature.&lt;/p&gt;
&lt;h2 id=&#34;what-no-fa-means&#34;&gt;What no FA means
&lt;/h2&gt;&lt;p&gt;If &lt;code&gt;FA&lt;/code&gt; means &lt;code&gt;Flash Attention&lt;/code&gt;, then &lt;code&gt;no FA&lt;/code&gt; simply means that &lt;code&gt;Flash Attention&lt;/code&gt; was not enabled for this test.&lt;/p&gt;
&lt;p&gt;In other words, the benchmark was measured using a more traditional attention implementation.&lt;/p&gt;
&lt;p&gt;There are several reasons benchmark tables explicitly label &lt;code&gt;no FA&lt;/code&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;to keep a baseline for comparison&lt;/li&gt;
&lt;li&gt;to support hardware or software environments where &lt;code&gt;FA&lt;/code&gt; is unavailable&lt;/li&gt;
&lt;li&gt;to avoid mixing scores from different optimization conditions&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So when you see &lt;code&gt;no FA&lt;/code&gt;, you should not read it as &amp;ldquo;this GPU is weak.&amp;rdquo; A more accurate reading is:&lt;/p&gt;
&lt;p&gt;&amp;ldquo;This score was measured without Flash Attention enabled.&amp;rdquo;&lt;/p&gt;
&lt;h2 id=&#34;what-q4_0-means-a-quantization-format&#34;&gt;What Q4_0 means: a quantization format
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;Q4_0&lt;/code&gt; refers to a 4-bit quantization format.&lt;/p&gt;
&lt;p&gt;The original model weights are usually not stored at such low precision. Quantization compresses higher-precision weights into a lower-bit representation so the model becomes easier to run on consumer GPUs.&lt;/p&gt;
&lt;p&gt;A rough way to think about it is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Q&lt;/code&gt;: Quantization&lt;/li&gt;
&lt;li&gt;&lt;code&gt;4&lt;/code&gt;: 4-bit&lt;/li&gt;
&lt;li&gt;&lt;code&gt;_0&lt;/code&gt;: a specific quantization scheme identifier&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Its practical importance is straightforward:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;smaller model size&lt;/li&gt;
&lt;li&gt;lower VRAM requirements&lt;/li&gt;
&lt;li&gt;better chances of fitting on consumer hardware&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So &lt;code&gt;Llama 2 7B, Q4_0&lt;/code&gt; does not mean just &amp;ldquo;a normal 7B model.&amp;rdquo; It means &amp;ldquo;a 7B model already compressed using a 4-bit quantization format.&amp;rdquo;&lt;/p&gt;
&lt;h2 id=&#34;what-pp512-ts-means&#34;&gt;What pp512 t/s means
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;pp512&lt;/code&gt; usually means:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;Prompt Processing 512 tokens&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;It measures how fast the model processes the input prompt, usually in &lt;code&gt;t/s&lt;/code&gt;, meaning &lt;code&gt;tokens per second&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Here, &lt;code&gt;512&lt;/code&gt; means the prompt length used in the test was &lt;code&gt;512 tokens&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;This metric does not measure output speed. It measures how quickly the model encodes and computes over the input before it starts responding. You can think of it as the speed of the &amp;ldquo;reading the prompt first&amp;rdquo; stage.&lt;/p&gt;
&lt;p&gt;One important property of this stage is that it is usually much more parallelizable.&lt;/p&gt;
&lt;p&gt;Because the input sequence can be processed in batches, the GPU can often keep its compute units highly utilized. That is why &lt;code&gt;pp512&lt;/code&gt; numbers can look extremely high, sometimes almost suspiciously high at first glance.&lt;/p&gt;
&lt;p&gt;So if you see something like:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;pp512 ≈ 14000 t/s
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;there is no reason to panic. That is measuring prompt-processing throughput, not the speed of token-by-token output generation.&lt;/p&gt;
&lt;h2 id=&#34;what-tg128-ts-means&#34;&gt;What tg128 t/s means
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;tg128&lt;/code&gt; usually means:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;Text Generation 128 tokens&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;It measures the average speed of generating &lt;code&gt;128 tokens&lt;/code&gt;, again in &lt;code&gt;t/s&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;This metric is much closer to what people intuitively mean when they ask whether a model feels fast, because it is directly measuring the output stage.&lt;/p&gt;
&lt;p&gt;But the biggest difference from &lt;code&gt;pp512&lt;/code&gt; is that text generation is usually autoregressive.&lt;/p&gt;
&lt;p&gt;That means:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;the model must generate the first token&lt;/li&gt;
&lt;li&gt;then use that to generate the second&lt;/li&gt;
&lt;li&gt;then continue to the third&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So this stage cannot be parallelized the way prompt processing can, and it is naturally much slower.&lt;/p&gt;
&lt;p&gt;That is why it is perfectly normal to see something like:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;pp512&lt;/code&gt; in the tens of thousands of &lt;code&gt;t/s&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;tg128&lt;/code&gt; only in the hundreds of &lt;code&gt;t/s&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This is not a benchmark error. These two metrics are measuring fundamentally different workloads.&lt;/p&gt;
&lt;h2 id=&#34;why-pp512-and-tg128-differ-so-much&#34;&gt;Why pp512 and tg128 differ so much
&lt;/h2&gt;&lt;p&gt;This is often the first thing people find confusing when reading a scoreboard.&lt;/p&gt;
&lt;p&gt;The short explanation is:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;pp512&lt;/code&gt; is closer to measuring parallel throughput, while &lt;code&gt;tg128&lt;/code&gt; is closer to measuring token-by-token generation ability.&lt;/p&gt;
&lt;p&gt;To expand on that:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;the input stage is easier to parallelize&lt;/li&gt;
&lt;li&gt;the output stage depends on sequential token generation&lt;/li&gt;
&lt;li&gt;generation is usually more sensitive to memory bandwidth and cache behavior&lt;/li&gt;
&lt;li&gt;so generation speed being much lower than prompt-processing speed is entirely normal&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That also explains an interesting pattern you sometimes see in GPU comparisons:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;one GPU is stronger in &lt;code&gt;pp512&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;another ends up slightly faster in &lt;code&gt;tg128&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That is not contradictory. One metric leans more toward peak compute throughput, while the other reflects the actual memory and latency behavior of the generation path.&lt;/p&gt;
&lt;h2 id=&#34;how-to-think-about-ts&#34;&gt;How to think about t/s
&lt;/h2&gt;&lt;p&gt;Here, &lt;code&gt;t/s&lt;/code&gt; simply means &lt;code&gt;tokens per second&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;It tells you how many tokens the model can process or generate in one second.&lt;/p&gt;
&lt;p&gt;But there is one important caveat: a &lt;code&gt;token&lt;/code&gt; is not the same thing as a character or a word. It is the unit produced by the model&amp;rsquo;s tokenizer, and its actual text length can vary a lot across models and languages.&lt;/p&gt;
&lt;p&gt;So in practice, &lt;code&gt;t/s&lt;/code&gt; is most useful for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;comparing different GPUs on the same model&lt;/li&gt;
&lt;li&gt;comparing different parameter settings in the same environment&lt;/li&gt;
&lt;li&gt;comparing a framework before and after a specific optimization is enabled&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;It is much less reliable as a universal &amp;ldquo;absolute speed&amp;rdquo; metric across different models, frameworks, and tokenizers.&lt;/p&gt;
&lt;h2 id=&#34;what-to-focus-on-first-when-reading-a-scoreboard&#34;&gt;What to focus on first when reading a scoreboard
&lt;/h2&gt;&lt;p&gt;If you do not want to get buried under abbreviations every time, start with these questions.&lt;/p&gt;
&lt;h3 id=&#34;1-what-model-is-being-tested&#34;&gt;1. What model is being tested
&lt;/h3&gt;&lt;p&gt;For example, is it &lt;code&gt;Llama 2 7B&lt;/code&gt;? Is it the same quantized variant, such as &lt;code&gt;Q4_0&lt;/code&gt;? If the model or quantization format changes, direct comparison becomes much less meaningful.&lt;/p&gt;
&lt;h3 id=&#34;2-whether-key-optimizations-are-enabled&#34;&gt;2. Whether key optimizations are enabled
&lt;/h3&gt;&lt;p&gt;The most common example is &lt;code&gt;FA&lt;/code&gt;. If one benchmark uses &lt;code&gt;Flash Attention&lt;/code&gt; and the other does not, those scores are not directly comparable.&lt;/p&gt;
&lt;h3 id=&#34;3-whether-the-metric-is-measuring-input-speed-or-output-speed&#34;&gt;3. Whether the metric is measuring input speed or output speed
&lt;/h3&gt;&lt;p&gt;&lt;code&gt;pp512&lt;/code&gt; and &lt;code&gt;tg128&lt;/code&gt; are measuring different stages. One is closer to prompt-reading speed, the other is closer to answer-generation speed.&lt;/p&gt;
&lt;h3 id=&#34;4-whether-you-care-about-throughput-or-user-feel&#34;&gt;4. Whether you care about throughput or user feel
&lt;/h3&gt;&lt;p&gt;If you care more about how quickly a long prompt gets processed, &lt;code&gt;pp512&lt;/code&gt; matters more. If you care more about how fast the model feels while answering, &lt;code&gt;tg128&lt;/code&gt; is usually closer to the real experience.&lt;/p&gt;
&lt;h2 id=&#34;a-more-practical-way-to-remember-all-this&#34;&gt;A more practical way to remember all this
&lt;/h2&gt;&lt;p&gt;If you want to compress all of these into one short memory aid, you can think of them like this:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Q4_0&lt;/code&gt;: the model is compressed into a 4-bit quantized version&lt;/li&gt;
&lt;li&gt;&lt;code&gt;FA&lt;/code&gt;: whether Flash Attention is enabled&lt;/li&gt;
&lt;li&gt;&lt;code&gt;pp512&lt;/code&gt;: how fast the model processes a 512-token input&lt;/li&gt;
&lt;li&gt;&lt;code&gt;tg128&lt;/code&gt;: how fast the model generates a 128-token output&lt;/li&gt;
&lt;li&gt;&lt;code&gt;t/s&lt;/code&gt;: speed unit, tokens per second&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Once those five points are clear, it becomes much easier to judge what a given CUDA Scoreboard is actually measuring.&lt;/p&gt;
&lt;h2 id=&#34;closing&#34;&gt;Closing
&lt;/h2&gt;&lt;p&gt;GPU benchmark tables often look more complicated than they really are, not because the metrics themselves are mysterious, but because model identity, quantization, optimization flags, and different stages of throughput are all compressed into a few short abbreviations.&lt;/p&gt;
&lt;p&gt;Once you unpack terms like &lt;code&gt;FA&lt;/code&gt;, &lt;code&gt;Q4_0&lt;/code&gt;, &lt;code&gt;pp512&lt;/code&gt;, and &lt;code&gt;tg128&lt;/code&gt;, these benchmark tables become much easier to read.&lt;/p&gt;
&lt;p&gt;What matters is not just remembering a raw score, but knowing:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;which model configuration the score came from&lt;/li&gt;
&lt;li&gt;whether key optimizations were enabled&lt;/li&gt;
&lt;li&gt;whether it measured input or output behavior&lt;/li&gt;
&lt;li&gt;whether it reflects compute throughput or something closer to actual generation feel&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That makes it much easier to judge what these results really mean.&lt;/p&gt;
</description>
        </item>
        
    </channel>
</rss>
