What the Common GPU Inference Benchmark Metrics Actually Mean: FA, pp512, tg128, and Q4_0

When reading GPU inference benchmarks, you often run into metrics like FA, pp512, tg128, Q4_0, and t/s. They all relate to performance, but they do not measure the same thing. This article breaks down what each of them actually means.

As soon as you start looking at local LLM or GPU inference benchmarks, you quickly run into a stack of abbreviations: FA, pp512, tg128, and Q4_0. They all look like performance metrics, but without context they can be surprisingly hard to interpret.

For example, you may see a line like this:

1
CUDA Scoreboard for Llama 2 7B, Q4_0 (no FA)

Then right below it, you might also see:

1
2
pp512 t/s
tg128 t/s

If you do not unpack what these terms mean, it becomes difficult to understand what the benchmark is actually measuring, or how to compare the results of two different GPUs.

This article is not about which GPU is the better buy. It is specifically about breaking down the most common metrics you see in GPU inference benchmarks.

First, what the whole title line is actually saying

A line like CUDA Scoreboard for Llama 2 7B, Q4_0 (no FA) already tells you most of the test setup.

At minimum, it contains four layers of information:

  • CUDA: the benchmark is running on the NVIDIA CUDA path
  • Llama 2 7B: the model being tested is the 7B version of Llama 2
  • Q4_0: the model uses a 4-bit quantized format
  • no FA: Flash Attention was disabled in this test

So in practical terms, this kind of title usually means:

“A benchmark of a quantized large model running on an NVIDIA GPU, measured under a specific inference path.”

What FA means: Flash Attention

Here, FA stands for Flash Attention.

It is one of the most important acceleration techniques in large-model training and inference, mainly because it optimizes how attention is computed. In Transformer models, attention is already one of the most expensive and memory-bandwidth-heavy parts of the entire pipeline.

A traditional attention implementation often suffers from a few problems:

  • frequent memory reads and writes
  • many intermediate results
  • repeated data movement between VRAM and on-chip cache
  • rapidly growing overhead as context length increases

What Flash Attention does, in simple terms, is:

  • reorganize the computation order
  • reduce how often intermediate results are written back to VRAM
  • keep more of the work inside faster cache

That gives it three typical advantages:

  • it is faster
  • it saves memory
  • it is mathematically equivalent to standard attention rather than a lower-accuracy shortcut

That is why so many modern inference and training frameworks treat it as a major optimization feature.

What no FA means

If FA means Flash Attention, then no FA simply means that Flash Attention was not enabled for this test.

In other words, the benchmark was measured using a more traditional attention implementation.

There are several reasons benchmark tables explicitly label no FA:

  • to keep a baseline for comparison
  • to support hardware or software environments where FA is unavailable
  • to avoid mixing scores from different optimization conditions

So when you see no FA, you should not read it as “this GPU is weak.” A more accurate reading is:

“This score was measured without Flash Attention enabled.”

What Q4_0 means: a quantization format

Q4_0 refers to a 4-bit quantization format.

The original model weights are usually not stored at such low precision. Quantization compresses higher-precision weights into a lower-bit representation so the model becomes easier to run on consumer GPUs.

A rough way to think about it is:

  • Q: Quantization
  • 4: 4-bit
  • _0: a specific quantization scheme identifier

Its practical importance is straightforward:

  • smaller model size
  • lower VRAM requirements
  • better chances of fitting on consumer hardware

So Llama 2 7B, Q4_0 does not mean just “a normal 7B model.” It means “a 7B model already compressed using a 4-bit quantization format.”

What pp512 t/s means

pp512 usually means:

Prompt Processing 512 tokens

It measures how fast the model processes the input prompt, usually in t/s, meaning tokens per second.

Here, 512 means the prompt length used in the test was 512 tokens.

This metric does not measure output speed. It measures how quickly the model encodes and computes over the input before it starts responding. You can think of it as the speed of the “reading the prompt first” stage.

One important property of this stage is that it is usually much more parallelizable.

Because the input sequence can be processed in batches, the GPU can often keep its compute units highly utilized. That is why pp512 numbers can look extremely high, sometimes almost suspiciously high at first glance.

So if you see something like:

1
pp512 ≈ 14000 t/s

there is no reason to panic. That is measuring prompt-processing throughput, not the speed of token-by-token output generation.

What tg128 t/s means

tg128 usually means:

Text Generation 128 tokens

It measures the average speed of generating 128 tokens, again in t/s.

This metric is much closer to what people intuitively mean when they ask whether a model feels fast, because it is directly measuring the output stage.

But the biggest difference from pp512 is that text generation is usually autoregressive.

That means:

  • the model must generate the first token
  • then use that to generate the second
  • then continue to the third

So this stage cannot be parallelized the way prompt processing can, and it is naturally much slower.

That is why it is perfectly normal to see something like:

  • pp512 in the tens of thousands of t/s
  • tg128 only in the hundreds of t/s

This is not a benchmark error. These two metrics are measuring fundamentally different workloads.

Why pp512 and tg128 differ so much

This is often the first thing people find confusing when reading a scoreboard.

The short explanation is:

pp512 is closer to measuring parallel throughput, while tg128 is closer to measuring token-by-token generation ability.

To expand on that:

  • the input stage is easier to parallelize
  • the output stage depends on sequential token generation
  • generation is usually more sensitive to memory bandwidth and cache behavior
  • so generation speed being much lower than prompt-processing speed is entirely normal

That also explains an interesting pattern you sometimes see in GPU comparisons:

  • one GPU is stronger in pp512
  • another ends up slightly faster in tg128

That is not contradictory. One metric leans more toward peak compute throughput, while the other reflects the actual memory and latency behavior of the generation path.

How to think about t/s

Here, t/s simply means tokens per second.

It tells you how many tokens the model can process or generate in one second.

But there is one important caveat: a token is not the same thing as a character or a word. It is the unit produced by the model’s tokenizer, and its actual text length can vary a lot across models and languages.

So in practice, t/s is most useful for:

  • comparing different GPUs on the same model
  • comparing different parameter settings in the same environment
  • comparing a framework before and after a specific optimization is enabled

It is much less reliable as a universal “absolute speed” metric across different models, frameworks, and tokenizers.

What to focus on first when reading a scoreboard

If you do not want to get buried under abbreviations every time, start with these questions.

1. What model is being tested

For example, is it Llama 2 7B? Is it the same quantized variant, such as Q4_0? If the model or quantization format changes, direct comparison becomes much less meaningful.

2. Whether key optimizations are enabled

The most common example is FA. If one benchmark uses Flash Attention and the other does not, those scores are not directly comparable.

3. Whether the metric is measuring input speed or output speed

pp512 and tg128 are measuring different stages. One is closer to prompt-reading speed, the other is closer to answer-generation speed.

4. Whether you care about throughput or user feel

If you care more about how quickly a long prompt gets processed, pp512 matters more. If you care more about how fast the model feels while answering, tg128 is usually closer to the real experience.

A more practical way to remember all this

If you want to compress all of these into one short memory aid, you can think of them like this:

  • Q4_0: the model is compressed into a 4-bit quantized version
  • FA: whether Flash Attention is enabled
  • pp512: how fast the model processes a 512-token input
  • tg128: how fast the model generates a 128-token output
  • t/s: speed unit, tokens per second

Once those five points are clear, it becomes much easier to judge what a given CUDA Scoreboard is actually measuring.

Closing

GPU benchmark tables often look more complicated than they really are, not because the metrics themselves are mysterious, but because model identity, quantization, optimization flags, and different stages of throughput are all compressed into a few short abbreviations.

Once you unpack terms like FA, Q4_0, pp512, and tg128, these benchmark tables become much easier to read.

What matters is not just remembering a raw score, but knowing:

  • which model configuration the score came from
  • whether key optimizations were enabled
  • whether it measured input or output behavior
  • whether it reflects compute throughput or something closer to actual generation feel

That makes it much easier to judge what these results really mean.

记录并分享
Built with Hugo
Theme Stack designed by Jimmy