A Practical Guide to Common Tensor Formats in LLMs: FP32, FP16, BF16, TF32, and FP8

As soon as you start working with large-model training, inference, or deployment, you quickly run into a familiar set of abbreviations: FP32, FP16, BF16, TF32, and FP8. They may look like small labels on a model page, but their impact is much bigger than a naming difference.

These formats determine how numbers are stored in memory and represented during computation. They directly affect training stability, inference speed, and even how large a model a given GPU can realistically handle.

So if you want to understand precision trade-offs in large models, one of the best places to start is not a benchmark chart for a specific model, but a clear picture of what these tensor formats are and why they were designed the way they are.

What tensor formats actually determine

At its core, a large model is a massive set of matrix operations over huge numbers of parameters, and the tensor format is how those numbers are stored in memory and represented during computation.

The trade-off usually revolves around three dimensions:

precision
VRAM usage
compute speed

This is actually a lot like image formats. Lossless formats preserve more detail, but take more space and load more slowly. Compressed formats discard information that is less noticeable to the eye in exchange for smaller size and faster handling. Large models can accept similar trade-offs because, across extremely large parameter sets, many tiny numerical changes do not significantly affect the final output.

That is why the model world has developed a whole family of precision formats.

How a number is represented

Before getting into the formats, it helps to remember one basic structure. A floating-point number is usually made of three parts:

sign bit: determines positive or negative
exponent bits: determine numerical range
mantissa bits: determine numerical detail

In large models, mantissa precision certainly matters, but many models are even more sensitive to insufficient numerical range, meaning too few exponent bits and a higher risk of overflow or unstable training. A lot of tensor format design is essentially about reallocating a limited number of bits between range and detail.

The diagram below gives a quick overall view:

Overview of the bit layouts of FP32, FP16, BF16, TF32, and FP8

FP32: the most stable, but expensive

FP32 is the traditional single-precision floating-point format. It uses 32 bits in total, or 4 bytes.

FP32 bit layout diagram

Its strengths are straightforward:

wide numerical range
high precision
the most stable training behavior

But the downside is just as clear: it consumes a lot of VRAM.

A very rough estimate is:

1

VRAM usage ≈ parameter count × bytes per parameter

If a 27B model stores weights entirely in FP32, the weights alone take roughly:

1

27B × 4 bytes ≈ 108GB

And that still does not include activations, KV cache, optimizer state, or other runtime overhead. So in modern large-model training and inference, FP32 is no longer the default so much as the most stable baseline format.

FP16: half the size, but less stable

FP16 compresses each parameter to 2 bytes, cutting memory usage roughly in half compared with FP32.

FP16 bit layout diagram

For the same 27B model, if you only look at weight size:

1

27B × 2 bytes ≈ 54GB

That already explains why many deployment guides place a 27B model around the 50GB VRAM range.

The advantages of FP16 are obvious:

much lower VRAM pressure
higher throughput
widely used in early mixed-precision training

Its weakness is the relatively small exponent range. In large-model training, that makes overflow more likely and often requires extra techniques such as loss scaling, which adds engineering complexity.

So FP16 is still common, but in many scenarios it is no longer the most comfortable option.

BF16: a more practical half precision for the large-model era

BF16 also uses 2 bytes, but it makes a different trade-off from FP16.

BF16 bit layout diagram

It keeps a much larger exponent range, making its dynamic range closer to FP32, while giving up some mantissa precision. That trade-off works especially well for large models, because they are often more sensitive to range than to losing a few mantissa bits.

That is why many training frameworks, many large-model papers, and many real deployment setups prefer BF16.

A simple way to think about it is:

VRAM cost close to FP16
stability closer to FP32

If one 27B deployment guide asks for roughly 50GB of VRAM while another optimized one gets closer to 30GB, the former often still lives in the FP16/BF16 layer, while the latter has usually moved further toward lower precision or quantization.

TF32: not about saving VRAM, but about accelerating FP32 workflows

TF32 is easy to mistake for yet another memory-saving format, but its role is different.

In common terms, you can roughly think of it as a computation format that keeps a large exponent range while shortening mantissa precision.

TF32 computation format diagram

But it is important to note that TF32 is more like an internal computation format used on the Tensor Core path, rather than something primarily used to store weights like FP16 or BF16.

It is mainly a computation mode NVIDIA provides on newer GPUs. The goal is not to reduce VRAM usage, but to make originally FP32-based training workflows run faster without requiring major code changes.

Its role can be summarized in one sentence:

externally it still looks like an FP32 workflow
internally it performs faster approximate matrix math

So TF32 mainly solves the problem that FP32 is too slow, not that FP32 uses too much memory. If your question is why the same model can have very different VRAM requirements, TF32 is not the main answer.

FP8: further compression, but much more demanding engineering

Going one step further leads to FP8. It compresses each value into even fewer bits, reducing memory bandwidth and storage cost even more.

It usually appears not as one single format, but as two common variants: E4M3 and E5M2.

FP8 variant diagram

But FP8 comes with an obvious cost: once the bit count gets that low, it becomes very hard to preserve both range and precision at the same time. In practice, different variants are often used for different stages to balance forward passes, backward passes, and gradients.

This format family represents a more aggressive strategy:

give up more precision
gain lower storage cost and higher throughput
rely on more mature hardware and frameworks

It has a lot of potential, but for most users, the main practical dividing lines are still FP32, FP16, and BF16.

Why understanding these formats matters

Many people first treat these abbreviations as implementation details on a download page. In practice, though, they change how you think about both training and deployment.

For example, they help explain:

why some training setups care so much about numerical stability
why some inference stacks emphasize quantization and low precision first
why models with similar parameter counts can still have very different deployment requirements
why some formats are better for storing weights while others make more sense as compute paths

If you keep unpacking those questions, they usually lead back to the same issue: how you choose to trade off precision, range, memory, and speed.

That is why understanding FP32, FP16, BF16, TF32, and FP8 is not just about decoding a glossary. It is about understanding what is really being exchanged when you read a training config, choose an inference engine, or compare deployment options.

A practical mental model

If you do not want to memorize all the details right away, it helps to remember them in this order:

FP32: most stable, most expensive
FP16: lower VRAM use, but smaller range
BF16: similar VRAM cost to FP16, but more suitable stability for large models
TF32: mainly solves slow FP32, not VRAM usage
FP8: a more aggressive compression and acceleration route

Summary chart of common tensor formats

After that, when you see fp16, bf16, or fp8 on a model download page, or when different deployment guides give wildly different VRAM thresholds, it no longer looks like a difference in wording. Those labels reflect very different precision budgets and engineering choices.

Closing

Tensor formats in large models may look like a discussion about bit widths, but underneath they are really a discussion about engineering trade-offs.

FP32, FP16, BF16, TF32, and FP8 are not simply better or worse than one another. Each one sits at a different point on the trade-off curve between stability, range, precision, memory, and speed.

Once you understand that layer clearly, it becomes much easier to read training papers, tune inference settings, and compare deployment strategies with the right mental model.