<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
    <channel>
        <title>FP16 on KnightLi Blog</title>
        <link>https://www.knightli.com/en/tags/fp16/</link>
        <description>Recent content in FP16 on KnightLi Blog</description>
        <generator>Hugo -- gohugo.io</generator>
        <language>en</language>
        <lastBuildDate>Wed, 22 Apr 2026 22:40:00 +0800</lastBuildDate><atom:link href="https://www.knightli.com/en/tags/fp16/index.xml" rel="self" type="application/rss+xml" /><item>
        <title>A Practical Guide to Common Tensor Formats in LLMs: FP32, FP16, BF16, TF32, and FP8</title>
        <link>https://www.knightli.com/en/2026/04/22/common-tensor-formats-fp32-fp16-bf16-tf32-fp8/</link>
        <pubDate>Wed, 22 Apr 2026 22:40:00 +0800</pubDate>
        
        <guid>https://www.knightli.com/en/2026/04/22/common-tensor-formats-fp32-fp16-bf16-tf32-fp8/</guid>
        <description>&lt;p&gt;As soon as you start working with large-model training, inference, or deployment, you quickly run into a familiar set of abbreviations: &lt;code&gt;FP32&lt;/code&gt;, &lt;code&gt;FP16&lt;/code&gt;, &lt;code&gt;BF16&lt;/code&gt;, &lt;code&gt;TF32&lt;/code&gt;, and &lt;code&gt;FP8&lt;/code&gt;. They may look like small labels on a model page, but their impact is much bigger than a naming difference.&lt;/p&gt;
&lt;p&gt;These formats determine how numbers are stored in memory and represented during computation. They directly affect training stability, inference speed, and even how large a model a given GPU can realistically handle.&lt;/p&gt;
&lt;p&gt;So if you want to understand precision trade-offs in large models, one of the best places to start is not a benchmark chart for a specific model, but a clear picture of what these tensor formats are and why they were designed the way they are.&lt;/p&gt;
&lt;h2 id=&#34;what-tensor-formats-actually-determine&#34;&gt;What tensor formats actually determine
&lt;/h2&gt;&lt;p&gt;At its core, a large model is a massive set of matrix operations over huge numbers of parameters, and the tensor format is how those numbers are stored in memory and represented during computation.&lt;/p&gt;
&lt;p&gt;The trade-off usually revolves around three dimensions:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;precision&lt;/li&gt;
&lt;li&gt;VRAM usage&lt;/li&gt;
&lt;li&gt;compute speed&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This is actually a lot like image formats. Lossless formats preserve more detail, but take more space and load more slowly. Compressed formats discard information that is less noticeable to the eye in exchange for smaller size and faster handling. Large models can accept similar trade-offs because, across extremely large parameter sets, many tiny numerical changes do not significantly affect the final output.&lt;/p&gt;
&lt;p&gt;That is why the model world has developed a whole family of precision formats.&lt;/p&gt;
&lt;h2 id=&#34;how-a-number-is-represented&#34;&gt;How a number is represented
&lt;/h2&gt;&lt;p&gt;Before getting into the formats, it helps to remember one basic structure. A floating-point number is usually made of three parts:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;sign bit: determines positive or negative&lt;/li&gt;
&lt;li&gt;exponent bits: determine numerical range&lt;/li&gt;
&lt;li&gt;mantissa bits: determine numerical detail&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In large models, mantissa precision certainly matters, but many models are even more sensitive to insufficient numerical range, meaning too few exponent bits and a higher risk of overflow or unstable training. A lot of tensor format design is essentially about reallocating a limited number of bits between range and detail.&lt;/p&gt;
&lt;p&gt;The diagram below gives a quick overall view:&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://www.knightli.com/2026/04/22/common-tensor-formats-fp32-fp16-bf16-tf32-fp8/tensor-format-overview.svg&#34;
	
	
	
	loading=&#34;lazy&#34;
	
		alt=&#34;Overview of the bit layouts of FP32, FP16, BF16, TF32, and FP8&#34;
	
	
&gt;&lt;/p&gt;
&lt;h2 id=&#34;fp32-the-most-stable-but-expensive&#34;&gt;FP32: the most stable, but expensive
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;FP32&lt;/code&gt; is the traditional single-precision floating-point format. It uses 32 bits in total, or 4 bytes.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://www.knightli.com/2026/04/22/common-tensor-formats-fp32-fp16-bf16-tf32-fp8/fp32-layout.svg&#34;
	
	
	
	loading=&#34;lazy&#34;
	
		alt=&#34;FP32 bit layout diagram&#34;
	
	
&gt;&lt;/p&gt;
&lt;p&gt;Its strengths are straightforward:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;wide numerical range&lt;/li&gt;
&lt;li&gt;high precision&lt;/li&gt;
&lt;li&gt;the most stable training behavior&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;But the downside is just as clear: it consumes a lot of VRAM.&lt;/p&gt;
&lt;p&gt;A very rough estimate is:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;VRAM usage ≈ parameter count × bytes per parameter
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If a 27B model stores weights entirely in &lt;code&gt;FP32&lt;/code&gt;, the weights alone take roughly:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;27B × 4 bytes ≈ 108GB
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;And that still does not include activations, KV cache, optimizer state, or other runtime overhead. So in modern large-model training and inference, &lt;code&gt;FP32&lt;/code&gt; is no longer the default so much as the most stable baseline format.&lt;/p&gt;
&lt;h2 id=&#34;fp16-half-the-size-but-less-stable&#34;&gt;FP16: half the size, but less stable
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;FP16&lt;/code&gt; compresses each parameter to 2 bytes, cutting memory usage roughly in half compared with &lt;code&gt;FP32&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://www.knightli.com/2026/04/22/common-tensor-formats-fp32-fp16-bf16-tf32-fp8/fp16-layout.svg&#34;
	
	
	
	loading=&#34;lazy&#34;
	
		alt=&#34;FP16 bit layout diagram&#34;
	
	
&gt;&lt;/p&gt;
&lt;p&gt;For the same 27B model, if you only look at weight size:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;27B × 2 bytes ≈ 54GB
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;That already explains why many deployment guides place a 27B model around the 50GB VRAM range.&lt;/p&gt;
&lt;p&gt;The advantages of &lt;code&gt;FP16&lt;/code&gt; are obvious:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;much lower VRAM pressure&lt;/li&gt;
&lt;li&gt;higher throughput&lt;/li&gt;
&lt;li&gt;widely used in early mixed-precision training&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Its weakness is the relatively small exponent range. In large-model training, that makes overflow more likely and often requires extra techniques such as loss scaling, which adds engineering complexity.&lt;/p&gt;
&lt;p&gt;So &lt;code&gt;FP16&lt;/code&gt; is still common, but in many scenarios it is no longer the most comfortable option.&lt;/p&gt;
&lt;h2 id=&#34;bf16-a-more-practical-half-precision-for-the-large-model-era&#34;&gt;BF16: a more practical half precision for the large-model era
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;BF16&lt;/code&gt; also uses 2 bytes, but it makes a different trade-off from &lt;code&gt;FP16&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://www.knightli.com/2026/04/22/common-tensor-formats-fp32-fp16-bf16-tf32-fp8/bf16-layout.svg&#34;
	
	
	
	loading=&#34;lazy&#34;
	
		alt=&#34;BF16 bit layout diagram&#34;
	
	
&gt;&lt;/p&gt;
&lt;p&gt;It keeps a much larger exponent range, making its dynamic range closer to &lt;code&gt;FP32&lt;/code&gt;, while giving up some mantissa precision. That trade-off works especially well for large models, because they are often more sensitive to range than to losing a few mantissa bits.&lt;/p&gt;
&lt;p&gt;That is why many training frameworks, many large-model papers, and many real deployment setups prefer &lt;code&gt;BF16&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;A simple way to think about it is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;VRAM cost close to &lt;code&gt;FP16&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;stability closer to &lt;code&gt;FP32&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If one 27B deployment guide asks for roughly 50GB of VRAM while another optimized one gets closer to 30GB, the former often still lives in the &lt;code&gt;FP16/BF16&lt;/code&gt; layer, while the latter has usually moved further toward lower precision or quantization.&lt;/p&gt;
&lt;h2 id=&#34;tf32-not-about-saving-vram-but-about-accelerating-fp32-workflows&#34;&gt;TF32: not about saving VRAM, but about accelerating FP32 workflows
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;TF32&lt;/code&gt; is easy to mistake for yet another memory-saving format, but its role is different.&lt;/p&gt;
&lt;p&gt;In common terms, you can roughly think of it as a computation format that keeps a large exponent range while shortening mantissa precision.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://www.knightli.com/2026/04/22/common-tensor-formats-fp32-fp16-bf16-tf32-fp8/tf32-layout.svg&#34;
	
	
	
	loading=&#34;lazy&#34;
	
		alt=&#34;TF32 computation format diagram&#34;
	
	
&gt;&lt;/p&gt;
&lt;p&gt;But it is important to note that &lt;code&gt;TF32&lt;/code&gt; is more like an internal computation format used on the Tensor Core path, rather than something primarily used to store weights like &lt;code&gt;FP16&lt;/code&gt; or &lt;code&gt;BF16&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;It is mainly a computation mode NVIDIA provides on newer GPUs. The goal is not to reduce VRAM usage, but to make originally &lt;code&gt;FP32&lt;/code&gt;-based training workflows run faster without requiring major code changes.&lt;/p&gt;
&lt;p&gt;Its role can be summarized in one sentence:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;externally it still looks like an &lt;code&gt;FP32&lt;/code&gt; workflow&lt;/li&gt;
&lt;li&gt;internally it performs faster approximate matrix math&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So &lt;code&gt;TF32&lt;/code&gt; mainly solves the problem that &lt;code&gt;FP32&lt;/code&gt; is too slow, not that &lt;code&gt;FP32&lt;/code&gt; uses too much memory. If your question is why the same model can have very different VRAM requirements, &lt;code&gt;TF32&lt;/code&gt; is not the main answer.&lt;/p&gt;
&lt;h2 id=&#34;fp8-further-compression-but-much-more-demanding-engineering&#34;&gt;FP8: further compression, but much more demanding engineering
&lt;/h2&gt;&lt;p&gt;Going one step further leads to &lt;code&gt;FP8&lt;/code&gt;. It compresses each value into even fewer bits, reducing memory bandwidth and storage cost even more.&lt;/p&gt;
&lt;p&gt;It usually appears not as one single format, but as two common variants: &lt;code&gt;E4M3&lt;/code&gt; and &lt;code&gt;E5M2&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://www.knightli.com/2026/04/22/common-tensor-formats-fp32-fp16-bf16-tf32-fp8/fp8-layout.svg&#34;
	
	
	
	loading=&#34;lazy&#34;
	
		alt=&#34;FP8 variant diagram&#34;
	
	
&gt;&lt;/p&gt;
&lt;p&gt;But &lt;code&gt;FP8&lt;/code&gt; comes with an obvious cost: once the bit count gets that low, it becomes very hard to preserve both range and precision at the same time. In practice, different variants are often used for different stages to balance forward passes, backward passes, and gradients.&lt;/p&gt;
&lt;p&gt;This format family represents a more aggressive strategy:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;give up more precision&lt;/li&gt;
&lt;li&gt;gain lower storage cost and higher throughput&lt;/li&gt;
&lt;li&gt;rely on more mature hardware and frameworks&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;It has a lot of potential, but for most users, the main practical dividing lines are still &lt;code&gt;FP32&lt;/code&gt;, &lt;code&gt;FP16&lt;/code&gt;, and &lt;code&gt;BF16&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id=&#34;why-understanding-these-formats-matters&#34;&gt;Why understanding these formats matters
&lt;/h2&gt;&lt;p&gt;Many people first treat these abbreviations as implementation details on a download page. In practice, though, they change how you think about both training and deployment.&lt;/p&gt;
&lt;p&gt;For example, they help explain:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;why some training setups care so much about numerical stability&lt;/li&gt;
&lt;li&gt;why some inference stacks emphasize quantization and low precision first&lt;/li&gt;
&lt;li&gt;why models with similar parameter counts can still have very different deployment requirements&lt;/li&gt;
&lt;li&gt;why some formats are better for storing weights while others make more sense as compute paths&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you keep unpacking those questions, they usually lead back to the same issue: how you choose to trade off precision, range, memory, and speed.&lt;/p&gt;
&lt;p&gt;That is why understanding &lt;code&gt;FP32&lt;/code&gt;, &lt;code&gt;FP16&lt;/code&gt;, &lt;code&gt;BF16&lt;/code&gt;, &lt;code&gt;TF32&lt;/code&gt;, and &lt;code&gt;FP8&lt;/code&gt; is not just about decoding a glossary. It is about understanding what is really being exchanged when you read a training config, choose an inference engine, or compare deployment options.&lt;/p&gt;
&lt;h2 id=&#34;a-practical-mental-model&#34;&gt;A practical mental model
&lt;/h2&gt;&lt;p&gt;If you do not want to memorize all the details right away, it helps to remember them in this order:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;FP32&lt;/code&gt;: most stable, most expensive&lt;/li&gt;
&lt;li&gt;&lt;code&gt;FP16&lt;/code&gt;: lower VRAM use, but smaller range&lt;/li&gt;
&lt;li&gt;&lt;code&gt;BF16&lt;/code&gt;: similar VRAM cost to &lt;code&gt;FP16&lt;/code&gt;, but more suitable stability for large models&lt;/li&gt;
&lt;li&gt;&lt;code&gt;TF32&lt;/code&gt;: mainly solves slow &lt;code&gt;FP32&lt;/code&gt;, not VRAM usage&lt;/li&gt;
&lt;li&gt;&lt;code&gt;FP8&lt;/code&gt;: a more aggressive compression and acceleration route&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img src=&#34;https://www.knightli.com/2026/04/22/common-tensor-formats-fp32-fp16-bf16-tf32-fp8/tensor-format-summary.svg&#34;
	
	
	
	loading=&#34;lazy&#34;
	
		alt=&#34;Summary chart of common tensor formats&#34;
	
	
&gt;&lt;/p&gt;
&lt;p&gt;After that, when you see &lt;code&gt;fp16&lt;/code&gt;, &lt;code&gt;bf16&lt;/code&gt;, or &lt;code&gt;fp8&lt;/code&gt; on a model download page, or when different deployment guides give wildly different VRAM thresholds, it no longer looks like a difference in wording. Those labels reflect very different precision budgets and engineering choices.&lt;/p&gt;
&lt;h2 id=&#34;closing&#34;&gt;Closing
&lt;/h2&gt;&lt;p&gt;Tensor formats in large models may look like a discussion about bit widths, but underneath they are really a discussion about engineering trade-offs.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;FP32&lt;/code&gt;, &lt;code&gt;FP16&lt;/code&gt;, &lt;code&gt;BF16&lt;/code&gt;, &lt;code&gt;TF32&lt;/code&gt;, and &lt;code&gt;FP8&lt;/code&gt; are not simply better or worse than one another. Each one sits at a different point on the trade-off curve between stability, range, precision, memory, and speed.&lt;/p&gt;
&lt;p&gt;Once you understand that layer clearly, it becomes much easier to read training papers, tune inference settings, and compare deployment strategies with the right mental model.&lt;/p&gt;
</description>
        </item>
        
    </channel>
</rss>
