<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
    <channel>
        <title>Cooling on KnightLi Blog</title>
        <link>https://www.knightli.com/en/tags/cooling/</link>
        <description>Recent content in Cooling on KnightLi Blog</description>
        <generator>Hugo -- gohugo.io</generator>
        <language>en</language>
        <lastBuildDate>Thu, 23 Apr 2026 11:15:10 +0800</lastBuildDate><atom:link href="https://www.knightli.com/en/tags/cooling/index.xml" rel="self" type="application/rss+xml" /><item>
        <title>Is Tesla V100 Still Worth Buying: ECC Checks, Cooling Mods, and DIY Pitfalls</title>
        <link>https://www.knightli.com/en/2026/04/23/tesla-v100-buying-ecc-cooling-diy-guide/</link>
        <pubDate>Thu, 23 Apr 2026 11:15:10 +0800</pubDate>
        
        <guid>https://www.knightli.com/en/2026/04/23/tesla-v100-buying-ecc-cooling-diy-guide/</guid>
        <description>&lt;p&gt;If you have been looking at used &lt;code&gt;Tesla V100&lt;/code&gt; cards recently, you have probably seen two very different opinions:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;one side says the card is still strong and offers great value&lt;/li&gt;
&lt;li&gt;the other says the market is full of traps and DIY users can easily get burned&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Both are true.&lt;/p&gt;
&lt;p&gt;The point is not that &lt;code&gt;V100&lt;/code&gt; is unbuyable. The point is that you cannot buy it the same way you would buy a normal consumer GPU. What matters is not only whether it boots, and not only whether the seller says &amp;ldquo;like new&amp;rdquo; or &amp;ldquo;pulled from an original server&amp;rdquo;. What matters is whether the card has been tampered with, what its &lt;code&gt;ECC&lt;/code&gt; condition looks like, and whether the cooling and power setup are actually reliable.&lt;/p&gt;
&lt;p&gt;This article pulls together the most useful checks for buying and using one in practice.&lt;/p&gt;
&lt;h2 id=&#34;quick-takeaways&#34;&gt;Quick Takeaways
&lt;/h2&gt;&lt;p&gt;If you only want the short version, remember these points:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;V100&lt;/code&gt; was produced roughly from &lt;code&gt;2017&lt;/code&gt; to &lt;code&gt;2021&lt;/code&gt;, and &lt;code&gt;2021&lt;/code&gt; cards are uncommon in the &lt;code&gt;16G&lt;/code&gt; version&lt;/li&gt;
&lt;li&gt;looking only at &amp;ldquo;zero ECC&amp;rdquo; or &amp;ldquo;original pull&amp;rdquo; is not enough, because both data and physical condition can be altered&lt;/li&gt;
&lt;li&gt;the biggest risk is often not buying an old card, but buying one that was disassembled, reflashed, or paired with a bad cooling setup&lt;/li&gt;
&lt;li&gt;for &lt;code&gt;DIY&lt;/code&gt; users, the real problem is usually not the core itself, but the adapter board, power delivery, hotspot temperature, and backplate cooling&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;1-start-with-production-date-and-batch-clues&#34;&gt;1. Start with Production Date and Batch Clues
&lt;/h2&gt;&lt;p&gt;A very practical method is to check the chip date first, then see whether the dates on nearby components match it.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://www.knightli.com/2026/04/23/tesla-v100-buying-ecc-cooling-diy-guide/1.png&#34;
	width=&#34;1139&#34;
	height=&#34;670&#34;
	srcset=&#34;https://www.knightli.com/2026/04/23/tesla-v100-buying-ecc-cooling-diy-guide/1_hu_a8325dae98af3ae7.png 480w, https://www.knightli.com/2026/04/23/tesla-v100-buying-ecc-cooling-diy-guide/1_hu_40537b27bd676168.png 1024w&#34;
	loading=&#34;lazy&#34;
	
		alt=&#34;Tesla V100&#34;
	
	
		class=&#34;gallery-image&#34; 
		data-flex-grow=&#34;170&#34;
		data-flex-basis=&#34;408px&#34;
	
&gt;&lt;/p&gt;
&lt;p&gt;For example, if the chip surface shows &lt;code&gt;1828&lt;/code&gt;, it usually means:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;18&lt;/code&gt; = year &lt;code&gt;2018&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;28&lt;/code&gt; = week &lt;code&gt;28&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So that chip was produced in week &lt;code&gt;28&lt;/code&gt; of &lt;code&gt;2018&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Besides the chip package, nearby inductors often carry date-related markings too. If the chip date and inductor date are far apart, for example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;chip date is &lt;code&gt;2017&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;inductors point to &lt;code&gt;2020&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;then you should be cautious. It does not automatically prove the card is bad, but it does suggest it is no longer in a very original state.&lt;/p&gt;
&lt;p&gt;On the other hand, if the dates broadly line up, such as:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;a &lt;code&gt;2018&lt;/code&gt; chip with &lt;code&gt;2018&lt;/code&gt; surrounding components&lt;/li&gt;
&lt;li&gt;a late &lt;code&gt;2019&lt;/code&gt; chip paired with &lt;code&gt;2020&lt;/code&gt; components&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;that is much more normal.&lt;/p&gt;
&lt;h2 id=&#34;2-do-not-only-look-at-the-chip-check-inductors-springs-and-frame&#34;&gt;2. Do Not Only Look at the Chip: Check Inductors, Springs, and Frame
&lt;/h2&gt;&lt;p&gt;Visual inspection is best broken into a few separate checks.&lt;/p&gt;
&lt;h3 id=&#34;1-touch-the-inductors-first&#34;&gt;1. Touch the inductors first
&lt;/h3&gt;&lt;p&gt;Gently press or touch the inductors. Under normal conditions, none of them should feel loose.&lt;/p&gt;
&lt;p&gt;If one of them is already moving, it usually means:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;the solder condition is not healthy&lt;/li&gt;
&lt;li&gt;the problem may worsen with continued use&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Even if the card still works now, that is not a good sign.&lt;/p&gt;
&lt;h3 id=&#34;2-check-whether-the-retaining-spring-has-been-removed-before&#34;&gt;2. Check whether the retaining spring has been removed before
&lt;/h3&gt;&lt;p&gt;There is a useful logic here:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;if the seller insists this is an &amp;ldquo;original server pull&amp;rdquo;&lt;/li&gt;
&lt;li&gt;then the retaining spring generally should not have been casually removed&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In a normal factory server environment, people do not usually remove this spring for no reason.&lt;/p&gt;
&lt;p&gt;If the spring comes off very easily, the card was probably opened before. If the seller is also claiming it is untouched, that claim deserves skepticism.&lt;/p&gt;
&lt;h3 id=&#34;3-if-the-frame-comes-apart-too-easily-that-is-also-suspicious&#34;&gt;3. If the frame comes apart too easily, that is also suspicious
&lt;/h3&gt;&lt;p&gt;Once the middle frame is removed, if the whole structure separates with almost no effort, that usually means the card has already been disassembled multiple times.&lt;/p&gt;
&lt;p&gt;That matters on used &lt;code&gt;V100&lt;/code&gt; cards because reflashing, modification, and repair work often leave exactly these kinds of traces.&lt;/p&gt;
&lt;h2 id=&#34;3-if-the-backplate-separates-too-easily-suspect-a-reflash-or-prior-tampering&#34;&gt;3. If the Backplate Separates Too Easily, Suspect a Reflash or Prior Tampering
&lt;/h2&gt;&lt;p&gt;One especially important detail is that there is a metal plate under the &lt;code&gt;PCB&lt;/code&gt;. It is not only for protection; it also helps with heat dissipation.&lt;/p&gt;
&lt;p&gt;In a normal original condition, this backplate is usually not easy to remove. Reasons include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;adhesive&lt;/li&gt;
&lt;li&gt;a tight structural fit&lt;/li&gt;
&lt;li&gt;the design was not meant for repeated disassembly&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If the backplate separates from the &lt;code&gt;PCB&lt;/code&gt; with only a little force, then you should suspect:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;it has been opened before&lt;/li&gt;
&lt;li&gt;the card may have had its &lt;code&gt;VBIOS&lt;/code&gt; reflashed&lt;/li&gt;
&lt;li&gt;there may have been secondary modifications&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That does not automatically make it unusable, but it is clearly inconsistent with &amp;ldquo;original and untouched.&amp;rdquo;&lt;/p&gt;
&lt;h2 id=&#34;4-how-to-read-ecc-what-matters-most-is-not-whether-it-is-zero-but-whether-it-grows&#34;&gt;4. How to Read &lt;code&gt;ECC&lt;/code&gt;: What Matters Most Is Not Whether It Is Zero, but Whether It Grows
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;ECC&lt;/code&gt; is one of the first things people look at on a &lt;code&gt;V100&lt;/code&gt;, and it really needs to be interpreted carefully.&lt;/p&gt;
&lt;p&gt;A common method is to use &lt;code&gt;nvidia-smi&lt;/code&gt; in detailed mode and check the &lt;code&gt;ECC Errors&lt;/code&gt; section.&lt;/p&gt;
&lt;h3 id=&#34;1-real-time-errors-are-the-most-dangerous&#34;&gt;1. Real-time errors are the most dangerous
&lt;/h3&gt;&lt;p&gt;The upper section can be understood as real-time errors.&lt;/p&gt;
&lt;p&gt;If those numbers keep increasing while the card is running, that usually means the card is already in an unstable state.&lt;/p&gt;
&lt;p&gt;In simple terms:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;a card that runs without new errors matters more than a static zero reading&lt;/li&gt;
&lt;li&gt;a card that starts increasing errors under stress is much more worrying than one with only historical accumulated counts&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;2-lifetime-accumulated-errors-are-not-always-scary&#34;&gt;2. Lifetime accumulated errors are not always scary
&lt;/h3&gt;&lt;p&gt;Another section shows lifetime accumulated errors, meaning how many corrected or uncorrected events happened across the card&amp;rsquo;s life.&lt;/p&gt;
&lt;p&gt;If those values are only:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;single digits&lt;/li&gt;
&lt;li&gt;or maybe in the teens&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;that is not automatically a disaster.&lt;/p&gt;
&lt;p&gt;If real-time errors do not continue increasing during actual use, the card may still be perfectly usable.&lt;/p&gt;
&lt;h3 id=&#34;3-the-page-retirement-section-deserves-more-attention&#34;&gt;3. The page retirement section deserves more attention
&lt;/h3&gt;&lt;p&gt;The page retirement section is even more important, because it indicates memory blocks that were retired after uncorrectable errors.&lt;/p&gt;
&lt;p&gt;A practical way to think about it is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;single-bit and double-bit categories may each have retired blocks&lt;/li&gt;
&lt;li&gt;if the total climbs past &lt;code&gt;10&lt;/code&gt;, you are entering a range where caution is warranted&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That does not always mean the card is unusable, but it does suggest reduced effective memory and weaker long-term confidence.&lt;/p&gt;
&lt;h2 id=&#34;5-do-not-worship-zero-ecc-the-data-itself-can-be-manipulated&#34;&gt;5. Do Not Worship &amp;ldquo;Zero ECC&amp;rdquo;: The Data Itself Can Be Manipulated
&lt;/h2&gt;&lt;p&gt;There is a very practical warning here:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;ECC&lt;/code&gt; numbers are not inherently sacred.&lt;/p&gt;
&lt;p&gt;If a card has:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;extremely clean-looking data&lt;/li&gt;
&lt;li&gt;but obvious signs of disassembly&lt;/li&gt;
&lt;li&gt;and a structure that clearly looks worked on&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;then you should not trust &amp;ldquo;zero ECC&amp;rdquo; by itself.&lt;/p&gt;
&lt;p&gt;A useful analogy is an old car that suddenly shows &lt;code&gt;0&lt;/code&gt; mileage and almost no tire wear after many years. It is hard not to suspect the odometer was touched.&lt;/p&gt;
&lt;p&gt;The same idea applies to &lt;code&gt;V100&lt;/code&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;numbers that look too perfect are not always good news&lt;/li&gt;
&lt;li&gt;what matters is whether the data, the physical condition, and the stress-test behavior all make sense together&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;6-stress-testing-is-necessary-but-testing-only-the-core-is-not-enough&#34;&gt;6. Stress Testing Is Necessary, but Testing Only the Core Is Not Enough
&lt;/h2&gt;&lt;p&gt;You can use a tool such as &lt;code&gt;gpu-burn&lt;/code&gt; to stress the card for several minutes or longer and watch:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;whether it remains stable&lt;/li&gt;
&lt;li&gt;whether the card drops out&lt;/li&gt;
&lt;li&gt;whether new &lt;code&gt;ECC&lt;/code&gt; errors appear&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;But there is another important point:&lt;/p&gt;
&lt;p&gt;Testing only the core does not prove the entire card is healthy.&lt;/p&gt;
&lt;p&gt;A lot of &lt;code&gt;V100&lt;/code&gt; failures do not start with the core. They start with:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;overheating in the power-delivery area&lt;/li&gt;
&lt;li&gt;insufficient cooling around the backplate&lt;/li&gt;
&lt;li&gt;excessive hotspot temperatures&lt;/li&gt;
&lt;li&gt;adapter boards and cooling systems that are always operating too close to the edge&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So stress testing only proves that &amp;ldquo;the card can run right now.&amp;rdquo; It does not prove that &amp;ldquo;this DIY setup will survive in the long run.&amp;rdquo;&lt;/p&gt;
&lt;h2 id=&#34;7-for-diy-users-the-real-failure-point-is-usually-cooling-and-power-not-the-purchase-itself&#34;&gt;7. For DIY Users, the Real Failure Point Is Usually Cooling and Power, Not the Purchase Itself
&lt;/h2&gt;&lt;p&gt;This is probably the most important part of the entire topic.&lt;/p&gt;
&lt;p&gt;The core idea is simple:&lt;/p&gt;
&lt;p&gt;For &lt;code&gt;DIY&lt;/code&gt; users, casually combining an adapter base with a generic cooler is not a robust plan.&lt;/p&gt;
&lt;p&gt;That is because &lt;code&gt;V100&lt;/code&gt; is not a normal consumer card. It is a server accelerator with:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;high power draw&lt;/li&gt;
&lt;li&gt;high heat density&lt;/li&gt;
&lt;li&gt;complicated heat distribution&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The chip is not the only thing producing heat. The backplate, power area, and connector region also get hot, and sometimes very hot.&lt;/p&gt;
&lt;h3 id=&#34;1-do-not-only-watch-average-gpu-temperature&#34;&gt;1. Do not only watch average GPU temperature
&lt;/h3&gt;&lt;p&gt;Many monitoring tools show the average card temperature, but the more dangerous number is often the &lt;code&gt;hot spot&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;That means:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;the visible temperature may only be in the 60s Celsius&lt;/li&gt;
&lt;li&gt;while local hotspots may already be over 100C&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That is why some DIY &lt;code&gt;V100&lt;/code&gt; builds look &amp;ldquo;fine&amp;rdquo; on paper and then suddenly die later.&lt;/p&gt;
&lt;h3 id=&#34;2-backplate-cooling-must-be-considered&#34;&gt;2. Backplate cooling must be considered
&lt;/h3&gt;&lt;p&gt;Cooling for the backplate and power area cannot be ignored.&lt;/p&gt;
&lt;p&gt;If you only cool the core, but:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;the &lt;code&gt;MOS&lt;/code&gt; area is neglected&lt;/li&gt;
&lt;li&gt;the backplate gets no heat transfer help&lt;/li&gt;
&lt;li&gt;the rear side lacks proper thermal design&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;then the full setup is still incomplete.&lt;/p&gt;
&lt;h3 id=&#34;3-cheap-improvised-water-cooling-setups-are-risky&#34;&gt;3. Cheap improvised water-cooling setups are risky
&lt;/h3&gt;&lt;p&gt;You should be cautious about the &amp;ldquo;random adapter board + cheap AIO water cooler&amp;rdquo; style setup.&lt;/p&gt;
&lt;p&gt;The issue is not that it always fails immediately. The issue is that it often has:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;uneven water-channel coverage&lt;/li&gt;
&lt;li&gt;incomplete cooling for the power-delivery area&lt;/li&gt;
&lt;li&gt;poor control of the actual hotspot zones&lt;/li&gt;
&lt;li&gt;unpredictable long-term lifespan&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;8-if-you-still-want-to-diy-at-least-watch-these-points&#34;&gt;8. If You Still Want to DIY, At Least Watch These Points
&lt;/h2&gt;&lt;p&gt;The most practical recommendations are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;prefer more mature adapter-board solutions with a better track record&lt;/li&gt;
&lt;li&gt;do not focus only on the core; the rear power area and backplate need thermal attention too&lt;/li&gt;
&lt;li&gt;the water block needs real coverage and even heat handling, not just physical contact&lt;/li&gt;
&lt;li&gt;after stress testing, keep watching temperatures, hotspots, and long-term behavior&lt;/li&gt;
&lt;li&gt;PSU quality also affects coil whine and overall stability&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In other words, the hard part of a DIY &lt;code&gt;V100&lt;/code&gt; build is not &amp;ldquo;getting it to boot.&amp;rdquo; The hard part is &amp;ldquo;keeping it alive and stable afterward.&amp;rdquo;&lt;/p&gt;
&lt;h2 id=&#34;9-coil-whine-and-adapter-board-variance-are-real-problems-too&#34;&gt;9. Coil Whine and Adapter-Board Variance Are Real Problems Too
&lt;/h2&gt;&lt;p&gt;Two more points are often overlooked.&lt;/p&gt;
&lt;h3 id=&#34;1-coil-whine-may-not-be-fully-eliminable&#34;&gt;1. Coil whine may not be fully eliminable
&lt;/h3&gt;&lt;p&gt;It depends on the individual card, the inductors, capacitors, and the power environment. It is not something you can always solve with one cable or one small accessory.&lt;/p&gt;
&lt;h3 id=&#34;2-adapter-board-variance-is-huge&#34;&gt;2. Adapter-board variance is huge
&lt;/h3&gt;&lt;p&gt;That is why some sellers, even when they are willing to sell a bare card, still emphasize:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;bench-testing it first&lt;/li&gt;
&lt;li&gt;recording the serial number&lt;/li&gt;
&lt;li&gt;doing stress tests&lt;/li&gt;
&lt;li&gt;documenting the process&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Because a lot of disputes are not caused by the silicon itself. They are caused by the adapter board and cooling solution paired with it afterward.&lt;/p&gt;
&lt;h2 id=&#34;closing&#34;&gt;Closing
&lt;/h2&gt;&lt;p&gt;So, is &lt;code&gt;Tesla V100&lt;/code&gt; still worth buying? Yes, but only if you understand what you are buying and how you plan to use it afterward.&lt;/p&gt;
&lt;p&gt;If you only check:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;whether it powers on&lt;/li&gt;
&lt;li&gt;whether &lt;code&gt;ECC&lt;/code&gt; is all zero&lt;/li&gt;
&lt;li&gt;whether the seller says &amp;ldquo;original pull&amp;rdquo;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;that is nowhere near enough.&lt;/p&gt;
&lt;p&gt;The more useful things to verify are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;whether the dates and batch clues line up&lt;/li&gt;
&lt;li&gt;whether there are suspicious signs of prior disassembly&lt;/li&gt;
&lt;li&gt;whether the backplate and structure were clearly opened before&lt;/li&gt;
&lt;li&gt;whether errors increase under stress testing&lt;/li&gt;
&lt;li&gt;whether your cooling and power setup are actually trustworthy&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Especially for &lt;code&gt;DIY&lt;/code&gt; users, the most dangerous part of &lt;code&gt;V100&lt;/code&gt; is often not &amp;ldquo;buying an old card&amp;rdquo;, but underestimating how demanding these cards are about cooling, power delivery, and modification quality.&lt;/p&gt;
</description>
        </item>
        
    </channel>
</rss>
