<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
    <channel>
        <title>NVIDIA on KnightLi Blog</title>
        <link>https://www.knightli.com/en/tags/nvidia/</link>
        <description>Recent content in NVIDIA on KnightLi Blog</description>
        <generator>Hugo -- gohugo.io</generator>
        <language>en</language>
        <lastBuildDate>Fri, 24 Apr 2026 14:41:35 +0800</lastBuildDate><atom:link href="https://www.knightli.com/en/tags/nvidia/index.xml" rel="self" type="application/rss+xml" /><item>
        <title>What Is NVIDIA nvbandwidth: How to Use This GPU Bandwidth Testing Tool</title>
        <link>https://www.knightli.com/en/2026/04/24/nvidia-nvbandwidth-guide/</link>
        <pubDate>Fri, 24 Apr 2026 14:41:35 +0800</pubDate>
        
        <guid>https://www.knightli.com/en/2026/04/24/nvidia-nvbandwidth-guide/</guid>
        <description>&lt;p&gt;If you have recently been troubleshooting interconnect performance between multiple &lt;code&gt;NVIDIA GPU&lt;/code&gt;s, or you want to verify the real bandwidth between &lt;code&gt;PCIe&lt;/code&gt;, &lt;code&gt;NVLink&lt;/code&gt;, host memory, and VRAM, &lt;code&gt;NVIDIA/nvbandwidth&lt;/code&gt; is a small tool worth knowing about.&lt;/p&gt;
&lt;p&gt;It is not a general benchmark utility, and it is not a hidden command inside a large model framework. It is an open-source tool from NVIDIA specifically designed to measure bandwidth and latency for GPU-related memory copies. Instead of only looking at theoretical bandwidth, &lt;code&gt;nvbandwidth&lt;/code&gt; is better at answering a practical question: &lt;strong&gt;how much bandwidth can this machine and its current GPU interconnects actually deliver right now?&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&#34;1-what-does-nvbandwidth-do&#34;&gt;1. What does &lt;code&gt;nvbandwidth&lt;/code&gt; do
&lt;/h2&gt;&lt;p&gt;According to the official README, &lt;code&gt;nvbandwidth&lt;/code&gt; is a command-line tool for measuring bandwidth on &lt;code&gt;NVIDIA GPU&lt;/code&gt;s.&lt;/p&gt;
&lt;p&gt;It mainly focuses on transfer performance across different &lt;code&gt;memcpy&lt;/code&gt; patterns, such as:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;GPU -&amp;gt; GPU&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;CPU -&amp;gt; GPU&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;GPU -&amp;gt; CPU&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Transfers between GPUs across multiple nodes&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These tests are especially useful in scenarios like:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Troubleshooting interconnect bottlenecks in multi-GPU training or inference&lt;/li&gt;
&lt;li&gt;Verifying the actual behavior of links such as &lt;code&gt;NVLink&lt;/code&gt;, &lt;code&gt;PCIe&lt;/code&gt;, and &lt;code&gt;C2C&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Comparing transfer differences across servers, topologies, drivers, or CUDA versions&lt;/li&gt;
&lt;li&gt;Performing baseline hardware validation before cluster deployment&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In short, &lt;code&gt;nvbandwidth&lt;/code&gt; is not about model throughput. It is about the lower-level ability to move data.&lt;/p&gt;
&lt;h2 id=&#34;2-it-does-not-produce-just-one-simple-score&#34;&gt;2. It does not produce just one simple score
&lt;/h2&gt;&lt;p&gt;Many people think of a bandwidth test as something that ends with a single number, but &lt;code&gt;nvbandwidth&lt;/code&gt; provides more detailed output than that.&lt;/p&gt;
&lt;p&gt;It reports results as matrices for each test type. For example, in a test like &lt;code&gt;device_to_device_memcpy_write_ce&lt;/code&gt;, it shows the bandwidth between each pair of GPUs by row and column. That means you can see more than just a rough system-wide speed estimate. You can also spot:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Which GPU pairs are especially fast&lt;/li&gt;
&lt;li&gt;Which paths are clearly limited by &lt;code&gt;PCIe&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Whether certain GPU pairs show abnormally low bandwidth&lt;/li&gt;
&lt;li&gt;Whether the multi-GPU topology matches your expectations&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you are working with an 8-GPU server, a dual-socket platform, or a multinode system, this matrix-style output is often more useful than a single average number.&lt;/p&gt;
&lt;h2 id=&#34;3-how-to-understand-ce-and-sm-copies&#34;&gt;3. How to understand &lt;code&gt;CE&lt;/code&gt; and &lt;code&gt;SM&lt;/code&gt; copies
&lt;/h2&gt;&lt;p&gt;The official documentation splits tests into two categories:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;CE&lt;/code&gt;: copy engine transfers based on &lt;code&gt;memcpy&lt;/code&gt; APIs&lt;/li&gt;
&lt;li&gt;&lt;code&gt;SM&lt;/code&gt;: kernel-based transfers&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These two result types are not guaranteed to match exactly, because they represent different copy paths.&lt;br&gt;
If you mainly want to understand regular device-to-device transfer behavior, you will usually look at &lt;code&gt;CE&lt;/code&gt; first. If you want to study execution details more closely, then &lt;code&gt;SM&lt;/code&gt; is worth checking too.&lt;/p&gt;
&lt;p&gt;The README also explains that bandwidth results use the median across multiple test runs by default. Newer versions additionally include variability statistics, which makes it easier to judge how stable the numbers are.&lt;/p&gt;
&lt;h2 id=&#34;4-what-environment-does-it-require&#34;&gt;4. What environment does it require
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;nvbandwidth&lt;/code&gt; is not a pure binary utility that you simply download and run. It expects a standard CUDA development environment.&lt;/p&gt;
&lt;p&gt;The current README lists these basic requirements:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;CUDA Toolkit 11.x&lt;/code&gt; or newer&lt;/li&gt;
&lt;li&gt;A compiler with &lt;code&gt;C++17&lt;/code&gt; support&lt;/li&gt;
&lt;li&gt;&lt;code&gt;CMake 3.20+&lt;/code&gt;, with &lt;code&gt;3.24+&lt;/code&gt; recommended&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Boost program_options&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;A usable &lt;code&gt;CUDA&lt;/code&gt; device and a compatible driver&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The requirements are higher if you want the multinode version. The current README explicitly states:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Multinode builds require &lt;code&gt;CUDA Toolkit 12.3&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;The driver must be &lt;code&gt;550&lt;/code&gt; or newer&lt;/li&gt;
&lt;li&gt;&lt;code&gt;MPI&lt;/code&gt; is required&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;nvidia-imex&lt;/code&gt; service must be configured&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So this is much more of an engineering tool for Linux GPU servers and clusters than something aimed at casual desktop use.&lt;/p&gt;
&lt;h2 id=&#34;5-how-to-build-and-run-the-single-node-version&#34;&gt;5. How to build and run the single-node version
&lt;/h2&gt;&lt;p&gt;The single-node build process is straightforward:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;cmake .
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;make
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;On &lt;code&gt;Ubuntu&lt;/code&gt; / &lt;code&gt;Debian&lt;/code&gt;, the project also provides a &lt;code&gt;debian_install.sh&lt;/code&gt; script that installs common dependencies and builds the project.&lt;/p&gt;
&lt;p&gt;After building, you can check the help output first:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;./nvbandwidth -h
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Some commonly used options include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;-l&lt;/code&gt;: list available tests&lt;/li&gt;
&lt;li&gt;&lt;code&gt;-t&lt;/code&gt;: run a specific test by name or index&lt;/li&gt;
&lt;li&gt;&lt;code&gt;-p&lt;/code&gt;: run tests by prefix&lt;/li&gt;
&lt;li&gt;&lt;code&gt;-b&lt;/code&gt;: set the memcpy buffer size, default &lt;code&gt;512 MiB&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;-i&lt;/code&gt;: set the number of benchmark iterations&lt;/li&gt;
&lt;li&gt;&lt;code&gt;-j&lt;/code&gt;: output &lt;code&gt;JSON&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;-H&lt;/code&gt;: enable huge pages for host memory allocation&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you just want to run the default test suite once, use:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;./nvbandwidth
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If you only want to test one specific item, such as a device-to-device copy:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;./nvbandwidth -t device_to_device_memcpy_read_ce
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h2 id=&#34;6-multinode-support-is-one-of-its-standout-features&#34;&gt;6. Multinode support is one of its standout features
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;nvbandwidth&lt;/code&gt; is not only for single-node multi-GPU testing. It also supports multinode scenarios.&lt;/p&gt;
&lt;p&gt;According to the README, the multinode build is done like this:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;cmake -DMULTINODE&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;m&#34;&gt;1&lt;/span&gt; .
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;make
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;At runtime, it is typically used together with &lt;code&gt;mpirun&lt;/code&gt;, with one process launched per GPU.&lt;br&gt;
The documentation also requires all participating ranks to belong to the same multinode clique, and it recommends mainly running tests with the &lt;code&gt;multinode&lt;/code&gt; prefix under MPI.&lt;/p&gt;
&lt;p&gt;That makes its positioning much closer to high-performance computing and large GPU systems than to simple workstation self-checks.&lt;/p&gt;
&lt;p&gt;If you are working with &lt;code&gt;NVLink&lt;/code&gt; multinode deployments or more complex platforms such as &lt;code&gt;GB200&lt;/code&gt; / &lt;code&gt;Grace Hopper&lt;/code&gt;, the value of &lt;code&gt;nvbandwidth&lt;/code&gt; is much higher than it would be on a typical consumer GPU setup.&lt;/p&gt;
&lt;h2 id=&#34;7-what-changed-in-v09&#34;&gt;7. What changed in &lt;code&gt;v0.9&lt;/code&gt;
&lt;/h2&gt;&lt;p&gt;As of &lt;strong&gt;April 24, 2026&lt;/strong&gt;, the GitHub Releases page shows that the latest version of &lt;code&gt;nvbandwidth&lt;/code&gt; is &lt;strong&gt;&lt;code&gt;v0.9&lt;/code&gt;&lt;/strong&gt;, released on &lt;strong&gt;April 8, 2026&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;The most notable updates in this release include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Added variability statistics to bandwidth output&lt;/li&gt;
&lt;li&gt;Added huge page support for host memory (&lt;code&gt;Windows&lt;/code&gt; excluded)&lt;/li&gt;
&lt;li&gt;Added pair sampling for device-to-device tests&lt;/li&gt;
&lt;li&gt;Added a troubleshooting guide&lt;/li&gt;
&lt;li&gt;Unified single-node and multinode execution paths&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Two engineering-oriented changes are also worth noting:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Improved CUDA architecture detection without relying as much on direct GPU access&lt;/li&gt;
&lt;li&gt;Deprecated &lt;code&gt;Volta&lt;/code&gt; (&lt;code&gt;sm_70&lt;/code&gt; / &lt;code&gt;sm_72&lt;/code&gt;) support in &lt;code&gt;CUDA Toolkit 13.0+&lt;/code&gt; environments&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So if you only looked at early versions before, &lt;code&gt;v0.9&lt;/code&gt; is no longer just a basic bandwidth tester. It is clearly moving toward better automation, troubleshooting, and large-scale system validation.&lt;/p&gt;
&lt;h2 id=&#34;8-when-is-it-a-good-fit&#34;&gt;8. When is it a good fit
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;nvbandwidth&lt;/code&gt; is especially suitable when:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;You want to verify real interconnect bandwidth between multiple &lt;code&gt;NVIDIA GPU&lt;/code&gt;s&lt;/li&gt;
&lt;li&gt;You suspect one GPU is installed in a bandwidth-limited &lt;code&gt;PCIe&lt;/code&gt; slot&lt;/li&gt;
&lt;li&gt;You want to compare &lt;code&gt;NVLink&lt;/code&gt; paths against non-&lt;code&gt;NVLink&lt;/code&gt; paths&lt;/li&gt;
&lt;li&gt;You are deploying a multinode GPU cluster and need to validate the links&lt;/li&gt;
&lt;li&gt;You want test results in &lt;code&gt;JSON&lt;/code&gt; for automation pipelines&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;But if your goal is only to answer questions like &amp;ldquo;how fast is training&amp;rdquo; or &amp;ldquo;how many tokens per second can inference reach,&amp;rdquo; this tool is not the whole answer.&lt;br&gt;
In that case, you still need workload-level testing with your training framework, inference engine, or real application.&lt;/p&gt;
&lt;h2 id=&#34;9-how-to-think-about-its-value&#34;&gt;9. How to think about its value
&lt;/h2&gt;&lt;p&gt;Many GPU performance problems are not really caused by insufficient compute. They happen because the data path is not working as expected.&lt;/p&gt;
&lt;p&gt;For example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;GPUs are not using the intended interconnect path&lt;/li&gt;
&lt;li&gt;Cross-NUMA access is reducing speed&lt;/li&gt;
&lt;li&gt;Certain GPU pairs have abnormal bandwidth&lt;/li&gt;
&lt;li&gt;Multinode communication is only partially configured&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These issues are often hard to diagnose if you only look at &lt;code&gt;nvidia-smi&lt;/code&gt; or model throughput.&lt;br&gt;
A lower-level, matrix-oriented tool like &lt;code&gt;nvbandwidth&lt;/code&gt; is useful precisely because it exposes what is happening at the interconnect layer.&lt;/p&gt;
&lt;p&gt;So a simple way to think about it is: &lt;strong&gt;&lt;code&gt;nvbandwidth&lt;/code&gt; is a command-line health check tool for bandwidth on NVIDIA GPU systems.&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&#34;related-links&#34;&gt;Related links
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;GitHub project: &lt;a class=&#34;link&#34; href=&#34;https://github.com/NVIDIA/nvbandwidth&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://github.com/NVIDIA/nvbandwidth&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Releases: &lt;a class=&#34;link&#34; href=&#34;https://github.com/NVIDIA/nvbandwidth/releases&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://github.com/NVIDIA/nvbandwidth/releases&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
        </item>
        
    </channel>
</rss>
