What Is NVIDIA nvbandwidth: How to Use This GPU Bandwidth Testing Tool

Based on the official NVIDIA/nvbandwidth repository and Releases page, this article explains what the GPU bandwidth testing tool does, what it depends on, how to use it, how multinode testing works, and what changed in v0.9.

If you have recently been troubleshooting interconnect performance between multiple NVIDIA GPUs, or you want to verify the real bandwidth between PCIe, NVLink, host memory, and VRAM, NVIDIA/nvbandwidth is a small tool worth knowing about.

It is not a general benchmark utility, and it is not a hidden command inside a large model framework. It is an open-source tool from NVIDIA specifically designed to measure bandwidth and latency for GPU-related memory copies. Instead of only looking at theoretical bandwidth, nvbandwidth is better at answering a practical question: how much bandwidth can this machine and its current GPU interconnects actually deliver right now?

1. What does nvbandwidth do

According to the official README, nvbandwidth is a command-line tool for measuring bandwidth on NVIDIA GPUs.

It mainly focuses on transfer performance across different memcpy patterns, such as:

  • GPU -> GPU
  • CPU -> GPU
  • GPU -> CPU
  • Transfers between GPUs across multiple nodes

These tests are especially useful in scenarios like:

  • Troubleshooting interconnect bottlenecks in multi-GPU training or inference
  • Verifying the actual behavior of links such as NVLink, PCIe, and C2C
  • Comparing transfer differences across servers, topologies, drivers, or CUDA versions
  • Performing baseline hardware validation before cluster deployment

In short, nvbandwidth is not about model throughput. It is about the lower-level ability to move data.

2. It does not produce just one simple score

Many people think of a bandwidth test as something that ends with a single number, but nvbandwidth provides more detailed output than that.

It reports results as matrices for each test type. For example, in a test like device_to_device_memcpy_write_ce, it shows the bandwidth between each pair of GPUs by row and column. That means you can see more than just a rough system-wide speed estimate. You can also spot:

  • Which GPU pairs are especially fast
  • Which paths are clearly limited by PCIe
  • Whether certain GPU pairs show abnormally low bandwidth
  • Whether the multi-GPU topology matches your expectations

If you are working with an 8-GPU server, a dual-socket platform, or a multinode system, this matrix-style output is often more useful than a single average number.

3. How to understand CE and SM copies

The official documentation splits tests into two categories:

  • CE: copy engine transfers based on memcpy APIs
  • SM: kernel-based transfers

These two result types are not guaranteed to match exactly, because they represent different copy paths.
If you mainly want to understand regular device-to-device transfer behavior, you will usually look at CE first. If you want to study execution details more closely, then SM is worth checking too.

The README also explains that bandwidth results use the median across multiple test runs by default. Newer versions additionally include variability statistics, which makes it easier to judge how stable the numbers are.

4. What environment does it require

nvbandwidth is not a pure binary utility that you simply download and run. It expects a standard CUDA development environment.

The current README lists these basic requirements:

  • CUDA Toolkit 11.x or newer
  • A compiler with C++17 support
  • CMake 3.20+, with 3.24+ recommended
  • Boost program_options
  • A usable CUDA device and a compatible driver

The requirements are higher if you want the multinode version. The current README explicitly states:

  • Multinode builds require CUDA Toolkit 12.3
  • The driver must be 550 or newer
  • MPI is required
  • The nvidia-imex service must be configured

So this is much more of an engineering tool for Linux GPU servers and clusters than something aimed at casual desktop use.

5. How to build and run the single-node version

The single-node build process is straightforward:

1
2
cmake .
make

On Ubuntu / Debian, the project also provides a debian_install.sh script that installs common dependencies and builds the project.

After building, you can check the help output first:

1
./nvbandwidth -h

Some commonly used options include:

  • -l: list available tests
  • -t: run a specific test by name or index
  • -p: run tests by prefix
  • -b: set the memcpy buffer size, default 512 MiB
  • -i: set the number of benchmark iterations
  • -j: output JSON
  • -H: enable huge pages for host memory allocation

If you just want to run the default test suite once, use:

1
./nvbandwidth

If you only want to test one specific item, such as a device-to-device copy:

1
./nvbandwidth -t device_to_device_memcpy_read_ce

6. Multinode support is one of its standout features

nvbandwidth is not only for single-node multi-GPU testing. It also supports multinode scenarios.

According to the README, the multinode build is done like this:

1
2
cmake -DMULTINODE=1 .
make

At runtime, it is typically used together with mpirun, with one process launched per GPU.
The documentation also requires all participating ranks to belong to the same multinode clique, and it recommends mainly running tests with the multinode prefix under MPI.

That makes its positioning much closer to high-performance computing and large GPU systems than to simple workstation self-checks.

If you are working with NVLink multinode deployments or more complex platforms such as GB200 / Grace Hopper, the value of nvbandwidth is much higher than it would be on a typical consumer GPU setup.

7. What changed in v0.9

As of April 24, 2026, the GitHub Releases page shows that the latest version of nvbandwidth is v0.9, released on April 8, 2026.

The most notable updates in this release include:

  • Added variability statistics to bandwidth output
  • Added huge page support for host memory (Windows excluded)
  • Added pair sampling for device-to-device tests
  • Added a troubleshooting guide
  • Unified single-node and multinode execution paths

Two engineering-oriented changes are also worth noting:

  • Improved CUDA architecture detection without relying as much on direct GPU access
  • Deprecated Volta (sm_70 / sm_72) support in CUDA Toolkit 13.0+ environments

So if you only looked at early versions before, v0.9 is no longer just a basic bandwidth tester. It is clearly moving toward better automation, troubleshooting, and large-scale system validation.

8. When is it a good fit

nvbandwidth is especially suitable when:

  • You want to verify real interconnect bandwidth between multiple NVIDIA GPUs
  • You suspect one GPU is installed in a bandwidth-limited PCIe slot
  • You want to compare NVLink paths against non-NVLink paths
  • You are deploying a multinode GPU cluster and need to validate the links
  • You want test results in JSON for automation pipelines

But if your goal is only to answer questions like “how fast is training” or “how many tokens per second can inference reach,” this tool is not the whole answer.
In that case, you still need workload-level testing with your training framework, inference engine, or real application.

9. How to think about its value

Many GPU performance problems are not really caused by insufficient compute. They happen because the data path is not working as expected.

For example:

  • GPUs are not using the intended interconnect path
  • Cross-NUMA access is reducing speed
  • Certain GPU pairs have abnormal bandwidth
  • Multinode communication is only partially configured

These issues are often hard to diagnose if you only look at nvidia-smi or model throughput.
A lower-level, matrix-oriented tool like nvbandwidth is useful precisely because it exposes what is happening at the interconnect layer.

So a simple way to think about it is: nvbandwidth is a command-line health check tool for bandwidth on NVIDIA GPU systems.

记录并分享
Built with Hugo
Theme Stack designed by Jimmy