A Practical llama.cpp Multi-GPU Benchmarking Approach: Is 2x V100 16GB Faster Than One 32GB Card?

Sat, 09 May 2026 15:05:41 +0800

Short version: llama.cpp multi-GPU offload is not free performance just because you add a second card. If the model already fits fully on one 32GB GPU, 2x V100 16GB is often less convenient than a single 32GB card and may even be slower. If the model does not fit on one 16GB card, the main value of dual GPUs is that the model can stay on GPU, and the benefit can be obvious.

First, Understand split mode

llama.cpp multi-GPU usage mainly revolves around --split-mode and --tensor-split. When discussing performance, distinguish these modes first:

layer: splits layers across GPUs. It is usually the most compatible starting point.
tensor: splits tensor computation across multiple GPUs. It is closer to true parallel compute, but depends more heavily on inter-GPU bandwidth and backend support.
row: an older row-splitting mode that still appears in some setups, but is usually not the first choice for new deployments.

In simple terms, layer is like putting different floors on different cards. During single-token generation, it may not keep both cards fully busy at the same time. tensor is more like letting both cards work on the same layer together. It has more theoretical parallelism, but inter-GPU communication can become the bottleneck.

If One 32GB Card Can Fit the Model, Dual 16GB Is Not Always Faster

If the model and KV cache fit fully on one 32GB GPU, a single card is usually steadier and often faster. For hardware in the same generation, such as 1x V100 32GB versus 2x V100 16GB, the dual-card setup does not necessarily win.

A conservative expectation is that 2x V100 16GB may be 10% to 40% slower than one V100 32GB, especially for single-user chat, Continue Agent, and code Q&A workloads where one request is mainly generating one answer.

The reason is straightforward: multi-GPU does not simply merge VRAM into one fast pool. With layer splitting, inference moves across GPUs and one card may wait for the other during token generation. With tensor splitting, both cards can compute together, but intermediate results need cross-GPU synchronization, so bandwidth and latency directly affect throughput.

So if your choice is:

1x V100 32GB
2x V100 16GB

and the target model already fits fully on one 32GB card, the single 32GB card is often the more comfortable option.

If One 16GB Card Cannot Fit the Model, Dual Cards Matter

The situation changes completely when the model does not fit on one 16GB card but can fit across two 16GB cards.

In that case, the value of dual GPUs is very direct:

One 16GB card: may require heavy CPU offload, which can slow things down a lot.
2x 16GB cards: weights can stay mostly on GPU, which may be much faster than mixed CPU/GPU execution.

In this scenario, 2x V100 16GB is not guaranteed to beat one 32GB card, but it may be several times faster than a single 16GB card with heavy system-memory offload. In other words, the first value of dual cards is not acceleration. It is avoiding the need to push model weights into slower system RAM.

V100 PCIe and V100 SXM2 Are Very Different

The easiest thing to overlook in multi-GPU inference is the interconnect.

If you have V100 SXM2 with NVLink, cross-GPU communication bandwidth is much higher. NVIDIA’s V100 material lists NVLink interconnect bandwidth up to 300GB/s. In that environment, tensor mode or higher-batch workloads have a better chance of approaching or exceeding single-card performance.

If you have V100 PCIe, expectations should be more conservative. V100 PCIe mainly uses PCIe Gen3, and the listed interconnect bandwidth is 32GB/s. That is a very different class from NVLink, which is why dual PCIe cards often provide enough VRAM without doubling speed.

So when judging whether 2x V100 16GB is worthwhile, do not only add the VRAM to 32GB. Also check whether the cards are PCIe or SXM2/NVLink.

A Practical Buying Rule

If the model fits on one 32GB GPU, choose the single card first. Its latency, stability, and tuning cost are usually better.

If the model does not fit on one 16GB GPU but can fit on two 16GB GPUs, dual cards are worth using. At that point, the goal is to keep weights on GPU as much as possible, not to expect linear performance scaling.

If you have dual V100 PCIe cards, start with --split-mode layer and aim for stable execution with less CPU fallback.

If you have V100 SXM2/NVLink, it is more worth testing tensor-related modes, especially for prefill, larger batches, or concurrent serving.

When to Buy 2x16GB and When to Buy 1x32GB

If you serve only one user and mainly do chat, code completion, Continue Agent, or long-context Q&A, and the target model fits within 32GB, 1x32GB is usually the better choice. It avoids cross-GPU scheduling, has steadier latency, and is easier to debug.

If you already own one 16GB card and want a lower-cost path to run 30B, 32B, or higher-quantized models, 2x16GB makes sense. It may not double token/s, but it can keep weights on GPU that would otherwise require CPU offload.

If you are buying from scratch, the priority can look like this:

Single model, single user, latency-sensitive: prefer 1x32GB.
Model does not fit on one card and budget is limited: consider 2x16GB.
Machine has NVLink or SXM2: 2x16GB is much more interesting than ordinary PCIe dual cards.
You want longer context later: do not only count model weights; reserve VRAM for KV cache too.

Practical Advice for layer split and tensor split

The practical rule is: start with layer, then benchmark tensor.

layer is the default starting point. It splits the model by layer, has better compatibility, and is friendlier to PCIe dual-card systems. The downside is that generation can behave more like a pipeline: at certain moments one card is busy while the other waits.

tensor is better suited to machines with strong interconnects, such as V100 SXM2/NVLink. It splits part of the same layer’s computation across GPUs, so it has more parallelism in theory, but it also synchronizes across cards more often. On PCIe dual cards, communication overhead may eat the benefit.

You can start with these tests:

1
2
3

llama-bench -m model.gguf -ngl 99 --split-mode layer --tensor-split 1,1
llama-bench -m model.gguf -ngl 99 --split-mode tensor --tensor-split 1,1
llama-bench -m model.gguf -ngl 99 --split-mode layer --tensor-split 1,0

The third command is not meant as the long-term configuration. It gives you a single-card reference, so you can see whether dual GPUs are actually faster or only distributing VRAM pressure.

Why prefill and decode Behave Differently

Local LLM performance should usually be viewed in two stages:

prefill: processes the input prompt. A typical metric is prompt-processing throughput such as pp512.
decode: generates the response token by token. A typical metric is token-generation throughput such as tg128.

prefill is more like large-batch matrix computation. With larger batches, it is easier to keep GPUs busy and more likely to benefit from multi-GPU parallelism. decode generates one token after another. The batch is smaller and synchronization is more frequent, so cross-card communication and scheduling latency are easier to notice.

That is why you may see dual GPUs improve pp512 while tg128 barely improves or even gets worse. For chat and agent workflows, user experience is closer to tg128. For long document ingestion, batch prefill, or concurrent serving, pp512 also matters.

Can KV cache Become a Second VRAM Bottleneck?

Yes. Many people only count model weights and forget KV cache.

Model weights decide whether the model can load. KV cache decides whether you can use the context length you want. The longer the context, the higher the concurrency, and the larger the batch, the more visible KV cache usage becomes. You may find that the model itself fits in 32GB, but 32K or 64K context pushes VRAM over the limit.

At minimum, leave VRAM headroom for:

KV cache
CUDA graph or backend runtime overhead
prompt batch and ubatch
desktop, driver, and other process usage

If you use 2x16GB, VRAM is not a fully equivalent 32GB pool. Some buffers, KV cache, or intermediate tensors may still be limited by remaining memory on a single card. When testing long context, use the target --ctx-size and target concurrency directly instead of only checking whether the model starts.

How to Benchmark Dual Cards with llama-bench

llama-bench is better than direct chatting for hardware comparison because it separates prompt processing and token generation into comparable metrics. The default example in the official README is:

`1`	`llama-bench -m model.gguf`

For dual V100 cards, test at least these sets:

# Single-card baseline
CUDA_VISIBLE_DEVICES=0 llama-bench -m model.gguf -ngl 99

# Dual-card layer split
CUDA_VISIBLE_DEVICES=0,1 llama-bench -m model.gguf -ngl 99 --split-mode layer --tensor-split 1,1

# Dual-card tensor split
CUDA_VISIBLE_DEVICES=0,1 llama-bench -m model.gguf -ngl 99 --split-mode tensor --tensor-split 1,1

Focus on two columns:

pp512: prompt processing, more relevant to long inputs and batch prefill.
tg128: token generation, more relevant to single-user chat and agent responsiveness.

Keep the model, quantization, context length, batch settings, driver version, and llama.cpp version fixed. Run each group several times and compare medians rather than one-off results. Finally, test your real workflow too, such as Continue Agent, an OpenAI-compatible server, or your own RAG requests, because a good benchmark does not always mean better interactive experience.

One-Sentence Conclusion

The main advantage of 2x V100 16GB is VRAM capacity, not guaranteed generation speed. If the model fits on one card, a single 32GB card is usually faster and steadier. If the model does not fit on one 16GB card, dual 16GB cards become valuable because they avoid heavy CPU offload. Whether they are faster depends on split mode, batch size, model size, and whether the two V100 cards are connected through PCIe or NVLink.

References:

GPU Acceleration on KnightLi Blog