Ollama Multi-GPU Notes: VRAM Pooling, GPU Selection, and Common Misunderstandings

A practical summary of Ollama multi-GPU behavior: when models are split across GPUs, how to limit devices with CUDA_VISIBLE_DEVICES / ROCR_VISIBLE_DEVICES, whether VRAM can be pooled, whether mixed GPUs work, and common Docker, PCIe, and performance pitfalls.

When running local inference with Ollama, a few questions come up quickly: if I already have one GPU and my motherboard still has empty PCIe slots, does adding more GPUs help? Do the GPUs need to be identical? Can VRAM be combined? Will it accelerate inference like a multi-GPU training framework?

This note summarizes how Ollama behaves with multiple GPUs. The short version:

  • Ollama supports multiple GPUs.
  • The main value of multiple GPUs is usually fitting larger models into available VRAM, not getting linear token/s scaling.
  • By default, if a model fits entirely on one GPU, Ollama tends to load it on a single GPU.
  • If a model does not fit on one GPU, Ollama can spread it across available GPUs.
  • Mixed GPU models may be visible to Ollama, but performance and placement may not be ideal.
  • SLI / NVLink is not required for multi-GPU use.
  • To limit which GPUs Ollama can use, use CUDA_VISIBLE_DEVICES, ROCR_VISIBLE_DEVICES, or GGML_VK_VISIBLE_DEVICES.

Official Behavior: Single GPU First, Multi-GPU When Needed

Ollama’s FAQ describes the multi-GPU loading logic directly: when loading a new model, Ollama estimates the required VRAM and compares it with currently available GPU memory. If the model can fit entirely on one GPU, it loads the model onto that GPU. If it cannot fit on a single GPU, the model is spread across all available GPUs.

The reason is performance. Keeping a model on one GPU usually reduces data transfers across the PCIe bus during inference, so it is often faster.

So do not think of Ollama multi-GPU as “more cards automatically means several times faster.” A more accurate model is:

  • Small model fits on one GPU: usually runs on one GPU.
  • Large model does not fit on one GPU: split across multiple GPUs.
  • Still not enough VRAM: part of the model falls back to system memory, and speed drops noticeably.

Use this command to see where the model is loaded:

1
ollama ps

The PROCESSOR column may show something like:

1
2
3
100% GPU
48%/52% CPU/GPU
100% CPU

If you see 48%/52% CPU/GPU, part of the model is already in system memory. In that case, adding more GPU memory or using a larger-VRAM GPU is usually more useful than continuing to rely on CPU/RAM.

Multi-GPU Is Not Simple Compute Stacking

Local LLM inference is not the same as SLI in games. With Ollama on multiple GPUs, the common pattern is that different layers or tensors are placed on different devices. This can make a larger model fit into the combined available VRAM, but data may still need to move between devices during inference.

So multi-GPU benefits usually fall into two categories:

  • VRAM benefit: larger models fit more easily, or less of the model falls back to CPU/RAM.
  • Performance benefit: usually most obvious when a model would otherwise not fit on one GPU or would heavily spill to CPU.

If an 8B or 14B model already fits entirely on a single RTX 3090, forcing it across two GPUs may not be faster. It may even slow down due to cross-GPU transfer overhead. Ollama’s default “use one GPU when it fits” strategy avoids that unnecessary PCIe cost.

Ollama multi-GPU does not depend on SLI. Multiple normal PCIe GPUs can be scheduled as long as the driver and Ollama can detect them.

NVLink or higher PCIe bandwidth may help in some cross-GPU scenarios, but it is not a requirement. Many used GPU servers and workstations can run multiple GPUs over ordinary PCIe.

What you should pay attention to is PCIe bandwidth. The difference between x1, x4, x8, and x16 affects how quickly a model is loaded into VRAM. If you frequently switch large models, PCIe bandwidth becomes more important. After a model is loaded, PCIe usually matters less during generation, but cross-GPU splitting can still add overhead.

Safer rules:

  • Prefer x16 / x8 over mining-style x1 risers.
  • PCIe bandwidth matters more when switching large models frequently.
  • If a model stays resident in VRAM for a long time, PCIe bandwidth is less visible.
  • For multi-GPU machines, check motherboard PCIe topology and CPU-attached lanes.

Limit Which NVIDIA GPUs Ollama Uses

On NVIDIA multi-GPU systems, use CUDA_VISIBLE_DEVICES to control which GPUs Ollama can see.

Temporary run:

1
CUDA_VISIBLE_DEVICES=0,1 ollama serve

Use only the second GPU:

1
CUDA_VISIBLE_DEVICES=1 ollama serve

Force Ollama not to use NVIDIA GPUs:

1
CUDA_VISIBLE_DEVICES=-1 ollama serve

The official docs note that numeric IDs may change order, so GPU UUIDs are more reliable. Check UUIDs first:

1
nvidia-smi -L

Example output:

1
2
GPU 0: NVIDIA GeForce RTX 3090 (UUID: GPU-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx)
GPU 1: NVIDIA GeForce RTX 3070 (UUID: GPU-yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy)

Then specify the UUID:

1
CUDA_VISIBLE_DEVICES=GPU-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx ollama serve

If Ollama is installed as a Linux systemd service, put the variable into the service environment:

1
sudo systemctl edit ollama.service

Add:

1
2
[Service]
Environment="CUDA_VISIBLE_DEVICES=0,1"

Reload and restart:

1
2
sudo systemctl daemon-reload
sudo systemctl restart ollama

AMD and Vulkan Device Selection

For AMD ROCm, use ROCR_VISIBLE_DEVICES to control visible GPUs:

1
ROCR_VISIBLE_DEVICES=0,1 ollama serve

To force Ollama not to use ROCm GPUs, use an invalid ID:

1
ROCR_VISIBLE_DEVICES=-1 ollama serve

Ollama’s GPU docs also mention experimental Vulkan support. For Vulkan GPUs, use GGML_VK_VISIBLE_DEVICES:

1
OLLAMA_VULKAN=1 GGML_VK_VISIBLE_DEVICES=0 ollama serve

If Vulkan devices cause problems, disable them:

1
GGML_VK_VISIBLE_DEVICES=-1 ollama serve

AMD multi-GPU setups are more likely to run into driver, ROCm version, and GFX version compatibility issues. The official docs also mention Linux ROCm driver requirements and compatibility overrides such as HSA_OVERRIDE_GFX_VERSION. If you mix different generations of AMD GPUs, first verify that each card works on its own before trying multi-GPU.

Exposing Multiple GPUs in Docker

If you run Ollama in Docker, NVIDIA setups usually require nvidia-container-toolkit, then --gpus to expose devices.

Expose all GPUs:

1
2
3
4
5
6
docker run -d \
  --gpus=all \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  --name ollama \
  ollama/ollama

Expose specific GPUs:

1
2
3
4
5
6
docker run -d \
  --gpus '"device=0,1"' \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  --name ollama \
  ollama/ollama

You can also combine this with environment variables:

1
2
3
4
5
6
7
docker run -d \
  --gpus=all \
  -e CUDA_VISIBLE_DEVICES=0,1 \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  --name ollama \
  ollama/ollama

If nvidia-smi cannot see GPUs inside the container, Ollama cannot use them either. Troubleshoot Docker GPU passthrough first, then Ollama.

What Is OLLAMA_SCHED_SPREAD

In some multi-GPU configuration discussions, you may see OLLAMA_SCHED_SPREAD=1 or OLLAMA_SCHED_SPREAD=true. It is related to Ollama’s scheduler and is often used when people want models or requests to be spread more broadly across GPUs.

Example:

1
OLLAMA_SCHED_SPREAD=1 ollama serve

Or with systemd:

1
2
[Service]
Environment="OLLAMA_SCHED_SPREAD=true"

But it is not a magic switch. Enabling it does not imply linear token/s scaling, and it may still run into OOM when multiple models are loaded, VRAM estimates are tight, context length grows, or the KV cache expands. The core FAQ behavior still applies: if one GPU can fully hold the model, one GPU is usually more efficient; if one GPU cannot hold it, then multi-GPU splitting becomes useful.

Treat OLLAMA_SCHED_SPREAD as an advanced scheduling experiment, not a required multi-GPU setting. Understand the default behavior first, then adjust based on ollama ps, logs, and nvidia-smi.

How to Check Whether Multiple GPUs Are Being Used

Useful commands:

1
ollama ps
1
watch -n 0.5 nvidia-smi

View the Ollama service logs:

1
journalctl -u ollama -f

If using Docker:

1
docker logs -f ollama

Watch for:

  • Whether Ollama discovers compatible GPUs.
  • Whether the model shows 100% GPU or a CPU/GPU split.
  • Whether each GPU has VRAM allocated.
  • Whether VRAM grows on multiple GPUs during model loading.
  • Whether generation token/s improves compared with CPU/RAM spillover.
  • Whether OOM or model unloading happens frequently.

GPU utilization alone can be misleading. LLM inference does not always keep GPUs fully loaded, especially with multiple GPUs, low batch sizes, small contexts, slow CPUs, or slow PCIe links.

Common Misunderstandings

Misunderstanding 1: Two 12GB GPUs Equal One 24GB GPU

Not exactly. Multiple GPUs can place a model across devices, but cross-device access has overhead. It solves the “does not fit” problem, but it is not equivalent to the speed and stability of one large-VRAM GPU.

Misunderstanding 2: Different GPU Models Cannot Be Mixed

Not necessarily. If the driver, compute capability, and runtime libraries support the cards, Ollama can see multiple GPUs. But mixed setups are usually limited by the slower card, smaller VRAM, and PCIe topology. The most predictable setup is still same model, same VRAM size, and well-supported same-generation drivers.

Misunderstanding 3: Multi-GPU Is Always Faster Than Single-GPU

Not always. If the model fits completely on one fast GPU, single-GPU may be faster. Multi-GPU is mainly useful for large models, long contexts, or insufficient single-GPU VRAM.

No. Ordinary PCIe multi-GPU systems can be used by Ollama. NVLink is not a prerequisite.

Misunderstanding 5: Adding a GPU Does Not Require Restarting Services

Not always true. Linux systemd services, Windows background apps, and Docker containers may need to be restarted before they rediscover devices and environment variables.

GPU Selection Suggestions

For Ollama local inference, the rough priority is:

  1. Larger single-GPU VRAM is usually easier to manage.
  2. Identical GPUs are easier to troubleshoot than mixed GPUs.
  3. More complete PCIe lanes make large-model loading smoother.
  4. Older cards should be checked for CUDA compute capability or ROCm support first.
  5. Multi-GPU power, cooling, and chassis airflow must be planned ahead.

For budget second-hand platforms:

  • Dual RTX 3090 remains a common high-VRAM option.
  • Older Tesla cards such as P40 / M40 have large VRAM, but power, cooling, driver support, and performance all need trade-offs.
  • Cards such as RTX 4070 / 4070 Ti have good efficiency, but single-card VRAM can be limiting.
  • Multiple old 8GB cards can be fun to experiment with, but are not ideal for running large models long-term.

Summary

Ollama multi-GPU support is best understood as “VRAM expansion first, performance acceleration second.” If the model fits entirely on one GPU, the default single-GPU path is usually faster. If one GPU cannot hold it, multi-GPU can spread the model across devices and avoid heavy CPU/RAM spillover, making larger models usable.

In practice, use ollama ps to check where the model is loaded, then use nvidia-smi or ROCm tools to observe VRAM allocation. For GPU selection, use CUDA_VISIBLE_DEVICES on NVIDIA, ROCR_VISIBLE_DEVICES on AMD ROCm, and GGML_VK_VISIBLE_DEVICES for Vulkan. If running in Docker, first make sure the container can see the GPUs.

Multi-GPU is not magic. It can help fit larger models, but it does not guarantee linear speedup. The stable route is still to prefer large-VRAM single GPUs or identical multi-GPU setups, while considering driver support, PCIe, power, cooling, and model quantization together.

References

记录并分享
Built with Hugo
Theme Stack designed by Jimmy