How to Fix Ollama Using CPU Instead of GPU

A practical troubleshooting guide for Ollama running on CPU instead of GPU, covering GPU detection, ROCm or CUDA setup, service restarts, VRAM limits, and common AMD compatibility issues.

When running local LLMs, one of the most frustrating problems is this: your machine clearly has a GPU, yet Ollama still leans heavily on the CPU, and performance is painfully slow.

The short version is that this is usually not caused by one single issue. The most common causes are:

  • Ollama is not detecting any usable GPU
  • The driver, ROCm, or CUDA environment is not set up correctly
  • The Ollama service was started without the right environment variables
  • The model is too large and has fallen back to CPU or mixed CPU/GPU loading
  • On AMD platforms, there may be extra compatibility issues such as ROCm version mismatch, gfx settings, or device visibility problems

The fastest way to troubleshoot it is to go through the checks below in order.

1. First, confirm whether Ollama is really not using the GPU

The most direct check is:

1
ollama ps

Focus on the PROCESSOR column.

  • 100% GPU: the model is fully running on the GPU
  • 100% CPU: the GPU is not being used at all
  • Results like 48%/52% CPU/GPU: part of the model is in VRAM, and part has spilled into system memory

If you see 100% CPU, the next step is to focus on environment and service configuration.
If you see mixed loading, that does not necessarily mean the GPU is broken. In many cases, it simply means VRAM is not enough.

2. Rule out the most common misunderstanding first: the model does not fit into VRAM

Many people assume that once a GPU is installed, Ollama will always run fully on it. That is not how it works.

If the model is too large, the context is too long, or some other loaded model is already occupying VRAM, Ollama may fall back to:

  • Partial GPU + partial CPU
  • Full 100% CPU

At this point, the two simplest tests are:

  1. Try a smaller model first
    For example, test with a 4B or 7B model before jumping straight to much larger ones.
  2. Unload other active models and test again
    Run ollama ps first and make sure nothing else is occupying VRAM.

If smaller models use the GPU but larger ones do not, the real problem is usually VRAM capacity rather than the driver.

3. Check whether the GPU driver and the lower-level runtime are actually working

If even small models run only on CPU, the next step is to check the underlying environment.

NVIDIA

First confirm that the driver is working and the system can see the GPU. A common check is:

1
nvidia-smi

If this already fails, Ollama is very unlikely to use the GPU correctly.

AMD / ROCm

If you are using an AMD GPU, especially with ROCm, start with:

1
2
rocminfo
rocm-smi

If these tools cannot list the device properly, the problem is still below Ollama, so there is no point debugging the application layer yet.

On AMD, the most common issue is not simply “is the driver installed,” but rather:

  • The ROCm version does not match the OS version
  • The current GPU architecture has incomplete support
  • The device exists, but the runtime is not being exposed correctly to Ollama

4. Restart the Ollama service, not just your terminal

This is a very common trap.

Many people install drivers, change environment variables, fix ROCm, then just open a new terminal and continue with ollama run. But if Ollama is running as a background service, it may still be using the old environment.

So the safer approach is:

  • Fully restart the Ollama service
  • Reboot the machine if necessary

If you are running it as a service on Linux, make sure the service process was actually restarted instead of reusing the old one.

5. Check whether the environment variables are really reaching the service

This matters especially on AMD ROCm systems.

Some machines work fine when commands are run manually in a shell, but the Ollama service still uses only CPU. In that case, the usual reason is that the service process never received the variables you set in your shell.

Common variables to look at include:

1
2
ROCR_VISIBLE_DEVICES
HSA_OVERRIDE_GFX_VERSION

Specifically:

  • ROCR_VISIBLE_DEVICES limits or selects which GPUs ROCm can see
  • HSA_OVERRIDE_GFX_VERSION is often used as a compatibility workaround on some AMD platforms

If you only export these variables in the current terminal, but Ollama is started by systemd, a desktop background service, or another daemon, they may not take effect.

In other words, “it looks set in my terminal” does not mean Ollama is actually using it.

6. On AMD platforms, focus on ROCm compatibility

Based on the public page metadata, the original video for this topic is tied to AMD Max+ 395, strix halo, and AMD ROCm.
In setups like these, Ollama failing to use the GPU is often more dependent on version matching than on NVIDIA systems.

Start by checking these:

  1. Whether the installed ROCm version fits the current OS and GPU
  2. Whether the GPU belongs to an architecture with solid ROCm support
  3. Whether you need to set HSA_OVERRIDE_GFX_VERSION
  4. Whether an older Ollama build or older inference runtime is causing compatibility issues

If rocminfo works and the GPU is visible to the system, but Ollama still runs only on CPU, the issue is often in the version combination rather than in model parameters.

7. In Docker, WSL, or remote environments, also check device mapping

If you are not running on bare metal but inside:

  • Docker
  • WSL
  • Remote containers
  • Virtualized environments

then you need to check one more layer: whether the GPU device is actually being exposed inside that environment.

A typical symptom looks like this:

  • The host machine can see the GPU
  • Ollama inside the container or subsystem still uses only CPU

In that case, the issue may not be Ollama itself. The container or subsystem may simply not have GPU access.

8. Check logs last, but check them for the right reason

If you have already gone through the earlier steps, the most effective next move is not endless reinstalling, but looking directly at the Ollama startup and runtime logs.

Focus on two kinds of messages:

  • Whether a GPU was detected at all
  • Whether there are driver, library loading, or device initialization errors

If the logs clearly say something like “no compatible GPU found” or “failed to initialize ROCm/CUDA,” the troubleshooting direction becomes much clearer immediately.

Troubleshooting Order

If you only want the shortest path, use this order:

  1. Run ollama ps and confirm whether it is GPU, CPU, or mixed loading
  2. Try a smaller model to rule out VRAM limits
  3. Use nvidia-smi, rocminfo, and rocm-smi to verify the lower-level environment first
  4. Fully restart the Ollama service
  5. Check service environment variables, especially ROCR_VISIBLE_DEVICES and HSA_OVERRIDE_GFX_VERSION on AMD
  6. If you are in Docker or WSL, verify device mapping
  7. Finally, inspect logs for the exact error

Conclusion

When Ollama uses CPU instead of GPU, the root cause usually falls into one of three groups:

  • The GPU is not being detected at all
  • The GPU is detectable, but the runtime environment is not reaching Ollama
  • The GPU is working, but the model is too large and falls back to CPU or mixed memory

Once you separate those three cases, troubleshooting becomes much faster.
If you are on an AMD platform, pay special attention to ROCm version matching, device visibility, and compatibility variables instead of focusing only on the Ollama command itself.

Original video: https://www.bilibili.com/video/BV1cHoYBqE8k/

记录并分享
Built with Hugo
Theme Stack designed by Jimmy