This article documents a local Agent deployment plan: run a Qwen3.6 GGUF model with llama.cpp inside WSL2, then connect Hermes Agent to the local OpenAI-compatible API. This gives you a long-running local AI assistant on your own computer, without paying by online service Token usage.
This setup is suitable for users who want to try local AI Agents while keeping data private and controllable over the long term. It can be used for daily Q&A, writing, coding assistance, document organization, and simple automation tasks. The larger the model, the higher the VRAM requirement. The original example uses Qwen3.6-27B, and 24GB VRAM is more stable. If your VRAM is smaller, choose a smaller model or a lower quantization.
Architecture
The overall chain is simple:
- Install WSL2 and Ubuntu 24.04 on Windows.
- Install CUDA Toolkit inside WSL2 and compile
llama.cpp. - Download the Qwen3.6 GGUF model.
- Start a local model service with
llama-server. - Install Hermes Agent and configure it to
http://localhost:8080/v1. - Optional: write a startup script so the model service starts automatically when WSL2 opens.
Hermes provides the Agent capability, while Qwen3.6 provides the local LLM capability. Together, they turn the computer into a private local AI assistant.
Install WSL2 and Ubuntu
Run in an administrator Windows PowerShell window:
|
|
After rebooting, install Ubuntu 24.04:
|
|
After installation, Ubuntu prompts you to set a username and password. Once inside Ubuntu, first check whether the NVIDIA GPU is visible in WSL2:
|
|
If the GPU cannot be detected, update the NVIDIA driver on Windows first. WSL2 inherits the Windows driver, but CUDA Toolkit still needs to be installed separately inside WSL2.
Install Python and Basic Tools
|
|
You also need build tools, Git, and CMake:
|
|
Compile llama.cpp
Clone the repository:
|
|
If CUDA is already available in WSL2, compile directly:
|
|
CMAKE_CUDA_ARCHITECTURES=89 is suitable for Ada GPUs, such as RTX 40 series cards. Adjust it according to your actual GPU architecture.
If compilation reports that CUDA Toolkit is missing, install CUDA Toolkit inside WSL2 first:
|
|
Configure environment variables:
|
|
Then rebuild:
|
|
Download the Qwen3.6 GGUF Model
The example uses Qwen3.6-27B-UD-Q4_K_XL.gguf from unsloth/Qwen3.6-27B-GGUF:
|
|
The file is about 17GB. If Hugging Face is slow, use a mirror such as ModelScope. Do not force a 27B model if your VRAM is insufficient; use a smaller model or lower quantization.
Start the Local Model Service
Start llama-server with your own model file name:
|
|
After startup, open this in a Windows browser:
|
|
For Hermes Agent or other OpenAI-compatible clients, the API endpoint is usually:
|
|
Thinking Mode Tradeoff
Qwen3.6 may enable Thinking mode by default. It is suitable for complex reasoning, complicated coding problems, and multi-step analysis, but it is slower.
To disable Thinking mode, stop the service and add --chat-template-kwargs:
|
|
After disabling Thinking, simple Q&A, writing, code completion, and code explanation become faster. For complex algorithm design, difficult debugging, and architecture analysis, Thinking mode is still recommended.
Install Hermes Agent
Keep llama-server running, then open a new WSL2 terminal and install Hermes Agent:
|
|
The installer handles dependencies such as Python, Node.js, ripgrep, and ffmpeg. When configuring the model endpoint, choose a custom endpoint:
|
|
For a local llama-server, the API Key can be any placeholder value. After configuration, you can connect Telegram, WeChat, QQ, Discord, and other chat tools, allowing Hermes Agent to call the local model and execute tasks from those entry points.
Auto-Start the Model Service
You can write a startup script so the model service starts automatically when a WSL2 terminal opens.
Create the script:
|
|
Write it into .bashrc:
|
|
Each time you open a WSL2 terminal, it will start llama-server if it is not already running. If it is running, it skips startup and avoids duplicate processes.
Notes
- 27B models require substantial VRAM; 24GB VRAM is more stable. Use a smaller model if VRAM is limited.
--ctx-size 65536significantly increases VRAM and RAM pressure. If unstable, reduce it to32768or lower.- Both CUDA Toolkit in WSL2 and the Windows GPU driver must work properly. Either side can cause CUDA compilation or runtime failures.
- Hermes Agent calls the local service through an OpenAI-compatible API. The key is that
http://localhost:8080/v1responds correctly. - If accessing from a phone or another device, handle Windows Firewall, LAN addresses, and security isolation. Do not expose the local model service directly to the public internet.
Related Links
- Original article: Hermes + Qwen3.6:本地最强 Agent 组合!零成本、无限 Token,太香了!
- llama.cpp: ggerganov/llama.cpp
- Hermes Agent: NousResearch/hermes-agent
- Qwen3.6 GGUF example: unsloth/Qwen3.6-27B-GGUF