How to Get GGUF Models from Hugging Face with llama.cpp

A short guide to downloading GGUF models with llama.cpp from Hugging Face, switching compatible endpoints, and converting non-GGUF formats.

llama.cpp can work directly with GGUF models hosted on Hugging Face, so you do not always need to download model files manually first.

If a model repository already provides GGUF files, you can use the -hf argument in the CLI, for example:

1
llama-cli -hf ggml-org/gemma-3-1b-it-GGUF

By default, this downloads from Hugging Face.
If you use another service that exposes a Hugging Face compatible API, you can switch the download endpoint with the MODEL_ENDPOINT environment variable.

One important detail is that llama.cpp only works directly with the GGUF format.
If your model is in another format, you need to convert it first with the convert_*.py scripts provided in the repository.

Hugging Face also offers several online tools related to llama.cpp, including:

  • converting models to GGUF
  • quantizing weights to reduce size
  • converting LoRA adapters
  • editing GGUF metadata in the browser
  • hosting llama.cpp inference endpoints

If you only want the practical takeaway, start with repositories that already provide GGUF, then use llama-cli -hf <user>/<model>. In most cases, that is the simplest path.

记录并分享
Built with Hugo
Theme Stack designed by Jimmy