name: huggingface-local-models description: "Select GGUF artifacts and quantizations for llama.cpp on CPU, Mac Metal, CUDA, or ROCm runtimes. Covers Q4_K_M vs Q5_K_M vs Q6_K trade-offs, llama-server launch flags, --hf-repo/--hf-file fallback for non-standard naming, and conversion from Transformers weights when no GGUF exists. Use when the user picks llama.cpp / LM Studio / Ollama on non-Apple-Silicon platforms. Loaded by llm-externalizer-setup-agent." user-invocable: false
Overview
Search the Hugging Face Hub for llama.cpp-compatible GGUF repos, choose the right quant for the target hardware, and launch the model with llama-cli or llama-server. For MLX on Apple Silicon, see huggingface-mlx-models instead.
Prerequisites
The setup wizard's scripts/setup/recommend-models.py emits, for every recommended model, the list of GGUF artifacts on HF (whatcani.run runtime=llama.cpp filter) with pre-built download_command lines. The wizard runs that command verbatim. This skill is consulted for:
- Quantization-quality trade-offs (multiple compatible quants for the same model)
llama-serverlaunch flags (context size, threading, KV-cache offload, flash-attention)--hf-repo/--hf-filefallback for non-standard naming- Converting from Transformers weights when no GGUF exists
apps=llama.cppHF Hub filters for obscure repos
The wizard does NOT call this skill on Apple Silicon arm64 when the user picked MLX as the runtime — huggingface-mlx-models handles that path.
External requirements:
llama.cppinstalled (brew install llama.cpp,winget install llama.cpp, or build from source)hfCLI authenticated for gated repos (hf auth login)
Instructions
- Search the Hub with
apps=llama.cpp. - Open
https://huggingface.co/<repo>?local-app=llama.cpp. - Prefer the exact HF local-app snippet and quant recommendation when visible.
- Confirm exact
.gguffilenames withhttps://huggingface.co/api/models/<repo>/tree/main?recursive=true. - Launch with
llama-cli -hf <repo>:<QUANT>orllama-server -hf <repo>:<QUANT>. - Fall back to
--hf-repoplus--hf-filewhen the repo uses custom file naming. - Convert from Transformers weights only if the repo does not already expose GGUF files.
Output
Return the recommended GGUF artifact + launch command + verified smoke-test result to the user.
Error Handling
See launch-recipes.md §Failure modes: custom file naming, no GGUF artifact, gated repo, smoke-test fails.
Examples
# Install + auth + serve
brew install llama.cpp
hf auth login
llama-server -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M
# Run an exact GGUF file with explicit context size
llama-server --hf-repo unsloth/Qwen3.6-35B-A3B-GGUF --hf-file Qwen3.6-35B-A3B-UD-Q4_K_M.gguf -c 4096
Resources
- launch-recipes
Install llama.cpp · Authenticate for gated repos · Search the Hub · Run directly from the Hub · Run an exact GGUF file · Convert only when no GGUF is available · Smoke test a local server · Quant Choice · Failure modes
- hub-discovery
Core URLs · Search for llama.cpp-compatible models · Use the local-app page for the recommended quant · Confirm exact files from the tree API · Build the command · Example:
unsloth/Qwen3.6-35B-A3B-GGUF· Notes - quantization
Hub-first quant selection · Quantization Formats · Converting Models · K-Quantization Methods · Quality Testing · Use Case Guide · Model Size Scaling · Finding Pre-Quantized Models · Importance Matrices (
imatrix) · Troubleshooting - hardware
Apple Silicon (Metal) · NVIDIA (CUDA) · AMD (ROCm) · CPU
- llama.cpp:
https://github.com/ggml-org/llama.cpp - HF GGUF + llama.cpp docs:
https://huggingface.co/docs/hub/gguf-llamacpp - HF Local Apps docs:
https://huggingface.co/docs/hub/main/local-apps - GGUF converter Space:
https://huggingface.co/spaces/ggml-org/gguf-my-repo