name: llama-cpp-manage
description: Install, configure, troubleshoot, and operate llama.cpp on Linux or macOS — covers source builds (Debian/Ubuntu apt deps, Arch pacman, Fedora/openSUSE/Alpine, macOS brew), GPU backend selection (Vulkan / Metal / ROCm / CUDA), server lifecycle (port conflicts, health probes, detached vs foreground), pi-coding-agent integration via ~/.pi/agent/models.json, and per-model tuning (ctx, sampling, vision mmproj, GGUF metadata). USE THIS SKILL whenever the user mentions llama-server, llama-cli, llama-bench, llama.cpp build errors, "my model won't load", "the server failed to start", GGUF files, an OpenAI-compatible local server, port conflicts on 8080/8081, or trouble with the pi coding agent against a local model — even if they don't say "llama.cpp" by name. The user's primary CLI is locca; treat it as the frontend and llama.cpp as the runtime it manages.
llama.cpp on Linux / macOS — operations runbook
This skill is the institutional knowledge that goes alongside locca — the TUI that runs llama.cpp, manages models, and launches the pi coding agent against a local server. locca handles the happy path; this skill is the runbook for everything else: installation, troubleshooting, tuning, and integration with adjacent tools.
When you're helping the user
Before changing anything, get a clean read on state (what's installed, what's running). The most useful starting points:
locca status # server source (pid/external/attached), llama.cpp path, models dir
locca api # if a server is up, get URL + LAN/Tailscale URLs + endpoints
which llama-server # is the binary on PATH, and where?
ss -tlnp | grep ':8081' # is something else on the port?
Hit /health, /props, and /slots directly when you need ground truth (locca status already does /props for ctx + slot count, but you may want raw data):
curl -s http://127.0.0.1:8081/health
curl -s http://127.0.0.1:8081/props | jq '.default_generation_settings.params | {n_ctx, temperature, top_p, top_k, min_p}'
curl -s http://127.0.0.1:8081/slots
Logs for a locca-managed server:
tail -f "${XDG_RUNTIME_DIR:-/tmp}/locca-server.log" # or: locca logs
Installing llama.cpp
The right path depends on the distro. Default: build from source with Vulkan unless the user is on Arch (where pacman -S llama.cpp is a one-liner) or macOS (where brew install llama.cpp is). Vulkan is the most universal GPU backend on AMD / Intel / NVIDIA — use Metal on macOS, ROCm only when the user has a discrete AMD GPU and asks for it specifically.
See references/install.md for the full per-distro deps matrix. Quick reference:
| Distro | Easy path | Source-build deps |
|---|---|---|
| Arch | sudo pacman -S llama.cpp |
base-devel + vulkan-headers + glslang + curl |
| Debian / Ubuntu | (no package) — build from source | cmake build-essential pkg-config libvulkan-dev glslc glslang-tools spirv-headers libcurl4-openssl-dev |
| Fedora | (no package) — build from source | cmake gcc-c++ make pkgconf vulkan-headers vulkan-loader-devel glslang libcurl-devel |
| openSUSE | (no package) | similar to Fedora |
| Alpine | (no package) | cmake build-base pkgconfig vulkan-headers vulkan-loader-dev glslang curl-dev |
| macOS | brew install llama.cpp |
cmake for source builds (Metal default) |
Source build (Vulkan, the typical path on Debian/Ubuntu/Fedora):
git clone https://github.com/ggml-org/llama.cpp ~/llama.cpp
cmake -B ~/llama.cpp/build -S ~/llama.cpp -DGGML_VULKAN=ON
cmake --build ~/llama.cpp/build -j
export PATH="$HOME/llama.cpp/build/bin:$PATH" # this session
echo 'export PATH="$HOME/llama.cpp/build/bin:$PATH"' >> ~/.bashrc # persist
locca setup already renders the right install hint per detected distro — if the user is fresh, prefer running that over hand-typing instructions. If they hit a build error, see references/install.md for the gotchas.
Common build errors and fixes
These come up reliably on Debian/Ubuntu source builds:
| Error | Missing | Fix |
|---|---|---|
cmake: command not found |
cmake | sudo apt install -y cmake build-essential |
Could NOT find Vulkan (missing: glslc) |
shaderc binary | sudo apt install -y glslc |
fatal error: spirv/unified1/spirv.hpp |
SPIRV headers | sudo apt install -y spirv-headers |
fatal error: vulkan/vulkan.h |
Vulkan dev headers | sudo apt install -y libvulkan-dev |
Could NOT find CURL |
curl dev | sudo apt install -y libcurl4-openssl-dev |
cc1plus: error: unknown ... -march=native |
wrong build platform | rebuild on the deploy host (don't transplant binaries across CPU families) |
If the user gets one and we fix it, the build picks up where it left off — cmake --build ~/llama.cpp/build -j again is enough; no clean needed.
When llama-server fails to start
Three frequent causes, in order of likelihood:
1. Port already in use. llama-server's bind error reads couldn't bind HTTP server socket, hostname: 127.0.0.1, port: <N>. Check:
ss -tlnp | grep ':<port> '
curl -sI http://127.0.0.1:<port>/ # often reveals the conflicting service via Server header / page title
Port 8080 is a frequent collision because many self-hosted apps default to it. Fix by changing locca's defaultPort in ~/.config/locca/config.json to a free port (8081 and 18080 are common alternatives). locca has a preflight check (refuseIfPortTaken in src/preflight.ts) that names the conflicting service via its Server header / page <title> — but only fires before spawning, so a llama-server that crashed mid-startup won't show this.
2. Source build is broken / out of date. The build_info line in the log starts with b<NNNN>-<commit> — if it's months old and git log --oneline -1 in ~/llama.cpp shows the source has moved on, rebuild:
cd ~/llama.cpp && git pull && cmake --build build -j
3. Missing GPU drivers. Vulkan needs the Mesa driver on AMD/Intel, the proprietary driver on NVIDIA. Check:
vulkaninfo --summary 2>&1 | head -20 # should list at least one GPU
ls /dev/dri/ # renderD128 = AMD/Intel iGPU
No GPU listed → fall back to CPU build by re-running cmake without -DGGML_VULKAN=ON, or install the Vulkan driver (mesa-vulkan-drivers on Debian).
When pi can't see the model
pi 0.70+ removed the built-in --provider llamacpp flag. Local OpenAI-compatible servers are now registered as a custom provider in ~/.pi/agent/models.json. locca writes this file automatically before launching pi, but if pi reports Unknown provider "llamacpp" from a script bypassing locca, the registration is missing.
Minimal valid ~/.pi/agent/models.json:
{
"providers": {
"locca": {
"name": "locca (local llama.cpp)",
"baseUrl": "http://127.0.0.1:8081/v1",
"api": "openai-completions",
"apiKey": "unused",
"models": [
{
"id": "<exact-filename-from-/v1/models>",
"name": "<friendly>",
"reasoning": false,
"input": ["text"],
"contextWindow": 32768,
"maxTokens": 8192,
"cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 }
}
]
}
}
}
Then pi --model locca/<model-id>. The id must match what curl http://127.0.0.1:8081/v1/models | jq '.data[0].id' returns — usually the full GGUF filename.
If the user has other providers registered, only touch the locca key — that's the one locca owns. See src/pi-config.ts in the locca source for the canonical writer.
Server lifecycle modes
locca has three ways the server can be running, exposed in locca status as the source field:
| Source | Meaning | locca stop |
locca serve |
|---|---|---|---|
pid |
locca spawned and tracks the PID | works | refuses (server already running) |
external |
serverUrl is set in config; locca only talks to it |
refuses (not ours) | refuses (would conflict) |
attached |
something else (a manually started llama-server, another supervisor) is on the local port — locca probed /health and uses it as a read-only client |
refuses (not ours) | refuses (port taken) |
The most surprising one is attached. If a llama-server is already running on locca's defaultPort (started by hand, by another supervisor, or by an external tool), locca will attach to it and locca pi works without spawning a duplicate that would fight for VRAM. To swap to a locca-spawned server, the existing one has to be stopped first by whatever started it.
For an externally-managed server on a different port or another host, set serverUrl in ~/.config/locca/config.json:
{ "serverUrl": "http://localhost:8081" }
This switches locca into external-mode for all commands; serve/stop/logs start refusing because there's nothing for them to do.
Foreground vs detached
locca serve runs llama-server detached by default (writes to log file, exits with the PID). This is intentional — Ctrl-C'ing locca doesn't kill the server. To stop, use locca stop. To watch logs, use locca logs (which tails ${XDG_RUNTIME_DIR:-/tmp}/locca-server.log).
If the user complains "the server stops when I close my terminal" they're probably running llama-server directly; suggest locca serve or wrap their command in nohup/systemd.
Per-model tuning
locca picks reasonable defaults but several knobs matter for quality on a given GPU:
Context window
src/models.ts:ctxForModel() sets ctx by parameter count regex:
| Class | Auto ctx |
|---|---|
MoE / *A3B* |
131072 (128k) |
| 30–35B dense | 65536 (64k) |
| 22–27B dense | 32768 (32k) |
| 12–14B dense | 65536 (64k) |
| 3–9B dense | 131072 (128k) |
| Other / unrecognised | 32768 (default) |
Bigger ctx = larger KV cache = more VRAM. q8_0 KV cache (locca's default) is 4× smaller than f16 — that's the only reason 128k fits on a 16GB iGPU for an 8B model. If the user OOMs, halving ctx is the first lever, then dropping to a smaller quant.
If a model isn't getting matched correctly, check src/models.ts — the regex uses negative lookahead/lookbehind to avoid e.g. "9B" matching "32B" via substring. Add a new bucket if needed.
Sampling parameters
llama-server reads sampling defaults from the GGUF metadata when --jinja is on (locca always sets it). For example, Gemma 4 GGUFs embed temp=1.0, top_p=0.95, top_k=64 and llama-server picks those up — locca doesn't override. The user can verify via /props:
curl -s http://127.0.0.1:8081/props | jq '.default_generation_settings.params | {temperature, top_p, top_k, min_p}'
If a model card recommends different sampling, the user has two options: (1) override per-request in their client, or (2) modify the GGUF metadata with gguf-py (rare, only worth it for self-hosted finetunes).
Vision (mmproj)
Multimodal models (Gemma 4, Llava, Bakllava, etc.) ship a separate mmproj-*.gguf projector file. Drop it as a sibling in the same directory as the main GGUF and locca auto-detects (see scanModels in src/models.ts) — it tags the model with [vision] in pickers and passes --mmproj <path> to llama-server.
If the user downloaded only the main file, vision won't work. locca download <repo> shows the mmproj as an optional second download.
Useful files in the locca source
When debugging behaviour, these are the load-bearing files:
| File | What's there |
|---|---|
src/server.ts |
serverStatus() (the source-detection logic), launchServer(), waitReady() (uses /health), probeServer(), isPortInUse(), describePortOccupant() |
src/preflight.ts |
The "port taken by another service" error (names the conflicting service via Server header / page title) |
src/models.ts |
scanModels() (ggml-vocab and mmproj filtering), ctxForModel() |
src/pi-config.ts |
Writes ~/.pi/agent/models.json |
src/distro.ts |
Detects distro from /etc/os-release and renders apt/pacman/dnf install hints |
src/commands/*.ts |
Each subcommand (serve, pi, switch, status, bench, …) |
Diagnostic commands
When in doubt, run the bundled scripts/diagnose.sh for a one-shot health snapshot covering distro, llama.cpp binaries, locca config, server /health + /v1/models + /props, PID file state, models dir size, pi agent registration, and Vulkan device list. It's read-only — never starts or stops anything.
The script lives next to this SKILL.md; run it with the absolute path the skill was loaded from, or copy-paste the relevant pieces inline. The full set of probes it does (in order):
/etc/os-releasefor distro detectioncommand -v llama-serveretc. for binary discovery~/.config/locca/config.jsonparse viajqcurl /health,/v1/models,/propson the configured portss -tlnif /health fails — to see if a non-llama service is squatting the port${XDG_RUNTIME_DIR}/locca-server.pidfor managed-server status- Models dir count + size
~/.pi/agent/models.jsonproviders listvulkaninfo --summaryfor GPU detection
If you don't have the script handy, the same probes run by hand will surface the same issues.