llama-cpp-runtime

star 0

llama.cpp runtime/session control: use llama-cli and llama-server commands/flags to run local GGUF models and serve an API in a worker terminal. Trigger when the controller needs to operate llama.cpp like a human.

dickymoore By dickymoore schedule Updated 3/5/2026

name: llama-cpp-runtime description: "llama.cpp runtime/session control: use llama-cli and llama-server commands/flags to run local GGUF models and serve an API in a worker terminal. Trigger when the controller needs to operate llama.cpp like a human."

llama.cpp Runtime

Overview

Operate llama.cpp safely: run local GGUF models via llama-cli or serve an API via llama-server.

Session Safety

  1. Confirm idle state
  • Snapshot and/or status the worker; do not intervene mid-run.
  • Only proceed when the worker is at a prompt or explicitly idle.

Core Commands

llama-cli (interactive/local runs)

  • Run a local model file:
    • llama-cli -m my_model.gguf
  • Download and run directly from Hugging Face:
    • llama-cli -hf ggml-org/gemma-3-1b-it-GGUF
  • Conversation mode (if not auto-enabled):
    • llama-cli -m model.gguf -cnv --chat-template chatml

llama-server (OpenAI-compatible API)

  • Start a local server on port 8080:
    • llama-server -m model.gguf --port 8080
  • Parallel decoding example:
    • llama-server -m model.gguf -c 16384 -np 4

Guardrails

  • Do not restart mid-run.
  • Use llama-server for API-style usage and llama-cli for interactive/local prompts.
  • If the worker is not llama.cpp, switch to the model-specific runtime skill instead.
Install via CLI
npx skills add https://github.com/dickymoore/macs --skill llama-cpp-runtime
Repository Details
star Stars 0
call_split Forks 0
navigation Branch main
article Path SKILL.md
More from Creator