name: llama-cpp description: Run quantized LLMs locally with llama.cpp — CPU+GPU inference, GGUF format, OpenAI-compatible server, and Python bindings. version: 1.0.0 author: hermes-CCC (ported from Hermes Agent by NousResearch) license: MIT metadata: hermes: tags: [MLOps, llama-cpp, Local-Inference, GGUF, Quantization, CPU] related_skills: []
llama.cpp
Purpose
- Use this skill to run quantized GGUF models on laptops, workstations, and edge systems.
- Prefer it when you need portable local inference without a heavyweight serving stack.
llama.cppis especially useful for CPU-first deployments, low-cost GPU offload, and offline workflows.- Python users typically access it through
llama-cpp-python.
Install
- Fastest path for Python users:
pip install llama-cpp-python
- Build from source when you need custom acceleration backends or tighter platform control.
- Source builds are common for CUDA, Metal, ROCm, Vulkan, and CPU-tuned environments.
- Confirm the package imports successfully:
python -c "from llama_cpp import Llama; print('ok')"
Model Format
llama.cppprimarily uses theGGUFmodel format.- GGUF packages tokenizer metadata, architecture settings, and quantized weights in one artifact.
- Choose a GGUF variant that matches your hardware budget and quality target.
Download GGUF Models
Hugging Face is the standard source for GGUF checkpoints.
Common repos include:
bartowski/*TheBloke/*Typical examples:
bartowski/Llama-3.1-8B-Instruct-GGUFTheBloke/Mistral-7B-Instruct-v0.2-GGUFbartowski/Qwen2.5-7B-Instruct-GGUFStore the downloaded file locally, for example:
models/llama-3.1-8b-instruct-q4_k_m.gguf
Quantization Levels
Q4_K_M: best balance for many local deploymentsQ5_K_M: more quality, more RAM or VRAMQ8_0: highest quality among common quantized optionsQ2_K: very small, but quality drops sharplyStart with
Q4_K_Munless you already know the task is quality-sensitive.Move to
Q5_K_MorQ8_0for coding, reasoning, or long-form generation where quality matters more.Use
Q2_Konly for extreme memory constraints or experiments.
Basic Python Usage
from llama_cpp import Llama
llm = Llama(
model_path="models/llama-3.1-8b-instruct-q4_k_m.gguf",
n_ctx=8192,
n_threads=8,
n_gpu_layers=0,
)
output = llm(
"Explain why GGUF is useful for local inference.",
max_tokens=256,
temperature=0.7,
)
print(output["choices"][0]["text"])
Important Init Parameters
model_path: path to the.gguffile on diskn_gpu_layers: number of transformer layers to offload to GPUn_ctx: context size, constrained by the model and available memoryn_threads: CPU worker threads for prompt processing and generationThese four parameters are the first tuning knobs to adjust for almost every deployment.
Initialization Guidance
- Keep
model_pathon a local SSD when possible. - Set
n_threadsclose to the number of performant CPU cores, not necessarily total logical threads. - Increase
n_ctxonly after checking memory pressure. - Increase
n_gpu_layersgradually if the model fails to load or performance is unstable.
Generation Call
- Basic call shape:
result = llm(
"Write a short checklist for running a local LLM service.",
max_tokens=256,
temperature=0.7,
)
- Access the text with:
print(result["choices"][0]["text"])
- Keep
max_tokensbounded for interactive usage. - Use lower temperatures for summarization, extraction, and tool-style tasks.
Chat Format
- For instruction-tuned models, prefer the chat API:
from llama_cpp import Llama
llm = Llama(
model_path="models/qwen2.5-7b-instruct-q4_k_m.gguf",
n_ctx=8192,
n_gpu_layers=20,
n_threads=8,
)
response = llm.create_chat_completion(
messages=[
{"role": "system", "content": "You are concise and technical."},
{"role": "user", "content": "List three tradeoffs of 4-bit quantization."},
],
max_tokens=256,
temperature=0.4,
)
print(response["choices"][0]["message"]["content"])
- Use the chat API when the model card says the checkpoint is chat-tuned or instruct-tuned.
- Keep prompt templates aligned with the model family if outputs seem malformed.
Streaming Output
- Stream tokens for responsive CLI or web applications:
stream = llm.create_chat_completion(
messages=[
{"role": "system", "content": "You are concise."},
{"role": "user", "content": "Describe GPU offload in llama.cpp."},
],
max_tokens=128,
temperature=0.3,
stream=True,
)
for chunk in stream:
delta = chunk["choices"][0].get("delta", {})
text = delta.get("content")
if text:
print(text, end="", flush=True)
- Streaming is useful for chat UIs, terminals, and server-sent event bridges.
GPU Offload
n_gpu_layers=-1means full GPU offload when supported by the backend and hardware.- You can also offload only the first
Nlayers:
llm = Llama(
model_path="models/mistral-7b-instruct-q5_k_m.gguf",
n_ctx=8192,
n_threads=8,
n_gpu_layers=-1,
)
- If full offload fails, try a partial value like
20,30, or40. - Partial offload is common on consumer GPUs with limited VRAM.
CPU-Only Example
from llama_cpp import Llama
llm = Llama(
model_path="models/phi-3-mini-q4_k_m.gguf",
n_ctx=4096,
n_threads=12,
n_gpu_layers=0,
)
- CPU-only mode is viable for smaller models and latency-tolerant tasks.
- It is a good fit for offline assistants, batch summarization, and test environments.
Context Size
n_ctxcontrols the context window in tokens.- Larger values increase RAM or VRAM usage.
- The effective maximum depends on the model architecture, quantization, and rope scaling setup.
- Do not assume every GGUF file supports the same long context as the original FP16 checkpoint.
Context Sizing Rule of Thumb
- Start with
4096or8192. - Increase only after verifying memory headroom and prompt quality.
- Very large
n_ctxvalues can degrade throughput significantly.
OpenAI-Compatible Server
llama-cpp-pythonincludes a server mode:
python -m llama_cpp.server --model model.gguf
- More realistic example:
python -m llama_cpp.server \
--model models/llama-3.1-8b-instruct-q4_k_m.gguf \
--host 0.0.0.0 \
--port 8000 \
--n_ctx 8192
- This is useful for local OpenAI-style integrations, prototypes, and thin service wrappers.
- It is not as throughput-optimized as vLLM, but it is easy to run and distribute.
OpenAI Client Compatibility
- Many local clients can target the server with an OpenAI-compatible base URL:
from openai import OpenAI
client = OpenAI(
api_key="dummy",
base_url="http://localhost:8000/v1",
)
resp = client.chat.completions.create(
model="local-model",
messages=[
{"role": "system", "content": "You are concise."},
{"role": "user", "content": "Summarize Q4_K_M vs Q8_0."},
],
max_tokens=128,
)
print(resp.choices[0].message.content)
Build From Source
Build from source when:
you need CUDA acceleration not available in your wheel
you want Metal on macOS
you need a specific compiler or backend flag
you are packaging for a controlled deployment target
Source builds take more effort but often deliver better hardware utilization.
Model Families Commonly Used With GGUF
Llama
Qwen
Mistral
Phi
Gemma
Check the prompt format and tokenizer notes for each family before deploying.
Operational Tips
- Keep a naming convention that encodes model, size, and quantization.
- Store models outside the repo if they are large.
- Benchmark both prompt evaluation speed and generation speed.
- Use the smallest model that meets quality targets.
- Prefer chat-tuned checkpoints for agentic or assistant workloads.
Common Errors
Model will not load:
verify
model_pathverify the file is a GGUF checkpoint
verify the quantization is supported by your build
Very slow generation:
increase
n_threadsenable GPU offload
reduce
n_ctxuse a smaller model
Out-of-memory:
reduce
n_ctxchoose
Q4_K_Minstead ofQ8_0reduce
n_gpu_layersor use CPU-only modeBad chat formatting:
use
create_chat_completionverify the checkpoint is instruct-tuned
check whether the model expects a specific chat template
When To Use This Skill
- You need fully local inference with minimal infrastructure.
- You want a portable inference path for laptops or edge devices.
- You are testing GGUF quantizations before wider deployment.
- You need CPU inference or partial GPU offload instead of a GPU-only server.
Quick Reference
- Install:
pip install llama-cpp-python - Source build: use when you need custom acceleration
- Load model:
from llama_cpp import Llama - Key params:
model_path,n_gpu_layers,n_ctx,n_threads - Generate:
llm("prompt", max_tokens=256, temperature=0.7) - Chat:
llm.create_chat_completion(messages=[...]) - Server:
python -m llama_cpp.server --model model.gguf - Best balance quant:
Q4_K_M - Full GPU offload:
n_gpu_layers=-1