name: llama-cpp-1-0-0 description: C/C++ LLM inference library with GGUF support, quantization, GPU acceleration (CUDA/Metal/HIP/Vulkan/SYCL), OpenAI-compatible server, and speculative decoding. Use when building local LLM inference applications, deploying models on edge devices, creating OpenAI-compatible API servers, or working with GGUF models.
llama.cpp b8789
Overview
llama.cpp is a plain C/C++ implementation for LLM inference with minimal setup and state-of-the-art performance across a wide range of hardware — locally and in the cloud. It requires no external dependencies beyond system libraries, making it ideal for edge deployment, embedded systems, and resource-constrained environments.
Key capabilities:
- GGUF model format — native support for GGUF files with extensive quantization options
- Multi-backend GPU acceleration — CUDA (NVIDIA), Metal (Apple Silicon), HIP (AMD), Vulkan, SYCL (Intel), MUSA (Moore Threads), CANN (Ascend NPU)
- OpenAI-compatible HTTP server —
llama-serverwith full chat completions, embeddings, and tool calling support - Multimodal input — image and audio support via libmtmd for vision-language models
- Speculative decoding — draft model and n-gram based acceleration
- Function calling — OpenAI-style tool use with native format handlers for Llama 3.x, Hermes, Qwen, Mistral, and more
- Grammar-constrained generation — GBNF grammars and JSON schema support for structured outputs
- CPU optimizations — ARM NEON, AVX/AVX2/AVX512/AMX, RVV/ZVFH, ZenDNN, Arm KleidiAI
When to Use
- Running LLM inference locally without cloud dependencies
- Deploying models on edge devices with limited resources
- Building OpenAI-compatible API endpoints for local model serving
- Implementing multimodal AI (vision + text) applications
- Quantizing large models for reduced memory footprint
- Creating structured output pipelines with grammar constraints
- Developing cross-platform AI applications (Windows, macOS, Linux, Android)
- Building tools that integrate with LLMs via the
libllamaC API
Core Tools
llama-cli — Interactive CLI for conversation mode, completion, and experimentation. Supports chat templates, multimodal input, speculative decoding, grammar constraints, and all sampling parameters.
# Local model file
llama-cli -m my_model.gguf
# Download from Hugging Face
llama-cli -hf ggml-org/gemma-3-1b-it-GGUF
# Multimodal model with image
llama-cli -hf ggml-org/gemma-3-4b-it-GGUF --image photo.jpg
llama-server — Lightweight OpenAI-compatible HTTP server for serving LLMs. Supports chat completions, embeddings, reranking, tool calling, multimodal input, router mode (multiple models), and streaming.
# Start server on port 8080
llama-server -m model.gguf --port 8080
# With Hugging Face model
llama-server -hf ggml-org/gemma-3-1b-it-GGUF
# Router mode (multiple models)
llama-server --models-dir ./my_models
llama-quantize — Convert GGUF models between quantization formats. Supports Q4_K_M, Q8_0, IQ2_XS, and many other schemes.
# Quantize to Q4_K_M
llama-quantize input-f32.gguf output-Q4_K_M.gguf Q4_K_M
# With importance matrix for quality
llama-quantize --imatrix imatrix.gguf input-f16.gguf output-Q4_K_M.gguf Q4_K_M
llama-perplexity — Measure model perplexity (quality metric) over text files.
llama-perplexity -m model.gguf -f test-corpus.txt
llama-bench — Benchmark inference performance across various parameters.
llama-bench -m model.gguf
Installation / Setup
Pre-built Packages
- Homebrew (macOS/Linux):
brew install llama.cpp - Winget (Windows):
winget install llama.cpp - MacPorts (macOS):
sudo port install llama.cpp - Nix (macOS/Linux):
nix profile install nixpkgs#llama-cpp
Docker
Official images available on GitHub Container Registry:
# Full image with conversion tools
docker run -v /path/to/models:/models ghcr.io/ggml-org/llama.cpp:full --run -m /models/model.gguf
# CUDA variant
docker run --gpus all -v /path/to/models:/models ghcr.io/ggml-org/llama.cpp:full-cuda -m /models/model.gguf -ngl 99
# Server image
docker run -v /path/to/models:/models -p 8080:8080 ghcr.io/ggml-org/llama.cpp:server -m /models/model.gguf --port 8080
Build from Source
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
# CPU-only build
cmake -B build
cmake --build build --config Release
# CUDA build
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release
# Metal (macOS, enabled by default)
cmake -B build
cmake --build build --config Release
Usage Examples
Interactive Chat
# Auto-detects chat template and enters conversation mode
llama-cli -m model.gguf -cnv
# With custom chat template
llama-cli -m model.gguf -cnv --chat-template chatml
# With system prompt
llama-cli -m model.gguf -sys "You are a helpful coding assistant."
Constrained Output with Grammar
# JSON output via grammar file
llama-cli -m model.gguf -n 256 --grammar-file grammars/json.gbnf \
-p 'Request: schedule a call at 8pm; Command:'
# JSON schema (auto-converted to grammar)
llama-cli -m model.gguf -j '{"type":"object","properties":{"name":{"type":"string"},"age":{"type":"integer"}}}' \
-p 'Generate a person object.'
OpenAI-Compatible API
# Start server
llama-server -m model.gguf --port 8080
# Chat completion via curl
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "local",
"messages": [
{"role": "user", "content": "What is 2+2?"}
]
}'
# Python with openai library
python3 -c "
import openai
client = openai.OpenAI(base_url='http://localhost:8080/v1', api_key='sk-no-key-required')
resp = client.chat.completions.create(model='local', messages=[{'role':'user','content':'Hello'}])
print(resp.choices[0].message.content)
"
Embeddings
# Serve embedding model
llama-server -m embedding-model.gguf --embedding --pooling cls -ub 8192
# Query embeddings
curl http://localhost:8080/v1/embeddings \
-H "Content-Type: application/json" \
-d '{"input": "hello world", "model": "embed"}'
Speculative Decoding
# Draft model (smaller model accelerates larger one)
llama-server -m large-model.gguf -md draft-model.gguf
# N-gram speculative decoding (no draft model needed)
llama-server -m model.gguf --spec-type ngram-simple --draft-max 64
Multimodal (Vision + Text)
# CLI with vision model
llama-mtmd-cli -hf ggml-org/gemma-3-4b-it-GGUF
# Server with multimodal support
llama-server -hf ggml-org/gemma-3-4b-it-GGUF
# Local files
llama-server -m text-model.gguf --mmproj mmproj.gguf
Advanced Topics
Building and Backends: CUDA, Metal, HIP, Vulkan, SYCL, CANN, ZenDNN, KleidiAI, OpenVINO build configurations and runtime tuning → Building and Backends
Model Quantization: Quantization formats (Q4_K_M, Q8_0, IQ2_XS, etc.), importance matrices, quality vs size tradeoffs, perplexity measurement → Model Quantization
Server API Reference: OpenAI-compatible endpoints (/v1/chat/completions, /v1/embeddings, /v1/completions), router mode, multimodal, function calling, streaming → Server API Reference
Multimodal and Speculative Decoding: Vision/audio models, libmtmd integration, draft model decoding, n-gram speculation, performance tuning → Multimodal and Speculative Decoding
Grammar-Constrained Generation: GBNF grammar format, JSON schema to grammar conversion, structured output patterns, LLGuidance integration → Grammar-Constrained Generation
libllama C API: Core data structures (llama_model, llama_context), context management, batch decoding, sampler chains, LoRA adapters, KV cache → libllama C API