llm-deployment - SKILL.md Agent Skill

name: llm-deployment description: LLM deployment and serving — vLLM, Ollama, TGI, llama.cpp. Model quantization, GPU optimization, API serving domain: core

Overview

Running LLMs in production — local inference with Ollama/llama.cpp to high-throughput serving with vLLM/TGI. Quantization, GPU optimization, OpenAI-compatible APIs.

Capabilities

Local deployment (Ollama, llama.cpp, LM Studio)
High-throughput serving (vLLM, TGI)
Quantization (GGUF, GPTQ, AWQ)
GPU memory optimization (FlashAttention, PagedAttention)
OpenAI-compatible API endpoints

When to Use

Self-hosted LLM for privacy/cost
High-throughput API serving
Running on consumer GPUs (24GB or less)

When NOT to Use

Task is outside your authorization scope
You need to implement controls (use implementing-* skills)
Task is about analysis, not action (use analyzing-* skills)
You don't have access to target systems
Task requires compliance expertise (consult professionals)
Task is about defense, not offense (use defensive skills)

Pseudo Code

# Example workflow for this skill
def execute(input_data):
    # Step 1: Validate input
    if not input_data:
        raise ValueError("Input data is required")

    # Step 2: Process core logic
    result = process(input_data)

    # Step 3: Validate output
    validate_output(result)

    return result

Ollama

ollama pull llama3.1:8b && ollama serve
curl http://localhost:11434/api/generate -d '{"model":"llama3.1:8b","prompt":"Hello!","stream":false}'

vLLM

python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.1-8B-Instruct --tensor-parallel-size 2
# Use: OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")

llama.cpp

cmake -B build -DGGML_METAL=ON && cmake --build build -j
./build/bin/llama-server -m model.gguf --host 0.0.0.0 --port 8080 -ngl 99

Quantize

# GPTQ
from transformers import AutoModelForCausalLM, GPTQConfig
model = AutoModelForCausalLM.from_pretrained("model", quantization_config=GPTQConfig(bits=4))

Common Patterns

Benchmark: measure tokens/sec, latency p50/p99
Multi-GPU: tensor-parallel-size = GPU count
Memory: gpu-memory-utilization=0.9, max-model-len for context

How to Use

Invoke the skill when relevant domain keywords appear in the request
Provide required inputs as specified in the skill definition
Review the output for correctness before delivering to the user
Combine with related skills for complex multi-step workflows

Verification

After completing this skill, confirm:

Output meets the defined quality and completeness requirements
All prerequisites are verified and documented
Error handling covers edge cases
Results are accurate and actionable