llm-deployment

star 3

LLM deployment and serving — vLLM, Ollama, TGI, llama.cpp. Model quantization, GPU optimization, API serving

oyi77 By oyi77 schedule Updated 6/8/2026

name: llm-deployment description: LLM deployment and serving — vLLM, Ollama, TGI, llama.cpp. Model quantization, GPU optimization, API serving domain: core

Overview

Running LLMs in production — local inference with Ollama/llama.cpp to high-throughput serving with vLLM/TGI. Quantization, GPU optimization, OpenAI-compatible APIs.

Capabilities

  • Local deployment (Ollama, llama.cpp, LM Studio)
  • High-throughput serving (vLLM, TGI)
  • Quantization (GGUF, GPTQ, AWQ)
  • GPU memory optimization (FlashAttention, PagedAttention)
  • OpenAI-compatible API endpoints

When to Use

  • Self-hosted LLM for privacy/cost
  • High-throughput API serving
  • Running on consumer GPUs (24GB or less)

When NOT to Use

  • Task is outside your authorization scope
  • You need to implement controls (use implementing-* skills)
  • Task is about analysis, not action (use analyzing-* skills)
  • You don't have access to target systems
  • Task requires compliance expertise (consult professionals)
  • Task is about defense, not offense (use defensive skills)

Pseudo Code

# Example workflow for this skill
def execute(input_data):
    # Step 1: Validate input
    if not input_data:
        raise ValueError("Input data is required")

    # Step 2: Process core logic
    result = process(input_data)

    # Step 3: Validate output
    validate_output(result)

    return result

Ollama

ollama pull llama3.1:8b && ollama serve
curl http://localhost:11434/api/generate -d '{"model":"llama3.1:8b","prompt":"Hello!","stream":false}'

vLLM

python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.1-8B-Instruct --tensor-parallel-size 2
# Use: OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")

llama.cpp

cmake -B build -DGGML_METAL=ON && cmake --build build -j
./build/bin/llama-server -m model.gguf --host 0.0.0.0 --port 8080 -ngl 99

Quantize

# GPTQ
from transformers import AutoModelForCausalLM, GPTQConfig
model = AutoModelForCausalLM.from_pretrained("model", quantization_config=GPTQConfig(bits=4))

Common Patterns

  • Benchmark: measure tokens/sec, latency p50/p99
  • Multi-GPU: tensor-parallel-size = GPU count
  • Memory: gpu-memory-utilization=0.9, max-model-len for context

How to Use

  1. Invoke the skill when relevant domain keywords appear in the request
  2. Provide required inputs as specified in the skill definition
  3. Review the output for correctness before delivering to the user
  4. Combine with related skills for complex multi-step workflows

Verification

After completing this skill, confirm:

  • Output meets the defined quality and completeness requirements
  • All prerequisites are verified and documented
  • Error handling covers edge cases
  • Results are accurate and actionable
Install via CLI
npx skills add https://github.com/oyi77/1ai-skills --skill llm-deployment
Repository Details
star Stars 3
call_split Forks 0
navigation Branch main
article Path SKILL.md
More from Creator