minicpm5-deploy-vllm

star 9.5k

Serve MiniCPM5-1B via vLLM as an OpenAI-compatible HTTP server. Use when the user wants high-throughput production serving on NVIDIA GPU, asks for "vLLM", "OpenAI server", "REST API for MiniCPM5", or "production deployment".

OpenBMB By OpenBMB schedule Updated 5/31/2026

name: minicpm5-deploy-vllm description: Serve MiniCPM5-1B via vLLM as an OpenAI-compatible HTTP server. Use when the user wants high-throughput production serving on NVIDIA GPU, asks for "vLLM", "OpenAI server", "REST API for MiniCPM5", or "production deployment".

Deploy MiniCPM5-1B with vLLM

OpenAI-compatible server for the BF16 / FP16 MiniCPM5-1B checkpoint.

Required input

Var Example Default
MODEL_PATH openbmb/MiniCPM5-1B required
PORT 8000 8000
GPU_ID 0 0
CTX_LEN 131072 (128 K) 131072; lower if VRAM tight
MEM_FRAC 0.85 0.85; lower on shared GPUs

Steps

1. Install (once)

pip install "vllm>=0.21"          # latest (CUDA 13.x driver hosts)
# pip install "vllm==0.10.1.1"    # fallback for CUDA 12.x driver hosts

2. Launch

CUDA_VISIBLE_DEVICES=${GPU_ID} vllm serve "${MODEL_PATH}" \
    --served-model-name MiniCPM5-1B \
    --dtype bfloat16 \
    --max-model-len ${CTX_LEN} \
    --gpu-memory-utilization ${MEM_FRAC} \
    --port ${PORT}

Wait for Application startup complete in the log.

3. Validate

curl http://localhost:${PORT}/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "MiniCPM5-1B",
        "messages": [{"role": "user", "content": "1+1=?"}],
        "temperature": 0.7, "top_p": 0.95, "max_tokens": 64,
        "chat_template_kwargs": {"enable_thinking": false}
    }'

Expected: choices[0].message.content contains "2". If you see <think>...</think>, you forgot chat_template_kwargs.enable_thinking=false.

Sampling defaults

{"temperature": 0.9, "top_p": 0.95, "chat_template_kwargs": {"enable_thinking": true}}    // think
{"temperature": 0.7, "top_p": 0.95, "chat_template_kwargs": {"enable_thinking": false}}   // nothink

Common pitfalls

  • (free / total) < MEM_FRAC hard error: lower --gpu-memory-utilization (e.g. 0.5 on a shared GPU).
  • OOM at startup with 128 K: drop --max-model-len to 32768 or 8192.

Tool calling (plugin)

The vLLM-side MiniCPM5 XML parser (PR #43175) merged to main on 2026-05-27 but is not in any pip release yet (v0.22.0 was cut before the merge). Use the bridge plugin shipped at tool_parsers/minicpm5xml_tool_parser.py in this repo:

vllm serve "${MODEL_PATH}" \
    --served-model-name MiniCPM5-1B \
    --dtype bfloat16 --max-model-len ${CTX_LEN} --port ${PORT} \
    --enable-auto-tool-choice \
    --tool-parser-plugin /path/to/MiniCPM/tool_parsers/minicpm5xml_tool_parser.py \
    --tool-call-parser minicpm5

Drop --tool-parser-plugin once vLLM ships a release containing the parser natively.

When NOT to use

  • One-shot Python script → minicpm5-deploy-transformers
  • Apple Silicon / no NVIDIA GPU → minicpm5-deploy-llama-cpp / minicpm5-deploy-mlx
  • High-concurrency batch eval w/ prefix cache or tool calling → minicpm5-deploy-sglang

Reference

docs/deployment/vllm.md

Install via CLI
npx skills add https://github.com/OpenBMB/MiniCPM --skill minicpm5-deploy-vllm
Repository Details
star Stars 9,465
call_split Forks 621
navigation Branch main
article Path SKILL.md
More from Creator