name: my-llm-api version: 1.1.0 description: > Two-surface LLM primitive. Raw chat completion against self-hosted open-source models (you pick the model), and objective verbs (classify / extract / summarize / draft) that hide the model behind a task. Pricing in cents per 1M tokens; charged from your MyAPI balance. triggers: [llm, completion, chat, embed, embedding, inference, classify, extract, summarize, draft, qwen] checksum: sha256-pending
MyLLMAPI
A two-surface LLM gateway:
- Raw —
complete/embed/models. You pick a self-hosted catalog model (today:Qwen/Qwen3-Coder-30B-A3B-Instruct). Proprietary models are not callable here. - Verbs —
classify/extract/summarize/draft. You ask for a task done; the model is implementation detail and is never named in the response.
Pricing is in cents per 1M tokens at the actual upstream rate; cost is debited from your MyAPI balance.
Capabilities
Use this for workflow tasks — summarize a doc, classify an inbound email, extract structured data, draft a reply. The reply goes to stdout; a one-line usage footer (tokens + cost + finish reason) goes to stderr, so myapi llm complete ... | jq and similar pipelines work as expected.
Don't use this as your own model. If you're an agent reading this, you already have a more capable model than what's exposed here. Reach for the LLM verbs when you're scripting a recurring step where a small/cheap model is the right tool — not for one-shot reasoning you can just do yourself.
Raw vs. verbs
- Raw
complete— full control: pick the model, build themessagesarray, setmax_tokens/temperature/stop. Use when shape matters. - Verbs — you want a result (label, JSON, summary, draft). Use when you don't care which model runs underneath.
Raw complete request
{
"model": "Qwen/Qwen3-Coder-30B-A3B-Instruct",
"messages": [
{ "role": "system", "content": "You are a terse editor." },
{ "role": "user", "content": "Tighten this paragraph: ..." }
],
"max_tokens": 200,
"temperature": 0.2,
"stop": ["\n\n"]
}
Roles: system | user | assistant. Multiple system messages collapse to one instruction. max_tokens, temperature, and stop are optional — per-model defaults apply.
Raw complete response
{
"model": "Qwen/Qwen3-Coder-30B-A3B-Instruct",
"content": "...assistant reply...",
"finish_reason": "stop",
"usage": { "input_tokens": 42, "output_tokens": 87, "cost_cents": 0.029 }
}
finish_reason is one of stop (normal), length (hit max_tokens), filter (blocked).
If model isn't in the self-hosted catalog the server returns MODEL_NOT_IN_RAW_CATALOG — that's the signal to use a verb instead, not to retry with a different --model.
Raw embed
No embedding model is served on the raw catalog today; calls return EMBED_NOT_AVAILABLE. Use a dedicated embedding API for now.
Model catalog
- Chat models:
id,kind: 'chat',context_window,input_cost_per_1m_cents,output_cost_per_1m_cents - Embed models:
id,kind: 'embed',dimensions,input_cost_per_1m_cents
The catalog is live — it reflects what the inference gateway actually serves, refreshed every 15 minutes. Always query models rather than hard-coding ids.
Verb requests + responses
Every verb takes an optional tier (fast | reasoning | cheap) — opaque routing hint, server picks the model. Response usage block is identical across verbs.
| Verb | Request body | Response data |
|---|---|---|
classify |
{ input, labels: string[], multi?: boolean, tier? } |
{ label } or { labels: string[] } (when multi) |
extract |
{ input, schema: <json-schema>, tier? } |
{ data: <object conforming to schema> } |
summarize |
{ input, style?: 'brief'|'exec'|'bullet', tier? } |
{ summary } |
draft |
{ input?, kind: string, context?: object, prompt?: string, tier? } |
{ text } |
Shared usage block on every verb:
{ "tier_used": "fast", "tokens_in": 65, "tokens_out": 37, "cost_cents": 0.005 }
The model/provider is never named in the verb response. That's the point — the verb is the contract, the model is implementation.
OpenAI-compatible drop-in
POST /llm/orgs/{org_id}/chat/completions (and /v1/chat/completions alias) accepts the OpenAI request shape and returns the OpenAI response shape — no envelope. Same catalog and pricing as raw complete. Use it when existing OpenAI SDK / LangChain / any base_url-configurable tooling should point at MyAPI without rewriting.
from openai import OpenAI
client = OpenAI(
api_key="hq_live_…",
base_url="https://api.myapihq.com/llm/orgs/<org_id>/v1",
)
r = client.chat.completions.create(model="Qwen/Qwen3-Coder-30B-A3B-Instruct",
messages=[{"role":"user","content":"Hi"}])
Use raw complete (MyAPI shape) for first-party integrations; use the OpenAI-compat path for compatibility with existing client code.
Commands
| Command | What it does |
|---|---|
myapi llm models [--kind chat|embed] [--json] |
List the live model catalog with pricing (cents/1M) |
myapi llm complete "<prompt>" [--model <id>] [--system "<s>"] [--max-tokens N] [--temperature 0..1] [--stop <csv>] [--file <path>] [--json] |
Raw chat completion; reply to stdout, usage to stderr. Defaults to the first chat model in the catalog |
myapi llm embed "<text>" --model <id> [--json] |
Embed a string (no model served today — returns EMBED_NOT_AVAILABLE) |
myapi llm classify "<input>" --labels <csv> [--multi] [--tier <t>] [--json] |
Pick a label from a set |
myapi llm extract "<input>" --schema <path|json> [--tier <t>] [--json] |
Pull structured data conforming to a JSON Schema |
myapi llm summarize "<input>" [--style brief|exec|bullet] [--tier <t>] [--json] |
Summarize text |
myapi llm draft --kind <what> [--prompt "<s>"] [--context <json>] ["<src>"] [--tier <t>] [--json] |
Draft an email / reply / message / … |
Pass - as the prompt/input to read from stdin. Pass --file <path> to read longer content from disk.
Examples
# List the live catalog
myapi llm models
myapi llm models --kind chat --json | jq '.models[].id'
# Raw completion — picks the first chat model from the catalog
myapi llm complete "Summarize in 12 words: $(cat README.md)"
# Pin a specific model
myapi llm complete "Refactor this function: ..." \
--model Qwen/Qwen3-Coder-30B-A3B-Instruct \
--system "You are a careful Go reviewer." \
--max-tokens 600
# ── Verbs (recommended for workflow steps) ──────────────────────────────
myapi llm classify "I was charged twice — please refund." \
--labels billing,technical,sales,spam
myapi llm extract "Acme Corp employs 250 people in Berlin." \
--schema '{"type":"object","properties":{"company":{"type":"string"},"employees":{"type":"integer"}}}'
myapi llm summarize --file long-thread.txt --style bullet
myapi llm draft --kind email \
--prompt "Friendly welcome, under 60 words." \
--context '{"recipient":"a new signup","product":"MyAPI"}'
# Classify + route an inbound webhook delivery
BODY=$(myapi webhook delivery <id> --json | jq -r '.body')
INTENT=$(printf '%s' "$BODY" | myapi llm classify - \
--labels support,sales,spam --json | jq -r '.data.label')
Notes
draftcontext safety.contextfields are quoted into the prompt verbatim; sensitive-named keys (secret,api_key,password, …) are NOT redacted. Two guards on top: (a) injection-defense stripsinstructions/system/prompt/overridekeys and surfaces them inmeta.warnings; (b) output guardrail substring-scans fact values (length ≥ 4) in the response and lists matches inmeta.guardrails.facts_in_output(signal, not redaction). Rule of thumb: never put credentials, PII, or internal metadata incontext— pass identifiers, reference them indirectly.- Self-hosted raw, server-picked verbs. Raw runs on MyAPI's TPU; verbs route wherever the server picks.
- Cost + latency.
usage.cost_centsis authoritative — no markup. Qwen3-Coder-30B: 200–600 ms to first token, 1–3 s end-to-end. - Live catalog, no streaming, no BYOK. Don't hard-code ids —
modelsis truth (CLI auto-picks if--modelomitted). Full reply only.