my-llm-api

name: my-llm-api version: 1.1.0 description: > Two-surface LLM primitive. Raw chat completion against self-hosted open-source models (you pick the model), and objective verbs (classify / extract / summarize / draft) that hide the model behind a task. Pricing in cents per 1M tokens; charged from your MyAPI balance. triggers: [llm, completion, chat, embed, embedding, inference, classify, extract, summarize, draft, qwen] checksum: sha256-pending

MyLLMAPI

A two-surface LLM gateway:

Raw — complete / embed / models. You pick a self-hosted catalog model (today: Qwen/Qwen3-Coder-30B-A3B-Instruct). Proprietary models are not callable here.
Verbs — classify / extract / summarize / draft. You ask for a task done; the model is implementation detail and is never named in the response.

Pricing is in cents per 1M tokens at the actual upstream rate; cost is debited from your MyAPI balance.

Capabilities

Use this for workflow tasks — summarize a doc, classify an inbound email, extract structured data, draft a reply. The reply goes to stdout; a one-line usage footer (tokens + cost + finish reason) goes to stderr, so myapi llm complete ... | jq and similar pipelines work as expected.

Don't use this as your own model. If you're an agent reading this, you already have a more capable model than what's exposed here. Reach for the LLM verbs when you're scripting a recurring step where a small/cheap model is the right tool — not for one-shot reasoning you can just do yourself.

Raw vs. verbs

Raw complete — full control: pick the model, build the messages array, set max_tokens/temperature/stop. Use when shape matters.
Verbs — you want a result (label, JSON, summary, draft). Use when you don't care which model runs underneath.

Raw `complete` request

{
  "model": "Qwen/Qwen3-Coder-30B-A3B-Instruct",
  "messages": [
    { "role": "system", "content": "You are a terse editor." },
    { "role": "user",   "content": "Tighten this paragraph: ..." }
  ],
  "max_tokens": 200,
  "temperature": 0.2,
  "stop": ["\n\n"]
}

Roles: system | user | assistant. Multiple system messages collapse to one instruction. max_tokens, temperature, and stop are optional — per-model defaults apply.

Raw `complete` response

{
  "model": "Qwen/Qwen3-Coder-30B-A3B-Instruct",
  "content": "...assistant reply...",
  "finish_reason": "stop",
  "usage": { "input_tokens": 42, "output_tokens": 87, "cost_cents": 0.029 }
}

finish_reason is one of stop (normal), length (hit max_tokens), filter (blocked).

If model isn't in the self-hosted catalog the server returns MODEL_NOT_IN_RAW_CATALOG — that's the signal to use a verb instead, not to retry with a different --model.

Raw `embed`

No embedding model is served on the raw catalog today; calls return EMBED_NOT_AVAILABLE. Use a dedicated embedding API for now.

Model catalog

Chat models: id, kind: 'chat', context_window, input_cost_per_1m_cents, output_cost_per_1m_cents
Embed models: id, kind: 'embed', dimensions, input_cost_per_1m_cents

The catalog is live — it reflects what the inference gateway actually serves, refreshed every 15 minutes. Always query models rather than hard-coding ids.

Verb requests + responses

Every verb takes an optional tier (fast | reasoning | cheap) — opaque routing hint, server picks the model. Response usage block is identical across verbs.

Verb	Request body	Response `data`
`classify`	`{ input, labels: string[], multi?: boolean, tier? }`	`{ label }` or `{ labels: string[] }` (when `multi`)
`extract`	`{ input, schema: <json-schema>, tier? }`	`{ data: <object conforming to schema> }`
`summarize`	`{ input, style?: 'brief'\|'exec'\|'bullet', tier? }`	`{ summary }`
`draft`	`{ input?, kind: string, context?: object, prompt?: string, tier? }`	`{ text }`

Shared usage block on every verb:

{ "tier_used": "fast", "tokens_in": 65, "tokens_out": 37, "cost_cents": 0.005 }

The model/provider is never named in the verb response. That's the point — the verb is the contract, the model is implementation.

OpenAI-compatible drop-in

POST /llm/orgs/{org_id}/chat/completions (and /v1/chat/completions alias) accepts the OpenAI request shape and returns the OpenAI response shape — no envelope. Same catalog and pricing as raw complete. Use it when existing OpenAI SDK / LangChain / any base_url-configurable tooling should point at MyAPI without rewriting.

from openai import OpenAI
client = OpenAI(
    api_key="hq_live_…",
    base_url="https://api.myapihq.com/llm/orgs/<org_id>/v1",
)
r = client.chat.completions.create(model="Qwen/Qwen3-Coder-30B-A3B-Instruct",
    messages=[{"role":"user","content":"Hi"}])

Use raw complete (MyAPI shape) for first-party integrations; use the OpenAI-compat path for compatibility with existing client code.

Commands

Command	What it does
`myapi llm models [--kind chat\|embed] [--json]`	List the live model catalog with pricing (cents/1M)
`myapi llm complete "<prompt>" [--model <id>] [--system "<s>"] [--max-tokens N] [--temperature 0..1] [--stop <csv>] [--file <path>] [--json]`	Raw chat completion; reply to stdout, usage to stderr. Defaults to the first chat model in the catalog
`myapi llm embed "<text>" --model <id> [--json]`	Embed a string (no model served today — returns EMBED_NOT_AVAILABLE)
`myapi llm classify "<input>" --labels <csv> [--multi] [--tier <t>] [--json]`	Pick a label from a set
`myapi llm extract "<input>" --schema <path\|json> [--tier <t>] [--json]`	Pull structured data conforming to a JSON Schema
`myapi llm summarize "<input>" [--style brief\|exec\|bullet] [--tier <t>] [--json]`	Summarize text
`myapi llm draft --kind <what> [--prompt "<s>"] [--context <json>] ["<src>"] [--tier <t>] [--json]`	Draft an email / reply / message / …

Pass - as the prompt/input to read from stdin. Pass --file <path> to read longer content from disk.

Examples

# List the live catalog
myapi llm models
myapi llm models --kind chat --json | jq '.models[].id'

# Raw completion — picks the first chat model from the catalog
myapi llm complete "Summarize in 12 words: $(cat README.md)"

# Pin a specific model
myapi llm complete "Refactor this function: ..." \
  --model Qwen/Qwen3-Coder-30B-A3B-Instruct \
  --system "You are a careful Go reviewer." \
  --max-tokens 600

# ── Verbs (recommended for workflow steps) ──────────────────────────────

myapi llm classify "I was charged twice — please refund." \
  --labels billing,technical,sales,spam

myapi llm extract "Acme Corp employs 250 people in Berlin." \
  --schema '{"type":"object","properties":{"company":{"type":"string"},"employees":{"type":"integer"}}}'

myapi llm summarize --file long-thread.txt --style bullet

myapi llm draft --kind email \
  --prompt "Friendly welcome, under 60 words." \
  --context '{"recipient":"a new signup","product":"MyAPI"}'

# Classify + route an inbound webhook delivery
BODY=$(myapi webhook delivery <id> --json | jq -r '.body')
INTENT=$(printf '%s' "$BODY" | myapi llm classify - \
  --labels support,sales,spam --json | jq -r '.data.label')

Notes

draft context safety. context fields are quoted into the prompt verbatim; sensitive-named keys (secret, api_key, password, …) are NOT redacted. Two guards on top: (a) injection-defense strips instructions/system/prompt/override keys and surfaces them in meta.warnings; (b) output guardrail substring-scans fact values (length ≥ 4) in the response and lists matches in meta.guardrails.facts_in_output (signal, not redaction). Rule of thumb: never put credentials, PII, or internal metadata in context — pass identifiers, reference them indirectly.
Self-hosted raw, server-picked verbs. Raw runs on MyAPI's TPU; verbs route wherever the server picks.
Cost + latency. usage.cost_cents is authoritative — no markup. Qwen3-Coder-30B: 200–600 ms to first token, 1–3 s end-to-end.
Live catalog, no streaming, no BYOK. Don't hard-code ids — models is truth (CLI auto-picks if --model omitted). Full reply only.