validate-model

name: validate-model description: Validate a model entry (or every model in a provider) in `apps/sim/providers/models.ts` against the provider's live API docs, reporting pricing and capability drift, dead capability flags, hosting/billing intent, and any field that cannot be verified. Use when auditing or repairing model entries under `apps/sim/providers/models.ts`.

Validate Model Skill

You audit one or more model entries in apps/sim/providers/models.ts against the provider's official live API docs. Hallucinated pricing and capabilities are the #1 failure mode in this file. Every numeric and capability claim must be re-derived from a live web fetch in this session — not from memory, not from training data, not from the user's marketing email.

Hard rules (do not skip)

Live-fetch or report unverified. Each field must be backed by a live WebFetch in this session. If you cannot reach an authoritative URL for a field, mark it UNVERIFIED in the report — do not silently confirm it from memory.
Cite every fact. Every value in the report must show the source URL it was checked against. No URL → mark UNVERIFIED.
Two-source rule for pricing. Cross-check input/output/cached against at least one secondary source (OpenRouter, Artificial Analysis, CloudPrice). If sources disagree, the provider's own docs win — flag the disagreement.
Inspect provider implementation before flagging capability mismatches. A capability flag in models.ts is dead unless the provider's code under apps/sim/providers/{provider}/ consumes it (see Consumption Matrix below). Setting a flag the provider ignores is a warning, not a critical.
Never auto-fix without printing the diff. Show the user the proposed diff before applying. Get confirmation.

Your Task

When invoked as /validate-model <provider> [model-id]:

Read the target entries from models.ts
Live-fetch the provider's official models, pricing, and capability/reasoning pages + at least one secondary source for pricing
Inspect the provider implementation to know which flags are actually consumed
Run the checklist below per model
Report findings (critical / warning / suggestion / unverified) with every cell linked to its source URL
Offer to fix; on confirm, edit models.ts in a single pass and re-lint

If model-id is omitted, validate every model in the provider.

Step 1: Read entries from `models.ts`

Capture per model: id, full pricing, full capabilities, contextWindow, releaseDate, recommended, speedOptimized, deprecated.

Step 2: Live-fetch authoritative sources

Use the canonical provider URL table in the add-model skill (.claude/commands/add-model.md, or its mirror .agents/skills/add-model/SKILL.md), Step 1, as the single source of truth — fetch the models index, pricing, and reasoning/parameter caveats pages listed there for the target provider. If you update one table, update the other in the same change.

Secondary cross-check (use at least one): OpenRouter, Artificial Analysis, CloudPrice.

If a fetch fails (404, timeout, paywall), record the URL attempted and mark dependent fields UNVERIFIED.

Step 3: Build the consumption map for this provider

Re-grep before trusting the snapshot below:

rg "reasoningEffort|reasoning_effort" apps/sim/providers/<provider>/
rg "verbosity" apps/sim/providers/<provider>/
rg "request\.thinking|thinking:" apps/sim/providers/<provider>/
rg "supportsNativeStructuredOutputs|nativeStructuredOutputs" apps/sim/providers/<provider>/

Snapshot (verify before relying):

Capability	Consumed by
`reasoningEffort`	`openai/core.ts`, `azure-openai`, `anthropic/core.ts` (mapped via thinking), `gemini/core.ts`
`verbosity`	`openai/core.ts`, `azure-openai/index.ts`
`thinking`	`anthropic/core.ts`, `gemini/core.ts`
`nativeStructuredOutputs`	`anthropic/core.ts`, `fireworks/index.ts`, `openrouter/index.ts`
`computerUse`	`anthropic/core.ts`
`temperature`	All providers (passthrough)

A flag set in models.ts but not in the consumption list for this provider = warning: dead flag.

Step 4: Run the checklist

For each model, evaluate every row. Statuses: ✓ matches docs, ✗ disagrees, ⚠️ single-source, ❓ UNVERIFIED (could not fetch).

Identity

id exactly matches provider's API model identifier (case, dots, dashes, prefix for resellers)
releaseDate matches launch announcement
deprecated: true set if provider has announced retirement (or removed from active list)

Pricing (per 1M tokens, USD)

pricing.input matches provider pricing page
pricing.output matches provider pricing page
pricing.cachedInput matches provider's documented cached/prompt-cache rate (or is correctly omitted if no caching offered)
pricing.updatedAt is recent — warn if older than 60 days

Context & output limits

contextWindow matches docs (in tokens)
capabilities.maxOutputTokens matches documented output cap (or is correctly omitted if "no output limit")

Capabilities (each must be DOCUMENTED-AS-SUPPORTED and CONSUMED-BY-PROVIDER-CODE)

temperature — provider accepts it for this model (reasoning-always-on models often reject)
reasoningEffort.values — list matches docs; omitted for always-reasoning models that reject the parameter (e.g., grok-4.3, where xAI docs explicitly state reasoning_effort is not supported). Verify per model — some always-reasoning models (e.g., OpenAI's o-series) DO accept reasoning_effort and should keep the flag.
verbosity.values — only on OpenAI gpt-5.x family; values match docs
thinking.levels + thinking.default — only on Anthropic/Gemini; values match docs
nativeStructuredOutputs — only on anthropic/fireworks/openrouter; provider must document Structured Outputs / JSON-mode for this model
toolUsageControl — provider supports tool_choice semantics
computerUse — provider implements computer-use loop AND model is a computer-use SKU
deepResearch — only on actual deep-research SKUs
memory: false — only when the model genuinely cannot maintain conversation history

Flags

recommended: true — at most one or two per provider; should be current flagship
speedOptimized: true — only on smallest/fastest tier (nano / flash-lite / haiku class)

Hosting / billing

If the model is under openai/anthropic/google, it is automatically in getHostedModels() → served with Sim's rotating key and billed via shouldBillModelUsage(). Confirm that is the intent (a BYOK-only model parked under one of these providers is a billing bug — warning).
If the model is hosted, the deployment is expected to have its {PREFIX}_COUNT / {PREFIX}_1..N env vars set (ops concern; note if it looks unset for a model claiming hosted support).

Step 5: Report (mandatory format)

For each model, emit a table with one row per checklist item. Every row that claims ✓ must have a URL.

### Validation — <model-id>

| Field | Repo | Live docs | Source URL | Status |
|---|---|---|---|---|
| `input` | $1.25/M | $1.25/M | https://docs.x.ai/... | ✓ |
| `cachedInput` | $0.50/M | $0.20/M | https://cloudprice.net/... | ✗ stale (price cut not picked up) |
| `reasoningEffort` | low/medium/high | rejected by API | https://docs.x.ai/.../reasoning | ✗ inert — selecting silently no-ops |
| `contextWindow` | 1,000,000 | 1,000,000 | https://docs.x.ai/... + https://openrouter.ai/... | ✓ (2 sources) |
| `releaseDate` | 2026-04-30 | not found in scraped pages | _attempted: docs.x.ai, x.ai/news_ | ❓ UNVERIFIED |

**Findings**
- 🔴 critical — `cachedInput` is wrong: docs say $0.20/M, repo has $0.50/M
- 🟡 warning — `reasoningEffort` is set but provider rejects it for this model (xAI docs explicitly: "reasoning_effort is not supported by grok-4.3")
- 🔵 suggestion — `pricing.updatedAt` is 90 days old; refresh
- ❓ unverified — `releaseDate` could not be confirmed from any fetched page; ask user

**Disagreements between sources**
- _none_ OR _OpenRouter says $X, provider docs say $Y — went with provider docs_

End each multi-model run with a summary count: N models checked · X critical · Y warnings · Z suggestions · W unverified.

Step 6: Offer to fix

After reporting, ask: "Want me to fix the critical and warning items? I'll print the diff first." On yes:

Print the proposed diff (do not apply yet)
Get user confirmation
Edit models.ts in a single pass
Run bun run lint
Re-run only the failed rows of the checklist on the new state

Severity definitions

🔴 critical — wrong number or wrong identifier that misleads users about cost or breaks API calls. Examples: incorrect pricing, wrong model id, wrong context window, capability the API rejects.
🟡 warning — dead code or internal inconsistency. Examples: capability flag the provider ignores, multiple recommended: true per provider, pricing.updatedAt >60 days old, missing deprecated: true on retired model.
🔵 suggestion — style/consistency. Examples: field order, missing speedOptimized on a clearly smallest-tier model.
❓ unverified — could not fetch an authoritative source for this field. Surface it; never silently confirm.

Common bugs this skill catches

Pricing drift after a provider price cut (very common — providers cut quarterly)
reasoningEffort set on always-reasoning models that reject the parameter (grok-4.3, o3-pro pattern)
nativeStructuredOutputs set on providers that don't consume the flag (dead)
thinking set on non-Anthropic/non-Gemini providers
verbosity set on non-gpt-5.x models
Wrong context window (e.g., 128k claimed vs 200k actual)
Stale pricing.updatedAt
Multiple recommended: true per provider after a flagship swap
Missing deprecated: true on retired models (e.g., the xAI batch retiring May 15, 2026)

What "I cannot verify this" looks like

If, after fetching the documented sources, a field cannot be confirmed:

Mark the row ❓ UNVERIFIED with the URL(s) attempted
Surface it in the Findings section with severity ❓
Do NOT mark the validation as passed
Ask the user for a docs URL or guidance before changing anything

The skill is allowed to say "I could not verify the cached input price for grok-4.3 from the official xAI docs in this session — I attempted [URLs] without finding the value. Third-party sources [URL1, URL2] both report $0.20/M. Confirm before I update." That is correct behavior. Hallucinating a number is not.