provider-verification

name: provider-verification description: >- Verify mql provider resource/field changes against real cloud infrastructure. Given a pull request or a commit range, this provisions Terraform infra in the affected cloud(s), runs mql queries against every new or changed resource and field, reports the hourly cost (pausing for approval above $2/hr), opens a fix PR for any provider bugs it uncovers, and tears the infrastructure back down. Use this whenever someone wants to test, verify, smoke-test, or "prove out" a provider PR or a range of commits against live cloud APIs — e.g. "verify PR #7701 works", "spin up infra to test the new GCP resources", "check the azure changes against real infrastructure", "test resources changed between these commits". Trigger it even when the user only says "test this PR" in the

context of an mql provider change.

Provider Verification

mql resources are thin wrappers over cloud APIs, so a .lr schema change is only really "done" once it has been run against a real cloud account. This skill automates that loop: figure out what changed, stand up just enough real infrastructure to exercise it, query it with mql run, and clean up.

It exists because the failure modes that matter — a wrong location wildcard, a missing API view, an SDK that lags the service — are invisible to compile checks and unit tests. They only show up against live APIs.

Inputs

One of:

A pull request: a number (7701) or URL. Diff comes from gh pr diff.
A commit range: <base>..<head> (e.g. 7bfc8787a..HEAD). Diff comes from git diff.

Accept multiple PRs at once — verify them together in one infra spin-up.

Workflow

Work through these steps in order. Track them with TodoWrite — the run is long and the teardown step must not be skipped.

1. Extract what changed

Run the bundled script — it parses the diff and lists every added/changed resource and field, grouped by provider:

python3 .claude/skills/provider-verification/extract_changes.py --pr 7701
python3 .claude/skills/provider-verification/extract_changes.py --range A..B

It reports .lr schema changes (new resources, new fields) and flags the provider .go files that changed. A PR can touch a provider's code without touching its .lr (a pure bugfix); still verify those resources.

Doc-comment-only .lr changes (a PR that just adds doc-comments to existing resources) do not need infrastructure — skip them, but say so in the report.

2. Build the affected providers

The merged PR code must be running locally. From the repo root:

make providers/build/<provider> && make providers/install/<provider>

Build every affected provider. If the PRs are not yet merged, check out the PR branch first. If verifying a commit range, check out the head commit.

3. Check cloud auth

For each affected cloud, confirm credentials before provisioning anything: aws sts get-caller-identity, az account show, gcloud config get-value project, oci iam region list. If a cloud is not authenticated, stop and tell the user — do not try to provision it.

4. Generate Terraform

Goal: the cheapest real resources that make each changed field return non-empty, non-error data. Smallest SKUs, smallest instances, free tiers where they exist.

Read references/cloud-notes.md before writing any Terraform — it lists the per-cloud gotchas (SKUs that no longer exist, APIs that need a quota project, resources Terraform cannot create) that this process has already hit. It will save you a failed apply.

Write Terraform into a scratch directory outside the repo (e.g. ~/dev/mql-verify-<timestamp>/<cloud>/), one stack per cloud. When more than one cloud is involved, dispatch one subagent per cloud to write its stack in parallel — each agent writes the .tf, runs terraform init/validate/plan, and reports a per-resource hourly cost. Agents must not apply.

Tag every resource with project = mql-pr-verify so leftovers are findable.

5. Cost gate

Sum the 1-hour cost across every cloud's terraform plan. Present a per-resource cost table and the total.

Total ≤ $2/hour: state the cost and proceed.
Total > $2/hour: STOP. Show the table and ask the user to approve before applying. Do not apply until they say yes.

6. Apply

terraform apply -auto-approve per cloud — run them in parallel in the background; some resources are slow (see cloud-notes). Re-apply on transient failures, but cap it at 2–3 attempts — a failure that survives that many re-applies is not transient. Treat it as a blocker or an environment limitation and stop retrying. If a resource genuinely cannot be created, remove it from the stack, note it, and continue — one bad resource must not block the rest.

7. Verify with `mql run`

For every new or changed resource and field, run a query and confirm two things: it returns no error, and it returns appropriate data.

mql run <provider> -c "<resource> { <changed fields> }"

A new field that resolves to false/""/[] is fine if that genuinely reflects the resource's state (feature disabled, list empty). It is not fine if the field should have data — that is a bug.
A query that errors, or no data available caused by an underlying API error, is a bug. Capture the exact error.
Typed reference fields (vpc(), kinesisStream(), …) must resolve to the referenced resource, not error.

For resources Terraform cannot provision (ephemeral jobs, etc.), create them best-effort via the cloud CLI (see cloud-notes) so the accessor still gets exercised. If even that is impossible, verify the accessor resolves cleanly (empty, no error) and say so.

8. Triage bugs

A bug is any verification failure caused by provider code, including preexisting bugs in code outside the PRs under test. For each:

Has a clear, verifiable fix: fix it (see step 9).
No confident fix (e.g. an SDK lagging a new API): do not guess a fix. Record it in the report and offer to open a tracking GitHub issue.

A failure caused by the cloud account (expired trial, missing quota, un-enabled API) is not a provider bug — report it as an environment limitation.

9. Fix PR

If there are bugs with verifiable fixes, open one combined PR for all of them, across every provider:

Work in a worktree branched from main.
Apply the fixes. Match existing patterns in the provider (see CLAUDE.md).
gofmt -w changed files; rebuild the provider; re-run the failing queries against the still-live infrastructure to confirm each fix works.
Commit *.permissions.json if a fix changed it. No .lr.versions change unless a fix adds a schema field.
Commit (emoji-prefixed per CLAUDE.md — 🐛), push, gh pr create.

Verify fixes before teardown — the infrastructure is needed to prove them.

10. Teardown — always

Always destroy every stack at the end, even on failure, even when bugs were found (the fix PR was already verified in step 9). Run terraform destroy per cloud.

Destroy is fragile — handle the known failure modes in cloud-notes (orphaned EFS mount targets blocking AWS subnets, OCI API circuit-breakers, resources Terraform dropped from state). Fall back to CLI deletion when terraform destroy cannot finish. Confirm nothing tagged mql-pr-verify remains.

Note anything that genuinely cannot be deleted (e.g. an App Engine app) in the report — do not leave the user guessing.

11. Report

Always end with this structure:

# Verification report — <PRs / commit range>

## Provisioned
<per-cloud resource count + total $/hour for the run>

## Results
| PR / area | Resource / field | Result | Detail |
(✅ pass / ⚠️ partial / ❌ bug, one row per changed resource or field group)

## Bugs
<each bug: file:line, the failing query, observed vs expected, and either
"fixed in PR #NNNN" or "no verified fix — offer to open an issue">

## Environment limitations
<failures caused by the account, not the code>

## Teardown
<confirmation everything was destroyed; anything that could not be>

After the report, if there are unfixable bugs, offer to open a tracking issue.

Key rules

Cost gate is $2/hour total. Below it, proceed. Above it, pause for approval. Always show the number.
One combined fix PR for all bugs, regardless of how many providers.
Always tear down — unconditionally, at the end.
Cheapest infra that works. This is a verification run, not a deployment.
Honest reporting. An empty result is a pass only when empty is correct; otherwise it is a bug. Never claim a field works without seeing real data.

Reference

references/cloud-notes.md — per-cloud Terraform/CLI gotchas, slow resources, and what cannot be provisioned or destroyed. Read it before steps 4, 6, 10.
extract_changes.py — diff → changed resources/fields, grouped by provider.

context of an mql provider change.