release-checklist - SKILL.md Agent Skill

name: release-checklist description: Pre-release safety audit for the Bifrost repo. Scans database migrations changed in a release for high-scale deadlock / lock-contention risks and for work that blocks application boot time, then produces a pass/warn/fail report with a concrete remediation plan. Invoked with /release-checklist [git-ref-range]. Built to grow - new checks are appended to the Checks Registry. allowed-tools: Read, Grep, Glob, Bash, Task, AskUserQuestion

Release Checklist

A pre-release safety audit. Given a set of changes destined for release, run every check in the Checks Registry below and produce one consolidated report. This skill is read-only: it diagnoses, recommends a concrete fix for every finding, and never edits files. Applying a fix is a separate, explicitly-approved step.

The registry currently holds two migration-safety checks. It is designed to grow - see Adding a New Check.

Scope: what "the release" means

Determine the change set to audit, in this order:

If the user passed a git ref or range (e.g. /release-checklist v1.4.0..HEAD or /release-checklist origin/dev), use it.
Otherwise default to everything not yet on the main branch: diff origin/dev...HEAD (three-dot) and also include uncommitted working-tree changes.
If that range is empty, tell the user and ask for an explicit range.

Gather the raw material once, up front:

git fetch origin --quiet
git diff --stat origin/dev...HEAD
git diff origin/dev...HEAD -- '**/migrations.go' '**/matviews.go'
git status --porcelain

Migrations here are Go-defined, not .sql files. They live in:

framework/configstore/migrations.go - config DB (providers, keys, virtual keys, budgets)
framework/logstore/migrations.go - log DB (request logs; the high-volume table)
framework/logstore/matviews.go - materialized views over the log DB

A release with no diff in these files has no migration risk - record both migration checks as PASS (no migrations changed) and move on.

Migration system facts (needed by both checks)

All migrations run synchronously at boot. triggerMigrations() executes the full ordered migration list during store init, before the process serves traffic.
Supported databases: PostgreSQL and SQLite only. No MySQL. Lock behavior differs sharply between the two - judge both.
Migrations are cluster-serialized behind Postgres advisory lock 1000001 with a 5-minute acquisition timeout. A slow migration on one pod stalls every other pod's boot.
Each migration func runs in a transaction by default (Options.UseTransaction = true). A transaction holds every lock until the func returns - migration duration is lock duration.
Established escape hatch for heavy work: index builds and materialized views run in post-startup background goroutines under separate advisory locks (1000002 for indexes, 1000005/1000006 for matviews) - see framework/logstore/postgres.go, ensurePerformanceIndexes(), ensureMatViews(). migrationAddProviderHistogramIndex is intentionally a near-no-op that defers the real CREATE INDEX CONCURRENTLY there. This deferral pattern is the correct fix for anything heavy.

Checks Registry

Check 1 - Migrations that can deadlock or pile up under high-scale data

Goal: catch migrations whose locking is safe on a laptop but catastrophic on a production table with hundreds of millions of rows (notably the logstore request-log table).

For every added/modified migration func in the diff, flag:

Signal	Why it is dangerous at scale
`CREATE INDEX` without `CONCURRENTLY`	`SHARE` lock for the whole build - all writes block for minutes/hours on a large table.
`CREATE INDEX CONCURRENTLY` with `UseTransaction = true`	Postgres forbids `CONCURRENTLY` in a transaction - runtime error. Needs `UseTransaction = false` or the background path.
`ALTER TABLE` (add/drop column, add constraint, change type) on a hot table	Takes `ACCESS EXCLUSIVE`; queues behind in-flight queries, then every new query queues behind it - a cluster-wide stall.
`ADD COLUMN` with a volatile / non-constant `DEFAULT`	Rewrites the whole table under `ACCESS EXCLUSIVE`. A constant default is metadata-only and fine on PG11+.
`ADD ... FOREIGN KEY` / `ADD CONSTRAINT` validated immediately	Locks both tables and scans the child. Prefer `NOT VALID` then a separate `VALIDATE CONSTRAINT`.
Bulk `UPDATE`/`DELETE`/backfill over the whole table in one transaction	Holds row locks until commit; collides with live writes; bloats the table. Must be batched.
Order-dependent backfill, or migrations locking tables A-then-B vs B-then-A	Classic deadlock: two transactions grab the same locks in opposite order.
SQLite drop-column path (`CREATE TABLE ... AS SELECT` + `DROP` + `RENAME`) on a large table	Full table copy under SQLite's single global write lock - blocks every writer.

Report per flag: func name, file:line, the exact signal, realistic production impact, and a concrete remedy (add CONCURRENTLY + UseTransaction = false, batch the backfill, defer to ensurePerformanceIndexes, add the FK as NOT VALID, etc.).

Severity: FAIL if writes to a high-volume table (logstore logs) can be blocked or a deadlock is plausible; WARN for config-DB tables (low row counts, but still flag).

Check 2 - Migrations that block boot-up time

Goal: because triggerMigrations() runs synchronously before the process serves traffic, any migration whose runtime grows with row count delays - or past the 5-minute advisory-lock timeout, breaks - every pod's startup.

Flag any added/modified migration whose cost scales with data volume:

Signal	Why it blocks boot
Any non-`CONCURRENTLY` `CREATE INDEX`	Build time scales with row count; runs inside boot.
`CREATE INDEX CONCURRENTLY` placed directly in `triggerMigrations`	Even concurrent builds take minutes/hours on big tables; belongs in the background goroutine path.
Table rewrite: volatile-default `ADD COLUMN`, type change, SQLite drop-column copy	Rewrites/copies every row during boot.
Data backfill loop / bulk `UPDATE` over an unbounded row set	Runtime = O(rows); unbounded backfills have no ceiling.
`matviews.go` - creating or fully refreshing a matview in the synchronous path	Matview build scans the base table; must use the `ensureMatViews()` background path.
Anything that could realistically exceed the 5-minute advisory-lock timeout	Other pods fail to acquire lock `1000001` and crash-loop.

Expected safe pattern: heavy work is a near-no-op migration; the real build runs post-startup in a background goroutine. Compare each new heavy migration against migrationAddProviderHistogramIndex - if it does the heavy lifting inline, that is the finding.

Report per flag: func name, file:line, why runtime scales with data, risk at production row counts, and the remedy (move to the background path / batch it / make the default constant).

Severity: FAIL if the operation can plausibly exceed the 5-minute lock timeout on a production-sized table; WARN otherwise.

Report format

Output one consolidated report. Do not edit any files - this skill recommends fixes, it does not apply them.

# Release Checklist - <ref range>

Audited: <N> files changed, <M> migration func(s) added/modified
Migration files touched: <list, or "none">

## Check 1 - High-scale deadlock / lock contention
Status: PASS | WARN | FAIL
<findings: severity - func name - file:line - impact - remedy>

## Check 2 - Boot-time-blocking migrations
Status: PASS | WARN | FAIL
<findings ...>

## Remediation Plan
<one table row per WARN/FAIL finding from every check; see rules below>

## Summary
<overall: SHIP / SHIP WITH WARNINGS / DO NOT SHIP>
<one line per FAIL that must be resolved before release>

Remediation Plan table

Collect every WARN and FAIL finding from all checks into one table - this is the actionable outcome of the audit. If every check is PASS, drop the table and write No remediation needed - all checks passed. instead.

#	Impacted migration (func @ file:line)	Check	Severity	Offending operation / query	Recommended change

Impacted migration - the migration func and its file:line.
Offending operation / query - the exact SQL or migrator call that triggers the signal (e.g. CREATE INDEX idx_logs_foo ON logs(foo)), quoted verbatim or tightly paraphrased. This is what is wrong.
Recommended change - the precise fix: the corrected statement, the option to flip (UseTransaction = false), or the path to move to (ensurePerformanceIndexes). This is what to do instead.

When a fix needs more than a table cell (multi-line SQL, a Go code change), keep the table row short and add a ### Fix <#> - <func name> block below the table with a before/after the reader can apply directly:

### Fix 1 - migrationAddFooIndex   (framework/logstore/migrations.go:1234)
- // current - blocks all writes on logs for the whole build
- CREATE INDEX idx_logs_foo ON logs(foo)
+ // append to performanceIndexes; built CONCURRENTLY off the boot path
+ CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_logs_foo ON logs(foo)
Why: keeps the boot path O(1); the real build runs in ensurePerformanceIndexes.

Rules: overall is DO NOT SHIP if any check is FAIL, SHIP WITH WARNINGS if any WARN, else SHIP. Always show every check even when it passes - a visible PASS (no migrations changed) is a real result. Never silently drop a check, and never drop the Remediation Plan when there is at least one WARN/FAIL.

Adding a new check

This skill is meant to grow. To add a check:

Add a ### Check N - <title> subsection under Checks Registry, following Checks 1 and 2: a one-line Goal, a signal table, a per-finding report instruction, and a severity rule.
If it needs background facts, add them once under "Migration system facts" (or a new facts heading) so they are stated once and reused.
Add the check's heading to the Report format template. Its WARN/FAIL findings flow into the shared Remediation Plan table automatically - no per-check table needed.
Keep checks independent - one check failing must not stop the others from running.