name: release-checklist description: Pre-release safety audit for the Bifrost repo. Scans database migrations changed in a release for high-scale deadlock / lock-contention risks and for work that blocks application boot time, then produces a pass/warn/fail report with a concrete remediation plan. Invoked with /release-checklist [git-ref-range]. Built to grow - new checks are appended to the Checks Registry. allowed-tools: Read, Grep, Glob, Bash, Task, AskUserQuestion
Release Checklist
A pre-release safety audit. Given a set of changes destined for release, run every check in the Checks Registry below and produce one consolidated report. This skill is read-only: it diagnoses, recommends a concrete fix for every finding, and never edits files. Applying a fix is a separate, explicitly-approved step.
The registry currently holds two migration-safety checks. It is designed to grow - see Adding a New Check.
Scope: what "the release" means
Determine the change set to audit, in this order:
- If the user passed a git ref or range (e.g.
/release-checklist v1.4.0..HEADor/release-checklist origin/dev), use it. - Otherwise default to everything not yet on the main branch: diff
origin/dev...HEAD(three-dot) and also include uncommitted working-tree changes. - If that range is empty, tell the user and ask for an explicit range.
Gather the raw material once, up front:
git fetch origin --quiet
git diff --stat origin/dev...HEAD
git diff origin/dev...HEAD -- '**/migrations.go' '**/matviews.go'
git status --porcelain
Migrations here are Go-defined, not .sql files. They live in:
framework/configstore/migrations.go- config DB (providers, keys, virtual keys, budgets)framework/logstore/migrations.go- log DB (request logs; the high-volume table)framework/logstore/matviews.go- materialized views over the log DB
A release with no diff in these files has no migration risk - record both migration
checks as PASS (no migrations changed) and move on.
Migration system facts (needed by both checks)
- All migrations run synchronously at boot.
triggerMigrations()executes the full ordered migration list during store init, before the process serves traffic. - Supported databases: PostgreSQL and SQLite only. No MySQL. Lock behavior differs sharply between the two - judge both.
- Migrations are cluster-serialized behind Postgres advisory lock
1000001with a 5-minute acquisition timeout. A slow migration on one pod stalls every other pod's boot. - Each migration func runs in a transaction by default (
Options.UseTransaction = true). A transaction holds every lock until the func returns - migration duration is lock duration. - Established escape hatch for heavy work: index builds and materialized views run in
post-startup background goroutines under separate advisory locks (
1000002for indexes,1000005/1000006for matviews) - seeframework/logstore/postgres.go,ensurePerformanceIndexes(),ensureMatViews().migrationAddProviderHistogramIndexis intentionally a near-no-op that defers the realCREATE INDEX CONCURRENTLYthere. This deferral pattern is the correct fix for anything heavy.
Checks Registry
Check 1 - Migrations that can deadlock or pile up under high-scale data
Goal: catch migrations whose locking is safe on a laptop but catastrophic on a production table with hundreds of millions of rows (notably the logstore request-log table).
For every added/modified migration func in the diff, flag:
| Signal | Why it is dangerous at scale |
|---|---|
CREATE INDEX without CONCURRENTLY |
SHARE lock for the whole build - all writes block for minutes/hours on a large table. |
CREATE INDEX CONCURRENTLY with UseTransaction = true |
Postgres forbids CONCURRENTLY in a transaction - runtime error. Needs UseTransaction = false or the background path. |
ALTER TABLE (add/drop column, add constraint, change type) on a hot table |
Takes ACCESS EXCLUSIVE; queues behind in-flight queries, then every new query queues behind it - a cluster-wide stall. |
ADD COLUMN with a volatile / non-constant DEFAULT |
Rewrites the whole table under ACCESS EXCLUSIVE. A constant default is metadata-only and fine on PG11+. |
ADD ... FOREIGN KEY / ADD CONSTRAINT validated immediately |
Locks both tables and scans the child. Prefer NOT VALID then a separate VALIDATE CONSTRAINT. |
Bulk UPDATE/DELETE/backfill over the whole table in one transaction |
Holds row locks until commit; collides with live writes; bloats the table. Must be batched. |
| Order-dependent backfill, or migrations locking tables A-then-B vs B-then-A | Classic deadlock: two transactions grab the same locks in opposite order. |
SQLite drop-column path (CREATE TABLE ... AS SELECT + DROP + RENAME) on a large table |
Full table copy under SQLite's single global write lock - blocks every writer. |
Report per flag: func name, file:line, the exact signal, realistic production impact,
and a concrete remedy (add CONCURRENTLY + UseTransaction = false, batch the backfill,
defer to ensurePerformanceIndexes, add the FK as NOT VALID, etc.).
Severity: FAIL if writes to a high-volume table (logstore logs) can be blocked or a
deadlock is plausible; WARN for config-DB tables (low row counts, but still flag).
Check 2 - Migrations that block boot-up time
Goal: because triggerMigrations() runs synchronously before the process serves
traffic, any migration whose runtime grows with row count delays - or past the 5-minute
advisory-lock timeout, breaks - every pod's startup.
Flag any added/modified migration whose cost scales with data volume:
| Signal | Why it blocks boot |
|---|---|
Any non-CONCURRENTLY CREATE INDEX |
Build time scales with row count; runs inside boot. |
CREATE INDEX CONCURRENTLY placed directly in triggerMigrations |
Even concurrent builds take minutes/hours on big tables; belongs in the background goroutine path. |
Table rewrite: volatile-default ADD COLUMN, type change, SQLite drop-column copy |
Rewrites/copies every row during boot. |
Data backfill loop / bulk UPDATE over an unbounded row set |
Runtime = O(rows); unbounded backfills have no ceiling. |
matviews.go - creating or fully refreshing a matview in the synchronous path |
Matview build scans the base table; must use the ensureMatViews() background path. |
| Anything that could realistically exceed the 5-minute advisory-lock timeout | Other pods fail to acquire lock 1000001 and crash-loop. |
Expected safe pattern: heavy work is a near-no-op migration; the real build runs
post-startup in a background goroutine. Compare each new heavy migration against
migrationAddProviderHistogramIndex - if it does the heavy lifting inline, that is the finding.
Report per flag: func name, file:line, why runtime scales with data, risk at production row counts, and the remedy (move to the background path / batch it / make the default constant).
Severity: FAIL if the operation can plausibly exceed the 5-minute lock timeout on a
production-sized table; WARN otherwise.
Report format
Output one consolidated report. Do not edit any files - this skill recommends fixes, it does not apply them.
# Release Checklist - <ref range>
Audited: <N> files changed, <M> migration func(s) added/modified
Migration files touched: <list, or "none">
## Check 1 - High-scale deadlock / lock contention
Status: PASS | WARN | FAIL
<findings: severity - func name - file:line - impact - remedy>
## Check 2 - Boot-time-blocking migrations
Status: PASS | WARN | FAIL
<findings ...>
## Remediation Plan
<one table row per WARN/FAIL finding from every check; see rules below>
## Summary
<overall: SHIP / SHIP WITH WARNINGS / DO NOT SHIP>
<one line per FAIL that must be resolved before release>
Remediation Plan table
Collect every WARN and FAIL finding from all checks into one table - this is
the actionable outcome of the audit. If every check is PASS, drop the table and
write No remediation needed - all checks passed. instead.
| # | Impacted migration (func @ file:line) | Check | Severity | Offending operation / query | Recommended change |
|---|
- Impacted migration - the migration func and its
file:line. - Offending operation / query - the exact SQL or migrator call that triggers
the signal (e.g.
CREATE INDEX idx_logs_foo ON logs(foo)), quoted verbatim or tightly paraphrased. This is what is wrong. - Recommended change - the precise fix: the corrected statement, the option to
flip (
UseTransaction = false), or the path to move to (ensurePerformanceIndexes). This is what to do instead.
When a fix needs more than a table cell (multi-line SQL, a Go code change), keep
the table row short and add a ### Fix <#> - <func name> block below the table
with a before/after the reader can apply directly:
### Fix 1 - migrationAddFooIndex (framework/logstore/migrations.go:1234)
- // current - blocks all writes on logs for the whole build
- CREATE INDEX idx_logs_foo ON logs(foo)
+ // append to performanceIndexes; built CONCURRENTLY off the boot path
+ CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_logs_foo ON logs(foo)
Why: keeps the boot path O(1); the real build runs in ensurePerformanceIndexes.
Rules: overall is DO NOT SHIP if any check is FAIL, SHIP WITH WARNINGS if any
WARN, else SHIP. Always show every check even when it passes - a visible
PASS (no migrations changed) is a real result. Never silently drop a check, and
never drop the Remediation Plan when there is at least one WARN/FAIL.
Adding a new check
This skill is meant to grow. To add a check:
- Add a
### Check N - <title>subsection under Checks Registry, following Checks 1 and 2: a one-line Goal, a signal table, a per-finding report instruction, and a severity rule. - If it needs background facts, add them once under "Migration system facts" (or a new facts heading) so they are stated once and reused.
- Add the check's heading to the Report format template. Its
WARN/FAILfindings flow into the shared Remediation Plan table automatically - no per-check table needed. - Keep checks independent - one check failing must not stop the others from running.