name: workflow-failover-migrate
description: Migrate a GitHub Actions workflow in or out of the RUNNER_GENERIC failover pattern, and apply the per-step continue-on-error pattern for informational workflows. Use when a hosted-runner outage is blocking PRs (instant-fail jobs in 2-4s with ubuntu-latest), when a workflow should fail over to Pi runners, when an informational scanner needs to always report green, or when you need to roll a previously-migrated workflow back to hosted because it doesn't run on ARM64. Triggers on "migrate workflow to Pi", "RUNNER_GENERIC failover", "instant-fail", "make check non-blocking", "informational only check", "pin back to ubuntu-latest", "workflow always green".
allowed-tools: Bash, Read, Edit, Grep
Workflow Failover Migration
What this skill knows
This is hard-won from a session where five rounds of fixes were needed before the right pattern stuck. It codifies four lessons:
- Instant-fail = hosted-runner outage. Jobs that complete in 2–4 seconds and report
failurehaven't actually run — the GitHub control plane rejected the runner request before any YAML executed. CLAUDE.md describes the canonical failure: "the job was not started because recent account payments have failed". - Job-level
continue-on-error: truedoes NOT flip the per-job check conclusion that branch protection reads. It only affects the overall workflow conclusion. Empirically verified — was reverted in commit 884b0cf during the same session. - Step-level
continue-on-error: trueDOES flip the job conclusion if applied to every step that can fail. The job ends up withsuccesseven if individual steps fail. - Not every workflow can fail over to Pi. ARM64-incompatible deps (system Chrome on x86_64 for some Playwright matrices, CodeQL bundles older than v4, SST cold deploys that exceed Pi RAM) will hang or fail in a different way. The Preview workflow was documented as failed-on-Pi for months before being re-tried successfully when SST/CDK matured.
The three patterns
1. Migrate a hard-pinned workflow into RUNNER_GENERIC failover
Use when the workflow is currently runs-on: ubuntu-latest and the underlying tooling is ARM64-compatible (pure JS/TS, anything that uses actions/setup-node only, Playwright with bundled Chromium ≥ v1.40, CodeQL action v4+).
# Before
runs-on: ubuntu-latest
# After
runs-on: ${{ fromJSON(vars.RUNNER_GENERIC || '"ubuntu-latest"') }}
When RUNNER_GENERIC is unset, behaviour is identical to before (hosted). When flipped to Pi via .github/scripts/toggle-runner.sh pi, the workflow uses Pi runners.
Always also update docs/runners.md — move the entry from the "stay GitHub-hosted" section to the "opted in" list. Past failed-failover attempts (like preview.yml run 26321031309) belong in a parenthetical note, not a deletion.
If the migration is risky (cold-start SST, CodeQL on heavy projects), bump the timeout-minutes by 50–100% to leave headroom for slower ARM execution.
2. Make an informational workflow always report green
Use when a workflow is intentionally non-blocking but its failure check status still blocks branch protection. Common cases: security scanners that flag false positives, custom audit scripts, anything where the value is in the log not the gate.
jobs:
scan:
name: My Informational Check
# Job-level continue-on-error is NOT enough — see lesson 2 above.
# Apply step-level continue-on-error on EVERY step that can fail.
steps:
- uses: actions/checkout@v4
- name: Setup tool
uses: actions/setup-something@v4
continue-on-error: true # ← needed; setup itself can fail on Pi
with:
version: 'latest'
- name: Install deps
continue-on-error: true # ← needed; npm/pip can fail on ARM
run: npm install
- name: Run the scan
continue-on-error: true # ← needed; the tool itself may exit non-zero
run: npm run scan
actions/checkout@v4 does not need it — checkout failures should still fail the job.
3. Pin a workflow back to hosted (rollback)
Use when a migrated workflow actually doesn't work on Pi and needs to go back. Reverse the YAML edit:
# After (rollback)
runs-on: ubuntu-latest
And update docs/runners.md:
- Move entry back into the "stay GitHub-hosted" list
- Add a one-line note explaining what broke and the run ID, so the next person doesn't re-try blindly
Workflow checklist
For each workflow being migrated:
- Read the workflow file with
Readand locate theruns-on:line. - Check the tooling — Grep the steps for native deps, x86_64-only binaries, Chrome from system package manager (not the bundled Playwright Chromium), or anything with "linux-amd64" in its install URL.
- Edit the file with
Edit— replace exactly one line, theruns-on:. - Bump the timeout if the work is non-trivial (build, test matrices, scans).
- Edit
docs/runners.md— move the entry between the two lists, preserving prior-attempt notes. - Commit with a message describing the intent and the timeout bump rationale.
Decision tree
Workflow failing on PR?
├── Failed in 2-4s? → Hosted-runner outage.
│ ├── ARM64-compatible tooling? → Pattern 1 (migrate to failover)
│ └── x86_64-only tooling? → Either wait for hosted to recover,
│ or remove from required checks in
│ branch protection (admin-only).
│
├── Failed after running for > 30s? → Real failure.
│ ├── Real bug or flake? → Fix it.
│ └── Informational/scanner with false positives?
│ → Pattern 2 (always-green via step-level CoE)
│
└── Was previously migrated to Pi, now hanging?
└── Pattern 3 (rollback to hosted, document)
Files this skill touches
.github/workflows/*.yml— the workflow being migrateddocs/runners.md— the opt-in / stay-hosted lists
Toggling the runner pool itself
This skill does NOT toggle the RUNNER_GENERIC repo variable. That's the job of .github/scripts/toggle-runner.sh (documented in runner-ops skill). This skill assumes you've already decided which runner pool to target.
Anti-patterns observed in the wild
- Setting
continue-on-error: trueat the job level and expecting branch protection to accept it. It doesn't. See lesson 2. - Adding new
runs-on: ubuntu-latestworkflows without considering the failover pattern. Every new workflow without browser/x86_64 needs should opt in from day one. - Migrating a workflow blindly because it's failing. Confirm it failed for an outage reason (instant-fail signature) before changing the runner — a real bug stays a real bug after migration.
- Removing the prior-failed-attempt note from
docs/runners.mdduring a re-try. Always demote to a parenthetical, never delete — institutional memory matters when the next outage hits.