failure-diagnosis - SKILL.md Agent Skill

name: failure-diagnosis description: > Subagent-only. Do not load in the orchestrator's transcript — the diagnostic methodology + niche-edge-cases catalogue is heavy enough that inlining it contaminates orchestrator context. The orchestrator detects a failure and dispatches a subagent; the subagent loads this skill. Loading this skill into orchestrator context is a methodology violation (the skill is heavy enough to contaminate orchestrator context). The previous harness-side guard was retired in 0.3.6; respect the convention by dispatching a subagent.

Diagnose failing Playwright tests through structured evidence-based triage. Activates inside subagent context (composer-/probe-/process-validator-/ cleanup- prefixes) when a test fails during any mode (authoring, maintenance, test-composer, bug-discovery), or when the dispatching brief says "test is failing", "debug this", "why is this failing", "fix this test", or when another companion skill encounters a test failure. Guides the agent through screenshot analysis, DOM inspection, root cause hypothesis, then fixes test issues autonomously or reports app bugs with evidence.

Auto-invoked via Skill tool from element-interactions Rule 7 (after the orchestrator dispatches the subagent), test-composer's stabilization loop, bug-discovery's adversarial probes, test-repair's per-cluster diagnosis, and any subagent that runs tests and observes a failure — those callers explicitly route to this skill rather than relying on always-load. subagent-only: true

Activation banner: The first user-facing reply after this skill loads MUST begin with the line: Protocol Achilles activated. Once per session — skip if already declared in this conversation. Subagents (which return structured data, not user-facing text) are exempt.

Failure Diagnosis

A structured diagnostic protocol for failing Playwright tests. Every failure gets the full pipeline — no "retry and hope."

When This Activates

A test run produces failures (from any mode)
User says "test is failing", "debug this", "why is this failing", "fix this test"
Another companion skill encounters a failure during its workflow (exception: companion-mode treats Phase-4 failures as a first-class bundle outcome — failure-diagnosis is only its Phase-6 handoff, on explicit user assent)

Diagnostic Pipeline

Stage 0 — Context Pre-Read (mandatory)

Before collecting evidence on the failing test, read what the project already documents. Skipping this stage is how confidently-wrong "app bug" classifications get published — you compare the screenshot against your recollection of the page instead of against what the project already specifies.

Methodology rule. Failure-diagnosis edits and bug-report writes that skip the documented context pre-read produce confidently-wrong "app bug" classifications. The previous harness backstop that blocked these writes was retired in the 0.3.6 cleanup; the pre-read remains mandatory.

tests/e2e/docs/app-context.md — page structures, intended modal lifecycles, data-qa selectors, known UI quirks (configuration-dependent option subsets, redirect-vs-popup auth patterns, vendor-aliased payment / shipping methods, async-loaded modal placeholders, documented degradation banners, etc.). Read the section for the page the test was on at the moment of failure. This is where the failing element's intended behaviour is documented.
tests/e2e/docs/test-scenarios.md — the regression / scenario matrix. Confirms whether the failing scenario is even supposed to run on this configuration in the first place.
tests/e2e/docs/journey-map.md (when present) — the user journey map produced by the journey-mapping skill. Tells you whether the app's flow has changed since the test was written.
tests/data/page-repository.json — the locator entries for the page in question. A stale or missing entry is one of the most common true root causes.

Capture, in plain text, the documented expectations relevant to the failing step. The rest of the diagnostic pipeline is then comparing observed state against documented state — not against your recollection of what the page should do.

If any of these files are missing for the project, note that in the evidence package and consider whether the right escalation is to re-run journey-mapping on the relevant page rather than diagnose blind.

Stage 1 — Collect Evidence

Do NOT guess from the error message alone. Collect visual and structural evidence first.

Read the error message and stack trace. Note the test file, line number, step name, and error type.
Open the Playwright HTML report. Run npx playwright show-report and read the failure screenshot from the on-disk report directory (playwright-report/data/...). The base fixture captures a failure-screenshot on every failure automatically — open the file directly with the Read tool.
Describe what the screenshot shows. State explicitly: page state, visible elements, error messages, unexpected UI, loading indicators, overlays. Write this down — it informs every subsequent decision.
If the screenshot is insufficient: use @playwright/cli (see ../element-interactions/references/playwright-cli-protocol.md) to navigate to the failing page URL and take a fresh snapshot — npx playwright-cli -s=fd-<short-slug> open --browser=chromium <URL> then npx playwright-cli -s=fd-<short-slug> snapshot. Inspect the DOM for the element the test was trying to interact with.
Check the error context file. Failed tests produce an error-context.md in test-results/ — read it for additional diagnostic information.

Stage 2 — Group Failures

Before diagnosing individually, look at the big picture:

Scan all failures in the test run output.
Group by likely root cause:
- Same missing page/element in the repository → single repo issue
- Same page failing to load → navigation or app issue
- Same timeout pattern → timing or environment issue
- Same API misuse pattern → test code issue
Prioritize: Fix the root cause that unblocks the most tests first. A single missing page-repository entry might cause 10 failures — fix it once, not 10 times.

Stage 3 — Classify

Determine whether each failure group is a test issue, app bug, or ambiguous. You must meet the burden of proof before classifying.

Test Issue — fix autonomously

All of the following must be true:

Screenshot shows the page loaded correctly and the expected UI matches what app-context.md describes for this page (Stage 0 read this), and the expected element is visible in the DOM at the documented selector
Error is traceable to test code: wrong selector, wrong param order, missing wait, stale repo entry, API misuse, incorrect assertion
DOM inspection confirms the element exists but the test targeted it incorrectly

Common test issues:

Wrong (elementName, pageName) argument order
Missing or stale page-repository.json entry
Missing waitForState or waitForNetworkIdle before interaction
Hardcoded assertion value that doesn't match dynamic content
Test isolation problem — stale cookies/localStorage from prior test
Navigation race — test interacts before page finishes loading

App Bug — hard stop, report to user

At least one of the following must be true:

Screenshot shows unexpected UI state (blank page, error message, broken layout, wrong content displayed)
DOM inspection confirms the element genuinely doesn't exist or the app produces incorrect output
The test logic is correct per the scenario — the app simply doesn't do what it should

Additionally: the bug must be reproducible (not a one-off network blip). Navigate to the page manually via playwright-cli (-s=fd-<short-slug> open ..., then goto/snapshot/click) to confirm.

When you identify an app bug: STOP. Do NOT modify the test to accommodate the bug. Report it (see Stage 6).

Ambiguous — escalate to user

Evidence supports both interpretations
The app changed intentionally but tests weren't updated (is this a test issue or a spec change?)
Present all evidence and ask the user to classify before acting

Stage 4 — Edge Case Checklist

Before finalizing your classification, run through this checklist:

Edge Case	What to Check	Likely Classification
Element obscured/overlapped	Screenshot shows overlays, modals, z-index issues blocking the target element	App bug if the overlay shouldn't be there; test issue if the test forgot to dismiss a dialog or close a modal
Timing-dependent content	Screenshot shows loading state, spinner, or skeleton instead of the expected content	Test issue — add explicit `waitForState`, `waitForNetworkIdle`, or `waitForResponse` before the interaction
Data-dependent failure	Assertion expects a specific count or text value that doesn't match what's displayed	Check whether the assertion is hardcoded to fragile values; may be either a test issue (use dynamic assertion) or app bug (data is wrong)
Environment differences	Failure only in CI, passes locally; or vice versa	Note the environment context; check viewport size, network conditions, base URL differences. Often a test issue — add resilience
Partial page load	Page loaded but a specific section didn't render (lazy-loaded component, conditional feature flag)	Inspect DOM for presence of the container; app bug if the component is missing from the DOM, test issue if it needs a wait
Stale browser state	Cookies, localStorage, or cached data from a previous test contaminating the current one	Test isolation issue — test issue. Ensure tests don't depend on shared state
Navigation race	URL shows an intermediate state; page is mid-redirect when the test tries to interact	Test issue — add `verifyUrlContains` or `waitForState` after navigation
Third-party dependency	CDN asset failed, external widget didn't load, embedded iframe timed out	Neither test nor app bug — report as infrastructure/external dependency issue
Modal opens but content hangs	Frame mounts but content stays on a spinner sentinel — see `references/niche-edge-cases.md` entry (1) for the disambiguating probe and full prose	App bug — apply Stage 4a heal `(h)` (documented-quirk, no heal)

Niche edge cases

Failure shapes that LLMs routinely misclassify are catalogued in references/niche-edge-cases.md. Read the relevant entry before classifying when the failure shape doesn't fit Stage 4's table cleanly.

The catalogue is meant to grow. When you resolve a failure whose shape isn't already documented there, append a new entry as part of the same diagnostic session — before closing out / handing back to the caller. The entry costs a few minutes; future sessions (yours, other contributors', other consumers of the package) get to skip the wrong-direction work this entry catalogues.

When to append (criteria — must hold ALL):

You actually misclassified at first (or were close to). The catalogue is for shapes that trap the diagnoser — not for failures whose classification was obvious from the screenshot. If Stage 0 + Stage 4 got you to the right answer cleanly, no entry needed.
The disambiguating probe was non-obvious. The thing you ended up doing — the specific tool call, DOM read, or evidence grab that flipped the classification — is what the next diagnoser most needs. If your probe was just "look at the screenshot more carefully", that's not catalogue-worthy.
The shape is reproducible across consumers, not project-specific. A bug in this app's checkout flow is a project finding, not a niche-edges entry. A bug shape that any consumer of the package could plausibly hit (modal-fetch hangs, role-attribute serialisation, page-repo entry resolves but matches a hidden duplicate, etc.) is.

When all three hold, follow the entry shape documented in references/niche-edge-cases.md §"Adding an entry" (Symptom / Why LLMs struggle / Disambiguating probe / Classification / Cross-link). Keep entries tight — one paragraph per field, not a war story.

Contribution path for promoted entries: see skills/contributing-to-element-interactions/SKILL.md §"Contributing to the niche-edge-cases catalogue" — covers the criteria above, the entry template, and how to ship the change as part of either a normal PR or a standalone docs PR.

Stage 4a — Heal strategy selection

Once you've classified the failure as a test issue and checked edge cases, pick a healing strategy. Every heal has a precondition, an autonomy level, and a clear scope — applying the wrong one is how bugs get masked.

Heal	Autonomy	Precondition	What it does
a. Selector re-learn	Auto	Page-repo lookup failed; live DOM has a close match by text/role/landmark; screenshot shows correct UI otherwise	Update `page-repository.json` with the re-learned selector; immediate confirmation run
b. Timing hardening	Auto	Intermittent timeout on a known-good element; screenshot shows correct UI (no error state); no flow drift detected	Add `waitForState` / `waitForNetworkIdle` before the interaction; bump a bounded timeout
c. Flow-step drift	Propose	App shows an extra/missing/reordered step between expected actions; screenshot confirms correct page state at each step the app does reach	Present the detected flow diff to the operator; apply on approval
d. Assertion re-baseline	Propose	Hardcoded literal no longer matches; UI state around the assertion is otherwise correct	Present old vs new value to the operator; apply on approval
e. State isolation	Auto	Test passes when run alone, fails when run after specific predecessors (verified empirically)	Add fresh context / storage reset / cleanup hook; re-run in suite order
f. Flake quarantine	Report	Flake persisted after two heal attempts of different strategies; root cause unclear	Tag test `@flaky`, append an entry to the quarantine ledger (see §Quarantine ledger below), add to repair summary with diagnostic notes; do NOT silently skip
g. Whole-test rewrite	Operator-aligned	Flow changed so fundamentally that the scenario no longer maps to the app as-is; no incremental heal applies	Present to operator; on approval, invoke `test-composer` with journey context. Never regenerate without alignment.
h. Documented-quirk match — no heal	Report	The observed failure shape exactly matches a documented quirk in `app-context.md` (configuration-dependent option subsets, redirect-vs-popup auth patterns, vendor-aliased options, etc.) OR matches a documented app-degradation signal (a degradation-banner copy string from `app-context.md`'s documented-banners list, the documented hanging spinner-sentinel custom element, 5xx in network capture)	Report observed-vs-documented diff; do NOT modify the test. The skip / failure is correct; the regression is in the app or in the documentation. Cross-link the relevant `app-context.md` section in the report.

Selection rules (apply in order, stop at first match):

If the observed failure shape exactly matches a documented quirk or app-degradation signal recorded in Stage 0's app-context.md read → (h) documented-quirk match → report; do NOT heal.
If screenshot shows wrong UI (500, error page, broken layout, missing-that-should-be-present component) → app bug, go to Stage 6. Do not heal.
If page-repo lookup failed → (a) selector re-learn → proceed to Stage 4b
If timeout on a known-good element with correct surrounding state → (b) timing hardening
If pattern hypothesis (from test-repair if present) or empirical check says "state leak" → (e) state isolation
If live DOM shows step order does not match test sequence → (c) flow drift → propose
If assertion failure on a specific literal with otherwise-correct surrounding state → (d) re-baseline → propose
If two heal strategies have been attempted and the test still flakes → (f) quarantine
If the test scenario no longer maps to the app flow → (g) rewrite → operator-align

The precondition columns exist to keep you honest: any heal applied without meeting its precondition is a guess, and guesses mask bugs.

Quarantine ledger (heal (f) only)

Heal (f) appends an entry to the quarantine ledger at tests/e2e/docs/flake-quarantine.md. The ledger is committed, not gitignored — quarantine is cross-session state that the next test-repair session must see.

Ledger header (first lines of the file):

# Flake quarantine ledger
<!-- Written by failure-diagnosis heal (f); released by test-repair Stage 5.5.
     Out-of-band shell edits are denied by hooks/protected-artifact-bash-guard.sh. -->

Entry template (one per quarantined test):

### `tests/<file>.spec.ts::<test-name>`
- **Quarantined:** YYYY-MM-DD
- **Failure-shape:** flaky-consistent | flaky-chaotic
- **Heal attempts:** <strategy 1>, <strategy 2> — both destabilized
- **Error signature:** <one-line dominant error when failing>
- **Diagnostic notes:** <what the evidence showed; why root cause is unclear>
- **Observations:** <dated appends from later sessions — e.g. "YYYY-MM-DD: still flaking 1/3 in Stage-1 baseline">
- **Status:** quarantined | unquarantined (YYYY-MM-DD — <evidence: 3/3 baseline + 5/5 suite-order green>)

Ownership is write-only and split: failure-diagnosis writes entries (heal (f)); test-repair releases them (its Stage 5.5 quarantine review flips Status: to unquarantined with dated evidence, or appends a still-flaking observation). No other skill edits the ledger, and entries are never deleted — a released entry keeps its history.

Stage 4b — Live DOM re-learning (for heal strategy (a) only)

When the heal strategy is (a) selector re-learn, do NOT guess a replacement selector. Use playwright-cli to open the page at the navigation state where the lookup fails (npx playwright-cli -s=fd-<short-slug> open --browser=chromium <URL> followed by whatever goto / click chain reproduces the failure state), then locate candidates by stable signals.

Exact text match — does the previous selector have known text content? Search the live DOM for an element with the same text.
Role + accessible name fuzzy match — e.g. previous target was a button labeled "Submit"; find a role="button" whose name contains "Submit" (or close variants like "Place Order", "Confirm").
Nearby landmark stability — previous target was "the button inside the section with heading 'Shipping'"; find the current equivalent via the stable landmark.
Attribute overlap — shared data-testid family, shared class prefix, shared id pattern.

Confidence thresholds:

High confidence (text match + role match + landmark match all agree) → update page-repository.json atomically, run the test immediately to confirm.
Multiple competing candidates → escalate to the operator with the candidate list; do not guess between them.
No candidate found → the element likely genuinely disappeared. Re-classify as either (c) flow drift (something replaced it) or app bug (component missing that should be present) using the screenshot evidence as the tiebreaker.

Root cause: fragile selector

If triage attributes the failure to a fragile selector (text drift, position-dependent CSS, role/name collision), check workspace shape before selecting a heal strategy:

Frontend source in workspace — package.json lists the UI framework as a dependency and a src/-style tree of .tsx/.jsx/.vue/.svelte/.html files is present:

→ Dispatch selector-development (mode: "jit", scope = the element-key whose locator failed). After it returns, replace the test's locator with the new test-attribute selector and re-run. Then continue from Stage 5 (stability validation) as normal.

Frontend source NOT in workspace:

→ Report the fragile selector to the user as an actionable test-debt item. Do NOT attempt to harden the locator with compound selectors, nth-child chains, or XPath depth — that adds brittleness without adding stability. The report should name the element-key, the fragile signal (text drift / CSS position / role collision), and that selector-development cannot help because the source files are not available in this workspace.

Stage 5 — Fix and stability (test issues only)

Apply the fix per the heal strategy selected in Stage 4a. Use the Steps API correctly — refer to ../element-interactions/references/api-reference.md for all method signatures.
If the fix requires new selectors: Stage 4b has produced the proposal. For Auto strategies the update applies directly; for Propose strategies confirm with the operator first.
Run the test 3-5 times to confirm stability. A single pass is not sufficient — flaky tests are worse than failing tests.
```
# Run the specific test file multiple times
for i in {1..5}; do npx playwright test <test-file> --reporter=line; done
```
Only commit after all stability runs pass.
If any stability run fails: revert the heal, then re-enter the diagnostic pipeline from Stage 1. The heal was incomplete. If a second strategy also destabilizes, escalate to (f) flake quarantine rather than trying a third heal — two failed strategies is a signal that single-failure mode is insufficient for this test.

Stage 6 — Report (app bugs only)

Present the bug report to the user with this structure:

Application Bug Report

Test: tests/example.spec.ts — TC_001: Login flow Step: "Verify dashboard loads after login"

Expected: Dashboard page loads with welcome message and user stats Actual: Page shows "500 Internal Server Error"

Screenshot: — DOM evidence: <error-context.md or snapshot path> Severity: Environment: <base URL + browser> Journey: j- (when journey-map.md present) Reproducible: Yes — confirmed by navigating manually via playwright-cli

This is an application bug. The test has NOT been modified.

Hard rule: every app-bug report MUST cite at least one on-disk artifact path (screenshot, error-context.md, snapshot, or capture file). Prose supplements the artifact, never replaces it.

Do NOT modify the test to work around the bug. Do NOT skip the test. Do NOT add try/catch blocks to swallow the error. Report and stop.

Stability Validation Protocol

A fix is confirmed only when the test passes 3-5 consecutive runs without failure. This catches:

Race conditions that pass 80% of the time
Timing-sensitive tests that work on fast machines but fail under load
State leakage between tests that only manifests on repeated runs

If any run in the stability check fails, the fix is incomplete. Do not commit — re-diagnose.

Integration

Skills that call this one

Calling Skill	Activation Point	What Happens Next
`maintenance`	First step when a test failure is reported	After heal + stability → return for compliance review + commit
`authoring`	When a newly written test fails in Stage 3	After heal + stability → return for compliance review + commit
`test-composer`	When a test run produces failures	After heal + stability → return for next scenario
`bug-discovery`	When adversarial tests fail	After heal + stability OR bug report → return to caller
`test-repair`	Per cluster in its Stage 4 (batch repair pipeline)	Diagnose the cluster's representative, apply heal once for the whole cluster, return outcome (Healed / App bug / Operator-pending / Quarantined)

After a successful heal + stability confirmation, control returns to the calling skill.

Escalating up to test-repair

Sometimes single-failure mode isn't the right shape. Hand off to test-repair when the failure is not really a single event:

Condition	Why escalate
The current run has ≥5 failures or ≥30% of executed tests failed	Per-failure diagnosis doesn't scale; batch clustering finds the shared root cause faster
You have been invoked 3+ times in this session on distinct tests	The pattern across failures is likely worth detecting before healing more in isolation
A heal you applied caused previously-passing tests to start failing	Cross-test interaction is invisible from here; `test-repair`'s post-heal verification stage is designed for it
Two different heal strategies on the same test have both destabilized	Before trying a third, bump up to batch mode — the test's behavior may be coupled to sibling tests

Announce the escalation once to the operator and start batch mode:

Detected — handing off to the test-repair batch pipeline so we can cluster root causes before continuing to heal individually. Reply "stay single-failure" to override.

The operator can override back to single-failure mode if they have a reason to keep the narrower scope.

API Reference

Refer to ../element-interactions/references/api-reference.md for all method signatures, argument orders, and types. All Steps methods use (elementName, pageName) order.