name: failure-diagnosis description: > Subagent-only. Do not load in the orchestrator's transcript — the diagnostic methodology + niche-edge-cases catalogue is heavy enough that inlining it contaminates orchestrator context. The orchestrator detects a failure and dispatches a subagent; the subagent loads this skill. Loading this skill into orchestrator context is a methodology violation (the skill is heavy enough to contaminate orchestrator context). The previous harness-side guard was retired in 0.3.6; respect the convention by dispatching a subagent.
Diagnose failing Playwright tests through structured evidence-based triage. Activates inside subagent context (composer-/probe-/process-validator-/ cleanup- prefixes) when a test fails during any mode (authoring, maintenance, test-composer, bug-discovery), or when the dispatching brief says "test is failing", "debug this", "why is this failing", "fix this test", or when another companion skill encounters a test failure. Guides the agent through screenshot analysis, DOM inspection, root cause hypothesis, then fixes test issues autonomously or reports app bugs with evidence.
Auto-invoked via Skill tool from element-interactions Rule 7 (after the orchestrator dispatches the subagent), test-composer's stabilization loop, bug-discovery's adversarial probes, test-repair's per-cluster diagnosis, and any subagent that runs tests and observes a failure — those callers explicitly route to this skill rather than relying on always-load. subagent-only: true
Activation banner: The first user-facing reply after this skill loads MUST begin with the line: Protocol Achilles activated. Once per session — skip if already declared in this conversation. Subagents (which return structured data, not user-facing text) are exempt.
Failure Diagnosis
A structured diagnostic protocol for failing Playwright tests. Every failure gets the full pipeline — no "retry and hope."
When This Activates
- A test run produces failures (from any mode)
- User says "test is failing", "debug this", "why is this failing", "fix this test"
- Another companion skill encounters a failure during its workflow
(exception:
companion-modetreats Phase-4 failures as a first-class bundle outcome — failure-diagnosis is only its Phase-6 handoff, on explicit user assent)
Diagnostic Pipeline
Stage 0 — Context Pre-Read (mandatory)
Before collecting evidence on the failing test, read what the project already documents. Skipping this stage is how confidently-wrong "app bug" classifications get published — you compare the screenshot against your recollection of the page instead of against what the project already specifies.
Methodology rule. Failure-diagnosis edits and bug-report writes that skip the documented context pre-read produce confidently-wrong "app bug" classifications. The previous harness backstop that blocked these writes was retired in the 0.3.6 cleanup; the pre-read remains mandatory.
tests/e2e/docs/app-context.md— page structures, intended modal lifecycles,data-qaselectors, known UI quirks (configuration-dependent option subsets, redirect-vs-popup auth patterns, vendor-aliased payment / shipping methods, async-loaded modal placeholders, documented degradation banners, etc.). Read the section for the page the test was on at the moment of failure. This is where the failing element's intended behaviour is documented.tests/e2e/docs/test-scenarios.md— the regression / scenario matrix. Confirms whether the failing scenario is even supposed to run on this configuration in the first place.tests/e2e/docs/journey-map.md(when present) — the user journey map produced by thejourney-mappingskill. Tells you whether the app's flow has changed since the test was written.tests/data/page-repository.json— the locator entries for the page in question. A stale or missing entry is one of the most common true root causes.
Capture, in plain text, the documented expectations relevant to the failing step. The rest of the diagnostic pipeline is then comparing observed state against documented state — not against your recollection of what the page should do.
If any of these files are missing for the project, note that in the evidence package and consider whether the right escalation is to re-run journey-mapping on the relevant page rather than diagnose blind.
Stage 1 — Collect Evidence
Do NOT guess from the error message alone. Collect visual and structural evidence first.
- Read the error message and stack trace. Note the test file, line number, step name, and error type.
- Open the Playwright HTML report. Run
npx playwright show-reportand read the failure screenshot from the on-disk report directory (playwright-report/data/...). The base fixture captures afailure-screenshoton every failure automatically — open the file directly with the Read tool. - Describe what the screenshot shows. State explicitly: page state, visible elements, error messages, unexpected UI, loading indicators, overlays. Write this down — it informs every subsequent decision.
- If the screenshot is insufficient: use
@playwright/cli(see../element-interactions/references/playwright-cli-protocol.md) to navigate to the failing page URL and take a fresh snapshot —npx playwright-cli -s=fd-<short-slug> open --browser=chromium <URL>thennpx playwright-cli -s=fd-<short-slug> snapshot. Inspect the DOM for the element the test was trying to interact with. - Check the error context file. Failed tests produce an
error-context.mdintest-results/— read it for additional diagnostic information.
Stage 2 — Group Failures
Before diagnosing individually, look at the big picture:
- Scan all failures in the test run output.
- Group by likely root cause:
- Same missing page/element in the repository → single repo issue
- Same page failing to load → navigation or app issue
- Same timeout pattern → timing or environment issue
- Same API misuse pattern → test code issue
- Prioritize: Fix the root cause that unblocks the most tests first. A single missing page-repository entry might cause 10 failures — fix it once, not 10 times.
Stage 3 — Classify
Determine whether each failure group is a test issue, app bug, or ambiguous. You must meet the burden of proof before classifying.
Test Issue — fix autonomously
All of the following must be true:
- Screenshot shows the page loaded correctly and the expected UI matches what
app-context.mddescribes for this page (Stage 0 read this), and the expected element is visible in the DOM at the documented selector - Error is traceable to test code: wrong selector, wrong param order, missing wait, stale repo entry, API misuse, incorrect assertion
- DOM inspection confirms the element exists but the test targeted it incorrectly
Common test issues:
- Wrong
(elementName, pageName)argument order - Missing or stale
page-repository.jsonentry - Missing
waitForStateorwaitForNetworkIdlebefore interaction - Hardcoded assertion value that doesn't match dynamic content
- Test isolation problem — stale cookies/localStorage from prior test
- Navigation race — test interacts before page finishes loading
App Bug — hard stop, report to user
At least one of the following must be true:
- Screenshot shows unexpected UI state (blank page, error message, broken layout, wrong content displayed)
- DOM inspection confirms the element genuinely doesn't exist or the app produces incorrect output
- The test logic is correct per the scenario — the app simply doesn't do what it should
Additionally: the bug must be reproducible (not a one-off network blip). Navigate to the page manually via playwright-cli (-s=fd-<short-slug> open ..., then goto/snapshot/click) to confirm.
When you identify an app bug: STOP. Do NOT modify the test to accommodate the bug. Report it (see Stage 6).
Ambiguous — escalate to user
- Evidence supports both interpretations
- The app changed intentionally but tests weren't updated (is this a test issue or a spec change?)
- Present all evidence and ask the user to classify before acting
Stage 4 — Edge Case Checklist
Before finalizing your classification, run through this checklist:
| Edge Case | What to Check | Likely Classification |
|---|---|---|
| Element obscured/overlapped | Screenshot shows overlays, modals, z-index issues blocking the target element | App bug if the overlay shouldn't be there; test issue if the test forgot to dismiss a dialog or close a modal |
| Timing-dependent content | Screenshot shows loading state, spinner, or skeleton instead of the expected content | Test issue — add explicit waitForState, waitForNetworkIdle, or waitForResponse before the interaction |
| Data-dependent failure | Assertion expects a specific count or text value that doesn't match what's displayed | Check whether the assertion is hardcoded to fragile values; may be either a test issue (use dynamic assertion) or app bug (data is wrong) |
| Environment differences | Failure only in CI, passes locally; or vice versa | Note the environment context; check viewport size, network conditions, base URL differences. Often a test issue — add resilience |
| Partial page load | Page loaded but a specific section didn't render (lazy-loaded component, conditional feature flag) | Inspect DOM for presence of the container; app bug if the component is missing from the DOM, test issue if it needs a wait |
| Stale browser state | Cookies, localStorage, or cached data from a previous test contaminating the current one | Test isolation issue — test issue. Ensure tests don't depend on shared state |
| Navigation race | URL shows an intermediate state; page is mid-redirect when the test tries to interact | Test issue — add verifyUrlContains or waitForState after navigation |
| Third-party dependency | CDN asset failed, external widget didn't load, embedded iframe timed out | Neither test nor app bug — report as infrastructure/external dependency issue |
| Modal opens but content hangs | Frame mounts but content stays on a spinner sentinel — see references/niche-edge-cases.md entry (1) for the disambiguating probe and full prose |
App bug — apply Stage 4a heal (h) (documented-quirk, no heal) |
Niche edge cases
Failure shapes that LLMs routinely misclassify are catalogued in references/niche-edge-cases.md. Read the relevant entry before classifying when the failure shape doesn't fit Stage 4's table cleanly.
The catalogue is meant to grow. When you resolve a failure whose shape isn't already documented there, append a new entry as part of the same diagnostic session — before closing out / handing back to the caller. The entry costs a few minutes; future sessions (yours, other contributors', other consumers of the package) get to skip the wrong-direction work this entry catalogues.
When to append (criteria — must hold ALL):
- You actually misclassified at first (or were close to). The catalogue is for shapes that trap the diagnoser — not for failures whose classification was obvious from the screenshot. If Stage 0 + Stage 4 got you to the right answer cleanly, no entry needed.
- The disambiguating probe was non-obvious. The thing you ended up doing — the specific tool call, DOM read, or evidence grab that flipped the classification — is what the next diagnoser most needs. If your probe was just "look at the screenshot more carefully", that's not catalogue-worthy.
- The shape is reproducible across consumers, not project-specific. A bug in this app's checkout flow is a project finding, not a niche-edges entry. A bug shape that any consumer of the package could plausibly hit (modal-fetch hangs, role-attribute serialisation, page-repo entry resolves but matches a hidden duplicate, etc.) is.
When all three hold, follow the entry shape documented in references/niche-edge-cases.md §"Adding an entry" (Symptom / Why LLMs struggle / Disambiguating probe / Classification / Cross-link). Keep entries tight — one paragraph per field, not a war story.
Contribution path for promoted entries: see skills/contributing-to-element-interactions/SKILL.md §"Contributing to the niche-edge-cases catalogue" — covers the criteria above, the entry template, and how to ship the change as part of either a normal PR or a standalone docs PR.
Stage 4a — Heal strategy selection
Once you've classified the failure as a test issue and checked edge cases, pick a healing strategy. Every heal has a precondition, an autonomy level, and a clear scope — applying the wrong one is how bugs get masked.
| Heal | Autonomy | Precondition | What it does |
|---|---|---|---|
| a. Selector re-learn | Auto | Page-repo lookup failed; live DOM has a close match by text/role/landmark; screenshot shows correct UI otherwise | Update page-repository.json with the re-learned selector; immediate confirmation run |
| b. Timing hardening | Auto | Intermittent timeout on a known-good element; screenshot shows correct UI (no error state); no flow drift detected | Add waitForState / waitForNetworkIdle before the interaction; bump a bounded timeout |
| c. Flow-step drift | Propose | App shows an extra/missing/reordered step between expected actions; screenshot confirms correct page state at each step the app does reach | Present the detected flow diff to the operator; apply on approval |
| d. Assertion re-baseline | Propose | Hardcoded literal no longer matches; UI state around the assertion is otherwise correct | Present old vs new value to the operator; apply on approval |
| e. State isolation | Auto | Test passes when run alone, fails when run after specific predecessors (verified empirically) | Add fresh context / storage reset / cleanup hook; re-run in suite order |
| f. Flake quarantine | Report | Flake persisted after two heal attempts of different strategies; root cause unclear | Tag test @flaky, append an entry to the quarantine ledger (see §Quarantine ledger below), add to repair summary with diagnostic notes; do NOT silently skip |
| g. Whole-test rewrite | Operator-aligned | Flow changed so fundamentally that the scenario no longer maps to the app as-is; no incremental heal applies | Present to operator; on approval, invoke test-composer with journey context. Never regenerate without alignment. |
| h. Documented-quirk match — no heal | Report | The observed failure shape exactly matches a documented quirk in app-context.md (configuration-dependent option subsets, redirect-vs-popup auth patterns, vendor-aliased options, etc.) OR matches a documented app-degradation signal (a degradation-banner copy string from app-context.md's documented-banners list, the documented hanging spinner-sentinel custom element, 5xx in network capture) |
Report observed-vs-documented diff; do NOT modify the test. The skip / failure is correct; the regression is in the app or in the documentation. Cross-link the relevant app-context.md section in the report. |
Selection rules (apply in order, stop at first match):
- If the observed failure shape exactly matches a documented quirk or app-degradation signal recorded in Stage 0's
app-context.mdread → (h) documented-quirk match → report; do NOT heal. - If screenshot shows wrong UI (500, error page, broken layout, missing-that-should-be-present component) → app bug, go to Stage 6. Do not heal.
- If page-repo lookup failed → (a) selector re-learn → proceed to Stage 4b
- If timeout on a known-good element with correct surrounding state → (b) timing hardening
- If pattern hypothesis (from
test-repairif present) or empirical check says "state leak" → (e) state isolation - If live DOM shows step order does not match test sequence → (c) flow drift → propose
- If assertion failure on a specific literal with otherwise-correct surrounding state → (d) re-baseline → propose
- If two heal strategies have been attempted and the test still flakes → (f) quarantine
- If the test scenario no longer maps to the app flow → (g) rewrite → operator-align
The precondition columns exist to keep you honest: any heal applied without meeting its precondition is a guess, and guesses mask bugs.
Quarantine ledger (heal (f) only)
Heal (f) appends an entry to the quarantine ledger at
tests/e2e/docs/flake-quarantine.md. The ledger is committed, not
gitignored — quarantine is cross-session state that the next
test-repair session must see.
Ledger header (first lines of the file):
# Flake quarantine ledger
<!-- Written by failure-diagnosis heal (f); released by test-repair Stage 5.5.
Out-of-band shell edits are denied by hooks/protected-artifact-bash-guard.sh. -->
Entry template (one per quarantined test):
### `tests/<file>.spec.ts::<test-name>`
- **Quarantined:** YYYY-MM-DD
- **Failure-shape:** flaky-consistent | flaky-chaotic
- **Heal attempts:** <strategy 1>, <strategy 2> — both destabilized
- **Error signature:** <one-line dominant error when failing>
- **Diagnostic notes:** <what the evidence showed; why root cause is unclear>
- **Observations:** <dated appends from later sessions — e.g. "YYYY-MM-DD: still flaking 1/3 in Stage-1 baseline">
- **Status:** quarantined | unquarantined (YYYY-MM-DD — <evidence: 3/3 baseline + 5/5 suite-order green>)
Ownership is write-only and split: failure-diagnosis writes
entries (heal (f)); test-repair releases them (its Stage 5.5
quarantine review flips Status: to unquarantined with dated
evidence, or appends a still-flaking observation). No other skill
edits the ledger, and entries are never deleted — a released entry
keeps its history.
Stage 4b — Live DOM re-learning (for heal strategy (a) only)
When the heal strategy is (a) selector re-learn, do NOT guess a replacement selector. Use playwright-cli to open the page at the navigation state where the lookup fails (npx playwright-cli -s=fd-<short-slug> open --browser=chromium <URL> followed by whatever goto / click chain reproduces the failure state), then locate candidates by stable signals.
- Exact text match — does the previous selector have known text content? Search the live DOM for an element with the same text.
- Role + accessible name fuzzy match — e.g. previous target was a button labeled "Submit"; find a
role="button"whose name contains "Submit" (or close variants like "Place Order", "Confirm"). - Nearby landmark stability — previous target was "the button inside the section with heading 'Shipping'"; find the current equivalent via the stable landmark.
- Attribute overlap — shared
data-testidfamily, shared class prefix, sharedidpattern.
Confidence thresholds:
- High confidence (text match + role match + landmark match all agree) → update
page-repository.jsonatomically, run the test immediately to confirm. - Multiple competing candidates → escalate to the operator with the candidate list; do not guess between them.
- No candidate found → the element likely genuinely disappeared. Re-classify as either (c) flow drift (something replaced it) or app bug (component missing that should be present) using the screenshot evidence as the tiebreaker.
Root cause: fragile selector
If triage attributes the failure to a fragile selector (text drift, position-dependent CSS, role/name collision), check workspace shape before selecting a heal strategy:
Frontend source in workspace — package.json lists the UI framework as a dependency and a src/-style tree of .tsx/.jsx/.vue/.svelte/.html files is present:
→ Dispatch selector-development (mode: "jit", scope = the element-key whose locator failed). After it returns, replace the test's locator with the new test-attribute selector and re-run. Then continue from Stage 5 (stability validation) as normal.
Frontend source NOT in workspace:
→ Report the fragile selector to the user as an actionable test-debt item. Do NOT attempt to harden the locator with compound selectors, nth-child chains, or XPath depth — that adds brittleness without adding stability. The report should name the element-key, the fragile signal (text drift / CSS position / role collision), and that selector-development cannot help because the source files are not available in this workspace.
Stage 5 — Fix and stability (test issues only)
- Apply the fix per the heal strategy selected in Stage 4a. Use the Steps API correctly — refer to
../element-interactions/references/api-reference.mdfor all method signatures. - If the fix requires new selectors: Stage 4b has produced the proposal. For Auto strategies the update applies directly; for Propose strategies confirm with the operator first.
- Run the test 3-5 times to confirm stability. A single pass is not sufficient — flaky tests are worse than failing tests.
# Run the specific test file multiple times for i in {1..5}; do npx playwright test <test-file> --reporter=line; done - Only commit after all stability runs pass.
- If any stability run fails: revert the heal, then re-enter the diagnostic pipeline from Stage 1. The heal was incomplete. If a second strategy also destabilizes, escalate to (f) flake quarantine rather than trying a third heal — two failed strategies is a signal that single-failure mode is insufficient for this test.
Stage 6 — Report (app bugs only)
Present the bug report to the user with this structure:
Application Bug Report
Test:
tests/example.spec.ts— TC_001: Login flow Step: "Verify dashboard loads after login"Expected: Dashboard page loads with welcome message and user stats Actual: Page shows "500 Internal Server Error"
Screenshot:
— DOM evidence: <error-context.md or snapshot path> Severity: Environment: <base URL + browser> Journey: j- (when journey-map.md present) Reproducible: Yes — confirmed by navigating manually via playwright-cliThis is an application bug. The test has NOT been modified.
Hard rule: every app-bug report MUST cite at least one on-disk artifact path (screenshot, error-context.md, snapshot, or capture file). Prose supplements the artifact, never replaces it.
Do NOT modify the test to work around the bug. Do NOT skip the test. Do NOT add try/catch blocks to swallow the error. Report and stop.
Stability Validation Protocol
A fix is confirmed only when the test passes 3-5 consecutive runs without failure. This catches:
- Race conditions that pass 80% of the time
- Timing-sensitive tests that work on fast machines but fail under load
- State leakage between tests that only manifests on repeated runs
If any run in the stability check fails, the fix is incomplete. Do not commit — re-diagnose.
Integration
Skills that call this one
| Calling Skill | Activation Point | What Happens Next |
|---|---|---|
maintenance |
First step when a test failure is reported | After heal + stability → return for compliance review + commit |
authoring |
When a newly written test fails in Stage 3 | After heal + stability → return for compliance review + commit |
test-composer |
When a test run produces failures | After heal + stability → return for next scenario |
bug-discovery |
When adversarial tests fail | After heal + stability OR bug report → return to caller |
test-repair |
Per cluster in its Stage 4 (batch repair pipeline) | Diagnose the cluster's representative, apply heal once for the whole cluster, return outcome (Healed / App bug / Operator-pending / Quarantined) |
After a successful heal + stability confirmation, control returns to the calling skill.
Escalating up to test-repair
Sometimes single-failure mode isn't the right shape. Hand off to test-repair when the failure is not really a single event:
| Condition | Why escalate |
|---|---|
| The current run has ≥5 failures or ≥30% of executed tests failed | Per-failure diagnosis doesn't scale; batch clustering finds the shared root cause faster |
| You have been invoked 3+ times in this session on distinct tests | The pattern across failures is likely worth detecting before healing more in isolation |
| A heal you applied caused previously-passing tests to start failing | Cross-test interaction is invisible from here; test-repair's post-heal verification stage is designed for it |
| Two different heal strategies on the same test have both destabilized | Before trying a third, bump up to batch mode — the test's behavior may be coupled to sibling tests |
Announce the escalation once to the operator and start batch mode:
Detected
— handing off to the test-repairbatch pipeline so we can cluster root causes before continuing to heal individually. Reply "stay single-failure" to override.
The operator can override back to single-failure mode if they have a reason to keep the narrower scope.
API Reference
Refer to ../element-interactions/references/api-reference.md for all method signatures, argument orders, and types. All Steps methods use (elementName, pageName) order.