e2e-failure-analyzer

name: e2e-failure-analyzer description: Analyze e2e test failures from a GitHub Actions run. Provide a run ID or URL to download reports, extract traces/screenshots/logs, identify root causes, and get suggested actions. Works with both posit-dev/positron and posit-dev/positron-builds repos. disable-model-invocation: true

E2E Failure Analyzer

Analyzes Playwright e2e test failures from a GitHub Actions run using JSON reports, trace files, screenshots, and test logs to identify root causes and suggest next actions.

When to Use

A CI run has failed and you want to understand why
Triaging e2e test failures from Test: Merge to branch, Test: Full Suite, or Positron Build: Daily Release
Investigating flaky tests from a specific run

Prerequisites

GitHub CLI (gh) authenticated
@playwright/test available via npx (for merging blob reports, positron repo only)

Helper Scripts

Scripts live alongside this skill in scripts/. Use the base directory path shown above when the skill loads (the "Base directory for this skill: ..." line) as $SKILL_DIR. Scripts require Node.js and are cross-platform (Windows via Git Bash, macOS, Linux). Scripts that extract from zip files require unzip to be available in PATH (included in Git Bash on Windows).

Consolidated scripts (preferred -- fewer tool calls)

e2e-gather-run-info.js - Gathers all run metadata, failed jobs, artifacts, non-e2e job log excerpts, and commit info in one call. Replaces multiple gh api invocations.
e2e-process-project.js - Path A. Processes a merged blob report project end-to-end: extracts failures, scans blobs, extracts/parses traces, extracts screenshots and error-context. Replaces multiple script + unzip invocations.
e2e-process-s3.js - Path B. Processes a CloudFront-hosted Playwright HTML report end-to-end: fetches index.html, decodes the embedded base64 report.zip, downloads trace + error-context attachments from S3, parses traces, and extracts screencast frames. Produces the same JSON shape as e2e-process-project.js so the downstream analyzer treats both paths identically.

Standalone scripts (used by consolidated scripts internally, or for ad-hoc debugging)

e2e-extract-failures.js - Extracts failures from a merged Playwright JSON report
e2e-parse-trace.js - Parses a trace.trace file into an action timeline with errors and last screenshot hash
e2e-inspect-blobs.js - Scans blob report zips to find failed test IDs and their trace/log resource hashes
e2e-query-history.js - Queries the e2e-test-insights API for historical test health data (requires E2E_INSIGHTS_API_KEY env var)

Input

Run ID or URL from either repo:

https://github.com/posit-dev/positron/actions/runs/23610137774
https://github.com/posit-dev/positron-builds/actions/runs/23938334846

Step 1: Gather Run Info (single script call)

The consolidated e2e-gather-run-info.js script handles everything: run metadata, failed jobs, blob report artifacts, non-e2e job log excerpts, and commit info.

node "$SKILL_DIR/scripts/e2e-gather-run-info.js" <RUN_URL>

Output JSON contains:

repo, runId - parsed from URL
run - metadata (name, conclusion, html_url, head_sha, branch)
failedJobs - array of {id, name, isE2e} for all failed jobs
nonE2eJobLogs - map of job ID to failure log excerpts (for non-e2e jobs)
artifacts - sorted list of blob report artifact names
projects - unique project names extracted from artifacts (e.g., e2e-chromium, e2e-windows)
commit - {message, author, files} for the head commit

Use projects to determine what to process:

If projects list is non-empty -> use Path A (positron repo flow) for each project
If empty -> use Path B (positron-builds flow)

The two repos have different data access patterns:

posit-dev/positron: Uses sharded blob reports uploaded as GitHub artifacts. Requires downloading and merging.
posit-dev/positron-builds: Non-sharded single-job runs. HTML reports uploaded to S3 at CloudFront. No blob report artifacts.

Path A: posit-dev/positron (Sharded Blob Reports)

A1+A2: Download, Merge, and Process (single script call)

The e2e-process-project.js script handles everything in one call: downloads blob report artifacts, copies shards into a merged directory, runs npx playwright merge-reports, then extracts failures, scans blobs, extracts/parses traces, and extracts screenshots. Use --cleanup to remove intermediate download/merge artifacts automatically.

For each project from Step 1, run:

node "$SKILL_DIR/scripts/e2e-process-project.js" \
  --download --run-id <RUN_ID> --repo <REPO> --project <PROJECT> \
  --output-dir /tmp/e2e-analysis-<PROJECT> --cleanup

If there are multiple projects, run them sequentially (each call uses npx internally).

Fallback: If blob reports were already downloaded and merged (e.g., for debugging), you can skip --download and pass the directories directly:

node "$SKILL_DIR/scripts/e2e-process-project.js" \
  /tmp/blob-merged-<PROJECT> /tmp/report-<PROJECT>.json \
  --output-dir /tmp/e2e-analysis-<PROJECT>

Output JSON contains:

outputDir - path where screenshots and error-context files were saved
failures - array of final failures (tests that failed all retries) with title, file, tags, suite, project, errors
failedTests - array of all failed test attempts (including those that passed on retry) with testId, title, file, status, blob
testDetails - array of per-test objects, each containing:
- testId, title, file, status, blob, attemptCount
- attempts - array of per-attempt objects with:
  - trace - parsed trace data: timeline (human-readable string), errors (array), screenshotShas (array of {sha1, timestamp} in chronological order), lastScreenshotSha1 (legacy: same as last entry of screenshotShas)
  - screenshotPaths - chronological array of paths to extracted screenshot JPEGs (view with Read tool); the last entry is the failure-state frame, earlier entries show the moments before it
  - screenshotPath - legacy alias pointing to the last entry of screenshotPaths
  - errorContextPath - path to the extracted page snapshot markdown: Playwright's accessibility-tree snapshot of the page at the moment of failure (including content inside same-origin webview iframes), plus the failing selector and the relevant test source. Primary evidence for locator-not-found / not-visible / element-count / text-or-attribute failures -- Read it to tell a stale test selector from a real product regression (see the analysis rubric)
- logHashes - array of {resourceHash, blob} for logs (extract manually if needed)

IMPORTANT: View screenshots using the screenshotPaths arrays with the Read tool. You MUST Read all screenshots in a single message with multiple parallel Read tool calls -- this results in only one approval prompt instead of one per screenshot. View all attempts and all frames per attempt; comparing across retries reveals whether a failure is consistent or intermittent, and comparing the trailing frames within an attempt often shows where the test went wrong before the visible error. Screenshots are the most revealing evidence for diagnosing failures. Default frame count per attempt is 3 (configurable via --screenshots N on e2e-process-project.js).

View the error-context page snapshot with the Read tool using errorContextPath paths. For any locator-not-found, "not visible", element-count, or text/attribute failure, Read it FIRST (not as a last resort): it captures the failure-state accessibility tree -- the only evidence that distinguishes a stale test selector from a real product regression, since a screenshot cannot. See the analysis rubric.

Path B: posit-dev/positron-builds (S3 HTML Reports)

Process the HTML report (single script call)

The e2e-process-s3.js script handles everything in one call: fetches the report's index.html, decodes the embedded base64 report.zip, walks failures + per-file detail JSONs, downloads trace and error-context attachments from S3, parses traces, and extracts trailing screencast frames.

For each failed e2e job from Step 1, resolve the job's REPORT_DIR from its logs (the workflow logs both an unresolved template line containing literal ${IDENTIFIER} / ${OS_SUFFIX} and the expanded value -- ignore the template), then run:

node "$SKILL_DIR/scripts/e2e-process-s3.js" \
  --report-url https://d38p2avprg8il3.cloudfront.net/<REPORT_DIR>/ \
  --output-dir /tmp/e2e-analysis-<JOB_LABEL> \
  --cleanup

For interactive / ad-hoc use, you can call the script directly with any CloudFront-hosted Playwright HTML report URL -- no run ID required.

Output JSON is identical to Path A's e2e-process-project.js (see the field list above), so the same screenshot-reading and analysis flow applies. The blob field is the report directory name (last path segment of the S3 URL) rather than a zip filename, since Path B has no blob zips.

IMPORTANT: View screenshots the same way as Path A -- Read all screenshotPaths arrays in a single message with multiple parallel Read tool calls. Read the errorContextPath page snapshot FIRST for any locator-not-found / not-visible / attribute / text failure (it is the primary evidence for stale-selector vs product-regression -- see the analysis rubric), not just when screenshots and traces fall short.

Step 6: Query Historical Test Health (optional)

If the E2E_INSIGHTS_API_KEY environment variable is set, query the e2e-test-insights dashboard for historical failure data. This step is optional -- if the API is unavailable, skip it and proceed with analysis.

The repo identifier for the API is always positron for both posit-dev/positron and posit-dev/positron-builds. Both repos run the same tests (positron-builds uses positron as a submodule) and test results are stored under the positron repo ID in the dashboard.

Option 1: Query by workflow run ID (preferred)

If the GitHub run ID is available, use --run-id to get history for all tests that failed or flaked in this run:

node "$SKILL_DIR/scripts/e2e-query-history.js" --repo positron --run-id <RUN_ID> --lookback-days 14 --branch <BRANCH>

The branch is important -- a test may be rare_flake on main but known_flaky on a release branch. Get the branch from the run metadata (Step 1) or the onProject event in blob reports. Common branches: main, release/YYYY.MM.

Option 2: Query by test keys

If the run isn't in the dashboard yet, construct test keys manually from extracted failures:

node "$SKILL_DIR/scripts/e2e-query-history.js" --repo positron \
  --test-keys "testName1|||specPath1,testName2|||specPath2" --lookback-days 14

Using the history in analysis

The response includes per-test data. Use it to enhance the analysis:

insight.type: "new" = first-time failure (likely regression), "recurring" / "known_flaky" = known pattern, "rare_flake" = infrequent
history.pass_rate: Low pass rate = known flaky test, 100% pass rate before this run = regression
failure_patterns: Compare today's error message against historical patterns -- same pattern = recurring, new pattern = potential regression even for known-flaky tests
insight.first_failure_sha / insight.timing_value: When the failures started -- useful for bisecting

Interpreting `environment_breakdown` -- look across environments

The environment_breakdown array is often more informative than the aggregate history stats. Always check per-environment pass rates before concluding a test is "flaky":

0% pass rate on one environment, 100% on others = deterministic regression on that platform, NOT flaky. Example: a test failing on all chromium runs but passing on all electron runs is a chromium-specific bug, even if the aggregate pass rate is 58%.
Low pass rate across all environments = genuinely flaky
Low pass rate on one environment only = platform-specific flakiness (e.g., "worse on win/electron")

When the breakdown reveals an environment-specific pattern, call it out explicitly:

"History: 100% failure on chromium (0/4 passed), 100% pass on electron (6/6) -- deterministic regression on chromium, not flaky"
"History: known flaky across all platforms, worst on win/electron (88% pass rate)"

History line format

Include a History line in each failure's analysis, e.g.:

"History: failed 4/18 runs (22%) over last 14 days, same error pattern -- known flaky"
"History: passed 15/15 runs over last 14 days -- new regression"
"History: 0% pass rate on chromium (10/10 failed since Apr 02), 100% on electron -- deterministic platform regression"
"History: no data available (API unreachable)"

Step 7: Analyze and Present Results

For each failure (or group of related failures), apply the shared analysis rubric to determine its root-cause category, a 1-2 sentence evidence-based explanation, and a suggested action. rubric.md is the single source of truth for the root-cause categories, the evidence-reading order (screenshots, trace timeline, test source, and the error-context page snapshot -- read FIRST for any locator/visibility/attribute/text failure), the locator-drift-vs-product-regression decision, historical-data interpretation, and head-commit correlation. The same file is injected verbatim into the analyzer Action's system prompt, so local skill runs and the Action reason identically -- edit the rubric there, not here.

Include a Commit line in the detailed analysis when the head commit is relevant (per the rubric), e.g. "Commit: modified notebookCellList.ts (notebook cell rendering) -- plausible cause" or "Commit: no files related to this test's feature area -- unlikely cause".

Additional repo context

Also use context from the repo when helpful:

Read the failing test file to understand what it does
Check git log for recent changes to the test or related product code beyond the head commit
Search for related issues

Key log files to check:

window1/renderer.log - Main window renderer process logs
window1/exthost/exthost.log - Extension host logs
window1/exthost/positron.positron-supervisor/Python Kernel.log - Python kernel logs
window1/exthost/positron.positron-r/R Language Pack.log - R runtime logs
e2e-test-runner.log - Test runner output
main.log - Electron main process logs

For each failure, include the platform (OS and project/browser) where it occurred. This information comes from:

Path A: The project name (e.g., e2e-windows, e2e-electron, e2e-chromium) and the workflow name
Path B: The job name (e.g., "electron (macOS)", "electron (ubuntu)") and Playwright project in the test output (e.g., [e2e-macOS-ci])

When multiple projects/platforms are analyzed in a single run, note which platforms each failure occurred on and whether the same test passed on other platforms.

Present the analysis in a summary table that includes columns for: test name, platform, root cause category, and severity. In the severity column, clearly distinguish tests that failed all retries (hard failures) from tests that passed on retry (flaky). This distinction comes from comparing failures (final failures after all retries) vs failedTests (all attempts including those that recovered). Then provide detailed analysis for each failure below the table.

Include non-e2e job failures (unit tests, integration tests, build failures) in the summary table as well, with the job name as the test name and a brief description of the failure extracted from the job logs.

Offer to:

Open the relevant test files
Search for related recent changes
Create GitHub issues

Cleanup

Path A and Path B: If you used --cleanup with e2e-process-project.js / e2e-process-s3.js, the intermediate download/unzip dirs are already removed. Only the --output-dir remains (screenshots and error-context). Remove it with exact paths (no globs):

rm -rf /tmp/e2e-analysis-<PROJECT_OR_JOB_LABEL>