name: e2e-failure-analyzer description: Analyze e2e test failures from a GitHub Actions run. Provide a run ID or URL to download reports, extract traces/screenshots/logs, identify root causes, and get suggested actions. Works with both posit-dev/positron and posit-dev/positron-builds repos. disable-model-invocation: true
E2E Failure Analyzer
Analyzes Playwright e2e test failures from a GitHub Actions run using JSON reports, trace files, screenshots, and test logs to identify root causes and suggest next actions.
When to Use
- A CI run has failed and you want to understand why
- Triaging e2e test failures from
Test: Merge to branch,Test: Full Suite, orPositron Build: Daily Release - Investigating flaky tests from a specific run
Prerequisites
- GitHub CLI (
gh) authenticated @playwright/testavailable via npx (for merging blob reports, positron repo only)
Helper Scripts
Scripts live alongside this skill in scripts/. Use the base directory path shown above when the skill loads (the "Base directory for this skill: ..." line) as $SKILL_DIR. Scripts require Node.js and are cross-platform (Windows via Git Bash, macOS, Linux). Scripts that extract from zip files require unzip to be available in PATH (included in Git Bash on Windows).
Consolidated scripts (preferred -- fewer tool calls)
e2e-gather-run-info.js- Gathers all run metadata, failed jobs, artifacts, non-e2e job log excerpts, and commit info in one call. Replaces multiplegh apiinvocations.e2e-process-project.js- Path A. Processes a merged blob report project end-to-end: extracts failures, scans blobs, extracts/parses traces, extracts screenshots and error-context. Replaces multiple script + unzip invocations.e2e-process-s3.js- Path B. Processes a CloudFront-hosted Playwright HTML report end-to-end: fetchesindex.html, decodes the embedded base64report.zip, downloads trace + error-context attachments from S3, parses traces, and extracts screencast frames. Produces the same JSON shape ase2e-process-project.jsso the downstream analyzer treats both paths identically.
Standalone scripts (used by consolidated scripts internally, or for ad-hoc debugging)
e2e-extract-failures.js- Extracts failures from a merged Playwright JSON reporte2e-parse-trace.js- Parses atrace.tracefile into an action timeline with errors and last screenshot hashe2e-inspect-blobs.js- Scans blob report zips to find failed test IDs and their trace/log resource hashese2e-query-history.js- Queries the e2e-test-insights API for historical test health data (requiresE2E_INSIGHTS_API_KEYenv var)
Input
Run ID or URL from either repo:
https://github.com/posit-dev/positron/actions/runs/23610137774https://github.com/posit-dev/positron-builds/actions/runs/23938334846
Step 1: Gather Run Info (single script call)
The consolidated e2e-gather-run-info.js script handles everything: run metadata, failed jobs, blob report artifacts, non-e2e job log excerpts, and commit info.
node "$SKILL_DIR/scripts/e2e-gather-run-info.js" <RUN_URL>
Output JSON contains:
repo,runId- parsed from URLrun- metadata (name, conclusion, html_url, head_sha, branch)failedJobs- array of{id, name, isE2e}for all failed jobsnonE2eJobLogs- map of job ID to failure log excerpts (for non-e2e jobs)artifacts- sorted list of blob report artifact namesprojects- unique project names extracted from artifacts (e.g.,e2e-chromium,e2e-windows)commit-{message, author, files}for the head commit
Use projects to determine what to process:
- If projects list is non-empty -> use Path A (positron repo flow) for each project
- If empty -> use Path B (positron-builds flow)
The two repos have different data access patterns:
posit-dev/positron: Uses sharded blob reports uploaded as GitHub artifacts. Requires downloading and merging.posit-dev/positron-builds: Non-sharded single-job runs. HTML reports uploaded to S3 at CloudFront. No blob report artifacts.
Path A: posit-dev/positron (Sharded Blob Reports)
A1+A2: Download, Merge, and Process (single script call)
The e2e-process-project.js script handles everything in one call: downloads blob report artifacts, copies shards into a merged directory, runs npx playwright merge-reports, then extracts failures, scans blobs, extracts/parses traces, and extracts screenshots. Use --cleanup to remove intermediate download/merge artifacts automatically.
For each project from Step 1, run:
node "$SKILL_DIR/scripts/e2e-process-project.js" \
--download --run-id <RUN_ID> --repo <REPO> --project <PROJECT> \
--output-dir /tmp/e2e-analysis-<PROJECT> --cleanup
If there are multiple projects, run them sequentially (each call uses npx internally).
Fallback: If blob reports were already downloaded and merged (e.g., for debugging), you can skip --download and pass the directories directly:
node "$SKILL_DIR/scripts/e2e-process-project.js" \
/tmp/blob-merged-<PROJECT> /tmp/report-<PROJECT>.json \
--output-dir /tmp/e2e-analysis-<PROJECT>
Output JSON contains:
outputDir- path where screenshots and error-context files were savedfailures- array of final failures (tests that failed all retries) with title, file, tags, suite, project, errorsfailedTests- array of all failed test attempts (including those that passed on retry) with testId, title, file, status, blobtestDetails- array of per-test objects, each containing:testId,title,file,status,blob,attemptCountattempts- array of per-attempt objects with:trace- parsed trace data:timeline(human-readable string),errors(array),screenshotShas(array of{sha1, timestamp}in chronological order),lastScreenshotSha1(legacy: same as last entry ofscreenshotShas)screenshotPaths- chronological array of paths to extracted screenshot JPEGs (view with Read tool); the last entry is the failure-state frame, earlier entries show the moments before itscreenshotPath- legacy alias pointing to the last entry ofscreenshotPathserrorContextPath- path to the extracted page snapshot markdown: Playwright's accessibility-tree snapshot of the page at the moment of failure (including content inside same-origin webview iframes), plus the failing selector and the relevant test source. Primary evidence for locator-not-found / not-visible / element-count / text-or-attribute failures -- Read it to tell a stale test selector from a real product regression (see the analysis rubric)
logHashes- array of{resourceHash, blob}for logs (extract manually if needed)
IMPORTANT: View screenshots using the screenshotPaths arrays with the Read tool. You MUST Read all screenshots in a single message with multiple parallel Read tool calls -- this results in only one approval prompt instead of one per screenshot. View all attempts and all frames per attempt; comparing across retries reveals whether a failure is consistent or intermittent, and comparing the trailing frames within an attempt often shows where the test went wrong before the visible error. Screenshots are the most revealing evidence for diagnosing failures. Default frame count per attempt is 3 (configurable via --screenshots N on e2e-process-project.js).
View the error-context page snapshot with the Read tool using errorContextPath paths. For any locator-not-found, "not visible", element-count, or text/attribute failure, Read it FIRST (not as a last resort): it captures the failure-state accessibility tree -- the only evidence that distinguishes a stale test selector from a real product regression, since a screenshot cannot. See the analysis rubric.
Path B: posit-dev/positron-builds (S3 HTML Reports)
Process the HTML report (single script call)
The e2e-process-s3.js script handles everything in one call: fetches the report's index.html, decodes the embedded base64 report.zip, walks failures + per-file detail JSONs, downloads trace and error-context attachments from S3, parses traces, and extracts trailing screencast frames.
For each failed e2e job from Step 1, resolve the job's REPORT_DIR from its logs (the workflow logs both an unresolved template line containing literal ${IDENTIFIER} / ${OS_SUFFIX} and the expanded value -- ignore the template), then run:
node "$SKILL_DIR/scripts/e2e-process-s3.js" \
--report-url https://d38p2avprg8il3.cloudfront.net/<REPORT_DIR>/ \
--output-dir /tmp/e2e-analysis-<JOB_LABEL> \
--cleanup
For interactive / ad-hoc use, you can call the script directly with any CloudFront-hosted Playwright HTML report URL -- no run ID required.
Output JSON is identical to Path A's e2e-process-project.js (see the field list above), so the same screenshot-reading and analysis flow applies. The blob field is the report directory name (last path segment of the S3 URL) rather than a zip filename, since Path B has no blob zips.
IMPORTANT: View screenshots the same way as Path A -- Read all screenshotPaths arrays in a single message with multiple parallel Read tool calls. Read the errorContextPath page snapshot FIRST for any locator-not-found / not-visible / attribute / text failure (it is the primary evidence for stale-selector vs product-regression -- see the analysis rubric), not just when screenshots and traces fall short.
Step 6: Query Historical Test Health (optional)
If the E2E_INSIGHTS_API_KEY environment variable is set, query the e2e-test-insights dashboard for historical failure data. This step is optional -- if the API is unavailable, skip it and proceed with analysis.
The repo identifier for the API is always positron for both posit-dev/positron and posit-dev/positron-builds. Both repos run the same tests (positron-builds uses positron as a submodule) and test results are stored under the positron repo ID in the dashboard.
Option 1: Query by workflow run ID (preferred)
If the GitHub run ID is available, use --run-id to get history for all tests that failed or flaked in this run:
node "$SKILL_DIR/scripts/e2e-query-history.js" --repo positron --run-id <RUN_ID> --lookback-days 14 --branch <BRANCH>
The branch is important -- a test may be rare_flake on main but known_flaky on a release branch. Get the branch from the run metadata (Step 1) or the onProject event in blob reports. Common branches: main, release/YYYY.MM.
Option 2: Query by test keys
If the run isn't in the dashboard yet, construct test keys manually from extracted failures:
node "$SKILL_DIR/scripts/e2e-query-history.js" --repo positron \
--test-keys "testName1|||specPath1,testName2|||specPath2" --lookback-days 14
Using the history in analysis
The response includes per-test data. Use it to enhance the analysis:
insight.type:"new"= first-time failure (likely regression),"recurring"/"known_flaky"= known pattern,"rare_flake"= infrequenthistory.pass_rate: Low pass rate = known flaky test, 100% pass rate before this run = regressionfailure_patterns: Compare today's error message against historical patterns -- same pattern = recurring, new pattern = potential regression even for known-flaky testsinsight.first_failure_sha/insight.timing_value: When the failures started -- useful for bisecting
Interpreting environment_breakdown -- look across environments
The environment_breakdown array is often more informative than the aggregate history stats. Always check per-environment pass rates before concluding a test is "flaky":
- 0% pass rate on one environment, 100% on others = deterministic regression on that platform, NOT flaky. Example: a test failing on all chromium runs but passing on all electron runs is a chromium-specific bug, even if the aggregate pass rate is 58%.
- Low pass rate across all environments = genuinely flaky
- Low pass rate on one environment only = platform-specific flakiness (e.g., "worse on win/electron")
When the breakdown reveals an environment-specific pattern, call it out explicitly:
- "History: 100% failure on chromium (0/4 passed), 100% pass on electron (6/6) -- deterministic regression on chromium, not flaky"
- "History: known flaky across all platforms, worst on win/electron (88% pass rate)"
History line format
Include a History line in each failure's analysis, e.g.:
- "History: failed 4/18 runs (22%) over last 14 days, same error pattern -- known flaky"
- "History: passed 15/15 runs over last 14 days -- new regression"
- "History: 0% pass rate on chromium (10/10 failed since Apr 02), 100% on electron -- deterministic platform regression"
- "History: no data available (API unreachable)"
Step 7: Analyze and Present Results
For each failure (or group of related failures), apply the shared analysis rubric to determine its root-cause category, a 1-2 sentence evidence-based explanation, and a suggested action. rubric.md is the single source of truth for the root-cause categories, the evidence-reading order (screenshots, trace timeline, test source, and the error-context page snapshot -- read FIRST for any locator/visibility/attribute/text failure), the locator-drift-vs-product-regression decision, historical-data interpretation, and head-commit correlation. The same file is injected verbatim into the analyzer Action's system prompt, so local skill runs and the Action reason identically -- edit the rubric there, not here.
Include a Commit line in the detailed analysis when the head commit is relevant (per the rubric), e.g. "Commit: modified notebookCellList.ts (notebook cell rendering) -- plausible cause" or "Commit: no files related to this test's feature area -- unlikely cause".
Additional repo context
Also use context from the repo when helpful:
- Read the failing test file to understand what it does
- Check
git logfor recent changes to the test or related product code beyond the head commit - Search for related issues
Key log files to check:
window1/renderer.log- Main window renderer process logswindow1/exthost/exthost.log- Extension host logswindow1/exthost/positron.positron-supervisor/Python Kernel.log- Python kernel logswindow1/exthost/positron.positron-r/R Language Pack.log- R runtime logse2e-test-runner.log- Test runner outputmain.log- Electron main process logs
For each failure, include the platform (OS and project/browser) where it occurred. This information comes from:
- Path A: The project name (e.g.,
e2e-windows,e2e-electron,e2e-chromium) and the workflow name - Path B: The job name (e.g., "electron (macOS)", "electron (ubuntu)") and Playwright project in the test output (e.g.,
[e2e-macOS-ci])
When multiple projects/platforms are analyzed in a single run, note which platforms each failure occurred on and whether the same test passed on other platforms.
Present the analysis in a summary table that includes columns for: test name, platform, root cause category, and severity. In the severity column, clearly distinguish tests that failed all retries (hard failures) from tests that passed on retry (flaky). This distinction comes from comparing failures (final failures after all retries) vs failedTests (all attempts including those that recovered). Then provide detailed analysis for each failure below the table.
Include non-e2e job failures (unit tests, integration tests, build failures) in the summary table as well, with the job name as the test name and a brief description of the failure extracted from the job logs.
Offer to:
- Open the relevant test files
- Search for related recent changes
- Create GitHub issues
Cleanup
Path A and Path B: If you used --cleanup with e2e-process-project.js / e2e-process-s3.js, the intermediate download/unzip dirs are already removed. Only the --output-dir remains (screenshots and error-context). Remove it with exact paths (no globs):
rm -rf /tmp/e2e-analysis-<PROJECT_OR_JOB_LABEL>