ci-diagnostics - SKILL.md Agent Skill

name: ci-diagnostics description: Diagnose Proton CI failures and performance comparison results from GitHub checks and uploaded reports. Make sure to use this skill whenever CI checks are mentioned, a PR has red or failing checks, the user pastes a CI URL, or asks about test failures in the pipeline, even if they just ask 'why is CI failing'.

CI Diagnostics

Inputs

PR number or URL
optional check name substring
optional direct report URL

First step: collect check status

For a PR:

gh pr view "$PR" --json title,body,url
gh pr checks "$PR"
REPO=$(gh repo view --json nameWithOwner --jq .nameWithOwner)
SHA=$(gh pr view "$PR" --json commits --jq '.commits[-1].oid')
gh api "repos/$REPO/commits/$SHA/status"

Use the commit status payload to find:

failing or pending contexts
target_url links for uploaded HTML reports, raw logs, or performance artifacts

Proton report layout

CI uploads reports under:

<pr-number>/<commit-sha>/<normalized-check-name>...

The normalization logic is defined in:

Normalize a check name with lowercase and replacements for spaces, (, ), and ,.

Failure triage workflow

List failing contexts from gh pr checks or commit statuses.
Open each target_url report first.
If the report is sparse, inspect the linked raw log.
For test reports, summarize:
- failing test names
- first common error signature
- whether the failure looks deterministic, flaky, infra, or environment-specific
Map failures back to touched areas in the diff.

Performance comparison workflow

Performance comparison artifacts upload:

report.html
all-queries.html
all-query-metrics.tsv
queries.rep
images/flamegraphs

If you have a report.html URL, inspect sibling artifacts by replacing the filename in the same prefix.

When reviewing perf results:

start from the summary in report.html
inspect all-query-metrics.tsv for the biggest client_time regressions
distinguish broad regressions from a few outlier queries
correlate with touched execution paths, joins, windows, aggregations, or storage reads

Output expectations

Always report:

failing checks
best report/log URL for each failing check
likely failure class: code bug, flaky test, infra, dependency, or timeout/resource limit
smallest next debugging action

For performance changes also report:

whether the regression is broad or narrow
the most affected workload family
whether more local benchmarking is needed before code changes