close-flaky-issues - SKILL.md Agent Skill

name: close-flaky-issues description: Audit open "flaky test" GitHub issues and close those whose tests are no longer failing on master. Cross-references CI history from play.clickhouse.com with git log to attribute fixes. argument-hint: "[threshold-days] [dry-run]" disable-model-invocation: false allowed-tools: Bash, Read, Grep, Glob

Close Flaky Test Issues Skill

Arguments

$0 (optional): number of days the test must be failure-free on master before it is eligible for closing. Default: 14.
$1 (optional): pass dry-run to print the table of close candidates and stop without calling gh issue close. Any other value (or none) runs the full sweep.

Goal

Sweep open flaky test issues and close ones where:

The test has 0 failures on master within the threshold window, AND
(Ideally) a fix commit can be linked to the closure.

These issues are auto-generated by the CI bot — closing one is safe because the bot reopens automatically if the test fails again on master.

Steps

1. List open issues

mkdir -p tmp/flaky
gh issue list --repo ClickHouse/ClickHouse --label "flaky test" --state open \
  --limit 200 --json number,title,body > tmp/flaky/issues.json

Each body has the format:

Test name: <test_name>
Failure reason: <reason>
CI report: <url>
Failing test history: <play.clickhouse.com link with base64 SQL>

Extract (number, test_name) pairs:

jq -r '.[] | [.number, ((.body | capture("Test name: (?<n>.+)")? | .n) // "")] | @tsv' tmp/flaky/issues.json

The trailing ? after capture is required: without it, an issue with a null body or one that doesn't contain Test name: raises an error that aborts the whole pipeline and silently drops every subsequent issue. With ?, those rows produce an empty string instead, and you can fall back to the title or to the base64 SQL after # in the Failing test history URL — some old issues use the title rather than Test name: in the body.

2. Query CI failure history

Run one batch query against play.clickhouse.com for all extracted test names. The checks table is publicly readable via the play user.

Before running the query, guard against an empty list. If jq extracted no non-empty test names (zero open flaky test issues, or all bodies missing Test name:), test_name IN () is invalid SQL and aborts the sweep. Skip the query in that case and report "nothing to close".

Escape single quotes in test names before interpolation. Per ClickHouse SQL rules, a literal ' inside a string literal is written as ''. Names that contain a quote (e.g. parametrized integration tests like test_foo[a'b]) will otherwise produce invalid SQL or, worse, parse as a different list. Apply s/'/''/g to each extracted name before joining them with , .

curl -sS 'https://play.clickhouse.com/?user=play' --data-binary "
SELECT test_name,
       countIf(test_status IN ('FAIL','ERROR')) AS failures_90d,
       countIf(test_status IN ('FAIL','ERROR') AND check_start_time >= now() - INTERVAL 30 DAY) AS failures_30d,
       countIf(test_status IN ('FAIL','ERROR') AND check_start_time >= now() - INTERVAL 14 DAY) AS failures_14d,
       countIf(test_status IN ('FAIL','ERROR') AND check_start_time >= now() - INTERVAL 7 DAY)  AS failures_7d,
       maxIf(check_start_time, test_status IN ('FAIL','ERROR'))                                  AS last_fail
FROM checks
WHERE check_start_time >= now() - INTERVAL 90 DAY
  AND test_name IN ( '<test1>', '<test2>', ... )
  AND (pull_request_number = 0 OR base_ref IN ('master',''))
GROUP BY test_name
ORDER BY failures_7d DESC, failures_14d DESC
FORMAT TabSeparatedWithNames
"

Important:

pull_request_number = 0 OR base_ref IN ('master','') filters to master commits and PRs targeting master — fork-PR noise is dropped. The empty-string case covers older rows where base_ref was not populated for direct-master runs.
Tests with 0 rows returned have never failed on master in the lookback window — those are also closeable.
last_fail defaults to 1970-01-01 if the test never failed.

3. Categorize

For each issue, decide based on failures_<threshold-days>d:

0 failures in the window → close candidate
>= 1 failure in the window → keep open

Default threshold is 14 days. Use 7 days for an aggressive sweep, 30 days for a conservative one.

4. Find the fix commit (best-effort)

For each close candidate, search git log:

# Look for "fix flaky" commits naming the test
git log origin/master --since="<date 90 days ago>" --format="%H %s" \
  --grep="<test_name_fragment>" -i

# Look at recent direct modifications to the test file
git log origin/master --since="<date 90 days ago>" --format="%H %s" -- "<test_path>"

Not every close candidate has an explicit fix commit — some stabilize via surrounding infra changes. That's fine; close those too, just without a commit reference.

For integration tests (test_X/test.py::...), search the directory: tests/integration/test_X/. For stateless tests (NNNNN_name), search: tests/queries/0_stateless/NNNNN_name.*.

5. Print the scope, then proceed

Before issuing gh issue close, print a table showing:

Issues to close (with linked fix commit if found)
Issues to keep open (with failure count)

The user already authorized closing when they invoked the skill, so do not ask follow-up questions — proceed to step 6 immediately after printing. The table is for after-the-fact spot-checking, not for interactive approval. If the user wants a preview without changes, they can pass dry-run as a second argument and the skill should stop here.

6. Close issues

For each closeable issue, post a closing comment that names what fixed it and how recent the last failure was:

# With fix commit linked
gh issue close <N> --repo ClickHouse/ClickHouse \
  --comment "Fixed by <SHA> (\`<commit subject>\`). No failures on master in the last <D> days (last seen <YYYY-MM-DD>). Closing."

# Without a specific fix
gh issue close <N> --repo ClickHouse/ClickHouse \
  --comment "No failures on master in the last <D> days (last seen <YYYY-MM-DD>). Closing as no longer flaky; will reopen automatically by CI bot if it fails again."

Use full 40-char SHAs in comments — short SHAs can become ambiguous later.

Wrap test names, commit subjects, and identifiers in backticks per the project's CLAUDE.md style rule.

7. Report

End-of-task summary: how many were closed (split into "with fix linked" vs "stabilized without fix"), how many remain open, and whether any high-frequency offenders are left (e.g. tests still failing >5 times per week — those may need an owner ping rather than a sweep).

Rules

Never modify the issue body or labels. The body says DO NOT modify the body content. DO NOT remove labels. because CI matches against them. Closing with a comment is the only allowed action.
Don't close issues with any failure inside the threshold window. Even one recent failure means it's still flaky.
Don't reopen issues that were closed manually with rationale — let the CI bot do that.
Use tmp/flaky/ for working files, never /tmp (per CLAUDE.md).
Default to failures_14d == 0. Tests can have multi-week dormancy and then reappear — 14 days is a reasonable balance between cleaning up cruft and avoiding premature closure.
Don't dismiss issues based on the test file existing/not existing alone. A test can exist and still be passing; a test can be removed and the issue not yet closed. Failure history is the source of truth.