name: close-flaky-issues description: Audit open "flaky test" GitHub issues and close those whose tests are no longer failing on master. Cross-references CI history from play.clickhouse.com with git log to attribute fixes. argument-hint: "[threshold-days] [dry-run]" disable-model-invocation: false allowed-tools: Bash, Read, Grep, Glob
Close Flaky Test Issues Skill
Arguments
$0(optional): number of days the test must be failure-free on master before it is eligible for closing. Default:14.$1(optional): passdry-runto print the table of close candidates and stop without callinggh issue close. Any other value (or none) runs the full sweep.
Goal
Sweep open flaky test issues and close ones where:
- The test has 0 failures on master within the threshold window, AND
- (Ideally) a fix commit can be linked to the closure.
These issues are auto-generated by the CI bot — closing one is safe because the bot reopens automatically if the test fails again on master.
Steps
1. List open issues
mkdir -p tmp/flaky
gh issue list --repo ClickHouse/ClickHouse --label "flaky test" --state open \
--limit 200 --json number,title,body > tmp/flaky/issues.json
Each body has the format:
Test name: <test_name>
Failure reason: <reason>
CI report: <url>
Failing test history: <play.clickhouse.com link with base64 SQL>
Extract (number, test_name) pairs:
jq -r '.[] | [.number, ((.body | capture("Test name: (?<n>.+)")? | .n) // "")] | @tsv' tmp/flaky/issues.json
The trailing ? after capture is required: without it, an issue with a null body or one that doesn't contain Test name: raises an error that aborts the whole pipeline and silently drops every subsequent issue. With ?, those rows produce an empty string instead, and you can fall back to the title or to the base64 SQL after # in the Failing test history URL — some old issues use the title rather than Test name: in the body.
2. Query CI failure history
Run one batch query against play.clickhouse.com for all extracted test names. The checks table is publicly readable via the play user.
Before running the query, guard against an empty list. If jq extracted no non-empty test names (zero open flaky test issues, or all bodies missing Test name:), test_name IN () is invalid SQL and aborts the sweep. Skip the query in that case and report "nothing to close".
Escape single quotes in test names before interpolation. Per ClickHouse SQL rules, a literal ' inside a string literal is written as ''. Names that contain a quote (e.g. parametrized integration tests like test_foo[a'b]) will otherwise produce invalid SQL or, worse, parse as a different list. Apply s/'/''/g to each extracted name before joining them with , .
curl -sS 'https://play.clickhouse.com/?user=play' --data-binary "
SELECT test_name,
countIf(test_status IN ('FAIL','ERROR')) AS failures_90d,
countIf(test_status IN ('FAIL','ERROR') AND check_start_time >= now() - INTERVAL 30 DAY) AS failures_30d,
countIf(test_status IN ('FAIL','ERROR') AND check_start_time >= now() - INTERVAL 14 DAY) AS failures_14d,
countIf(test_status IN ('FAIL','ERROR') AND check_start_time >= now() - INTERVAL 7 DAY) AS failures_7d,
maxIf(check_start_time, test_status IN ('FAIL','ERROR')) AS last_fail
FROM checks
WHERE check_start_time >= now() - INTERVAL 90 DAY
AND test_name IN ( '<test1>', '<test2>', ... )
AND (pull_request_number = 0 OR base_ref IN ('master',''))
GROUP BY test_name
ORDER BY failures_7d DESC, failures_14d DESC
FORMAT TabSeparatedWithNames
"
Important:
pull_request_number = 0 OR base_ref IN ('master','')filters to master commits and PRs targeting master — fork-PR noise is dropped. The empty-string case covers older rows wherebase_refwas not populated for direct-master runs.- Tests with 0 rows returned have never failed on master in the lookback window — those are also closeable.
last_faildefaults to1970-01-01if the test never failed.
3. Categorize
For each issue, decide based on failures_<threshold-days>d:
0failures in the window → close candidate>= 1failure in the window → keep open
Default threshold is 14 days. Use 7 days for an aggressive sweep, 30 days for a conservative one.
4. Find the fix commit (best-effort)
For each close candidate, search git log:
# Look for "fix flaky" commits naming the test
git log origin/master --since="<date 90 days ago>" --format="%H %s" \
--grep="<test_name_fragment>" -i
# Look at recent direct modifications to the test file
git log origin/master --since="<date 90 days ago>" --format="%H %s" -- "<test_path>"
Not every close candidate has an explicit fix commit — some stabilize via surrounding infra changes. That's fine; close those too, just without a commit reference.
For integration tests (test_X/test.py::...), search the directory: tests/integration/test_X/.
For stateless tests (NNNNN_name), search: tests/queries/0_stateless/NNNNN_name.*.
5. Print the scope, then proceed
Before issuing gh issue close, print a table showing:
- Issues to close (with linked fix commit if found)
- Issues to keep open (with failure count)
The user already authorized closing when they invoked the skill, so do not ask follow-up questions — proceed to step 6 immediately after printing. The table is for after-the-fact spot-checking, not for interactive approval. If the user wants a preview without changes, they can pass dry-run as a second argument and the skill should stop here.
6. Close issues
For each closeable issue, post a closing comment that names what fixed it and how recent the last failure was:
# With fix commit linked
gh issue close <N> --repo ClickHouse/ClickHouse \
--comment "Fixed by <SHA> (\`<commit subject>\`). No failures on master in the last <D> days (last seen <YYYY-MM-DD>). Closing."
# Without a specific fix
gh issue close <N> --repo ClickHouse/ClickHouse \
--comment "No failures on master in the last <D> days (last seen <YYYY-MM-DD>). Closing as no longer flaky; will reopen automatically by CI bot if it fails again."
Use full 40-char SHAs in comments — short SHAs can become ambiguous later.
Wrap test names, commit subjects, and identifiers in backticks per the project's CLAUDE.md style rule.
7. Report
End-of-task summary: how many were closed (split into "with fix linked" vs "stabilized without fix"), how many remain open, and whether any high-frequency offenders are left (e.g. tests still failing >5 times per week — those may need an owner ping rather than a sweep).
Rules
- Never modify the issue body or labels. The body says
DO NOT modify the body content. DO NOT remove labels.because CI matches against them. Closing with a comment is the only allowed action. - Don't close issues with any failure inside the threshold window. Even one recent failure means it's still flaky.
- Don't reopen issues that were closed manually with rationale — let the CI bot do that.
- Use
tmp/flaky/for working files, never/tmp(per CLAUDE.md). - Default to
failures_14d == 0. Tests can have multi-week dormancy and then reappear — 14 days is a reasonable balance between cleaning up cruft and avoiding premature closure. - Don't dismiss issues based on the test file existing/not existing alone. A test can exist and still be passing; a test can be removed and the issue not yet closed. Failure history is the source of truth.