investigate-flaky-test - SKILL.md Agent Skill

name: investigate-flaky-test description: Investigate a RediSearch flaky-test report from a Jira key, CI failure, test id, or local logs. Use this to collect evidence, identify a supported root cause, and propose a real fix; if the evidence is insufficient, say so and ask for the missing data instead of suggesting workaround-only fixes or skip_until.

Investigate Flaky Test

Investigate a flaky-test report and propose a proper fix only when the evidence supports it.

Arguments

$ARGUMENTS may contain any of:

Jira issue key, usually MOD-...
GitHub Actions run URL, job URL, PR URL, or run id
Test id, usually test_file:testName or test_file.py::test_name
Local log file or failure excerpt

Instructions

1. Gather Evidence

Collect enough context to reason from facts:

Jira description and comments, if a Jira key is provided
GitHub Actions run/job logs and Test Logs ... artifacts, if a CI URL/run id is provided
Failure excerpt, stack trace, Redis server logs, Rust panic/backtrace, C backtrace, and INFO sections
Failed test source and nearby helpers
Relevant production code touched by the test path
Similar Jira issues or failures for the same test or same failure signature
Recent related changes, when they can explain a regression

For GitHub Actions failures, prefer gh:

gh run view <run_id> --repo RediSearch/RediSearch --json url,headBranch,headSha,event,jobs
gh run view <run_id> --repo RediSearch/RediSearch --log-failed
mkdir -p /tmp/redisearch-flaky-<run_id>
gh run download <run_id> --repo RediSearch/RediSearch --dir /tmp/redisearch-flaky-<run_id>

CI failed-test artifacts commonly contain tests/**/logs/*.log*, bin/**/redisearch.so, and bin/**/redisearch.so.debug.

2. Classify The Failure

Classify the failure before proposing a fix:

Assertion race or nondeterministic ordering
Timeout or performance budget issue
Redis/RLTest lifecycle issue, such as save/reload, expiration, cursor reap, or server shutdown
Coordinator/cluster-only behavior
Sanitizer, crash, panic, or memory safety issue
Environment or runner issue
Unknown or insufficient evidence

Separate proven facts from inference. Use language like "the logs show" for facts and "likely" or "possible" for hypotheses.

3. Find The Root Cause

Trace from the failing assertion or exception to the code path that can produce it:

For Python flow tests, inspect the exact test, fixtures, helper decorators, Env() settings, cluster skips, and waits.
For C/C++ unit tests, inspect the failing assertion, setup/teardown, thread usage, and decoded stack traces.
For Rust tests or panics, inspect the panic message, backtrace, unsafe boundaries, and FFI wrappers.
For coordinator failures, compare standalone and coordinator paths and check shard placement.
For timeouts, distinguish slow-but-correct behavior from a hang, deadlock, polling bug, or excessive test data size.

Avoid stopping at "probably flaky" when the logs identify a specific race, missing wait, lifecycle conflict, or environment failure.

4. Propose A Real Fix Or Say Evidence Is Insufficient

Propose a root-cause fix only when the cause is supported well enough to justify a code or test change. If the logs, source path, and failure signature do not support a specific cause, do not claim one.

A good fix proposal includes:

Root cause, with evidence
Exact target files/functions
Behavior change
Why this is preferable to a workaround
Verification commands
Remaining risk

If the cause is unclear, say so plainly. Do not suggest workaround-only fixes, quarantine, or skip_until from this skill. Instead, list the missing data needed, such as:

Full Test Logs ... artifact
Failed job log with timestamps
Reproduction command and seed/config
Server log around the assertion or crash
A second occurrence to compare signatures

When the current evidence is not enough but the next failure could be made more informative, recommend a diagnostic PR instead of a workaround. A valid recommendation is to add focused debug logs, counters, state dumps, or richer test assertion messages, merge or run that instrumentation in CI for a few days, and use the next flaky occurrence to identify the root cause. Do not present this as "add logs and rerun once to find the root cause right away"; a focused local rerun can help, but the main goal is better evidence when the intermittent failure happens again.

5. Verification Plan

Choose verification based on the proposed fix:

Python flow test:

./build.sh RUN_PYTEST ENABLE_ASSERT=1 TEST_TIMEOUT=20 TEST="<test_file>:<test_name>"

Coordinator-specific flow test:

REDIS_STANDALONE=0 ./build.sh RUN_PYTEST ENABLE_ASSERT=1 TEST_TIMEOUT=20 TEST="<test_file>:<test_name>"

C/C++ unit test:

./build.sh RUN_UNIT_TESTS ENABLE_ASSERT=1 TEST=<unit_test_name>

Rust test:

cargo nextest run --manifest-path src/redisearch_rs/Cargo.toml -p <crate_name> <test_filter>

If the failure is timing-sensitive, recommend repeating the focused test enough times to gain confidence, and include any required environment variables from the CI failure.

Report Back

End with:

Classification
Evidence-backed root cause, or "insufficient evidence"
Proposed real fix, if supported
Verification plan
Missing data, if blocked