investigate-flaky-test

star 6.2k

Investigate a RediSearch flaky-test report from a Jira key, CI failure, test id, or local logs. Use this to collect evidence, identify a supported root cause, and propose a real fix; if the evidence is insufficient, say so and ask for the missing data instead of suggesting workaround-only fixes or skip_until.

RediSearch By RediSearch schedule Updated 5/29/2026

name: investigate-flaky-test description: Investigate a RediSearch flaky-test report from a Jira key, CI failure, test id, or local logs. Use this to collect evidence, identify a supported root cause, and propose a real fix; if the evidence is insufficient, say so and ask for the missing data instead of suggesting workaround-only fixes or skip_until.

Investigate Flaky Test

Investigate a flaky-test report and propose a proper fix only when the evidence supports it.

Arguments

$ARGUMENTS may contain any of:

  • Jira issue key, usually MOD-...
  • GitHub Actions run URL, job URL, PR URL, or run id
  • Test id, usually test_file:testName or test_file.py::test_name
  • Local log file or failure excerpt

Instructions

1. Gather Evidence

Collect enough context to reason from facts:

  • Jira description and comments, if a Jira key is provided
  • GitHub Actions run/job logs and Test Logs ... artifacts, if a CI URL/run id is provided
  • Failure excerpt, stack trace, Redis server logs, Rust panic/backtrace, C backtrace, and INFO sections
  • Failed test source and nearby helpers
  • Relevant production code touched by the test path
  • Similar Jira issues or failures for the same test or same failure signature
  • Recent related changes, when they can explain a regression

For GitHub Actions failures, prefer gh:

gh run view <run_id> --repo RediSearch/RediSearch --json url,headBranch,headSha,event,jobs
gh run view <run_id> --repo RediSearch/RediSearch --log-failed
mkdir -p /tmp/redisearch-flaky-<run_id>
gh run download <run_id> --repo RediSearch/RediSearch --dir /tmp/redisearch-flaky-<run_id>

CI failed-test artifacts commonly contain tests/**/logs/*.log*, bin/**/redisearch.so, and bin/**/redisearch.so.debug.

2. Classify The Failure

Classify the failure before proposing a fix:

  • Assertion race or nondeterministic ordering
  • Timeout or performance budget issue
  • Redis/RLTest lifecycle issue, such as save/reload, expiration, cursor reap, or server shutdown
  • Coordinator/cluster-only behavior
  • Sanitizer, crash, panic, or memory safety issue
  • Environment or runner issue
  • Unknown or insufficient evidence

Separate proven facts from inference. Use language like "the logs show" for facts and "likely" or "possible" for hypotheses.

3. Find The Root Cause

Trace from the failing assertion or exception to the code path that can produce it:

  • For Python flow tests, inspect the exact test, fixtures, helper decorators, Env() settings, cluster skips, and waits.
  • For C/C++ unit tests, inspect the failing assertion, setup/teardown, thread usage, and decoded stack traces.
  • For Rust tests or panics, inspect the panic message, backtrace, unsafe boundaries, and FFI wrappers.
  • For coordinator failures, compare standalone and coordinator paths and check shard placement.
  • For timeouts, distinguish slow-but-correct behavior from a hang, deadlock, polling bug, or excessive test data size.

Avoid stopping at "probably flaky" when the logs identify a specific race, missing wait, lifecycle conflict, or environment failure.

4. Propose A Real Fix Or Say Evidence Is Insufficient

Propose a root-cause fix only when the cause is supported well enough to justify a code or test change. If the logs, source path, and failure signature do not support a specific cause, do not claim one.

A good fix proposal includes:

  • Root cause, with evidence
  • Exact target files/functions
  • Behavior change
  • Why this is preferable to a workaround
  • Verification commands
  • Remaining risk

If the cause is unclear, say so plainly. Do not suggest workaround-only fixes, quarantine, or skip_until from this skill. Instead, list the missing data needed, such as:

  • Full Test Logs ... artifact
  • Failed job log with timestamps
  • Reproduction command and seed/config
  • Server log around the assertion or crash
  • A second occurrence to compare signatures

When the current evidence is not enough but the next failure could be made more informative, recommend a diagnostic PR instead of a workaround. A valid recommendation is to add focused debug logs, counters, state dumps, or richer test assertion messages, merge or run that instrumentation in CI for a few days, and use the next flaky occurrence to identify the root cause. Do not present this as "add logs and rerun once to find the root cause right away"; a focused local rerun can help, but the main goal is better evidence when the intermittent failure happens again.

5. Verification Plan

Choose verification based on the proposed fix:

  • Python flow test:
    ./build.sh RUN_PYTEST ENABLE_ASSERT=1 TEST_TIMEOUT=20 TEST="<test_file>:<test_name>"
    
  • Coordinator-specific flow test:
    REDIS_STANDALONE=0 ./build.sh RUN_PYTEST ENABLE_ASSERT=1 TEST_TIMEOUT=20 TEST="<test_file>:<test_name>"
    
  • C/C++ unit test:
    ./build.sh RUN_UNIT_TESTS ENABLE_ASSERT=1 TEST=<unit_test_name>
    
  • Rust test:
    cargo nextest run --manifest-path src/redisearch_rs/Cargo.toml -p <crate_name> <test_filter>
    

If the failure is timing-sensitive, recommend repeating the focused test enough times to gain confidence, and include any required environment variables from the CI failure.

Report Back

End with:

  • Classification
  • Evidence-backed root cause, or "insufficient evidence"
  • Proposed real fix, if supported
  • Verification plan
  • Missing data, if blocked
Install via CLI
npx skills add https://github.com/RediSearch/RediSearch --skill investigate-flaky-test
Repository Details
star Stars 6,156
call_split Forks 585
navigation Branch main
article Path SKILL.md
More from Creator