diagnose - SKILL.md Agent Skill

name: diagnose description: > Root cause analysis for a Quay bug. Traces code paths through Quay subsystems, analyzes git history, forms and tests hypotheses, and enumerates complete state space. allowed-tools: - Bash(git log ) - Bash(git blame ) - Bash(git show ) - Bash(gh pr view ) - Read - Glob - Grep - Agent - AskUserQuestion

Diagnose Root Cause

Determine why the bug exists. This is the most critical phase — a wrong diagnosis leads to a wrong fix.

Your Role

Perform thorough root cause analysis that provides clear, evidence-based conclusions. Identify the underlying issue, not just the symptom.

Process

Step 1: Review Reproduction

Read the reproduction report (artifacts/quay-bugfix/reports/reproduction.md if it exists):

Understand the exact conditions that trigger the bug
Note any patterns or edge cases discovered
Identify the entry point for investigation

Step 2: Trace the Code Path

Starting from the reproduction steps, trace the execution flow through Quay's subsystems. Use the subsystem map from CLAUDE.md:

For API bugs: Entry point in endpoints/ -> business logic in data/model/ -> database operations in data/database.py -> query execution

For worker bugs: Worker class in workers/ -> queue operations -> data layer -> database

For UI bugs: React component in web/src/ -> API hook in web/src/hooks/ -> REST endpoint in endpoints/

For auth bugs: Auth middleware in auth/ -> permission checks -> data layer

For storage bugs: Storage backend in storage/ -> blob operations -> configuration

Use file:line notation for every code reference (e.g., endpoints/api/repository.py:245).

Step 3: Historical Analysis

git log --oneline -20 -- <affected-files>
git blame <file> -L <start>,<end>

Look for: recent changes that introduced the bug, related PRs, patterns of similar fixes.

Step 4: Hypothesis Formation

List all potential root causes based on evidence:

Rank hypotheses by likelihood (high/medium/low confidence)
Consider: logic errors, race conditions, edge cases, missing validation, incorrect state transitions, ORM query issues
Document reasoning for each hypothesis

Step 5: Hypothesis Testing

Identify where targeted logging or debugging would confirm/refute each hypothesis (note locations for the fix phase)
Design minimal test cases that would validate or disprove each hypothesis (document them for the test phase)
Use binary search (git bisect) if the change was introduced gradually
Narrow down to the definitive root cause

Step 6: State Enumeration

CRITICAL: If the bug involves state-dependent logic (status fields, phase transitions, feature flags, configuration options):

Search the codebase for the complete list of possible values
Don't assume you know all states — verify by searching
Document feature interactions that affect the bug

Example: If a worker stops processing on "terminal" statuses, search for ALL statuses used in the codebase, not just the ones in the bug report.

Step 7: Impact Assessment

What other code paths are affected by the same root cause?
Could the fix cause regressions in related functionality?
Are there similar patterns elsewhere that have the same bug?
Does this affect database migrations or schema?

Step 8: Solution Approach

Recommend fix strategy based on root cause
Consider multiple approaches and their trade-offs
Document why the recommended approach is best

Output

Save to artifacts/quay-bugfix/analysis/root-cause.md:

# Root Cause Analysis: <TICKET>

## Root Cause
<Clear, specific explanation of why the bug exists>

## Code References
- `file:line` — <what this code does wrong>
- ...

## Evidence
- <supporting evidence from code, git history, reproduction>

## Timeline
<When the bug was introduced — commit/PR reference>

## Impact
- **Scope:** <what else is affected>
- **Risk:** <could the fix cause regressions>
- **Related patterns:** <similar code that may have the same issue>

## States/Conditions Enumerated
<Complete list of states/values found by searching, not assumed>

## Recommended Fix
<Specific approach with code locations to change>

## Alternative Approaches
<Other solutions with pros/cons>

Confidence Assessment

Include this section at the end of your root cause artifact:

## Confidence Assessment
- **Level**: high | medium | low
- **Score**: <0-100 integer>
- **Score rationale**: <1-2 sentences>
- **Open questions**: <bullet list, or "None">

Confidence signals for this phase:

High: Single root cause identified with strong code evidence and git history
Medium: Primary root cause identified but alternative explanations not fully ruled out
Low: Multiple plausible causes, insufficient evidence to distinguish, or root cause in unfamiliar subsystem

When This Phase Is Done

Report: the identified root cause, confidence level, and where the analysis was written.