name: diagnosis
description: "Use when facing an unknown software failure, when symptoms point to different root causes, or when an initial debugging attempt has not converged. Provides a triage-first diagnostic routing framework: classify the failure, collect the right evidence, choose a technique, track confidence, and escalate when stuck. Do NOT use for executing scientific debugging after triage (use debugging), code-quality review (use code-review), or proactive observability setup. Do NOT use for actually execute scientific-method debugging on this stack trace. Do NOT use for review this AI-generated PR for correctness. Do NOT use for scan this repo for OWASP top 10 vulnerabilities. Do NOT use for design observability instrumentation for this service. Do NOT use for decide which agent should pick up this ticket. Do NOT use for what's the right test pyramid for this feature."
license: MIT
compatibility: "Language- and stack-agnostic. The classification taxonomy, evidence protocol, and confidence ladder apply to any software failure investigation; specific technique names (git bisect, EXPLAIN plans, HMAC verification) are illustrative — substitute the equivalents of your stack."
allowed-tools: Read Grep
metadata:
relations: "{"related":["code-review","error-tracking","owasp-security","testing-strategy","debugging"],"suppresses":["debugging"],"verify_with":["debugging","a11y"]}"
subject: software-engineering-method
public: "true"
scope: "Use when facing an unknown software failure, when symptoms point to different root causes, or when an initial debugging attempt has not converged. Provides a triage-first diagnostic routing framework: classify the failure, collect the right evidence, choose a technique, track confidence, and escalate when stuck. Do NOT use for executing scientific debugging after triage (use debugging), code-quality review (use code-review), or proactive observability setup."
taxonomy_domain: engineering/debugging
stability: experimental
keywords: "["diagnostic triage software failure","symptom classification taxonomy","what kind of bug is this","which debugging approach","diagnostic routing framework","evidence collection before hypothesis","confidence ladder debugging","escalation criteria debugging","cascade vs coincidence failure","environment ghost"]"
examples: "["the agent has been chasing this bug for 30 minutes — what's the structural fix?","the symptoms span data integrity and UI rendering — which is the root cause?","the build fails locally but passes in CI — how do I diagnose that class first?","I have a stack trace and an unhandled exception — what's the cheapest technique?","intermittent failure that doesn't reproduce on retry — which class is this?","we ran profiling, instrumentation, and bisect — none converge. What did we misclassify?","two engineers disagree on whether this is a config issue or a logic error — what evidence settles it?"]"
anti_examples: "["actually execute scientific-method debugging on this stack trace","review this AI-generated PR for correctness","scan this repo for OWASP top 10 vulnerabilities","design observability instrumentation for this service","decide which agent should pick up this ticket","what's the right test pyramid for this feature"]"
grounding: "{"subject_matter":"Portable software-failure diagnostic triage: evidence collection, symptom classification, technique selection, confidence tracking, escalation, and sensitive diagnostic evidence handling","grounding_mode":"universal","truth_sources":["https://sre.google/sre-book/effective-troubleshooting/\",\"https://git-scm.com/docs/git-bisect\",\"https://stackoverflow.com/help/minimal-reproducible-example\",\"https://developer.chrome.com/docs/devtools/performance/overview\",\"https://www.postgresql.org/docs/current/sql-explain.html\",\"https://cheatsheetseries.owasp.org/cheatsheets/Logging_Cheat_Sheet.html\",\"https://opentelemetry.io/docs/security/handling-sensitive-data/\"],\"failure_modes\":[\"fixing_before_classification\",\"hypothesis_without_baseline_evidence\",\"wrong_technique_for_problem_class\",\"confidence_inflation_without_verification\",\"stuck_state_not_escalated_or_reclassified\",\"diagnostic_evidence_captures_sensitive_or_secret_data\",\"eval_or_routing_claim_inflated_without_run\"],\"evidence_priority\":\"equal\"}"
skill_graph_source_repo: "https://github.com/jacob-balslev/skill-graph"
skill_graph_project: Skill Graph
skill_graph_canonical_skill: skills/software-engineering-method/diagnosis/SKILL.md
skill_graph_export_description_projection: anti_examples
Diagnosis
Concept of the skill
Use when facing an unknown software failure, when symptoms point to different root causes, or when an initial debugging attempt has not converged.
Coverage
The triage-first framework that classifies a software failure into a problem class and routes it to the right diagnostic technique before root-cause investigation begins. Names nine symptom classes — Logic Error, Runtime Crash, Data Integrity, Timing / Race, Performance, Configuration, Security, Integration, Tooling / Build / Script-path — and provides a classification decision tree that walks from "is there a stack trace?" to a single class. Specifies a universal evidence-collection protocol (exact error message, reproduction steps, last-known-good state, environment facts) and class-specific evidence checklists. Lays out the technique-selection matrix — stack-trace reading, data-flow tracing, git bisect, differential comparison, instrumentation, MRE isolation, profiling, boundary probing — with each technique's time cost, best-case class, and evidence prerequisite. Defines the diagnostic confidence ladder (level 0 Symptom → 1 Classified → 2 Localized → 3 Root Cause → 4 Verified Fix) with explicit "you can say / you cannot say" boundaries at each level and stuck-state checkpoints (5-min, 10-min, 15-min, oscillation). Names escalation criteria for switching approach, switching class, or escalating to a human. Covers three cross-domain patterns where multiple classes apply simultaneously: the Cascade (one root cause, many symptoms), the Coincidence (two unrelated bugs that look like one), the Environment Ghost (works in one environment, fails in another). Catalogues diagnostic anti-patterns and ships a structured diagnostic-session template.
Philosophy of the skill
Debugging fails most often not because the engineer lacks skill, but because the wrong methodology is applied to the problem class. A timing bug needs different tools than a data-integrity bug. A scope leak needs different thinking than a rendering glitch. The most expensive debugging mistake is spending 30 minutes applying scientific-method debugging to what is actually a configuration error discoverable in 2 minutes.
This skill is the triage nurse, not the surgeon. A nurse does not treat the patient — they take vital signs, route to cardiology or neurology, and escalate to the attending physician when criteria are met. Software diagnosis works the same way: collect evidence, classify the symptom, route to the right specialist technique, and pivot when convergence stalls. The small cost of triage is almost always smaller than the cost of chasing a plausible but wrong cause. Skipping triage because "the cause is obvious" is a confirmation-bias trap; even seasoned engineers benefit from making the classification step explicit.
1. The Diagnostic Triage Protocol
Before debugging, diagnose which kind of problem you have. The class determines the technique and the technique determines the time-to-fix.
1. Collect baseline evidence (Section 3)
2. Classify the symptom (Section 2)
3. Select the diagnostic technique (Section 4)
4. Execute using the routed technique
5. If not converging after 3 attempts, escalate (Section 6)
Rule: never start fixing before completing steps 1–3. The cost of misclassification often exceeds the cost of a short triage pass, and the written classification gives the next person something concrete to challenge.
Diagnosis vs debugging handoff
| Surface | Diagnosis owns | Handoff signal | Next owner |
|---|---|---|---|
| Failure triage | Evidence collection, symptom class, technique choice, confidence level, escalation trigger | The failure has a primary class, a chosen technique, and enough evidence to run it | debugging |
| Root-cause execution | Reproduction, scope reduction, instrumentation, hypothesis testing, fix verification, regression test | The selected technique has started producing falsifiable evidence | debugging |
| Error capture pipeline | Whether the failure was captured, sanitized, and made observable | The problem is "this error was not reported or was reported unsafely" | error-tracking |
| Pre-merge quality review | Whether the code is risky before a known failure exists | The question is about correctness risk, maintainability, or review feedback rather than an observed symptom | code-review |
| Security investigation | Threat-model-specific analysis against an attack class | Evidence points at auth, authorization, injection, secret exposure, or data exposure | owasp-security |
Treat the handoff as a contract, not a vague recommendation. Diagnosis does not fix the bug; it decides which investigation path is justified by evidence.
2. Symptom-Classification Taxonomy
Every failure falls into one of nine classes. Each class has a primary diagnostic technique.
| Class | Symptoms | Primary technique |
|---|---|---|
| Logic Error | Wrong output, wrong calculation, wrong state transition | Trace data flow; compare expected vs actual at each stage |
| Runtime Crash | Unhandled exception, process exit, 500 error | Read stack trace; find the throwing line; check preconditions |
| Data Integrity | Missing records, wrong totals, duplicate entries, cross-tenant leak | Compare source data to derived data at each transform stage |
| Timing / Race | Intermittent failure, works on retry, order-dependent | Add timestamps to logs; look for concurrent mutations; check locks |
| Performance | Slow response, timeout, memory growth, CPU spike | Profile first (measure before hypothesizing); find the hot path |
| Configuration | Works locally but not in staging / prod, env-dependent | Diff environments — env vars, versions, feature flags, DNS, SSL |
| Security | Auth bypass, data exposure, HMAC failure, injection | Follow data flow from untrusted input to sensitive operation |
| Integration | Webhook not arriving, API returning unexpected shape, sync drift | Check both sides of the boundary independently, then compare |
| Tooling / Build / Script-path | Cannot find module, wrong cwd, stale script paths, read EIO, ENOENT on a script |
Verify path resolution; check cwd; verify dependency install; compare referenced path vs actual filesystem path |
Classification decision tree
Is there a stack trace or error message?
YES → Does it point to a specific line?
YES → Runtime Crash (read the line; check preconditions)
NO → Is it a timeout or OOM?
YES → Performance
NO → Logic Error (the error is a symptom of wrong state)
NO → Is the output wrong but no error thrown?
YES → Is the wrongness in calculated numbers or records?
YES → Data Integrity
NO → Logic Error
NO → Is it intermittent?
YES → Timing / Race
NO → Does it depend on environment?
YES → Configuration
NO → Does the error message contain a file/module path?
YES → Tooling / Build / Script-path
NO → Does it involve external services?
YES → Integration
NO → Are there security signals
(auth failure, permission error,
unexpected data exposure, HMAC failure,
access-control bypass)?
YES → Security
NO → Unknown / Unclassified
→ restart evidence collection;
run a fresh investigative sweep
3. Evidence-Collection Protocol
Before forming any hypothesis, collect baseline evidence. The class determines the additional evidence needed beyond the universal set.
Evidence safety rule
Diagnostic notes, logs, screenshots, and repro snippets often contain more sensitive information than the final fix. Collect enough evidence to classify the failure, but redact or replace personal data, credentials, session tokens, raw request bodies, and secret-bearing headers before copying evidence into a shared note, issue, audit artifact, or skill. Prefer internal opaque IDs, hashes, synthetic examples, and minimal reproductions over real payload dumps.
Universal evidence (always collect)
| Evidence | How to collect | Why |
|---|---|---|
| Exact error message or wrong output | Copy from logs, terminal, or UI | Prevents paraphrasing errors |
| Reproduction steps | The minimal sequence that triggers the failure | Proves the bug exists and is testable |
| Last-known-good state | git log --oneline -10, recent deploys, recent data changes |
Brackets the introduction window |
| Environment facts | Runtime version, env vars, database state, running services | Eliminates the Configuration class early |
Class-specific evidence
| Class | Additional evidence to collect |
|---|---|
| Logic Error | Input data, expected output, actual output, intermediate values at key transform points |
| Runtime Crash | Full stack trace, request payload, database state at crash time |
| Data Integrity | Source record count vs derived count, sample rows from each stage, tenant / scope identifiers |
| Timing / Race | Timestamps of concurrent operations, lock state, retry behaviour, whether it reproduces under load |
| Performance | Response-time baseline, CPU / memory profile, query plans (EXPLAIN), N+1 query check |
| Configuration | Env-var diff (local vs staging vs prod), package-version diff, feature-flag state |
| Security | Auth state, session-token contents, role / permission, request headers, HMAC comparison |
| Integration | Request / response pair from both sides, delivery logs, timestamp alignment |
| Tooling / Build / Script-path | Module-resolution output, current working directory at failure, dependency-install verification, referenced path vs filesystem path |
Rule: if you cannot fill the universal evidence table, you are not ready to hypothesize. Collect first, think second.
Evidence ledger
Use an evidence ledger when the investigation has more than one plausible class. This keeps assumptions separate from observations and prevents confidence inflation.
| Field | Record | Example |
|---|---|---|
| Observation | Raw fact, redacted if sensitive | POST /webhook returns 401 in staging only |
| Source | Where the fact came from | Deployment log, stack trace, profile, sanitized request sample |
| Class signal | Which class it supports | Configuration, Integration, Security |
| Contradiction | Which class it weakens | Logic Error: same code path passes locally |
| Next test | Cheapest falsification step | Compare staging and local signing secret metadata without exposing the secret |
If an observation changes the likely class, update the class explicitly. Silent reclassification is how investigations drift into mythology.
4. Technique-Selection Matrix
Once the symptom is classified, pick the cheapest technique that could resolve the class.
| Technique | Best for | Time cost | Evidence required |
|---|---|---|---|
| Stack-trace reading | Runtime crashes, unhandled exceptions | 1–2 min | Stack trace |
| Data-flow tracing | Logic errors, data integrity | 5–15 min | Input + output at each stage |
Binary search (git bisect) |
Regressions with known-good state | 3–10 min | Known-good commit + reproducible test |
| Differential comparison | Configuration, environment-dependent failure | 2–5 min | Two environments to compare |
| Instrumentation (logging) | Timing / race, intermittent failures | 5–10 min setup | Hypothesis about where to instrument |
| Isolation (MRE) | Complex failures with many variables | 10–20 min | Reproducible failure |
| Profiling | Performance, memory, CPU | 5–15 min | Running system under load |
| Boundary probing | Integration failures | 5–10 min | Access to both sides of the integration |
Technique-ordering principle
Always start with the cheapest technique that could resolve the class:
- Read the error (~30 s) — cheapest first pass for runtime crashes
- Check the environment (~1 min) — cheapest first pass for configuration issues
- Trace the data flow (~5 min) — cheapest first pass for logic / data errors
- Isolate with MRE (~10 min) — useful when too many variables remain in play
- Instrument and observe (~10+ min) — necessary when timing / intermittent failures cannot be reproduced directly
The percentages are intentionally absent. This skill is a routing framework, not a benchmark claim. Use local incident history or an actual eval corpus before making quantified success-rate claims.
5. The Diagnostic Confidence Ladder
As evidence accumulates, confidence in the diagnosis should increase monotonically. If it doesn't, the symptom has been misclassified.
| Level | Confidence | You can say | You cannot say |
|---|---|---|---|
| 0 — Symptom | 0% | "Something is wrong" | Anything about the cause |
| 1 — Classified | 20% | "This is a [class] problem" | Where specifically |
| 2 — Localized | 50% | "The failure is in [module / file / function]" | What exactly is wrong |
| 3 — Root cause | 80% | "The cause is [specific condition]" | That the fix will work |
| 4 — Verified fix | 95% | "This fix resolves the root cause and does not regress" | Nothing — ship it |
Stuck-state checkpoints
- Stuck at level 0 for > 5 min → you need more evidence; restart Section 3
- Stuck at level 1 for > 10 min → likely misclassification; re-run the classification tree
- Stuck at level 2 for > 15 min → the problem may be cross-domain; check whether multiple classes apply
- Oscillating between levels → stop. Write down what you know vs what you're assuming. The assumption is wrong.
Reclassification rule
Classification is provisional until the evidence keeps moving the confidence ladder upward. Re-run the classification tree when any of these happens:
| Signal | Meaning | Required action |
|---|---|---|
| The selected technique produces no new evidence | The class may be wrong or the evidence prerequisite is missing | Re-check Section 3, then choose the next cheapest class-compatible technique |
| A contradiction appears | The current class does not explain all observations | Split observation from assumption in the evidence ledger and reclassify |
| Confidence decreases after a test | The hypothesis was falsified, not "almost right" | Record the falsification and move down the ladder before continuing |
| Two classes stay equally plausible | The failure may be a Cascade or Coincidence | Test the earliest shared data-flow point, then split symptoms if one fix does not affect both |
6. Escalation Criteria
Switch diagnostic approach when
| Signal | Action |
|---|---|
| Three hypotheses tested, none confirmed | Re-classify the symptom from scratch |
| Fix works locally but not in target env | Switch to Configuration-class techniques |
| Multiple symptoms that don't share a root cause | You may have 2+ bugs; triage each independently |
| Evidence contradicts the classification | Trust the evidence; re-classify |
| Confidence has decreased over the last 3 steps | Stop. You're making it worse. Fresh context needed. |
Escalate to human when
| Signal | Why a human is needed |
|---|---|
| Requires access you don't have (production DB, third-party dashboard) | Authorization boundary |
| Business-logic ambiguity ("should this return 0 or null?") | Product decision, not technical |
| Fix requires a breaking change to a public API | Stakeholder alignment needed |
| Reproduction requires real user data you cannot access | Privacy / compliance boundary |
| 30 minutes of investigation with no progress | Fresh perspective needed |
7. Cross-Domain Patterns
Some failures span multiple classes simultaneously. These compound failures are the hardest to diagnose.
Pattern: the Cascade
A single root cause triggers symptoms across multiple classes.
Root cause: missing null-check in a data transform
→ Data Integrity symptom: wrong totals
→ Logic Error symptom: UI shows negative values
→ Integration symptom: webhook payload rejected by partner
Diagnostic approach: find the earliest symptom in the data flow. That's closest to the root cause.
Pattern: the Coincidence
Two unrelated bugs appear simultaneously, creating a misleading compound symptom.
Bug A: CSS regression from a recent deploy (Logic Error)
Bug B: slow API from an unrelated query change (Performance)
Combined symptom: "the page is broken and slow"
Diagnostic approach: separate the symptoms. Test each independently. If fixing one doesn't affect the other, they're independent bugs.
Pattern: the Environment Ghost
Works in one environment, fails in another, with no code difference.
Local: works (runtime 20.11, .env.local, fresh DB)
Staging: fails (runtime 20.9, CI env vars, migrated DB)
Diagnostic approach: diff everything — runtime versions, env vars, DB state, feature flags, DNS, SSL, headers. The first difference you find is usually the cause.
8. Anti-Patterns
| Anti-pattern | Why it fails | Correct |
|---|---|---|
| Fixing before diagnosing | Treats the symptom; root cause persists | Complete the triage protocol first |
| Hypothesis without evidence | Confirmation bias drives you toward your guess | Collect universal evidence before any hypothesis |
| Changing multiple variables at once | Cannot determine which change had the effect | One variable at a time |
| Assuming the obvious cause | "Obvious" often means "familiar," not "verified" | Verify with evidence even when "obvious" |
| Copying raw sensitive data into evidence | The diagnostic artifact becomes a privacy or secret leak | Redact, synthesize, hash, or replace with opaque IDs |
Debugging by printf without a hypothesis |
Random instrumentation wastes time | Instrument to test a specific hypothesis |
| Applying the wrong class's technique | Performance profiling won't find a logic error | Re-classify if the technique isn't converging |
| Escalating too early | Hasn't gathered enough evidence for a useful escalation | Fill the evidence table before escalating |
| Escalating too late | Spent 45 minutes on what a human could resolve in 5 | Follow the time-based escalation triggers |
9. Diagnostic-Session Template
Use this template to structure a diagnostic session. It prevents skipping steps.
## Diagnostic Session: [Brief description]
### 1. Symptom
- What: [exact error or wrong behavior]
- Where: [route / component / job]
- When: [always / intermittent / environment-specific]
- Since: [commit / deploy / data change]
### 2. Classification
- Primary class: [from taxonomy]
- Confidence: [0–4 level]
- Technique: [from technique matrix]
### 3. Evidence Collected
- [ ] Error message / wrong output (exact)
- [ ] Reproduction steps (minimal)
- [ ] Last-known-good state
- [ ] Environment facts
- [ ] Sensitive evidence redacted or replaced with safe identifiers
- [ ] Class-specific evidence: [list]
### 4. Evidence Ledger
| Observation | Source | Class signal | Contradiction | Next test |
| ----------- | ------ | ------------ | ------------- | --------- |
| | | | | |
### 5. Hypotheses Tested
| # | Hypothesis | Test | Result | Confidence after |
| --- | ---------- | ---- | ------ | ---------------- |
| 1 | | | | |
### 6. Resolution
- Root cause: [one sentence]
- Fix: [what was changed]
- Prevention: [test / guard / doc added]
Grounding and Evaluation State
This skill is grounded in public troubleshooting and diagnostic-practice references: Google SRE troubleshooting guidance, git bisect documentation for regression bisection, Stack Overflow MRE guidance for isolation, Chrome DevTools and PostgreSQL EXPLAIN docs for measurement/profiling examples, OWASP logging guidance for diagnostic event capture, and OpenTelemetry sensitive-data guidance for safe telemetry handling.
The current eval metadata remains intentionally conservative: eval_artifacts: planned, eval_state: unverified, and routing_eval: absent. Do not mark this skill verified or routing-present until a real comprehension eval and routing eval include diagnosis and pass in the same change.
Verification
- The symptom was classified before any debugging technique was chosen
- Baseline evidence was collected before any hypothesis was formed
- Sensitive or secret-bearing evidence was redacted, synthesized, hashed, or replaced with opaque IDs before sharing
- The cheapest technique that could resolve this class was tried first
- Confidence increased monotonically — or the symptom was re-classified the moment it didn't
- If the approach was changed, the reason was documented (which signal triggered the switch)
- The time-based stuck-state checkpoints were respected (5-min / 10-min / 15-min triggers)
- If the failure spanned multiple classes, the cross-domain pattern (Cascade / Coincidence / Environment Ghost) was named explicitly
Do NOT Use When
| Use instead | When |
|---|---|
debugging |
Actually executing scientific-method debugging on a failure that has already been classified — this skill routes to debugging; it does not replace it |
code-review |
Reviewing code for quality / correctness before a failure exists — diagnosis is downstream |
owasp-security |
A focused security audit against a known threat list — diagnosis only routes here when symptoms point at security |
testing-strategy |
Deciding what to test proactively — diagnosis is for reactive investigation after a failure |
error-tracking |
Setting up the production-error-capture / sampling / alerting stack — diagnosis investigates a specific failure already in front of you |
skill-router |
Choosing which agent skill activates for an arbitrary query — that's cross-skill dispatch, not failure triage |
Skill Graph context
Classification
- Subject:
software-engineering-method - Public:
true - Domain:
engineering/debugging - Scope: Use when facing an unknown software failure, when symptoms point to different root causes, or when an initial debugging attempt has not converged. Provides a triage-first diagnostic routing framework: classify the failure, collect the right evidence, choose a technique, track confidence, and escalate when stuck. Do NOT use for executing scientific debugging after triage (use
debugging), code-quality review (usecode-review), or proactive observability setup.
When to use
- the agent has been chasing this bug for 30 minutes — what's the structural fix?
- the symptoms span data integrity and UI rendering — which is the root cause?
- the build fails locally but passes in CI — how do I diagnose that class first?
- I have a stack trace and an unhandled exception — what's the cheapest technique?
- intermittent failure that doesn't reproduce on retry — which class is this?
- we ran profiling, instrumentation, and bisect — none converge. What did we misclassify?
- two engineers disagree on whether this is a config issue or a logic error — what evidence settles it?
Not for
- actually execute scientific-method debugging on this stack trace
- review this AI-generated PR for correctness
- scan this repo for OWASP top 10 vulnerabilities
- design observability instrumentation for this service
- decide which agent should pick up this ticket
- what's the right test pyramid for this feature
Related skills
- Verify with:
debugging,a11y - Related:
code-review,error-tracking,owasp-security,testing-strategy,debugging
Grounding
- Mode:
universal - Truth sources:
https://sre.google/sre-book/effective-troubleshooting/,https://git-scm.com/docs/git-bisect,https://stackoverflow.com/help/minimal-reproducible-example,https://developer.chrome.com/docs/devtools/performance/overview,https://www.postgresql.org/docs/current/sql-explain.html,https://cheatsheetseries.owasp.org/cheatsheets/Logging_Cheat_Sheet.html,https://opentelemetry.io/docs/security/handling-sensitive-data/
Keywords
diagnostic triage software failure,symptom classification taxonomy,what kind of bug is this,which debugging approach,diagnostic routing framework,evidence collection before hypothesis,confidence ladder debugging,escalation criteria debugging,cascade vs coincidence failure,environment ghost