diagnosis

name: diagnosis description: "Use when facing an unknown software failure, when symptoms point to different root causes, or when an initial debugging attempt has not converged. Provides a triage-first diagnostic routing framework: classify the failure, collect the right evidence, choose a technique, track confidence, and escalate when stuck. Do NOT use for executing scientific debugging after triage (use `debugging`), code-quality review (use `code-review`), or proactive observability setup. Do NOT use for actually execute scientific-method debugging on this stack trace. Do NOT use for review this AI-generated PR for correctness. Do NOT use for scan this repo for OWASP top 10 vulnerabilities. Do NOT use for design observability instrumentation for this service. Do NOT use for decide which agent should pick up this ticket. Do NOT use for what's the right test pyramid for this feature." license: MIT compatibility: "Language- and stack-agnostic. The classification taxonomy, evidence protocol, and confidence ladder apply to any software failure investigation; specific technique names (git bisect, EXPLAIN plans, HMAC verification) are illustrative — substitute the equivalents of your stack." allowed-tools: Read Grep metadata: relations: "{"related":["code-review","error-tracking","owasp-security","testing-strategy","debugging"],"suppresses":["debugging"],"verify_with":["debugging","a11y"]}" subject: software-engineering-method public: "true" scope: "Use when facing an unknown software failure, when symptoms point to different root causes, or when an initial debugging attempt has not converged. Provides a triage-first diagnostic routing framework: classify the failure, collect the right evidence, choose a technique, track confidence, and escalate when stuck. Do NOT use for executing scientific debugging after triage (use `debugging`), code-quality review (use `code-review`), or proactive observability setup." taxonomy_domain: engineering/debugging stability: experimental keywords: "["diagnostic triage software failure","symptom classification taxonomy","what kind of bug is this","which debugging approach","diagnostic routing framework","evidence collection before hypothesis","confidence ladder debugging","escalation criteria debugging","cascade vs coincidence failure","environment ghost"]" examples: "["the agent has been chasing this bug for 30 minutes — what's the structural fix?","the symptoms span data integrity and UI rendering — which is the root cause?","the build fails locally but passes in CI — how do I diagnose that class first?","I have a stack trace and an unhandled exception — what's the cheapest technique?","intermittent failure that doesn't reproduce on retry — which class is this?","we ran profiling, instrumentation, and bisect — none converge. What did we misclassify?","two engineers disagree on whether this is a config issue or a logic error — what evidence settles it?"]" anti_examples: "["actually execute scientific-method debugging on this stack trace","review this AI-generated PR for correctness","scan this repo for OWASP top 10 vulnerabilities","design observability instrumentation for this service","decide which agent should pick up this ticket","what's the right test pyramid for this feature"]" grounding: "{"subject_matter":"Portable software-failure diagnostic triage: evidence collection, symptom classification, technique selection, confidence tracking, escalation, and sensitive diagnostic evidence handling","grounding_mode":"universal","truth_sources":["https://sre.google/sre-book/effective-troubleshooting/\",\"https://git-scm.com/docs/git-bisect\",\"https://stackoverflow.com/help/minimal-reproducible-example\",\"https://developer.chrome.com/docs/devtools/performance/overview\",\"https://www.postgresql.org/docs/current/sql-explain.html\",\"https://cheatsheetseries.owasp.org/cheatsheets/Logging_Cheat_Sheet.html\",\"https://opentelemetry.io/docs/security/handling-sensitive-data/\"],\"failure_modes\":[\"fixing_before_classification\",\"hypothesis_without_baseline_evidence\",\"wrong_technique_for_problem_class\",\"confidence_inflation_without_verification\",\"stuck_state_not_escalated_or_reclassified\",\"diagnostic_evidence_captures_sensitive_or_secret_data\",\"eval_or_routing_claim_inflated_without_run\"],\"evidence_priority\":\"equal\"}" skill_graph_source_repo: "https://github.com/jacob-balslev/skill-graph" skill_graph_project: Skill Graph skill_graph_canonical_skill: skills/software-engineering-method/diagnosis/SKILL.md skill_graph_export_description_projection: anti_examples

Concept of the skill

Use when facing an unknown software failure, when symptoms point to different root causes, or when an initial debugging attempt has not converged.

Coverage

The triage-first framework that classifies a software failure into a problem class and routes it to the right diagnostic technique before root-cause investigation begins. Names nine symptom classes — Logic Error, Runtime Crash, Data Integrity, Timing / Race, Performance, Configuration, Security, Integration, Tooling / Build / Script-path — and provides a classification decision tree that walks from "is there a stack trace?" to a single class. Specifies a universal evidence-collection protocol (exact error message, reproduction steps, last-known-good state, environment facts) and class-specific evidence checklists. Lays out the technique-selection matrix — stack-trace reading, data-flow tracing, git bisect, differential comparison, instrumentation, MRE isolation, profiling, boundary probing — with each technique's time cost, best-case class, and evidence prerequisite. Defines the diagnostic confidence ladder (level 0 Symptom → 1 Classified → 2 Localized → 3 Root Cause → 4 Verified Fix) with explicit "you can say / you cannot say" boundaries at each level and stuck-state checkpoints (5-min, 10-min, 15-min, oscillation). Names escalation criteria for switching approach, switching class, or escalating to a human. Covers three cross-domain patterns where multiple classes apply simultaneously: the Cascade (one root cause, many symptoms), the Coincidence (two unrelated bugs that look like one), the Environment Ghost (works in one environment, fails in another). Catalogues diagnostic anti-patterns and ships a structured diagnostic-session template.

Philosophy of the skill

Debugging fails most often not because the engineer lacks skill, but because the wrong methodology is applied to the problem class. A timing bug needs different tools than a data-integrity bug. A scope leak needs different thinking than a rendering glitch. The most expensive debugging mistake is spending 30 minutes applying scientific-method debugging to what is actually a configuration error discoverable in 2 minutes.

This skill is the triage nurse, not the surgeon. A nurse does not treat the patient — they take vital signs, route to cardiology or neurology, and escalate to the attending physician when criteria are met. Software diagnosis works the same way: collect evidence, classify the symptom, route to the right specialist technique, and pivot when convergence stalls. The small cost of triage is almost always smaller than the cost of chasing a plausible but wrong cause. Skipping triage because "the cause is obvious" is a confirmation-bias trap; even seasoned engineers benefit from making the classification step explicit.

1. The Diagnostic Triage Protocol

Before debugging, diagnose which kind of problem you have. The class determines the technique and the technique determines the time-to-fix.

1. Collect baseline evidence (Section 3)
2. Classify the symptom            (Section 2)
3. Select the diagnostic technique (Section 4)
4. Execute using the routed technique
5. If not converging after 3 attempts, escalate (Section 6)

Rule: never start fixing before completing steps 1–3. The cost of misclassification often exceeds the cost of a short triage pass, and the written classification gives the next person something concrete to challenge.

Diagnosis vs debugging handoff

Surface	Diagnosis owns	Handoff signal	Next owner
Failure triage	Evidence collection, symptom class, technique choice, confidence level, escalation trigger	The failure has a primary class, a chosen technique, and enough evidence to run it	`debugging`
Root-cause execution	Reproduction, scope reduction, instrumentation, hypothesis testing, fix verification, regression test	The selected technique has started producing falsifiable evidence	`debugging`
Error capture pipeline	Whether the failure was captured, sanitized, and made observable	The problem is "this error was not reported or was reported unsafely"	`error-tracking`
Pre-merge quality review	Whether the code is risky before a known failure exists	The question is about correctness risk, maintainability, or review feedback rather than an observed symptom	`code-review`
Security investigation	Threat-model-specific analysis against an attack class	Evidence points at auth, authorization, injection, secret exposure, or data exposure	`owasp-security`

Treat the handoff as a contract, not a vague recommendation. Diagnosis does not fix the bug; it decides which investigation path is justified by evidence.

2. Symptom-Classification Taxonomy

Every failure falls into one of nine classes. Each class has a primary diagnostic technique.

Class	Symptoms	Primary technique
Logic Error	Wrong output, wrong calculation, wrong state transition	Trace data flow; compare expected vs actual at each stage
Runtime Crash	Unhandled exception, process exit, 500 error	Read stack trace; find the throwing line; check preconditions
Data Integrity	Missing records, wrong totals, duplicate entries, cross-tenant leak	Compare source data to derived data at each transform stage
Timing / Race	Intermittent failure, works on retry, order-dependent	Add timestamps to logs; look for concurrent mutations; check locks
Performance	Slow response, timeout, memory growth, CPU spike	Profile first (measure before hypothesizing); find the hot path
Configuration	Works locally but not in staging / prod, env-dependent	Diff environments — env vars, versions, feature flags, DNS, SSL
Security	Auth bypass, data exposure, HMAC failure, injection	Follow data flow from untrusted input to sensitive operation
Integration	Webhook not arriving, API returning unexpected shape, sync drift	Check both sides of the boundary independently, then compare
Tooling / Build / Script-path	`Cannot find module`, wrong cwd, stale script paths, `read EIO`, `ENOENT` on a script	Verify path resolution; check cwd; verify dependency install; compare referenced path vs actual filesystem path

Classification decision tree

Is there a stack trace or error message?
  YES → Does it point to a specific line?
          YES → Runtime Crash (read the line; check preconditions)
          NO  → Is it a timeout or OOM?
                  YES → Performance
                  NO  → Logic Error (the error is a symptom of wrong state)
  NO  → Is the output wrong but no error thrown?
          YES → Is the wrongness in calculated numbers or records?
                  YES → Data Integrity
                  NO  → Logic Error
          NO  → Is it intermittent?
                  YES → Timing / Race
                  NO  → Does it depend on environment?
                          YES → Configuration
                          NO  → Does the error message contain a file/module path?
                                  YES → Tooling / Build / Script-path
                                  NO  → Does it involve external services?
                                          YES → Integration
                                          NO  → Are there security signals
                                                (auth failure, permission error,
                                                unexpected data exposure, HMAC failure,
                                                access-control bypass)?
                                                  YES → Security
                                                  NO  → Unknown / Unclassified
                                                          → restart evidence collection;
                                                            run a fresh investigative sweep

3. Evidence-Collection Protocol

Before forming any hypothesis, collect baseline evidence. The class determines the additional evidence needed beyond the universal set.

Evidence safety rule

Diagnostic notes, logs, screenshots, and repro snippets often contain more sensitive information than the final fix. Collect enough evidence to classify the failure, but redact or replace personal data, credentials, session tokens, raw request bodies, and secret-bearing headers before copying evidence into a shared note, issue, audit artifact, or skill. Prefer internal opaque IDs, hashes, synthetic examples, and minimal reproductions over real payload dumps.

Universal evidence (always collect)

Evidence	How to collect	Why
Exact error message or wrong output	Copy from logs, terminal, or UI	Prevents paraphrasing errors
Reproduction steps	The minimal sequence that triggers the failure	Proves the bug exists and is testable
Last-known-good state	`git log --oneline -10`, recent deploys, recent data changes	Brackets the introduction window
Environment facts	Runtime version, env vars, database state, running services	Eliminates the Configuration class early

Class-specific evidence

Class	Additional evidence to collect
Logic Error	Input data, expected output, actual output, intermediate values at key transform points
Runtime Crash	Full stack trace, request payload, database state at crash time
Data Integrity	Source record count vs derived count, sample rows from each stage, tenant / scope identifiers
Timing / Race	Timestamps of concurrent operations, lock state, retry behaviour, whether it reproduces under load
Performance	Response-time baseline, CPU / memory profile, query plans (EXPLAIN), N+1 query check
Configuration	Env-var diff (local vs staging vs prod), package-version diff, feature-flag state
Security	Auth state, session-token contents, role / permission, request headers, HMAC comparison
Integration	Request / response pair from both sides, delivery logs, timestamp alignment
Tooling / Build / Script-path	Module-resolution output, current working directory at failure, dependency-install verification, referenced path vs filesystem path

Rule: if you cannot fill the universal evidence table, you are not ready to hypothesize. Collect first, think second.

Evidence ledger

Use an evidence ledger when the investigation has more than one plausible class. This keeps assumptions separate from observations and prevents confidence inflation.

Field	Record	Example
Observation	Raw fact, redacted if sensitive	`POST /webhook` returns 401 in staging only
Source	Where the fact came from	Deployment log, stack trace, profile, sanitized request sample
Class signal	Which class it supports	Configuration, Integration, Security
Contradiction	Which class it weakens	Logic Error: same code path passes locally
Next test	Cheapest falsification step	Compare staging and local signing secret metadata without exposing the secret

If an observation changes the likely class, update the class explicitly. Silent reclassification is how investigations drift into mythology.

4. Technique-Selection Matrix

Once the symptom is classified, pick the cheapest technique that could resolve the class.

Technique	Best for	Time cost	Evidence required
Stack-trace reading	Runtime crashes, unhandled exceptions	1–2 min	Stack trace
Data-flow tracing	Logic errors, data integrity	5–15 min	Input + output at each stage
Binary search (`git bisect`)	Regressions with known-good state	3–10 min	Known-good commit + reproducible test
Differential comparison	Configuration, environment-dependent failure	2–5 min	Two environments to compare
Instrumentation (logging)	Timing / race, intermittent failures	5–10 min setup	Hypothesis about where to instrument
Isolation (MRE)	Complex failures with many variables	10–20 min	Reproducible failure
Profiling	Performance, memory, CPU	5–15 min	Running system under load
Boundary probing	Integration failures	5–10 min	Access to both sides of the integration

Technique-ordering principle

Always start with the cheapest technique that could resolve the class:

Read the error (~30 s) — cheapest first pass for runtime crashes
Check the environment (~1 min) — cheapest first pass for configuration issues
Trace the data flow (~5 min) — cheapest first pass for logic / data errors
Isolate with MRE (~10 min) — useful when too many variables remain in play
Instrument and observe (~10+ min) — necessary when timing / intermittent failures cannot be reproduced directly

The percentages are intentionally absent. This skill is a routing framework, not a benchmark claim. Use local incident history or an actual eval corpus before making quantified success-rate claims.

5. The Diagnostic Confidence Ladder

As evidence accumulates, confidence in the diagnosis should increase monotonically. If it doesn't, the symptom has been misclassified.

Level	Confidence	You can say	You cannot say
0 — Symptom	0%	"Something is wrong"	Anything about the cause
1 — Classified	20%	"This is a [class] problem"	Where specifically
2 — Localized	50%	"The failure is in [module / file / function]"	What exactly is wrong
3 — Root cause	80%	"The cause is [specific condition]"	That the fix will work
4 — Verified fix	95%	"This fix resolves the root cause and does not regress"	Nothing — ship it

Stuck-state checkpoints

Stuck at level 0 for > 5 min → you need more evidence; restart Section 3
Stuck at level 1 for > 10 min → likely misclassification; re-run the classification tree
Stuck at level 2 for > 15 min → the problem may be cross-domain; check whether multiple classes apply
Oscillating between levels → stop. Write down what you know vs what you're assuming. The assumption is wrong.

Reclassification rule

Classification is provisional until the evidence keeps moving the confidence ladder upward. Re-run the classification tree when any of these happens:

Signal	Meaning	Required action
The selected technique produces no new evidence	The class may be wrong or the evidence prerequisite is missing	Re-check Section 3, then choose the next cheapest class-compatible technique
A contradiction appears	The current class does not explain all observations	Split observation from assumption in the evidence ledger and reclassify
Confidence decreases after a test	The hypothesis was falsified, not "almost right"	Record the falsification and move down the ladder before continuing
Two classes stay equally plausible	The failure may be a Cascade or Coincidence	Test the earliest shared data-flow point, then split symptoms if one fix does not affect both

6. Escalation Criteria

Switch diagnostic approach when

Signal	Action
Three hypotheses tested, none confirmed	Re-classify the symptom from scratch
Fix works locally but not in target env	Switch to Configuration-class techniques
Multiple symptoms that don't share a root cause	You may have 2+ bugs; triage each independently
Evidence contradicts the classification	Trust the evidence; re-classify
Confidence has decreased over the last 3 steps	Stop. You're making it worse. Fresh context needed.

Escalate to human when

Signal	Why a human is needed
Requires access you don't have (production DB, third-party dashboard)	Authorization boundary
Business-logic ambiguity ("should this return 0 or null?")	Product decision, not technical
Fix requires a breaking change to a public API	Stakeholder alignment needed
Reproduction requires real user data you cannot access	Privacy / compliance boundary
30 minutes of investigation with no progress	Fresh perspective needed

7. Cross-Domain Patterns

Some failures span multiple classes simultaneously. These compound failures are the hardest to diagnose.

Pattern: the Cascade

A single root cause triggers symptoms across multiple classes.

Root cause: missing null-check in a data transform
  → Data Integrity symptom: wrong totals
  → Logic Error symptom:    UI shows negative values
  → Integration symptom:    webhook payload rejected by partner

Diagnostic approach: find the earliest symptom in the data flow. That's closest to the root cause.

Pattern: the Coincidence

Two unrelated bugs appear simultaneously, creating a misleading compound symptom.

Bug A: CSS regression from a recent deploy        (Logic Error)
Bug B: slow API from an unrelated query change    (Performance)
Combined symptom: "the page is broken and slow"

Diagnostic approach: separate the symptoms. Test each independently. If fixing one doesn't affect the other, they're independent bugs.

Pattern: the Environment Ghost

Works in one environment, fails in another, with no code difference.

Local:    works   (runtime 20.11, .env.local, fresh DB)
Staging:  fails   (runtime 20.9,  CI env vars, migrated DB)

Diagnostic approach: diff everything — runtime versions, env vars, DB state, feature flags, DNS, SSL, headers. The first difference you find is usually the cause.

8. Anti-Patterns

Anti-pattern	Why it fails	Correct
Fixing before diagnosing	Treats the symptom; root cause persists	Complete the triage protocol first
Hypothesis without evidence	Confirmation bias drives you toward your guess	Collect universal evidence before any hypothesis
Changing multiple variables at once	Cannot determine which change had the effect	One variable at a time
Assuming the obvious cause	"Obvious" often means "familiar," not "verified"	Verify with evidence even when "obvious"
Copying raw sensitive data into evidence	The diagnostic artifact becomes a privacy or secret leak	Redact, synthesize, hash, or replace with opaque IDs
Debugging by `printf` without a hypothesis	Random instrumentation wastes time	Instrument to test a specific hypothesis
Applying the wrong class's technique	Performance profiling won't find a logic error	Re-classify if the technique isn't converging
Escalating too early	Hasn't gathered enough evidence for a useful escalation	Fill the evidence table before escalating
Escalating too late	Spent 45 minutes on what a human could resolve in 5	Follow the time-based escalation triggers

9. Diagnostic-Session Template

Use this template to structure a diagnostic session. It prevents skipping steps.

## Diagnostic Session: [Brief description]

### 1. Symptom

- What: [exact error or wrong behavior]
- Where: [route / component / job]
- When: [always / intermittent / environment-specific]
- Since: [commit / deploy / data change]

### 2. Classification

- Primary class: [from taxonomy]
- Confidence: [0–4 level]
- Technique: [from technique matrix]

### 3. Evidence Collected

- [ ] Error message / wrong output (exact)
- [ ] Reproduction steps (minimal)
- [ ] Last-known-good state
- [ ] Environment facts
- [ ] Sensitive evidence redacted or replaced with safe identifiers
- [ ] Class-specific evidence: [list]

### 4. Evidence Ledger

| Observation | Source | Class signal | Contradiction | Next test |
| ----------- | ------ | ------------ | ------------- | --------- |
|             |        |              |               |           |

### 5. Hypotheses Tested

| #   | Hypothesis | Test | Result | Confidence after |
| --- | ---------- | ---- | ------ | ---------------- |
| 1   |            |      |        |                  |

### 6. Resolution

- Root cause: [one sentence]
- Fix: [what was changed]
- Prevention: [test / guard / doc added]

Grounding and Evaluation State

This skill is grounded in public troubleshooting and diagnostic-practice references: Google SRE troubleshooting guidance, git bisect documentation for regression bisection, Stack Overflow MRE guidance for isolation, Chrome DevTools and PostgreSQL EXPLAIN docs for measurement/profiling examples, OWASP logging guidance for diagnostic event capture, and OpenTelemetry sensitive-data guidance for safe telemetry handling.

The current eval metadata remains intentionally conservative: eval_artifacts: planned, eval_state: unverified, and routing_eval: absent. Do not mark this skill verified or routing-present until a real comprehension eval and routing eval include diagnosis and pass in the same change.

Verification

The symptom was classified before any debugging technique was chosen
Baseline evidence was collected before any hypothesis was formed
Sensitive or secret-bearing evidence was redacted, synthesized, hashed, or replaced with opaque IDs before sharing
The cheapest technique that could resolve this class was tried first
Confidence increased monotonically — or the symptom was re-classified the moment it didn't
If the approach was changed, the reason was documented (which signal triggered the switch)
The time-based stuck-state checkpoints were respected (5-min / 10-min / 15-min triggers)
If the failure spanned multiple classes, the cross-domain pattern (Cascade / Coincidence / Environment Ghost) was named explicitly

Do NOT Use When

Use instead	When
`debugging`	Actually executing scientific-method debugging on a failure that has already been classified — this skill routes to debugging; it does not replace it
`code-review`	Reviewing code for quality / correctness before a failure exists — diagnosis is downstream
`owasp-security`	A focused security audit against a known threat list — diagnosis only routes here when symptoms point at security
`testing-strategy`	Deciding what to test proactively — diagnosis is for reactive investigation after a failure
`error-tracking`	Setting up the production-error-capture / sampling / alerting stack — diagnosis investigates a specific failure already in front of you
`skill-router`	Choosing which agent skill activates for an arbitrary query — that's cross-skill dispatch, not failure triage

Skill Graph context

Classification

Subject: software-engineering-method
Public: true
Domain: engineering/debugging
Scope: Use when facing an unknown software failure, when symptoms point to different root causes, or when an initial debugging attempt has not converged. Provides a triage-first diagnostic routing framework: classify the failure, collect the right evidence, choose a technique, track confidence, and escalate when stuck. Do NOT use for executing scientific debugging after triage (use debugging), code-quality review (use code-review), or proactive observability setup.

When to use

the agent has been chasing this bug for 30 minutes — what's the structural fix?
the symptoms span data integrity and UI rendering — which is the root cause?
the build fails locally but passes in CI — how do I diagnose that class first?
I have a stack trace and an unhandled exception — what's the cheapest technique?
intermittent failure that doesn't reproduce on retry — which class is this?
we ran profiling, instrumentation, and bisect — none converge. What did we misclassify?
two engineers disagree on whether this is a config issue or a logic error — what evidence settles it?

Not for

actually execute scientific-method debugging on this stack trace
review this AI-generated PR for correctness
scan this repo for OWASP top 10 vulnerabilities
design observability instrumentation for this service
decide which agent should pick up this ticket
what's the right test pyramid for this feature

Related skills

Verify with: debugging, a11y
Related: code-review, error-tracking, owasp-security, testing-strategy, debugging

Grounding

Mode: universal
Truth sources: https://sre.google/sre-book/effective-troubleshooting/, https://git-scm.com/docs/git-bisect, https://stackoverflow.com/help/minimal-reproducible-example, https://developer.chrome.com/docs/devtools/performance/overview, https://www.postgresql.org/docs/current/sql-explain.html, https://cheatsheetseries.owasp.org/cheatsheets/Logging_Cheat_Sheet.html, https://opentelemetry.io/docs/security/handling-sensitive-data/

Keywords

diagnostic triage software failure, symptom classification taxonomy, what kind of bug is this, which debugging approach, diagnostic routing framework, evidence collection before hypothesis, confidence ladder debugging, escalation criteria debugging, cascade vs coincidence failure, environment ghost