diagnosis

star 0

Use when facing an unknown software failure, when symptoms point to different root causes, or when an initial debugging attempt has not converged. Provides a triage-first diagnostic routing framework: classify the failure, collect the right evidence, choose a technique, track confidence, and escalate when stuck. Do NOT use for executing scientific debugging after triage (use `debugging`), code-quality review (use `code-review`), or proactive observability setup. Do NOT use for actually execute scientific-method debugging on this stack trace. Do NOT use for review this AI-generated PR for correctness. Do NOT use for scan this repo for OWASP top 10 vulnerabilities. Do NOT use for design observability instrumentation for this service. Do NOT use for decide which agent should pick up this ticket. Do NOT use for what's the right test pyramid for this feature.

jacob-balslev By jacob-balslev schedule Updated 6/4/2026

name: diagnosis description: "Use when facing an unknown software failure, when symptoms point to different root causes, or when an initial debugging attempt has not converged. Provides a triage-first diagnostic routing framework: classify the failure, collect the right evidence, choose a technique, track confidence, and escalate when stuck. Do NOT use for executing scientific debugging after triage (use debugging), code-quality review (use code-review), or proactive observability setup. Do NOT use for actually execute scientific-method debugging on this stack trace. Do NOT use for review this AI-generated PR for correctness. Do NOT use for scan this repo for OWASP top 10 vulnerabilities. Do NOT use for design observability instrumentation for this service. Do NOT use for decide which agent should pick up this ticket. Do NOT use for what's the right test pyramid for this feature." license: MIT compatibility: "Language- and stack-agnostic. The classification taxonomy, evidence protocol, and confidence ladder apply to any software failure investigation; specific technique names (git bisect, EXPLAIN plans, HMAC verification) are illustrative — substitute the equivalents of your stack." allowed-tools: Read Grep metadata: relations: "{"related":["code-review","error-tracking","owasp-security","testing-strategy","debugging"],"suppresses":["debugging"],"verify_with":["debugging","a11y"]}" subject: software-engineering-method public: "true" scope: "Use when facing an unknown software failure, when symptoms point to different root causes, or when an initial debugging attempt has not converged. Provides a triage-first diagnostic routing framework: classify the failure, collect the right evidence, choose a technique, track confidence, and escalate when stuck. Do NOT use for executing scientific debugging after triage (use debugging), code-quality review (use code-review), or proactive observability setup." taxonomy_domain: engineering/debugging stability: experimental keywords: "["diagnostic triage software failure","symptom classification taxonomy","what kind of bug is this","which debugging approach","diagnostic routing framework","evidence collection before hypothesis","confidence ladder debugging","escalation criteria debugging","cascade vs coincidence failure","environment ghost"]" examples: "["the agent has been chasing this bug for 30 minutes — what's the structural fix?","the symptoms span data integrity and UI rendering — which is the root cause?","the build fails locally but passes in CI — how do I diagnose that class first?","I have a stack trace and an unhandled exception — what's the cheapest technique?","intermittent failure that doesn't reproduce on retry — which class is this?","we ran profiling, instrumentation, and bisect — none converge. What did we misclassify?","two engineers disagree on whether this is a config issue or a logic error — what evidence settles it?"]" anti_examples: "["actually execute scientific-method debugging on this stack trace","review this AI-generated PR for correctness","scan this repo for OWASP top 10 vulnerabilities","design observability instrumentation for this service","decide which agent should pick up this ticket","what's the right test pyramid for this feature"]" grounding: "{"subject_matter":"Portable software-failure diagnostic triage: evidence collection, symptom classification, technique selection, confidence tracking, escalation, and sensitive diagnostic evidence handling","grounding_mode":"universal","truth_sources":["https://sre.google/sre-book/effective-troubleshooting/\",\"https://git-scm.com/docs/git-bisect\",\"https://stackoverflow.com/help/minimal-reproducible-example\",\"https://developer.chrome.com/docs/devtools/performance/overview\",\"https://www.postgresql.org/docs/current/sql-explain.html\",\"https://cheatsheetseries.owasp.org/cheatsheets/Logging_Cheat_Sheet.html\",\"https://opentelemetry.io/docs/security/handling-sensitive-data/\"],\"failure_modes\":[\"fixing_before_classification\",\"hypothesis_without_baseline_evidence\",\"wrong_technique_for_problem_class\",\"confidence_inflation_without_verification\",\"stuck_state_not_escalated_or_reclassified\",\"diagnostic_evidence_captures_sensitive_or_secret_data\",\"eval_or_routing_claim_inflated_without_run\"],\"evidence_priority\":\"equal\"}" skill_graph_source_repo: "https://github.com/jacob-balslev/skill-graph" skill_graph_project: Skill Graph skill_graph_canonical_skill: skills/software-engineering-method/diagnosis/SKILL.md skill_graph_export_description_projection: anti_examples

Diagnosis

Concept of the skill

Use when facing an unknown software failure, when symptoms point to different root causes, or when an initial debugging attempt has not converged.

Coverage

The triage-first framework that classifies a software failure into a problem class and routes it to the right diagnostic technique before root-cause investigation begins. Names nine symptom classes — Logic Error, Runtime Crash, Data Integrity, Timing / Race, Performance, Configuration, Security, Integration, Tooling / Build / Script-path — and provides a classification decision tree that walks from "is there a stack trace?" to a single class. Specifies a universal evidence-collection protocol (exact error message, reproduction steps, last-known-good state, environment facts) and class-specific evidence checklists. Lays out the technique-selection matrix — stack-trace reading, data-flow tracing, git bisect, differential comparison, instrumentation, MRE isolation, profiling, boundary probing — with each technique's time cost, best-case class, and evidence prerequisite. Defines the diagnostic confidence ladder (level 0 Symptom → 1 Classified → 2 Localized → 3 Root Cause → 4 Verified Fix) with explicit "you can say / you cannot say" boundaries at each level and stuck-state checkpoints (5-min, 10-min, 15-min, oscillation). Names escalation criteria for switching approach, switching class, or escalating to a human. Covers three cross-domain patterns where multiple classes apply simultaneously: the Cascade (one root cause, many symptoms), the Coincidence (two unrelated bugs that look like one), the Environment Ghost (works in one environment, fails in another). Catalogues diagnostic anti-patterns and ships a structured diagnostic-session template.

Philosophy of the skill

Debugging fails most often not because the engineer lacks skill, but because the wrong methodology is applied to the problem class. A timing bug needs different tools than a data-integrity bug. A scope leak needs different thinking than a rendering glitch. The most expensive debugging mistake is spending 30 minutes applying scientific-method debugging to what is actually a configuration error discoverable in 2 minutes.

This skill is the triage nurse, not the surgeon. A nurse does not treat the patient — they take vital signs, route to cardiology or neurology, and escalate to the attending physician when criteria are met. Software diagnosis works the same way: collect evidence, classify the symptom, route to the right specialist technique, and pivot when convergence stalls. The small cost of triage is almost always smaller than the cost of chasing a plausible but wrong cause. Skipping triage because "the cause is obvious" is a confirmation-bias trap; even seasoned engineers benefit from making the classification step explicit.

1. The Diagnostic Triage Protocol

Before debugging, diagnose which kind of problem you have. The class determines the technique and the technique determines the time-to-fix.

1. Collect baseline evidence (Section 3)
2. Classify the symptom            (Section 2)
3. Select the diagnostic technique (Section 4)
4. Execute using the routed technique
5. If not converging after 3 attempts, escalate (Section 6)

Rule: never start fixing before completing steps 1–3. The cost of misclassification often exceeds the cost of a short triage pass, and the written classification gives the next person something concrete to challenge.

Diagnosis vs debugging handoff

Surface Diagnosis owns Handoff signal Next owner
Failure triage Evidence collection, symptom class, technique choice, confidence level, escalation trigger The failure has a primary class, a chosen technique, and enough evidence to run it debugging
Root-cause execution Reproduction, scope reduction, instrumentation, hypothesis testing, fix verification, regression test The selected technique has started producing falsifiable evidence debugging
Error capture pipeline Whether the failure was captured, sanitized, and made observable The problem is "this error was not reported or was reported unsafely" error-tracking
Pre-merge quality review Whether the code is risky before a known failure exists The question is about correctness risk, maintainability, or review feedback rather than an observed symptom code-review
Security investigation Threat-model-specific analysis against an attack class Evidence points at auth, authorization, injection, secret exposure, or data exposure owasp-security

Treat the handoff as a contract, not a vague recommendation. Diagnosis does not fix the bug; it decides which investigation path is justified by evidence.

2. Symptom-Classification Taxonomy

Every failure falls into one of nine classes. Each class has a primary diagnostic technique.

Class Symptoms Primary technique
Logic Error Wrong output, wrong calculation, wrong state transition Trace data flow; compare expected vs actual at each stage
Runtime Crash Unhandled exception, process exit, 500 error Read stack trace; find the throwing line; check preconditions
Data Integrity Missing records, wrong totals, duplicate entries, cross-tenant leak Compare source data to derived data at each transform stage
Timing / Race Intermittent failure, works on retry, order-dependent Add timestamps to logs; look for concurrent mutations; check locks
Performance Slow response, timeout, memory growth, CPU spike Profile first (measure before hypothesizing); find the hot path
Configuration Works locally but not in staging / prod, env-dependent Diff environments — env vars, versions, feature flags, DNS, SSL
Security Auth bypass, data exposure, HMAC failure, injection Follow data flow from untrusted input to sensitive operation
Integration Webhook not arriving, API returning unexpected shape, sync drift Check both sides of the boundary independently, then compare
Tooling / Build / Script-path Cannot find module, wrong cwd, stale script paths, read EIO, ENOENT on a script Verify path resolution; check cwd; verify dependency install; compare referenced path vs actual filesystem path

Classification decision tree

Is there a stack trace or error message?
  YES → Does it point to a specific line?
          YES → Runtime Crash (read the line; check preconditions)
          NO  → Is it a timeout or OOM?
                  YES → Performance
                  NO  → Logic Error (the error is a symptom of wrong state)
  NO  → Is the output wrong but no error thrown?
          YES → Is the wrongness in calculated numbers or records?
                  YES → Data Integrity
                  NO  → Logic Error
          NO  → Is it intermittent?
                  YES → Timing / Race
                  NO  → Does it depend on environment?
                          YES → Configuration
                          NO  → Does the error message contain a file/module path?
                                  YES → Tooling / Build / Script-path
                                  NO  → Does it involve external services?
                                          YES → Integration
                                          NO  → Are there security signals
                                                (auth failure, permission error,
                                                unexpected data exposure, HMAC failure,
                                                access-control bypass)?
                                                  YES → Security
                                                  NO  → Unknown / Unclassified
                                                          → restart evidence collection;
                                                            run a fresh investigative sweep

3. Evidence-Collection Protocol

Before forming any hypothesis, collect baseline evidence. The class determines the additional evidence needed beyond the universal set.

Evidence safety rule

Diagnostic notes, logs, screenshots, and repro snippets often contain more sensitive information than the final fix. Collect enough evidence to classify the failure, but redact or replace personal data, credentials, session tokens, raw request bodies, and secret-bearing headers before copying evidence into a shared note, issue, audit artifact, or skill. Prefer internal opaque IDs, hashes, synthetic examples, and minimal reproductions over real payload dumps.

Universal evidence (always collect)

Evidence How to collect Why
Exact error message or wrong output Copy from logs, terminal, or UI Prevents paraphrasing errors
Reproduction steps The minimal sequence that triggers the failure Proves the bug exists and is testable
Last-known-good state git log --oneline -10, recent deploys, recent data changes Brackets the introduction window
Environment facts Runtime version, env vars, database state, running services Eliminates the Configuration class early

Class-specific evidence

Class Additional evidence to collect
Logic Error Input data, expected output, actual output, intermediate values at key transform points
Runtime Crash Full stack trace, request payload, database state at crash time
Data Integrity Source record count vs derived count, sample rows from each stage, tenant / scope identifiers
Timing / Race Timestamps of concurrent operations, lock state, retry behaviour, whether it reproduces under load
Performance Response-time baseline, CPU / memory profile, query plans (EXPLAIN), N+1 query check
Configuration Env-var diff (local vs staging vs prod), package-version diff, feature-flag state
Security Auth state, session-token contents, role / permission, request headers, HMAC comparison
Integration Request / response pair from both sides, delivery logs, timestamp alignment
Tooling / Build / Script-path Module-resolution output, current working directory at failure, dependency-install verification, referenced path vs filesystem path

Rule: if you cannot fill the universal evidence table, you are not ready to hypothesize. Collect first, think second.

Evidence ledger

Use an evidence ledger when the investigation has more than one plausible class. This keeps assumptions separate from observations and prevents confidence inflation.

Field Record Example
Observation Raw fact, redacted if sensitive POST /webhook returns 401 in staging only
Source Where the fact came from Deployment log, stack trace, profile, sanitized request sample
Class signal Which class it supports Configuration, Integration, Security
Contradiction Which class it weakens Logic Error: same code path passes locally
Next test Cheapest falsification step Compare staging and local signing secret metadata without exposing the secret

If an observation changes the likely class, update the class explicitly. Silent reclassification is how investigations drift into mythology.

4. Technique-Selection Matrix

Once the symptom is classified, pick the cheapest technique that could resolve the class.

Technique Best for Time cost Evidence required
Stack-trace reading Runtime crashes, unhandled exceptions 1–2 min Stack trace
Data-flow tracing Logic errors, data integrity 5–15 min Input + output at each stage
Binary search (git bisect) Regressions with known-good state 3–10 min Known-good commit + reproducible test
Differential comparison Configuration, environment-dependent failure 2–5 min Two environments to compare
Instrumentation (logging) Timing / race, intermittent failures 5–10 min setup Hypothesis about where to instrument
Isolation (MRE) Complex failures with many variables 10–20 min Reproducible failure
Profiling Performance, memory, CPU 5–15 min Running system under load
Boundary probing Integration failures 5–10 min Access to both sides of the integration

Technique-ordering principle

Always start with the cheapest technique that could resolve the class:

  1. Read the error (~30 s) — cheapest first pass for runtime crashes
  2. Check the environment (~1 min) — cheapest first pass for configuration issues
  3. Trace the data flow (~5 min) — cheapest first pass for logic / data errors
  4. Isolate with MRE (~10 min) — useful when too many variables remain in play
  5. Instrument and observe (~10+ min) — necessary when timing / intermittent failures cannot be reproduced directly

The percentages are intentionally absent. This skill is a routing framework, not a benchmark claim. Use local incident history or an actual eval corpus before making quantified success-rate claims.

5. The Diagnostic Confidence Ladder

As evidence accumulates, confidence in the diagnosis should increase monotonically. If it doesn't, the symptom has been misclassified.

Level Confidence You can say You cannot say
0 — Symptom 0% "Something is wrong" Anything about the cause
1 — Classified 20% "This is a [class] problem" Where specifically
2 — Localized 50% "The failure is in [module / file / function]" What exactly is wrong
3 — Root cause 80% "The cause is [specific condition]" That the fix will work
4 — Verified fix 95% "This fix resolves the root cause and does not regress" Nothing — ship it

Stuck-state checkpoints

  • Stuck at level 0 for > 5 min → you need more evidence; restart Section 3
  • Stuck at level 1 for > 10 min → likely misclassification; re-run the classification tree
  • Stuck at level 2 for > 15 min → the problem may be cross-domain; check whether multiple classes apply
  • Oscillating between levels → stop. Write down what you know vs what you're assuming. The assumption is wrong.

Reclassification rule

Classification is provisional until the evidence keeps moving the confidence ladder upward. Re-run the classification tree when any of these happens:

Signal Meaning Required action
The selected technique produces no new evidence The class may be wrong or the evidence prerequisite is missing Re-check Section 3, then choose the next cheapest class-compatible technique
A contradiction appears The current class does not explain all observations Split observation from assumption in the evidence ledger and reclassify
Confidence decreases after a test The hypothesis was falsified, not "almost right" Record the falsification and move down the ladder before continuing
Two classes stay equally plausible The failure may be a Cascade or Coincidence Test the earliest shared data-flow point, then split symptoms if one fix does not affect both

6. Escalation Criteria

Switch diagnostic approach when

Signal Action
Three hypotheses tested, none confirmed Re-classify the symptom from scratch
Fix works locally but not in target env Switch to Configuration-class techniques
Multiple symptoms that don't share a root cause You may have 2+ bugs; triage each independently
Evidence contradicts the classification Trust the evidence; re-classify
Confidence has decreased over the last 3 steps Stop. You're making it worse. Fresh context needed.

Escalate to human when

Signal Why a human is needed
Requires access you don't have (production DB, third-party dashboard) Authorization boundary
Business-logic ambiguity ("should this return 0 or null?") Product decision, not technical
Fix requires a breaking change to a public API Stakeholder alignment needed
Reproduction requires real user data you cannot access Privacy / compliance boundary
30 minutes of investigation with no progress Fresh perspective needed

7. Cross-Domain Patterns

Some failures span multiple classes simultaneously. These compound failures are the hardest to diagnose.

Pattern: the Cascade

A single root cause triggers symptoms across multiple classes.

Root cause: missing null-check in a data transform
  → Data Integrity symptom: wrong totals
  → Logic Error symptom:    UI shows negative values
  → Integration symptom:    webhook payload rejected by partner

Diagnostic approach: find the earliest symptom in the data flow. That's closest to the root cause.

Pattern: the Coincidence

Two unrelated bugs appear simultaneously, creating a misleading compound symptom.

Bug A: CSS regression from a recent deploy        (Logic Error)
Bug B: slow API from an unrelated query change    (Performance)
Combined symptom: "the page is broken and slow"

Diagnostic approach: separate the symptoms. Test each independently. If fixing one doesn't affect the other, they're independent bugs.

Pattern: the Environment Ghost

Works in one environment, fails in another, with no code difference.

Local:    works   (runtime 20.11, .env.local, fresh DB)
Staging:  fails   (runtime 20.9,  CI env vars, migrated DB)

Diagnostic approach: diff everything — runtime versions, env vars, DB state, feature flags, DNS, SSL, headers. The first difference you find is usually the cause.

8. Anti-Patterns

Anti-pattern Why it fails Correct
Fixing before diagnosing Treats the symptom; root cause persists Complete the triage protocol first
Hypothesis without evidence Confirmation bias drives you toward your guess Collect universal evidence before any hypothesis
Changing multiple variables at once Cannot determine which change had the effect One variable at a time
Assuming the obvious cause "Obvious" often means "familiar," not "verified" Verify with evidence even when "obvious"
Copying raw sensitive data into evidence The diagnostic artifact becomes a privacy or secret leak Redact, synthesize, hash, or replace with opaque IDs
Debugging by printf without a hypothesis Random instrumentation wastes time Instrument to test a specific hypothesis
Applying the wrong class's technique Performance profiling won't find a logic error Re-classify if the technique isn't converging
Escalating too early Hasn't gathered enough evidence for a useful escalation Fill the evidence table before escalating
Escalating too late Spent 45 minutes on what a human could resolve in 5 Follow the time-based escalation triggers

9. Diagnostic-Session Template

Use this template to structure a diagnostic session. It prevents skipping steps.

## Diagnostic Session: [Brief description]

### 1. Symptom

- What: [exact error or wrong behavior]
- Where: [route / component / job]
- When: [always / intermittent / environment-specific]
- Since: [commit / deploy / data change]

### 2. Classification

- Primary class: [from taxonomy]
- Confidence: [0–4 level]
- Technique: [from technique matrix]

### 3. Evidence Collected

- [ ] Error message / wrong output (exact)
- [ ] Reproduction steps (minimal)
- [ ] Last-known-good state
- [ ] Environment facts
- [ ] Sensitive evidence redacted or replaced with safe identifiers
- [ ] Class-specific evidence: [list]

### 4. Evidence Ledger

| Observation | Source | Class signal | Contradiction | Next test |
| ----------- | ------ | ------------ | ------------- | --------- |
|             |        |              |               |           |

### 5. Hypotheses Tested

| #   | Hypothesis | Test | Result | Confidence after |
| --- | ---------- | ---- | ------ | ---------------- |
| 1   |            |      |        |                  |

### 6. Resolution

- Root cause: [one sentence]
- Fix: [what was changed]
- Prevention: [test / guard / doc added]

Grounding and Evaluation State

This skill is grounded in public troubleshooting and diagnostic-practice references: Google SRE troubleshooting guidance, git bisect documentation for regression bisection, Stack Overflow MRE guidance for isolation, Chrome DevTools and PostgreSQL EXPLAIN docs for measurement/profiling examples, OWASP logging guidance for diagnostic event capture, and OpenTelemetry sensitive-data guidance for safe telemetry handling.

The current eval metadata remains intentionally conservative: eval_artifacts: planned, eval_state: unverified, and routing_eval: absent. Do not mark this skill verified or routing-present until a real comprehension eval and routing eval include diagnosis and pass in the same change.

Verification

  • The symptom was classified before any debugging technique was chosen
  • Baseline evidence was collected before any hypothesis was formed
  • Sensitive or secret-bearing evidence was redacted, synthesized, hashed, or replaced with opaque IDs before sharing
  • The cheapest technique that could resolve this class was tried first
  • Confidence increased monotonically — or the symptom was re-classified the moment it didn't
  • If the approach was changed, the reason was documented (which signal triggered the switch)
  • The time-based stuck-state checkpoints were respected (5-min / 10-min / 15-min triggers)
  • If the failure spanned multiple classes, the cross-domain pattern (Cascade / Coincidence / Environment Ghost) was named explicitly

Do NOT Use When

Use instead When
debugging Actually executing scientific-method debugging on a failure that has already been classified — this skill routes to debugging; it does not replace it
code-review Reviewing code for quality / correctness before a failure exists — diagnosis is downstream
owasp-security A focused security audit against a known threat list — diagnosis only routes here when symptoms point at security
testing-strategy Deciding what to test proactively — diagnosis is for reactive investigation after a failure
error-tracking Setting up the production-error-capture / sampling / alerting stack — diagnosis investigates a specific failure already in front of you
skill-router Choosing which agent skill activates for an arbitrary query — that's cross-skill dispatch, not failure triage

Skill Graph context

Classification

  • Subject: software-engineering-method
  • Public: true
  • Domain: engineering/debugging
  • Scope: Use when facing an unknown software failure, when symptoms point to different root causes, or when an initial debugging attempt has not converged. Provides a triage-first diagnostic routing framework: classify the failure, collect the right evidence, choose a technique, track confidence, and escalate when stuck. Do NOT use for executing scientific debugging after triage (use debugging), code-quality review (use code-review), or proactive observability setup.

When to use

  • the agent has been chasing this bug for 30 minutes — what's the structural fix?
  • the symptoms span data integrity and UI rendering — which is the root cause?
  • the build fails locally but passes in CI — how do I diagnose that class first?
  • I have a stack trace and an unhandled exception — what's the cheapest technique?
  • intermittent failure that doesn't reproduce on retry — which class is this?
  • we ran profiling, instrumentation, and bisect — none converge. What did we misclassify?
  • two engineers disagree on whether this is a config issue or a logic error — what evidence settles it?

Not for

  • actually execute scientific-method debugging on this stack trace
  • review this AI-generated PR for correctness
  • scan this repo for OWASP top 10 vulnerabilities
  • design observability instrumentation for this service
  • decide which agent should pick up this ticket
  • what's the right test pyramid for this feature

Related skills

  • Verify with: debugging, a11y
  • Related: code-review, error-tracking, owasp-security, testing-strategy, debugging

Grounding

  • Mode: universal
  • Truth sources: https://sre.google/sre-book/effective-troubleshooting/, https://git-scm.com/docs/git-bisect, https://stackoverflow.com/help/minimal-reproducible-example, https://developer.chrome.com/docs/devtools/performance/overview, https://www.postgresql.org/docs/current/sql-explain.html, https://cheatsheetseries.owasp.org/cheatsheets/Logging_Cheat_Sheet.html, https://opentelemetry.io/docs/security/handling-sensitive-data/

Keywords

  • diagnostic triage software failure, symptom classification taxonomy, what kind of bug is this, which debugging approach, diagnostic routing framework, evidence collection before hypothesis, confidence ladder debugging, escalation criteria debugging, cascade vs coincidence failure, environment ghost
Install via CLI
npx skills add https://github.com/jacob-balslev/skill-graph --skill diagnosis
Repository Details
star Stars 0
call_split Forks 0
navigation Branch main
article Path SKILL.md
More from Creator
jacob-balslev
jacob-balslev Explore all skills →