thinking-kepner-tregoe - SKILL.md Agent Skill

name: thinking-kepner-tregoe description: Use when a defect is selective (some endpoints/regions/users/times affected, not all) and the cause is unclear — map what IS vs IS-NOT affected; the boundary contrast points at the root cause.

Kepner-Tregoe Problem Analysis

Overview

Kepner-Tregoe (KT) is a structured root-cause method. This skill focuses on Problem Analysis (PA) — the IS/IS-NOT boundary contrast — which is the high-value KT process for debugging. When a defect is selective (some cases affected, others not), the boundary between IS and IS-NOT reveals the distinction that points at the root cause.

Decision Analysis (DA) and Potential Problem Analysis (PPA) are de-emphasized here. For pure decision-making among alternatives, use thinking-opportunity-cost. For risk anticipation before a change, use thinking-pre-mortem. Those skills are purpose-built for those tasks; KT's DA/PPA add overhead without unique mechanism.

Situation Analysis (SA) is retained as a lightweight triage step when facing multiple concerns, but it is not a required preamble — jump directly to PA when the problem is already clear.

Core Principle: The boundary between what IS affected and what IS NOT affected encodes the root cause. Find the distinction, find the cause.

When to Use

A defect is selective: affects some endpoints/regions/users/times but NOT others — there is an IS-vs-IS-NOT boundary to contrast
The cause is unclear and not obvious from a stack trace, error message, or a single recent change
Multiple possible causes exist and you need a systematic way to narrow them
A complex situation has multiple concerns that need triage before diving in

Decision flow:

Defect is selective (not 100%)? → No → IS/IS-NOT has no signal; use direct debugging or thinking-systems
                                → Yes → Cause obvious from stack trace/recent change? → Yes → Just fix it
                                                                                      → No → APPLY KT PROBLEM ANALYSIS

When NOT to Use

The failure is uniform (affects 100% of requests/everything) — there is no IS-vs-IS-NOT boundary to contrast; PA gives no signal. Use thinking-systems or direct debugging.
The cause is already obvious from a stack trace, error message, or a single recent change — just fix it; IS/IS-NOT is overhead here.
A quick hypothesis is cheaply testable — test it (thinking-occams-razor) before building a full specification matrix.
Pure decision-making with no deviation to diagnose — use thinking-opportunity-cost, not KT's Decision Analysis.
Risk assessment for a planned change — use thinking-pre-mortem, not KT's Potential Problem Analysis.

Trigger Card

When a defect is selective (some cases affected, others not) and the cause is unclear:

State the problem precisely — what is the deviation? In what object? Where/when does it occur?
Map IS vs IS-NOT — what IS affected vs what IS NOT, side by side. The boundary is the signal.
Find the distinction — what is different about the IS cases vs the IS-NOT cases? That distinction IS the cause.

Skip if the failure is uniform (100%) — there's no boundary to contrast; use direct debugging. If the cause is obvious from a stack trace or recent change, just fix it. For a single cheaply-testable hypothesis, test it first.

Procedure

Step 1 (optional): Situation Analysis — Triage Multiple Concerns

Only when facing several problems at once. List all concerns, separate them if compound, and prioritize by Timing/Impact/Trend:

Concern	Timing	Impact	Trend	Priority
API latency spike	Urgent	High	Worsening	P0
Checkout errors	Soon	High	Stable	P1

For each concern, decide: Problem Analysis (PA), or delegate to another skill.

Step 2: State the Problem Precisely

Describe the deviation from expected behavior with specificity:

"API response time increased from 200ms to 800ms for /checkout endpoint,
US-East only, starting Monday 9 AM, affecting ~30% of requests."

Step 3: Build the IS/IS-NOT Matrix

Specify the problem across four dimensions. The power is in the distinction column — what's unique about the IS side?

Dimension	IS (affected)	IS NOT (not affected)	Distinction
WHAT — object	/checkout endpoint	/cart, /product, /user	Payment processing
WHAT — defect	4x latency increase	Errors, timeouts, data corruption	Performance only
WHERE — location	Production US-East	EU, US-West, staging	Single region
WHERE — on object	Database query phase	Auth, validation, serialization	DB layer
WHEN — first seen	Monday 9:00 AM	Before Monday, after 6 PM	Business hours
WHEN — pattern	During checkout submit	During browsing, cart add	Write operations
EXTENT — how many	~30% of requests	100% of requests	Intermittent
EXTENT — trend	Stable since Tuesday	Getting worse	Plateaued

Step 4: Extract Distinctions

For each row, ask: "What's unique or distinctive about the IS side compared to the IS-NOT side?"

Distinctions:
- Only /checkout (payment processing) — not other endpoints
- Only US-East (specific DB replica) — not other regions
- Only during business hours (load-related?) — not off-peak
- Only ~30% of requests (specific query pattern?) — not all
- Started Monday 9 AM — what changed?

Step 5: Identify Changes

What changed in, on, around, or about the distinctions near the first observation time?

Changes near Monday 9 AM:
- Payment provider SDK updated (Sunday night deploy)
- Database index rebuild scheduled (Sunday maintenance)
- New fraud detection rules enabled (Monday 8:45 AM)

Step 6: Generate and Test Possible Causes

Each candidate cause must explain BOTH the IS and the IS-NOT:

Possible Cause	Explains IS?	Explains IS-NOT?	Verdict
Fraud rules adding DB queries	✓ Only checkout, only write ops	✓ Not other endpoints	Pursue
Payment SDK change	✓ Only checkout	✗ Would affect all regions	Ruled out
Index rebuild	✓ DB layer	✗ Would affect all queries	Ruled out

Step 7: Verify the True Cause

Design a test to confirm or rule out the leading candidate:

Verification for "Fraud detection rules":
1. Check: Rules enabled 8:45 AM (matches timeline)
2. Check: Rules only on checkout (matches scope)
3. Test: Disable rules in canary, measure latency
4. Examine: Query logs for fraud check queries

Output Contract

A completed KT Problem Analysis produces:

Problem Statement — specific, measurable deviation
IS/IS-NOT Matrix — all four dimensions with distinctions extracted
Changes List — what changed near the distinctions around the first observation
Cause Test Table — each candidate tested against IS and IS-NOT
Confirmed Root Cause — with verification evidence
If used, SA Triage — prioritized concern list with assigned processes

Anti-Patterns

Anti-Pattern	Symptom	Correction
KT on uniform failure	Running PA when 100% of requests fail	No boundary to contrast; use direct debugging or `thinking-systems`
Over-specifying the matrix	Filling every IS/IS-NOT cell for a simple bug	Stop when the distinction is clear; don't ritualize
DA/PPA sprawl	Running full Decision Analysis or Potential Problem Analysis for routine tasks	Redirect to `thinking-opportunity-cost` (decisions) or `thinking-pre-mortem` (risks)
Skipping cause testing	Pursuing the first plausible cause without testing against IS-NOT	Every cause must explain BOTH IS and IS-NOT
SA as mandatory preamble	Running full Situation Analysis before every PA	Jump directly to PA when the problem is already clear
Ignoring the distinction	Building the matrix but not extracting what's unique about IS	The distinction IS the signal; without it, the matrix is just a table