causal-tree-analysis - SKILL.md Agent Skill

name: Causal Tree Analysis description: Hierarchical root cause investigation technique that maps all necessary and sufficient conditions leading to an adverse event using tree-structured logic

Overview

Causal Tree Analysis (also called Causal Factor Tree Analysis) is a root cause investigation methodology that creates a visual tree diagram showing the logical relationships between an undesired event and all contributing factors. Unlike brainstorming-based techniques like fishbones, Causal Tree Analysis applies rigorous logic checks to ensure each cause is both necessary and sufficient for the effect to occur.

The tree starts with a single consequence at the top (or root) and branches downward to show immediate causes, then their causes, continuing until reaching foundational factors that cannot be decomposed further. Each branch represents a causal relationship validated by evidence and logical necessity. This structure exposes gaps in knowledge and ensures investigators don't stop at superficial "obvious" causes.

Key distinction: Causal Trees emphasize logical rigor (necessity/sufficiency testing) versus Fishbone's brainstorming approach. Causal Trees work well as a master investigation technique for complex incidents where multiple factors interact.

When to Use

Apply Causal Tree Analysis when you need to:

Investigate serious incidents where comprehensive understanding is critical (workplace injuries, system outages, product recalls)
Analyze complex multi-factor failures where simple linear cause-effect models fall short
Document investigation rigor for regulatory compliance or legal defense
Expose knowledge gaps systematically during investigation (tree structure reveals missing evidence)
Communicate causation to stakeholders using familiar tree visualization
Validate completeness of other RCA methods by testing necessity/sufficiency of identified causes
Handle multiple potential scenarios where it's unclear which causal path actually occurred

Don't use when: Incident is simple enough for 5 Whys, time dependencies are critical (tree structure struggles with sequence), or you need quantitative probability analysis (use Fault Tree Analysis instead).

Process

1. Define the Consequence Precisely

Specify the single adverse event you're investigating at the top/root of the tree. Must be specific, observable, and well-defined.

Example: "Customer data exposed to unauthorized users on Nov 15, 14:32 UTC" not "security incident"

2. Identify Immediate Causes

Determine all events or conditions that directly and immediately caused the consequence. Apply logic tests:

Necessity test: If this cause had NOT occurred, would the consequence still have happened? If yes, it's not necessary (exclude or refine).

Sufficiency test: Given all identified immediate causes together, was the consequence inevitable? If no, you're missing causes.

Existence test: Do you have evidence this cause actually occurred, or is it speculation? Mark speculation clearly.

3. Decompose Each Cause Recursively

For each immediate cause, repeat step 2 to identify its immediate causes. Continue until reaching "base events" that are:

Outside your system boundaries (external dependencies)
Well-understood failure mechanisms with known root causes
Human actions where you've reached individual decision points
Events too foundational to decompose further

4. Validate Logical Consistency

Review each branch with the logic tests. Every cause must satisfy necessity and sufficiency relative to its parent effect. If you find gaps, it signals missing investigation.

Red flag: If removing one cause doesn't change the outcome, it's not necessary (remove it or you've missed the real connection).

5. Identify Root Causes and Contributing Factors

Examine the bottom-level events (leaves of the tree). Root causes are fundamental system weaknesses. Contributing factors enabled root causes but aren't sufficient alone.

Classification heuristic: If fixing this would prevent similar incidents, it's a root cause. If it only prevented this specific instance, it's a contributing factor.

6. Develop Corrective Actions

Focus actions on root causes at the base of the tree. Changes here prevent entire branches of failure. Validate that proposed actions actually break causal chains.

7. Document and Review

Present the completed tree to domain experts and stakeholders. Use their feedback to validate completeness and catch missing branches or invalid causal relationships.

Example

Scenario: Production database deleted during deployment

Consequence (Root): Production database "orders_prod" dropped at 02:15 UTC, 45,000 orders lost

Level 1 Immediate Causes (all necessary, together sufficient):

Deployment script executed "DROP DATABASE" command
Script targeted production environment instead of staging
Database lacked deletion protection safeguards

Level 2 Decomposition:

"Deployment script executed DROP DATABASE" requires:

Engineer ran deployment script manually
Script contained database migration rollback code
Rollback code included DROP command (intended for dev environments only)

"Script targeted production environment" requires:

Environment variable DEPLOY_ENV set to "production"
No confirmation prompt in script for production deployments
Engineer believed they were deploying to staging (mental model mismatch)

"Database lacked deletion protection" requires:

AWS RDS deletion protection disabled
Infrastructure-as-code didn't enforce deletion protection
Previous incident (6 months ago) didn't trigger policy change

Root Causes Identified:

No separation of rollback scripts by environment risk level
Manual deployment process vulnerable to human error
Infrastructure policies not enforced via automation
Incident learning system failed to prevent recurrence

Corrective Actions:

Remove DROP commands from all migration scripts (breaks first cause)
Implement automated deployments with staging → prod promotion (breaks manual execution)
Enforce deletion protection via Terraform with required=true (breaks third cause)
Create incident review board with enforcement authority (addresses learning failure)

Anti-Patterns

Stopping Too Early: Finding an "obvious" cause (human error, tool malfunction) and declaring victory. Causal Trees force you to ask "why was that possible?"

Confusing Correlation with Causation: Including factors that occurred near the incident but didn't actually cause it. Apply the necessity test rigorously.

Blame Focus: Creating a tree that terminates at "Bob made a mistake" instead of "system allowed mistake without safeguards." Causal Trees should reveal systemic issues.

Speculation Without Evidence: Adding causes based on assumptions. Mark speculative branches clearly and prioritize gathering evidence before building on them.

Paralysis by Perfection: Trying to decompose every branch to atomic levels. Accept pragmatic stopping points like "vendor component failed" unless root cause is strategic.

Ignoring Time Sequences: Forcing sequential dependencies into a tree structure. If "Event A must occur before Event B," Causal Trees struggle. Consider Event Tree Analysis instead.

Single-Perspective Construction: Building the tree solo without input from operators, users, or others who witnessed the system behavior. Diverse perspectives catch missing branches.

Related Frameworks

Fault Tree Analysis (FTA): Similar tree structure with Boolean logic gates (AND/OR); FTA is more formalized for quantitative analysis, Causal Tree more flexible for investigation
5 Whys: Simpler linear RCA technique; escalate to Causal Tree when 5 Whys reveals multiple interacting causes
Fishbone Diagram: Alternative visualization; Fishbone better for brainstorming, Causal Tree better for validating logical relationships
Event Tree Analysis: Forward-looking complement; Event Trees start with initiating event and map possible outcomes, Causal Trees work backward from outcome
Bow-Tie Analysis: Combines Causal Tree (left side: causes) with Event Tree (right side: consequences)
Swiss Cheese Model: Conceptual framework often used with Causal Tree to show how multiple defensive layers failed