name: fmea description: Systematically identify potential failure modes, assess their severity and likelihood, and prioritize preventive actions using risk priority numbers
FMEA (Failure Mode Effects Analysis)
Overview
FMEA is a systematic, proactive method for evaluating processes to identify where and how they might fail and assessing the relative impact of different failures. Developed in aerospace/automotive industries, FMEA calculates Risk Priority Numbers (RPN = Severity × Occurrence × Detection) to prioritize which failures to prevent first.
When to Use
- Designing new products, processes, or systems
- Launching critical features with high failure cost
- Incident post-mortems to prevent recurrence
- Evaluating vendor/supplier reliability
- Improving manufacturing quality
- Pre-launch risk assessment
The Process
Step 1: Identify Potential Failure Modes
List all ways the system could fail. For each component/process step, ask: "What could go wrong?"
Example: E-commerce checkout process failure modes: payment gateway timeout, inventory out of sync, address validation fails, credit card decline, email confirmation fails.
Step 2: Assess Severity (1-10 Scale)
Rate impact if failure occurs. 1 = no impact, 10 = catastrophic. Focus on customer/business impact, not technical complexity.
Example: Payment gateway timeout: Severity = 9 (lost revenue, customer frustration). Email confirmation fails: Severity = 3 (inconvenient but order proceeds).
Step 3: Assess Occurrence (1-10 Scale)
Rate likelihood of failure happening. 1 = rare, 10 = inevitable. Base on data if available, educated guess if not.
Example: Payment timeout: Occurrence = 2 (gateway 99.5% uptime). Email fails: Occurrence = 4 (email service less reliable).
Step 4: Assess Detection (1-10 Scale)
Rate likelihood of catching failure before customer impact. 1 = always detected, 10 = never detected until customer complains.
Example: Payment timeout: Detection = 3 (monitoring alerts immediately). Email fails: Detection = 8 (no monitoring, customer must report).
Step 5: Calculate RPN and Prioritize
RPN = Severity × Occurrence × Detection. Higher RPN = higher priority. Focus mitigations on highest RPNs first.
Example:
- Payment timeout: 9 × 2 × 3 = 54 RPN
- Email fails: 3 × 4 × 8 = 96 RPN → Prioritize email monitoring despite lower severity
Step 6: Implement Actions and Reassess
Design mitigations to reduce Severity (safer failure mode), Occurrence (prevent failure), or Detection (catch earlier). Recalculate RPN after mitigation.
Example: Add email delivery monitoring (Detection 8 → 2). New RPN: 3 × 4 × 2 = 24 (acceptable).
Example Application
Situation: SaaS company launching new API with high-value enterprise customers.
Failure Modes Analysis:
- API returns 500 errors: S=10, O=3, D=2 → RPN=60
- Response time >5sec: S=7, O=5, D=4 → RPN=140
- Authentication fails: S=9, O=2, D=3 → RPN=54
- Rate limiting blocks requests: S=8, O=6, D=7 → RPN=336
Prioritization: Address rate limiting first (RPN=336), then response time (RPN=140).
Mitigations:
- Rate limiting: Per-customer limits (O: 6→2), monitoring alerts (D: 7→2) → New RPN=32
- Response time: Database indexing (O: 5→2), performance monitoring (D: 4→1) → New RPN=14
Outcome: Launch succeeds, no customer-facing failures. Prevented potential churn.
Anti-Patterns
- ❌ Skipping low-severity, high-occurrence issues (death by a thousand cuts)
- ❌ Rating based on technical complexity vs. customer impact
- ❌ Ignoring detection rating (undetected failures multiply impact)
- ❌ Treating FMEA as one-time exercise vs. living document
- ❌ Analysis paralysis (don't FMEA every trivial process)
- ❌ Confusing RPN with absolute risk (RPN is relative prioritization tool)
Related
- five-whys
- fishbone-diagram
- root-cause-analysis
- risk-management
- pre-mortem