fail-safes

star 7

Design systems to automatically transition to safe state when failures occur, preventing harm through built-in protective responses

lev-os By lev-os schedule Updated 3/7/2026

name: fail-safes description: Design systems to automatically transition to safe state when failures occur, preventing harm through built-in protective responses

Fail-Safes

Overview

Fail-safe design is an engineering principle where systems respond to failures by automatically entering a safe state, preventing or minimizing harm to people, equipment, or the environment. Unlike inherent safety (where hazards don't exist), fail-safes acknowledge that failures will occur and embed protective mechanisms that activate when malfunctions are detected. The core insight: if and when a fail-safe system fails, it remains at least as safe as it was before the failure.

Common examples: air brakes on trains (loss of air pressure applies brakes automatically), circuit breakers (overload trips switch to open position), deadman switches (loss of operator input stops machine). Fail-safe design is critical in aviation, nuclear power, medical devices, industrial machinery, and transportation systems where failures can cause catastrophic consequences.

When to Use

  • High-consequence failures: Aviation, nuclear, medical devices where malfunction risks lives
  • Unattended operations: Industrial processes, autonomous systems without constant human supervision
  • Public safety systems: Transportation infrastructure, building safety, emergency systems
  • Regulatory requirements: Industries with mandatory safety standards (FDA, FAA, nuclear)
  • Complexity beyond human monitoring: Systems too fast or intricate for manual intervention

The Process

Step 1: Identify Failure Modes

Systematically enumerate ways the system can fail using FMEA (Failure Mode Effects Analysis) or fault tree analysis.

Categories of failures:

  • Component failures: Motor burns out, sensor drifts, wire breaks
  • Power loss: Electrical outage, battery depletion, pneumatic pressure drop
  • Software errors: Infinite loops, race conditions, memory overflow
  • Human errors: Operator input mistakes, maintenance oversights
  • Environmental: Overheating, freezing, electromagnetic interference

Example: Elevator system:

  • Cable breaks
  • Motor fails
  • Control system crashes
  • Power outage
  • Overload (too many passengers)

Step 2: Assess Consequences of Each Failure

Determine what happens if failure occurs without protective mechanisms.

Severity classification:

  • Catastrophic: Loss of life, major environmental damage
  • Critical: Severe injury, equipment destruction
  • Moderate: Minor injury, significant downtime
  • Negligible: No safety impact, minor inconvenience

Example consequences:

  • Elevator cable breaks → Catastrophic (free fall)
  • Control system crashes → Critical (stuck between floors)
  • Overload → Moderate (mechanical strain)

Prioritize fail-safe design for catastrophic and critical failures.

Step 3: Design Default-Safe State

Define what "safe" means for each failure mode - the system configuration that minimizes harm.

Safe state design patterns:

  • Shutdown: Stop all motion, cut power (industrial presses, saws)
  • Lock in place: Prevent further movement (elevator brakes, parking brakes)
  • Open circuit: Break electrical connection (circuit breakers, fuses)
  • Vent to atmosphere: Release pressure safely (pressure relief valves)
  • Activate backup: Switch to redundant system (dual power supplies, backup generators)
  • Alert and isolate: Trigger alarm, quarantine affected subsystem

Elevator safe state examples:

  • Cable breaks → Emergency brakes engage (friction clamps on guide rails)
  • Power loss → Brakes engage by default (held open by power, not closed)
  • Overload → Refuse to move, sound alarm

Step 4: Implement Automatic Trigger Mechanisms

Design sensors and logic to detect failures and activate safe state without human intervention.

Detection methods:

  • Sensor monitoring: Temperature, pressure, speed, voltage thresholds
  • Watchdog timers: Software heartbeat monitoring (if no signal in X seconds, assume crash)
  • Comparison logic: Redundant sensors cross-checked (if disagreement, assume failure)
  • Rate-of-change limits: Sudden spikes indicate malfunction
  • Deadman switches: Require continuous operator input (release = shutdown)

Automatic activation:

  • Mechanical: Spring-loaded brakes, pressure-activated valves (no electronics needed)
  • Electrical: Relay logic, circuit breakers with trip coils
  • Software: Exception handlers, interrupt routines
  • Hydraulic/pneumatic: Pilot valves, air-brake systems

Example: Train air brakes:

  • Detection: Air pressure drop (brake line split or decoupling)
  • Trigger: Spring applies brakes when air pressure lost
  • Safe state: Train stops (brakes engaged by default)

Step 5: Test Failure Scenarios

Validate that fail-safe mechanisms work as designed through simulated failures.

Testing approach:

  • Component removal: Disconnect sensor, cut wire, remove fuse
  • Power interruption: Kill power mid-operation
  • Software fault injection: Trigger exceptions, memory corruption
  • Environmental stress: Overheat, vibration, electromagnetic pulses
  • Human error simulation: Wrong button sequence, maintenance mistakes

Verification criteria:

  • Safe state achieved within acceptable time (milliseconds to seconds)
  • No overshoot or oscillation (system doesn't bounce between states)
  • Safe state maintained until manual reset or problem fixed
  • Failure clearly indicated to operators (lights, alarms, displays)

Step 6: Add Redundancy for Critical Paths

Combine fail-safe with backup systems for highest-risk scenarios.

Redundancy types:

  • Parallel redundancy: Multiple components performing same function (dual sensors, triple-redundant computers)
  • Standby redundancy: Backup activates when primary fails (emergency generator, spare pump)
  • Diverse redundancy: Different technologies for same function (mechanical + electronic brakes)

Example: Commercial aircraft:

  • Multiple engines (if one fails, others maintain flight)
  • Hydraulic + electric + mechanical backup for flight controls
  • Dual/triple redundant computers with voting logic
  • Emergency power from ram air turbine (windmill in slipstream)

Critical: Redundancy alone isn't fail-safe - need automatic switchover and safe state if all redundancy exhausted.

Step 7: Document and Train

Ensure operators understand fail-safe behavior and don't override safety mechanisms.

Documentation requirements:

  • Failure modes covered by fail-safes
  • Safe state for each failure
  • Trigger conditions and timing
  • Operator actions during fail-safe activation
  • Reset procedures after safe state triggered

Training focus:

  • Don't bypass: Operators sometimes disable fail-safes for "efficiency" (catastrophic)
  • Recognize activation: Understand alarms/indicators mean fail-safe triggered
  • Trust the system: Don't override during emergency (designed response is safest)

Common Pitfalls

Fail-deadly design - Opposite of fail-safe. Example: electric door lock that remains locked when power fails (traps people in fire). Design should default to unlocked.

Single point of failure - Fail-safe mechanism itself can fail. Need redundant detection/activation or diverse methods (mechanical + electronic).

Mode confusion - Operators don't know if fail-safe is active or system is malfunctioning normally. Clear indicators essential.

Delay in detection - Sensor lag or algorithm processing time allows unsafe state to persist. Critical systems need sub-second response.

False positives - Overly sensitive fail-safes trigger unnecessarily, frustrating operators who then disable them. Balance sensitivity vs. nuisance trips.

Untested fail-safes - Mechanisms not exercised regularly can corrode, seize, or fail when needed. Periodic testing mandatory.

Real-World Applications

Aviation - Multiple engines: If one engine flames out, others maintain flight. Planes can land safely on single engine.

Nuclear reactors - Control rod insertion: Power loss causes electromagnets to release, dropping control rods into reactor core via gravity (stops fission).

Railway - Air brakes: Brakes held open by air pressure. Brake line split or carriage uncoupling drops pressure, applying brakes automatically.

Medical - Infusion pumps: Software crash, sensor error, or power loss stops drug delivery (safer than uncontrolled delivery).

Industrial - Emergency stops: Red mushroom buttons cut all power, stop motion, engage brakes on lathes, mills, presses.

Automotive - Anti-lock brakes (ABS): Sensor failure defaults to normal braking (safe but sub-optimal vs. catastrophic lockup).

Key Insights

Fail-safe design inverts traditional engineering mindset. Instead of optimizing for normal operation, prioritize worst-case behavior. The question isn't "How well does this work?" but "What happens when it breaks?"

Three design philosophies:

  1. Inherent safety: Eliminate hazard entirely (replace toxic chemical with non-toxic)
  2. Fail-safe: Automatic safe response when failure occurs (air brakes, circuit breakers)
  3. Procedural safety: Rely on human vigilance and protocols (least reliable)

Fail-safe works best for predictable failure modes with clear safe states. Struggles when:

  • Failures are novel/unanticipated
  • No single safe state exists (trade-offs between hazards)
  • System must remain operational despite failure (use fail-operational design instead)

Critical distinction: Fail-safe vs. Fail-operational

  • Fail-safe: System shuts down safely
  • Fail-operational: System continues functioning despite failure (aircraft, autonomous vehicles, medical life support)

Modern trend: Move from fail-safe to fail-operational using redundancy, but retain fail-safe as ultimate backstop when all redundancy exhausted.

Install via CLI
npx skills add https://github.com/lev-os/agents --skill fail-safes
Repository Details
star Stars 7
call_split Forks 0
navigation Branch main
article Path SKILL.md
More from Creator