agent-safety-architect - SKILL.md Agent Skill

name: agent-safety-architect description: Design safety architectures for AI agents — autonomy tiers, permission zones, command approval gates, secret handling, escalation paths, and observability. Use when building agents that execute code, modify files, access networks, handle credentials, or make consequential decisions. Covers three autonomy tiers (full-auto, supervised, human-led), container security models, tool safety classifications, and audit logging. Based on patterns from Kimi's 4-layer container model, Claude Code's approval workflows, Devin's data security, and Windsurf's safety protocols.

Agent Safety Architect

Design safety boundaries, permission models, and observability for AI agents.

Workflow

Safety Design Workflow

Classify agent actions by risk level (safe, moderate, dangerous)
Assign autonomy tier for each action class
Define permission zones for file system and network access
Implement approval gates for high-risk operations
Add audit logging and observability

Safety Audit Workflow

Map all agent capabilities to risk levels
Check for missing approval gates on dangerous operations
Verify secret handling (no credentials in prompts or logs)
Test escalation paths end-to-end
Score against the safety checklist

Autonomy Tiers

Three levels of agent autonomy based on risk. Read references for templates.

Tier	Risk Level	Agent Role	Reference
Full Auto	Low	Execute without approval	`references/01-full-auto.md`
Supervised	Medium	Execute after approval	`references/02-supervised.md`
Human-Led	High	Recommend only, human executes	`references/03-human-led.md`

Permission Zones

Zone	Access	Reference
File System	Read/write/execute boundaries	`references/04-permission-zones.md`
Network	Allowed endpoints and protocols	`references/04-permission-zones.md`
Secrets	Environment variables, credential vaults	`references/05-secret-handling.md`

Action Risk Classification

<risk_classification>
  <safe auto_approve="true">
    - Read files within workspace
    - Search codebase
    - View file outlines
    - List directories
    - Run read-only database queries
  </safe>

  <moderate requires_approval="first_time">
    - Write/modify files within workspace
    - Run shell commands (non-destructive)
    - Install dependencies
    - Create branches
    - Make API calls to allowed endpoints
  </moderate>

  <dangerous requires_approval="always">
    - Delete files or directories
    - Run commands with sudo/root
    - Push to main/production branches
    - Drop database tables
    - Modify environment variables
    - Access external APIs not in allowlist
    - Execute arbitrary network requests
  </dangerous>
</risk_classification>

Approval Gate Template

<approval_gate>
  <trigger>[Action that requires approval]</trigger>
  <display>
    - Exact command/action to be performed
    - One-sentence purpose explanation
    - Risk assessment (what could go wrong)
  </display>
  <options>
    <approve>Execute the action</approve>
    <modify>Suggest alternative</modify>
    <reject>Cancel the action</reject>
  </options>
  <timeout>[Auto-reject after N minutes of no response]</timeout>
</approval_gate>

Audit Logging Requirements

Every consequential agent action MUST log:

Timestamp
Action taken (tool name + parameters)
Approval status (auto-approved, user-approved, system-approved)
Outcome (success, failure, partial)
State changes caused (files modified, commands run)

Anti-Patterns

Blanket Trust — auto-approving all actions regardless of risk
Security Theater — approval gates on safe actions, none on dangerous ones
Credential Leaking — API keys in prompts, logs, or generated code
Silent Failure — agent fails destructively with no audit trail
Privilege Creep — agent gradually escalates permissions without review

Validation Scripts

Validate safety architecture with automated scoring (0-10):

python3 scripts/validate_safety.py <config_file> [--strict]

Checks autonomy tier definitions, 5 safety mechanisms (secret handling, permission zones, audit logging, escalation, input validation), detects hardcoded credentials, and flags unsafe patterns (bypass instructions, elevated defaults).