guardrails - SKILL.md Agent Skill

context: fork name: guardrails description: "Input/output and per-tool guardrails with tripwire semantics. Auto-activates when: an agent processes untrusted input, calls a sensitive tool, or must short-circuit on a policy violation. Triggers: guardrail, input validation, tripwire, tool guardrail, policy check, refusal, agent safety." lang: [en] level: 2 triggers: ["guardrail", "input validation", "tripwire", "tool guardrail", "policy check", "refusal", "agent safety"] agents: ["security-reviewer", "backend-developer", "architect"] tokens: "~2K" category: "safety" platforms: [claude-code, gemini-cli, codex-cli, cursor] whenNotToUse: - "Trivial single-call utilities with no untrusted input" - "Pure-function tools whose output is already type-checked at the call site" - "Operations where blocking is unsafe (logging, telemetry stubs)"

Guardrails: Input, Output, and Tool-Level Policy Enforcement

Overview

Guardrails are async checks that run alongside agent input/output and per-tool invocations. A "tripwire" result short-circuits the run with either a refusal message or a thrown GuardrailTripped error. The Artibot implementation lives in lib/orchestration/guardrails.js (top-level) and lib/orchestration/tool-guardrails.js (per-tool registry).

When to Use

The agent receives free-form user input that could contain prompt injection, PII, or jailbreak content
A tool exposes a sensitive surface (file write, network egress, shell exec) that needs an extra gate
Output policy must enforce a structured shape before returning to the user
Multiple concurrent checks must run in parallel and any single trip should halt the run

When NOT to Use

A single-file utility with type-checked inputs and no user-facing surface
A pre-existing static validator (Zod/Pydantic) already covers the contract
The check is performance-critical hot-path (guardrails add async overhead)
The intent is to log only — use the on_llm_end hook instead

Process

Step	Action
1	Identify the smallest input/output boundary (per-agent vs per-tool)
2	Write a `Guardrail` (or `registerToolGuardrail`) returning `{ tripwireTriggered, info, refusal? }`
3	For per-tool, decide behavior: `reject_content` (continue with refusal) vs `raise_exception` (throw)
4	Run via `runAll(guardrails, ctx, input)` or `evaluateToolInput(toolName, params)`
5	Test the tripwire fires for a known-bad input and stays silent on a clean input

Common Rationalizations

Excuse	Rebuttal
"the LLM will refuse it anyway"	LLM refusal is probabilistic; guardrails are deterministic
"we already have Zod schemas"	Zod validates shape; guardrails validate intent and policy
"it slows down every call"	Run in parallel via `Promise.all`; cost is the slowest check, not the sum
"we will catch it in the post-hoc review"	Post-hoc means production already saw the bad output

Red Flags

A guardrail that returns tripwireTriggered: false for every input it has ever seen
A single guardrail that mutates the input in place (guardrails must be pure)
Behavior raise_exception used on a customer-facing tool without a global error handler
Per-tool guardrails registered globally at module import (use explicit registration in the run setup)

Verification

tests/lib/orchestration/guardrails.test.js — runAll parallelism + tripwire propagation
Manual: register a known-bad guardrail, call evaluateToolInput, confirm refusal payload
DATA POLICY: guardrails MUST NOT call out to external HTTP services for decisions