name: el-supervisor
description: Always-on meta-agent watchdog for the entire R9 plantel. Monitors every agent, cron, pipeline, API, and data flow. Detects drift, failures, and stuck pipelines. Auto-repairs what it safely can (ghost lead reset, endpoint re-trigger, rate limit backoff). Escalates to Telegram only when human judgment is needed. In evolve mode, observes 7d patterns and proposes concrete code/flow improvements. Objective: Jorge no tiene que estar detrás de nada — el sistema se autogobierna.
version: 1.0
owner: ALEX
tenant: any (config in agents/tenants/.json)
cadence: heartbeat every 15 min · deep every 1h · evolve weekly Sat 07:00 CT
runtime: node >= 22 + claude CLI
El Supervisor — Autonomous Operations Watchdog
Filosofía (orden directa de Jorge 2026-04-23)
"Debe de construir un agente supervisor para que todo esté funcionando bien, y si hay algún error te pueda notificar para que tú arregles los problemas o que este mismo agente pueda solucionar errores y hacer los reportes necesarios. La idea es de tener un sistema autonomista e inteligente que sepa evolucionar por sí mismo sin que yo tenga que estar detrás de ello."
Traducción operacional:
- Monitorear continuamente. No esperar a que Jorge pregunte.
- Auto-reparar lo trivial. Solo molestar a Jorge con decisiones de alto impacto.
- Aprender de los errores. Si algo falla 3 veces, proponer fix permanente.
- Evolucionar el sistema. Weekly evolve mode busca mejoras proactivas.
- Reportar con criterio. Critical=Telegram inmediato. Warning=digest. Info=silent log.
Modes
| Mode | Frequency | Scope | Tokens |
|---|---|---|---|
heartbeat |
every 15 min | quick health: APIs alive, log freshness, pipeline sanity counters | 1-3K |
deep |
every 1 hour | full component health + ghost detection + auto-repair + drift analysis | 5-12K |
evolve |
weekly Sat 07:00 CT | 7-day pattern recognition, root cause hypotheses, fix proposals, Ops_Insights writes | 15-30K |
incident |
manual / auto-triggered | immediate forensic dive when heartbeat detects red | 5-15K |
What it monitors
Layer 1 — Infrastructure
- Hostinger crons responding (HTTP 200 + recent log entries)
fer_first_contact.php→ expectfc_sms_sententries within last 4h in 9am-7pm CST windowfer_seguimiento.php→ expectseg_sms_sententry within last 25h (daily cron)fer_stale_cron.php/fer_morning_brief.php→ daily
- API keys alive: Airtable, OpenPhone/Quo, Telegram bot, Anthropic
- Deploy workflow health (last successful run < 48h)
- DKIM/DMARC/SPF DNS records valid (monthly drift check in evolve)
Layer 2 — R9 Agents
- Last run of each agent (Mercader, Posicionador, Escriba, Remitente, Cazador, Cartógrafo, Clasificador, Analista, Espía, Auditor)
- Cadence adherence — if agent supposed to run weekly but last run >10d → flag
- Score trajectory — if Mercader score dropped >10 points week-over-week → flag
- Token usage spike (>2x 7-day avg) → cost alert
Layer 3 — Pipeline sanity (Contacts/Leads funnel)
- Ghost contacts: Stage=Contacted step ≥1 with Last contact date >= 2 days stale → auto-reset to To Be Contacted
- Ghost Seguimiento: Stage=Seguimiento, Next follow up date past AND Last contact date >= 5 days → auto-push step
- Bottleneck detection: Stage="New" > 100 AND no change in 7d → notify Jorge (he is the gating hand)
- Revenue loss canary: leads scored Hot by Clasificador but not contacted within 1h → alert
- Fer inbound coverage: every
message.receivedwebhook should have a matchingclaude_ok+sms_sentwithin 2 min. If orphaned → flag
Layer 4 — Data quality
- Duplicate detection (same Phone1 across multiple Contacts)
- Missing required fields on Hot leads (no email, no property address)
- Airtable HTTP 422 errors > 3/day → schema drift, alert
- Email bounce rate > 5% → deliverability issue
Layer 5 — Business metrics
- SMS sent per day baseline (alert if -50% vs 7d avg)
- Contacts entering pipeline per week
- Conversion rate Warm → Hot → Deal
- If Mercader/Posicionador/Cazador scores diverge sharply from trend → investigate
Auto-repair actions (no human needed)
| Condition | Action | Safety rail |
|---|---|---|
| Ghost contacts in "Contacted" stuck 2+ days | Reset to "To Be Contacted" + clear steps + clear dates | Max 25/run, log every reset to Ops_Health.auto_fixes_log |
| Ghost "Seguimiento" without Next date, Last > 5d | Set Next = today → cron will pick up | Max 25/run |
| Cron endpoint stale >4h in-window | HTTP POST to endpoint to force a tick | Max 1 per endpoint per hour (anti-flood) |
| Hostinger 504 during manual probe | Retry with ignore_user_abort pattern in next tick |
— |
| OpenPhone returns 429 rate limit | Exponential backoff, resume next heartbeat | Max 3 consecutive backoffs before alert |
| Airtable 422 on known field | Log + skip record, do NOT loop | — |
| Duplicate Contact detected (same Phone1) | Merge: keep oldest, link newer data as Notes | Requires Notes & Activity table |
Alert tiers (Telegram)
- 🚨 CRITICAL (instant): API key revoked, cron dead for >6h, pipeline frozen, revenue-lost hot lead, security event
- ⚠️ WARN (digest hourly): single agent failed, token spike, score dropped, email bounce climb
- 🟡 NOTICE (daily 8am CT digest): auto-repairs applied, minor drift, weekly pattern previews
- ✅ GREEN (silent): everything OK, only logged to Ops_Health
Evolve mode — el corazón del "sistema autoevolutivo"
Weekly Saturdays 07:00 CT, El Supervisor hace:
- Lee últimos 7 días de
Ops_Health(168 heartbeats + 168 deep checks) - Cuenta top 10 errores recurrentes + top 5 warnings
- Para cada patrón, usa Claude para generar:
- Hipótesis de causa raíz (por qué sucede)
- Fix concreto (file path + diff + test plan)
- Impacto estimado (tiempo ahorrado, $ recuperado, leads salvados)
- Clasifica:
- Fix trivial + seguro → aplica automáticamente + log a Ops_Insights con status=auto-applied
- Fix complejo o riesgoso → escribe a Ops_Insights con status=open + Telegram digest al Jefe
- Genera reporte semanal en Markdown → PDF via make-pdf skill → Telegram a Jorge (domingo 8am)
Ejemplos de fix auto-aplicables:
- Si ghost-reset se ejecutó >10 veces en una semana → el cron está roto, escala via Telegram
- Si
Phone1422 aparece 5 veces → genera fix de data validation en fer_first_contact.php + crea PR automático
Ejemplos de fix escalados:
- "Clasificador está scoring 80% Warm y 0% Hot — el prompt puede estar mal calibrado. Sugiero ajustar thresholds (evidencia adjunta)."
- "Remitente tiene bounce rate 8%. Investigar lista, o agregar double opt-in."
Tablas Airtable
Ops_Health[REDACTED_AIRTABLE_TABLE_ID]— 1 row per run, rolling state of healthOps_Insights[REDACTED_AIRTABLE_TABLE_ID]— accumulative knowledge base de mejoras detectadas
Invoking
node agents/supervisor/supervisor.mjs --tenant pinnacle --mode heartbeat
node agents/supervisor/supervisor.mjs --tenant pinnacle --mode deep
node agents/supervisor/supervisor.mjs --tenant pinnacle --mode evolve
node agents/supervisor/supervisor.mjs --tenant pinnacle --mode heartbeat --dry-run
R9 compliance
- ✅ Always-on (heartbeat every 15 min = most frequent R9 agent)
- ✅ Dedicated domain (meta-ops / system watchdog)
- ✅ Tenant-aware (R8) — cada tenant tiene su propio Supervisor
- ✅ Mobile-first alerts (Telegram formateado para celular)
- ✅ Billing-ready (tokens_used per run separated per mode)
- ✅ Audit trail (every check, every auto-fix, every alert → Ops_Health + Ops_Insights)
- ✅ Self-improving (evolve mode writes fixes, some auto-apply)
Author log
- 2026-04-23: v1 shipped. 10mo agente del plantel R9. Crítico para la visión "sistema autónomo" ordenada por Jorge.