name: pipeline-improve description: Continuous-improvement audit for the tech-leads pipeline. Reviews the last 24h for issues (n8n silent failures, stuck/mistargeted leads, zombies, opt-out misses, cost drift, throughput gaps), fixes the ROOT CAUSE, encodes a guard so it can't recur, and documents the lesson. Use when the user says "audit the pipeline", "what issues", "improve the process", "fix what you find", or on any periodic review. Embodies the CONTINUOUS IMPROVEMENT mandate in CLAUDE.md.
Cost note: every LLM call routes through the LiteLLM gateway on the cheapest capable model (CLAUDE.md → Cost Discipline). This is an ops/mechanical skill — for a long run, prefer
/modelSonnet or Haiku; never burn Opus on rote work.
Pipeline continuous-improvement audit
Goal: leave the pipeline measurably better than you found it, and leave behind a guard + a lesson so the same problem never recurs. Fix root causes, not symptoms. Make the harness smarter rather than doing the work by hand again.
Step 1 — DETECT (read the real state, don't assume)
Run these and look for anomalies:
- n8n health (the silent killer): for every tech-leads workflow, check the latest execution STATUS, not just
active.GET https://n8n.ai.technijian.com/api/v1/executions?workflowId=<id>&limit=1(key inkeys/te-dc-ai-n8n.md). Anystatus != success→ diagnose (&includeData=true) and fix the node (auth must benone+ inline Bearer;options.allowUnauthorizedCerts: truefor the self-signed IIS cert), then re-fire the webhook to recover the missed batch. - Sends: today's count by pipeline (git
send:commits + webhook logs); per-fire volume; pre-dispatch utilization. Flat during the window = stuck. - Stuck/mistargeted leads: generic mailboxes (
is_generic_mailbox), wrong-fit titles (is_wrong_fit_title), paused leads with fixable reasons, deliverable-but-no-body, leads not on the highest-authority IT contact (find_decision_maker.py). - Processes: phase-A zombies (>20 min), engine stacking, gateway saturation (>20 concurrent LLM jobs).
- Opt-outs / replies:
tracking/_optout_log.mdSKIP/SUPPRESS lines; honor any missed opt-out (including cross-address). - Cost & timezone:
python scripts/model_config.py(all calls must be litellm); latestlogs/model-review/; confirm time logic usesAmerica/Los_Angeles(DST), never a fixed offset. - Git: branch not far ahead of origin (push not stuck / no rebase-merge left behind).
Step 2 — FIX THE ROOT CAUSE
For each issue, change the code/config/guardrail so the class of problem is resolved — not just the one instance. Test the fix (dry-run or a small batch) before trusting it, especially on the send path (the 2026-05-19 dry-run incident lived there).
Step 3 — ENCODE A GUARD
Add a guardrail / validation / idempotency check / self-healing step so it's caught automatically next time. Prefer adding the check to pipeline_watchdog.py (durable, runs every 30 min via n8n) or check_guardrails in the engine.
Step 4 — DOCUMENT & TEACH
- Append a dated lesson to
tracking/lessons-learned.md. - If a durable rule emerged, update
CLAUDE.md(and write a memory entry). - New capability → add it to the "Self-healing systems & guardrails" inventory in CLAUDE.md.
Step 5 — IMPROVE THE EXECUTORS
If a task is still manual, push it into a script + n8n cron, an agent, the watchdog, or a skill so it runs autonomously. Update autonomous_harness.py / the n8n schedule accordingly.
Report
Per issue: detected → root-cause fix → guard added → documented, plus what still needs the user. Be honest about what's verified vs trusted. Never claim "fixed" without evidence.
Guardrails on yourself
- Shared infra (n8n workflows you didn't create, the server config, mass mailbox deletes, killing scheduled engines) needs explicit user authorization — surface it, don't force it.
- Every LLM call goes through the LiteLLM gateway. Every time/schedule is
America/Los_Angeles(DST). Never email a generic mailbox or non-buyer title.