pipeline-improve - SKILL.md Agent Skill

name: pipeline-improve description: Continuous-improvement audit for the tech-leads pipeline. Reviews the last 24h for issues (n8n silent failures, stuck/mistargeted leads, zombies, opt-out misses, cost drift, throughput gaps), fixes the ROOT CAUSE, encodes a guard so it can't recur, and documents the lesson. Use when the user says "audit the pipeline", "what issues", "improve the process", "fix what you find", or on any periodic review. Embodies the CONTINUOUS IMPROVEMENT mandate in CLAUDE.md.

Cost note: every LLM call routes through the LiteLLM gateway on the cheapest capable model (CLAUDE.md → Cost Discipline). This is an ops/mechanical skill — for a long run, prefer /model Sonnet or Haiku; never burn Opus on rote work.

Pipeline continuous-improvement audit

Goal: leave the pipeline measurably better than you found it, and leave behind a guard + a lesson so the same problem never recurs. Fix root causes, not symptoms. Make the harness smarter rather than doing the work by hand again.

Step 1 — DETECT (read the real state, don't assume)

Run these and look for anomalies:

n8n health (the silent killer): for every tech-leads workflow, check the latest execution STATUS, not just active. GET https://n8n.ai.technijian.com/api/v1/executions?workflowId=<id>&limit=1 (key in keys/te-dc-ai-n8n.md). Any status != success → diagnose (&includeData=true) and fix the node (auth must be none + inline Bearer; options.allowUnauthorizedCerts: true for the self-signed IIS cert), then re-fire the webhook to recover the missed batch.
Sends: today's count by pipeline (git send: commits + webhook logs); per-fire volume; pre-dispatch utilization. Flat during the window = stuck.
Stuck/mistargeted leads: generic mailboxes (is_generic_mailbox), wrong-fit titles (is_wrong_fit_title), paused leads with fixable reasons, deliverable-but-no-body, leads not on the highest-authority IT contact (find_decision_maker.py).
Processes: phase-A zombies (>20 min), engine stacking, gateway saturation (>20 concurrent LLM jobs).
Opt-outs / replies: tracking/_optout_log.md SKIP/SUPPRESS lines; honor any missed opt-out (including cross-address).
Cost & timezone: python scripts/model_config.py (all calls must be litellm); latest logs/model-review/; confirm time logic uses America/Los_Angeles (DST), never a fixed offset.
Git: branch not far ahead of origin (push not stuck / no rebase-merge left behind).

Step 2 — FIX THE ROOT CAUSE

For each issue, change the code/config/guardrail so the class of problem is resolved — not just the one instance. Test the fix (dry-run or a small batch) before trusting it, especially on the send path (the 2026-05-19 dry-run incident lived there).

Step 3 — ENCODE A GUARD

Add a guardrail / validation / idempotency check / self-healing step so it's caught automatically next time. Prefer adding the check to pipeline_watchdog.py (durable, runs every 30 min via n8n) or check_guardrails in the engine.

Step 4 — DOCUMENT & TEACH

Append a dated lesson to tracking/lessons-learned.md.
If a durable rule emerged, update CLAUDE.md (and write a memory entry).
New capability → add it to the "Self-healing systems & guardrails" inventory in CLAUDE.md.

Step 5 — IMPROVE THE EXECUTORS

If a task is still manual, push it into a script + n8n cron, an agent, the watchdog, or a skill so it runs autonomously. Update autonomous_harness.py / the n8n schedule accordingly.

Report

Per issue: detected → root-cause fix → guard added → documented, plus what still needs the user. Be honest about what's verified vs trusted. Never claim "fixed" without evidence.

Guardrails on yourself

Shared infra (n8n workflows you didn't create, the server config, mass mailbox deletes, killing scheduled engines) needs explicit user authorization — surface it, don't force it.
Every LLM call goes through the LiteLLM gateway. Every time/schedule is America/Los_Angeles (DST). Never email a generic mailbox or non-buyer title.