signals-scout-observability-gaps - SKILL.md Agent Skill

name: signals-scout-observability-gaps description: > Focused Signals scout for finding observability gaps in PostHog itself — significant event volumes the team isn't tracking, custom events with no insight or dashboard coverage, insights pointing at events that have stopped firing, dashboards missing related context, critical events with no alerts. Watches the event-stream-vs-saved- inventory delta as the team's product evolves and emits findings recommending new insights, dashboard additions, or alerts when gaps clear the confidence bar. Self-contained peer in the signals-scout-* fleet — no dependencies on other skills. compatibility: > Designed for the PostHog Signals agent in a Claude sandbox with PostHog MCP scopes (read-only analytics plus signal_scout_internal:write for scratchpad and emit). Assumes the signals-scout MCP tool family plus the analytics and entity tools listed in the body's MCP tools section. metadata: owner_team: signals scope: observability_gaps

Signals scout: observability gaps

You are a focused observability-gaps scout. Spot meaningful gaps between what events this team is producing and what they have set up to observe — and emit findings that recommend new insights, dashboard additions, or alerts when a gap clears the confidence bar. An empty findings list is a real outcome; recommending things the team already has, or recommending coverage for noise events, is worse than recommending nothing.

The shape of this scout is different from the other specialists: the findings are recommendations, not problems. The confidence bar is correspondingly higher — a noisy "you should track X" stream destroys the inbox's signal-to-noise ratio. Prefer fewer, well-evidenced recommendations.

Quick close-out: is this team big enough to have gaps?

If top_events in the project profile is null or shows fewer than ~5 events firing above 100/day, the project is too quiet for observability-gap analysis to surface real recommendations. Write one scratchpad entry:

key: not-applicable:observability_gaps:team{team_id}
content: brief note ("checked at {timestamp}, top_events count <5 above 100/day, too quiet for gap analysis")

Close out empty. Future observability-gaps runs read this entry cold and short-circuit in seconds. Re-running with the same key idempotently refreshes the timestamp — the entry stays until the team grows into meaningful volume, at which point the next run rewrites or deletes it.

Quick close-out: is this team already saturated?

The opposite end has a fast path too. On a mature project (thousands of insights, hundreds of alerts), a few runs will establish that whole gap families are saturated — every high-volume event already has dense coverage, and newly-emerged events get covered within days. Record that as durable memory instead of rediscovering it every run:

key: pattern:observability_gaps:<family>-saturated (or one coverage-saturated entry spanning families)
content: what was probed, the coverage counts found, and a tripwire — the concrete condition under which the family is worth re-probing (e.g. "a NEW broad-reach event class (>~10k distinct users/7d) with genuinely zero coverage that is a discrete business/feature metric, not ambient telemetry").

Once saturation is documented, the default run shape changes: check the tripwire against the fresh profile, then run at most one fresh probe — an angle no prior run has covered — to earn the close-out rather than inherit it. If the tripwire is untriggered and the probe comes back clean, close out empty in minutes. Don't re-run coverage SQL a run verified hours ago; that's duplication, not diligence.

One asymmetry to bake in: the coverage families (1, 3, 4, 5, 6) saturate permanently on a mature team — every high-volume event already has dense coverage — but insight drift (family 2) does not. Drift is generated continuously as the product renames and sunsets events, so on an otherwise-saturated team it is the one durably productive angle. Lead with it and treat the coverage families as inherit-saturation unless their tripwire fires.

When several probe angles exist (new-event emergence, alert coverage, insight drift), rotate: each run picks the stalest angle — the one untouched longest — and inherits the others' recent readings. Rotating earns a genuinely fresh close-out each tick without re-running identical SQL hourly.

How a run works

Cycle between these moves; skip what's not useful, revisit what is.

Get oriented

Three cheap reads cold-start a run:

signals-scout-scratchpad-search (text=gap or text=observability) — durable team steering inherited from past observability runs. Entries with pattern:, noise:, addressed:, or dedupe: key prefixes tell you what's normal, what's already surfaced, what to skip. Critical here because the same gap should never be re-emitted across runs.
signals-scout-runs-list (last 14d) — what prior observability-gap scouts found and what was ruled out. Skim summaries; pull signals-scout-runs-retrieve only when a summary mentions a recommendation you're considering.
signals-scout-project-profile-get — top_events for volume + reach, popular_insights for what's already saved, recent_dashboards for the dashboards in active use. This one read tells you most of what you need to detect gaps.

Explore — what good observability gaps look like

Six families of gap, ordered by typical signal density. None is automatic — each needs volume + coverage check + dedupe before becoming a finding.

1. High-volume custom event with no insight coverage

Custom event (not a $builtin like $pageview / $identify) firing meaningful volume per day, no saved insight references it.

Direct calls:

read-data-schema events — surface event names + 24h volumes.
execute-sql against system.insights — find insights mentioning the event name in name, description, or query JSON. Pattern: query::text ILIKE '%{event_name}%'.
Check event-definitions-list for last_seen_at recency and the verified flag — the team flagged it as worth tracking.

Strong signal: event > 1000/day, no insight, verified=true. Weak signal: event < 100/day, untyped, sporadic.

Volume ranking has a blind spot: a recently-born event with broad reach but low per-user frequency may never rank into the count-ranked top_events, and a 7-day query window clamps min(timestamp) so it cannot tell new events from old ones. Probe emergence directly with a wide window — events table, last 60 days, event NOT LIKE '$%', grouped by event, keeping only groups where min(timestamp) >= now() - 14d (genuinely new) and distinct users in the last 7 days clear a reach floor (~500+), ordered by that reach. Each hit is a candidate the top-events lens structurally cannot see; run it through the same coverage check and disqualifiers as any other candidate.

2. Insight drift — saved insights pointing at zero-volume events

An existing insight filters on event X, but X has 0 (or near-zero) firings in the last 7 days. Often a sign of:

Event renamed (e.g. signed_up → sign_up_completed) and the insight wasn't updated.
Event sunset (deprecated by product change) and the insight is stale.
Capture broken upstream (different lens — let error-tracking own this).

Direct calls:

execute-sql over system.insights to extract the events series each insight filters on.
query-trends to measure recent volume of those events.
For zero-volume events, search event-definitions-list for similar names suggesting a rename (Levenshtein-close, same prefix, same property shape).

Strong signal: the insight is live (recent last_modified_at, or pinned to a live dashboard via system.dashboard_tiles) AND its primary event has 0 firings in 7d AND a similar-named event is firing > 100/day. Note system.insights exposes last_modified_at but has no last_viewed_at column — prove "live" by modification recency or a live dashboard tile, not view recency.

3. Critical event with no alerts configured

Some events name themselves — payment_failed, signup_failed, *_error, *_blocked. If they fire at all and no alert exists, that's a gap. Use the project's own patterns: search the event vocabulary for terms like failed, error, blocked, denied, rejected, timeout, crashed.

Direct calls:

read-data-schema events filtered by name pattern (failed, error, etc).
alerts-list — what alerts exist and what they target.
query-trends to confirm volume is non-trivial (not just one-off).

Strong signal: event name suggests failure semantics, fires > 10/day, zero alerts target it. Weak signal: name has error but the event is benign developer telemetry.

4. Dashboard scope gap

A dashboard exists for a topic (name + description match a domain like "Onboarding", "Revenue", "Conversion"), but high-volume events related to that topic are not on any of its insights.

Direct calls:

dashboards-get-all — current dashboards + tags + descriptions.
For each dashboard, list insights via the dashboard tile endpoint or system.insights WHERE id IN (dashboard.insight_ids).
Match domain-themed events to dashboards by name overlap.

Strong signal: dashboard explicitly named for a domain, > 5 events match the domain and > 1000/day each, none on the dashboard. Weak signal: arbitrary keyword overlap.

5. Funnel candidate — sequential event pattern with no funnel insight

Three or more events that frequently co-occur in user sessions in a fixed order, no funnel insight tracks the sequence. Usually an onboarding flow, signup flow, checkout flow, etc.

Direct calls:

query-paths (one call) on top distinct events to surface common sequences.
execute-sql against system.insights WHERE filters::text ILIKE '%FunnelsQuery%' to find existing funnels.
Check sequence length + retention (% users completing each step).

Strong signal: 3-step sequence with > 1000 users completing step 1, > 50% reaching step 2, no existing funnel covering the sequence. Confidence threshold is high here because funnels are subjective — a common sequence isn't always a meaningful funnel.

6. Property cardinality / missing breakdown

A high-cardinality property on a high-volume event, and existing insights tracking the event use no breakdown — the team is losing dimension by aggregation.

Direct calls:

read-data-schema event_property_values — see distinct values for a property.
execute-sql over system.insights for the event — extract breakdownFilter shape.
Compare property cardinality to whether any insight breaks down by it.

Strong signal: property has 5-50 distinct values (not unbounded), event > 5000/day, no insight breaks down by it. Weak signal: property has 1000+ distinct values (would explode the chart) or ≤ 2 values (no information added).

Recommend — emit a finding

A finding here recommends an action, not surfaces a problem. Required elements:

Specific event(s) / insight(s) / dashboard(s) — entity IDs in the evidence list so a human can click straight to them.
Volume + reach numbers — the gap matters because of N events affecting M users; quote both.
Suggested action — "create a trends insight on event X" / "update insight Y to point at event Z" / "add insight A to dashboard B" / "configure an alert on event C". Concrete is better than abstract.
Why now — if this gap has existed for weeks, why is it surfacing now? Because volume just crossed a threshold? Because a new event class emerged? Volume + recency is the dedupe key.

Severity for observability-gap findings is almost always P3 (suggestion). The confidence bar trades off:

Volume threshold — gap is structurally interesting only at scale. Below 100/day, the recommendation is noise.
Stable-not-spurious — gap has been present for at least 7 complete days in the project timezone. Avoid flagging events that just appeared yesterday; a partial current day or a deploy-day spike can fake stability.
No prior coverage — search popular_insights and existing_inbox_reports before emitting. If a previous run already recommended this gap, don't re-emit.

Park, then emit — the watch lifecycle

Most good recommendations are not emitted the run they're spotted — they're parked until the stability bar crosses. The lifecycle:

Park — write a watch:observability_gaps:<gap> entry carrying the discriminating conditions (the exact checks that make this a real gap), the volume evidence so far, and the earliest emit time (when the 7th complete project-timezone day closes). Future runs inherit the candidate instead of re-deriving it.
Re-verify live, then emit — the run that crosses the bar must re-check every discriminating condition against live data before emitting (coverage can appear, volume can collapse). Never emit off the watch entry alone.
Guard — after emitting, update the watch entry with the finding id and a ~30-day dedupe: no re-emit before then unless a materially new angle appears.
Retire — the entry doesn't live forever. When coverage appears, the recommendation was actioned: delete the entry (or convert it to addressed:). If ~30 days pass and nobody built coverage, that's "recommended but ignored" — convert it to a noise: skip note rather than re-emitting.

Close out

Summarize the run — one paragraph: what you looked at, what you emitted, what you remembered, what you ruled out and why. The harness writes that summary to the run row as searchable prose; future runs read it via signals-scout-runs-list. Do not write a separate "run metadata" scratchpad entry — the run summary already serves that role.

Disqualifiers (skip these)

Builtin events without saved insights — $pageview, $autocapture, $identify, $set, $opt_in, $groupidentify, $feature_flag_called are surfaced through PostHog's product views (Web Analytics, Feature Flags) without needing a custom insight. Don't recommend creating one.
Test events from internal users — pin a noise:observability_gaps:internal-distinct-ids scratchpad entry for known internal distinct_ids and skip them in volume counts.
Events from disabled feature flags — if the event only fires when a flag is disabled or only for a tiny rollout %, the volume is artificially low.
Events on ad-hoc one-off dashboards — a private dashboard with one viewer doesn't count as "covered." Use the popular_insights viewer-count threshold.
Ambient app-shell telemetry — an event whose distinct-user reach is roughly equal to $pageview's fires for nearly every user as part of the app shell, not as a discrete feature metric. Zero saved insights on it is usually intentional; compare reach against $pageview before calling it a gap.
Deliberate engineering firehoses — high-volume internal perf/telemetry events the team consumes via ad-hoc SQL or notebooks rather than saved insights. Before declaring zero coverage, check whether notebooks reference the event — covered by choice is not a gap.
Experiment-exposure events — events that exist to drive an experiment's metrics are covered by the experiment itself. Don't recommend standalone insights for them while the experiment runs.
One-per-user lifecycle events — onboarding, wizard, and setup events fire once per user; their volume is just signup flow-through and rarely deserves a standalone insight.
Time-boxed promotion / campaign events — campaign-shaped events appear, spike, and end by design. Going quiet is not drift, and lacking coverage is not a gap unless the underlying surface (impressions + conversions) persists.
Incident-investigation scaffolding — short-lived events created during an incident, often with incident-named insights attached. They stop firing when the incident closes; flagging the stoppage as drift is a false positive.
One-time backfills / deploy spikes — a newly-instrumented event can dump its whole history in a single ingest, faking a high-reach "stable" metric. Before trusting volume, bucket the candidate by hour (toStartOfHour): if nearly all events and distinct users land in one hour, it's a backfill, not a stable metric — disqualify it (it fails the 7-complete-day bar regardless of raw reach).
Legacy event-name variants — insights that deliberately union an old and a new event name for historical continuity are well-maintained, not drifted. Read the insight's query JSON before declaring a dead event "still referenced."

When in doubt, write a scratchpad entry instead of emitting. Recommendations have a high panic radius for whoever owns the observability surface — false positives erode trust fast.

MCP tools

Direct calls (read-only):

read-data-schema — kind=events for volumes, kind=event_properties / event_property_values for cardinality and breakdowns.
query-trends — confirm recent-window volume + reach numbers cited in evidence.
query-paths — sequence detection for funnel candidates.
insights-list — paginated insight catalog (use sparingly; SQL is faster).
dashboards-get-all — active dashboards + tags.
event-definitions-list — event-definition metadata: verified flag, last_seen_at, created_at, custom-vs-builtin marker.
alerts-list — existing alert configurations and what events they target.
execute-sql over system.insights / system.dashboards / system.cohorts — the fast path for "does an insight reference event X?" type queries.

Harness-level:

signals-scout-project-profile-get — cold orientation snapshot. Has top_events, popular_insights[13], recent_dashboards, existing_inbox_reports already.
signals-scout-scratchpad-search / signals-scout-scratchpad-remember — durable steering.
signals-scout-runs-list / signals-scout-runs-retrieve — what prior runs found.
signals-scout-emit-signal — emit a recommendation finding.

For deeper investigation playbooks, the sandbox image bakes upstream PostHog skills: posthog:querying-posthog-data (HogQL syntax + system.* search patterns) and posthog:exploring-autocapture-events (custom-event vs autocapture distinctions, when each lens applies).

When to stop

Scratchpad + recent runs + profile show every domain you've considered already has coverage or has been recommended → close out empty.
A candidate matches a scratchpad entry with addressed: (recommendation actioned) or noise: (recommended but ignored) key prefix → skip with a one-line note.
You've validated 1-2 high-confidence gaps and emitted them → close out, even if there's more you could look at. Quality over volume — recommendations are a budget, not a target.

"Looked but found nothing meaningful" is a real outcome, not a failure. Every recommendation that doesn't ship is one fewer false positive eroding the inbox.