authoring-log-alerts - SKILL.md Agent Skill

name: authoring-log-alerts description: > Author useful, low-noise log alerts on services in a PostHog project. Use when the user asks to set up alerts for their logs, suggest alerts they should add, or evaluate whether a service is worth monitoring. Covers service triage, baseline characterisation, threshold drafting, back-testing via simulate, and shipping with a notification destination.

Authoring log alerts

Authoring an alert is a measurement problem, not a guessing problem. You are not trying to be exhaustive — you are trying to land thresholds that fire 0–3 times per week on real production patterns, on services that matter.

When to use this skill

The user asks to "set up alerts" / "suggest alerts" for their project.
The user wants to evaluate whether a service is producing alertable signal.
The user has just enabled log alerting and wants a starter set.

When not to use this skill

Tuning an alert that already exists — that's a different job (use posthog:logs-alerts-events-list to inspect fire/resolve cadence and posthog:logs-alerts-partial-update to adjust).
Investigating an active incident — pull rows with posthog:query-logs, don't author an alert mid-incident.

Tools

Tool	Job	Where it fits
`posthog:logs-services`	Top-25 services in window with log_count, error_count, error_rate, sparkline.	Step 1 — triage.
`posthog:logs-attributes-list` / `posthog:logs-attribute-values-list`	Discover keys/values for narrower filters.	Step 2, optional.
`posthog:logs-count-ranges`	Adaptive time-bucketed counts for a filter.	Step 3 — baseline.
`posthog:logs-alerts-simulate-create`	Replay a draft config against `-7d` history with full state machine.	Step 4 — validate.
`posthog:logs-alerts-create`	Persist the alert.	Step 5 — ship.
`posthog:logs-alerts-destinations-create`	Wire the alert to Slack or webhook.	Step 5 — ship.

Do not call posthog:query-logs during authoring. You need distributions, not rows. Reserve posthog:query-logs for the very end if the user asks "show me a sample of what would have fired" — limit: 10 is plenty.

Workflow

1. Triage — pick candidate services

Call posthog:logs-services for the last 24h with no filters. The response is capped at 25 services and includes a sparkline, so it is small and bounded.

A service is a candidate when both are true:

log_count is non-trivial (≥ ~1k in 24h — quieter services produce too little signal to alert on).
error_rate is non-zero, or the user has named the service explicitly.

Skip services with high volume but error_rate == 0 unless the user wants a volume-shape alert (e.g. "warn me if api-gateway suddenly stops producing logs"). Volume-floor alerts use threshold_operator: below and need different reasoning — see references/volume-floor-alerts.md.

If the user names a service, treat it as a candidate even without error signal.

2. (Optional) Narrow the filter

If a service has many error sub-types, an alert on "all errors" is usually too broad. Use posthog:logs-attributes-list (try attribute_type: log) and posthog:logs-attribute-values-list to find a discriminator — common ones are http.status_code, error.type, k8s.container.name. Add the narrowing filter to your draft.

Keep it simple: one severity filter + one or two attribute filters is plenty. Multi-clause filters are harder to reason about and rarely improve precision.

3. Baseline — characterise the candidate over 7 days

Call posthog:logs-count-ranges with the candidate's filters, dateRange: { date_from: "-7d" }, and targetBuckets: 24 (one bucket ≈ 7h). The response gives you bucket counts.

Do not eyeball the percentiles or scale the threshold to the alert window manually. Pipe the count-ranges response into the helper script:

echo '<count-ranges JSON>' | python3 scripts/baseline_stats.py --window-minutes 5

The script returns:

{
  "n_buckets": 12,
  "bucket_minutes": 420.0,
  "alert_window_minutes": 5,
  "stats": { "p50": 12.0, "p95": 71.25, "p99": 126.25, "max": 140 },
  "suggested_threshold_count": 5,
  "rationale": "max(p99=126.25, median*3=36.0, floor=5) scaled from 420m bucket to 5m window",
  "health": []
}

Use suggested_threshold_count as your starting threshold. Read health:

`health` flag	What it means	What to do
`sparse:N_of_M_buckets`	Too few non-empty buckets for a 7d baseline.	Widen filter, extend to `-30d`, or skip.
`empty`	All buckets are zero.	Skip — no signal.
`spiky`	`max` is 10×+ `p95`.	Count-threshold alerts work well. Proceed.
`flat`	`p95` ≈ `p50`.	Be cautious — either no incidents in lookback, or the metric is too smooth. Try a longer lookback or skip.
`[]` (empty)	Healthy distribution.	Proceed.

4. Draft and simulate

Pick a starter draft from these defaults — see references/threshold-defaults.md for the reasoning:

Setting	Default	Notes
`threshold_count`	`suggested_threshold_count` from the script	Already scaled to the alert window.
`threshold_operator`	`above`	Use `below` only for volume-floor alerts.
`window_minutes`	`5`	Allowed: 5, 10, 15, 30, 60. Must match what you passed to the script.
`evaluation_periods`	`3`	M in N-of-M.
`datapoints_to_alarm`	`2`	N in N-of-M. 2-of-3 reduces flap from a single noisy bucket.
`cooldown_minutes`	`30`	Minimum time between repeat fires.

Call posthog:logs-alerts-simulate-create with these settings and date_from: "-7d". The response gives you fire_count and resolve_count.

5. Iterate — three rounds, then ship or skip

Target: fire_count between 0 and ~3 over -7d. If outside the band:

Outcome	Adjustment
`fire_count` = 0 over 7d and the baseline was spiky	Lower `threshold_count` toward `stats.p95` from the script, or drop to 1-of-2.
`fire_count` = 0 and the baseline was flat	The service has no alertable signal. Skip it; log why.
`fire_count` > 5	Raise `threshold_count` toward `stats.max` from the script, or move to 3-of-5 for a smoother window.
`fire_count` is fine but resolve_count never matches fire_count	Cooldown is too long, or the underlying state is genuinely sticky. Acceptable for now.

When adjusting the threshold, read values from the script's stats block — never recompute percentiles by hand.

Cap iteration at 3 simulate calls per candidate. If you can't land in the band in 3 rounds, the metric is wrong — either the filter is too broad, the window is wrong, or the service genuinely doesn't have a threshold-shape signal. Note it and move on.

6. Ship — create + attach destination

Once a draft simulates cleanly:

Call posthog:logs-alerts-create with the validated config. Use a name like <service> error rate (auto) so the user can see at a glance which alerts came from this skill.
Call posthog:logs-alerts-destinations-create to wire it to a notification target. An alert with no destination is silent. Always confirm the channel name or webhook URL with the user before attaching — never wire an auto-generated alert to a production channel without explicit confirmation. If the user is unsure, suggest a low-traffic testing channel for the first few alerts.

If the user wants alerts created in enabled: false state for review-then-flip, pass enabled: false to -create and tell them how many drafts you produced.

Filter shape — required

The filters field on posthog:logs-alerts-create takes a subset of LogsViewerFilters and must contain at least one of:

severityLevels — list of ["trace","debug","info","warn","error","fatal"]
serviceNames — list of service name strings
filterGroup — property filter group

The same shape goes into posthog:logs-alerts-simulate-create's filters field. Match the simulate filters to the alert filters exactly — otherwise the simulation is testing a different alert than the one you ship.

Example minimum:

{
  "severityLevels": ["error", "fatal"],
  "serviceNames": ["api-gateway"]
}

Token-economy rules

One posthog:logs-services call at the start, not per-candidate.
One posthog:logs-count-ranges call per candidate at targetBuckets: 24. Don't go above 30 during authoring.
≤ 3 posthog:logs-alerts-simulate-create calls per candidate.
Zero posthog:query-logs calls during the authoring loop.
Prefer reporting a small set of well-validated alerts over a long list of unvalidated drafts.

Output

Report what you did, in this shape:

For each shipped alert: name, filters, threshold, simulated fire_count over 7d, destination.
For each skipped candidate: service name + why (flat baseline, can't land threshold, low volume).
Total simulate calls made, total alerts created.

The user should be able to read this and decide whether to disable any drafts before they go live.