name: diagnose-cdc description: "Diagnose PeerDB Postgres-to-ClickHouse CDC health (replication slot status, mirror row counts vs Postgres source, and dbt read-model freshness) for the three Dofek flows: dofek_metric_stream_analytics, dofek_fitness_raw_analytics, dofek_provider_inventory_raw_analytics. Use when the web shows partial/empty data for recent activities, when read models look stale, or after a large migration."
Diagnose CDC
Use this skill when activity pages, dashboards, or any ClickHouse-backed view show empty or stale data for rows that exist in Postgres. The most common cause is a broken PeerDB CDC flow, not the read-model SQL.
Signals that point here
- A specific activity page (e.g.
/activity/<id>) loadsactivity.byIdbutactivity.stream,activity.hrZones,activity.powerZonesreturn empty / all-null aggregates. - "No heart rate zone data" / missing Performance / missing Route Map for a recent activity that has data on the provider side (Strava, Garmin, WHOOP).
- Dashboards that read from
analytics.v_activity,analytics.activity_summary,analytics.deduped_sensor,analytics.deduped_locationshow stale or missing rows. - The deploy that introduced the regression involved a large migration / backfill that churned WAL.
1) Check Postgres replication slot health
Lost slots are the most common failure mode. WAL files are gone, so the slot can never recover on its own.
ssh dofek-server 'docker exec $(docker ps --format "{{.Names}}" | grep -E "dofek[_-]db" | head -1) bash -lc "psql -U \$POSTGRES_USER -d \$POSTGRES_DB -c \"SELECT slot_name, active, wal_status, restart_lsn, confirmed_flush_lsn FROM pg_replication_slots ORDER BY slot_name;\""'
Healthy looks like active = t, wal_status = 'reserved', non-null restart_lsn. Broken looks like active = f, wal_status IN ('lost', 'unreserved'), empty restart_lsn.
For repeat slot-loss incidents, also capture the WAL budget and lag:
ssh dofek-server 'docker exec $(docker ps --format "{{.Names}}" | grep -E "^dofek_db" | head -1) bash -lc "psql -U \$POSTGRES_USER -d \$POSTGRES_DB -P pager=off -c \"SHOW max_slot_wal_keep_size; SELECT slot_name, active, wal_status, safe_wal_size, pg_size_pretty(safe_wal_size) AS safe_wal_size_pretty, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)) AS retained_lag FROM pg_replication_slots ORDER BY slot_name;\""'
If retained_lag exceeds max_slot_wal_keep_size, the lost slot is explained by a PeerDB outage or write burst outlasting the configured WAL budget. Treat that as a recurring durability problem, not just a mirror-resync task.
The three Dofek slots:
peerflow_slot_dofek_metric_stream_analytics— feedspostgres_fitness.metric_stream.peerflow_slot_dofek_fitness_raw_analytics— feedspostgres_fitness.activity, plus the rest offitness.*mirror tables.peerflow_slot_dofek_provider_inventory_raw_analytics— feedspostgres_fitness.provider_inventory_*mirrors.
2) Check PeerDB worker logs for the actual error
ssh dofek-server 'docker logs $(docker ps --format "{{.Names}}" | grep peerdb-flow-worker | head -1) --since 24h 2>&1 | grep -iE "error|SQLSTATE" | tail -30'
Look for SQLSTATE 55000 ("can no longer access replication slot") — confirms the slot is lost. Other PeerDB-specific errors (peer connection failures, ClickHouse insert errors, schema drift) point to different remediation paths.
3) Compare row counts: Postgres source vs ClickHouse mirror
For each suspected mirror table:
# Postgres (source of truth)
ssh dofek-server 'docker exec $(docker ps --format "{{.Names}}" | grep -E "dofek[_-]db" | head -1) bash -lc "psql -U \$POSTGRES_USER -d \$POSTGRES_DB -c \"SELECT count(*), max(started_at) FROM fitness.activity;\""'
# ClickHouse mirror
ssh dofek-server "docker exec \$(docker ps --format '{{.Names}}' | grep -E 'dofek_clickhouse' | head -1) clickhouse-client -q \"SELECT count(), toString(max(_peerdb_synced_at)) FROM postgres_fitness.activity FINAL WHERE _peerdb_is_deleted = 0\""
If max(_peerdb_synced_at) is hours/days behind max(started_at) from Postgres, CDC is stalled or stopped for that flow. Run the same comparison for fitness.metric_stream ↔ postgres_fitness.metric_stream and any other affected mirror.
4) Check whether ClickHouse read models are catching up
Some read models are normal ClickHouse views, while expensive derived tables are
incremental dbt models populated by the analytics-worker service. First check
the worker logs:
ssh dofek-server 'docker service logs --tail 100 dofek_analytics-worker'
Then inspect the dbt-owned target tables directly:
-- ClickHouse
SELECT max(refreshed_at) FROM analytics.deduped_sensor;
SELECT max(refreshed_at) FROM analytics.resting_heart_rate_sleep_window;
If these timestamps are current but rows are missing, the upstream mirror or raw
source data is the problem. If timestamps are stale, debug analytics-worker
and run dbt build --project-dir analytics --profiles-dir analytics --select sensor_scalar_sample deduped_sensor resting_heart_rate_sleep_window from a
configured app container.
The relevant read models:
analytics.v_activity— readspostgres_fitness.activity,provider_priority,device_priority.analytics.v_activity_members— readsanalytics.v_activity.analytics.sensor_scalar_sample— dbt microbatch staging table over scalarpostgres_fitness.metric_stream, event-timed byrecorded_at.analytics.deduped_sensor— dbt microbatch table overanalytics.sensor_scalar_sample, event-timed byrecorded_at.analytics.deduped_location— same, location channel only.analytics.activity_summary— aggregates the above peractivity_id.analytics.resting_heart_rate_sleep_window— dbt incremental table for resting heart rate.
Because every read model INNER JOINs v_activity_members, a missing row in postgres_fitness.activity cascades to empty results in all of them, even if postgres_fitness.metric_stream has the samples.
5) Decide between slot recreate vs full mirror resync
wal_status = 'lost': no recovery option from Postgres side. The slot must be dropped and recreated. Whether that requires a full ClickHouse mirror resnapshot depends on data volume and how stale the mirror is.wal_status = 'unreserved'(slot still has retention room but is inactive): the slot may recover on its own once PeerDB reconnects — try restartingdofek_peerdb-flow-workerfirst.- PeerDB peer errors (network, auth, schema drift): no slot recreate needed; fix the underlying connectivity / schema issue and restart the flow.
6) Remediation (lost slot)
This is destructive on the PeerDB side — confirm with the user before running.
For each broken flow:
Drop the broken slot in Postgres:
SELECT pg_drop_replication_slot('peerflow_slot_dofek_<flow_name>');In the PeerDB UI / API, "Resync" the corresponding mirror. This will:
- Create a fresh logical replication slot.
- Take an initial snapshot of the source tables into the ClickHouse mirror.
- Resume CDC from the new slot's start LSN.
For the
metric_streamflow specifically, prefer the bounded catch-up script before a fullmetric_streamresnapshot when the missing range is known:pnpm catch-up:metric-stream -- --start <utc-start> --end <utc-end> --executeThis direct-inserts non-IMU
fitness.metric_streamrows intopostgres_fitness.metric_streamfor an explicit half-open window and skips ids already present in ClickHouse. It repairs recent analytics input but does not recreate the lost PeerDB slot; the mirror still needs a fresh slot.Verify the mirrors fill in (
count(*)matches Postgres within tolerance) and thatanalytics.v_activity/analytics.deduped_sensorrepopulate within ~1 refresh interval.
7) Prevent the next occurrence
- Production currently caps each slot at
max_slot_wal_keep_size=64GB. Investigate before increasing it again; unlimited (-1) requires reliable disk monitoring and slot-lag alerting. - Keep the
cdc-healthservice running sopg_replication_slots.wal_status IN ('lost', 'unreserved'), inactive slots, stale mirrors, and high retained-WAL thresholds are continuously surfaced to logs/Sentry. - Add a heartbeat that compares Postgres
fitness.activityrow count topostgres_fitness.activity FINALrow count and alarms if they diverge. - Record the incident in
docs/production-incident-baseline.mdwith timestamps, evidence, root cause, and follow-up actions.
If this is not the first lost-slot incident, do not stop at recreating the mirror. Record the observed max_slot_wal_keep_size, slot lag, and the first PeerDB fatal line, then propose one of:
- Increase the bounded WAL budget if the host disk budget can absorb it.
- Set
max_slot_wal_keep_size=-1only if disk monitoring and slot-lag alerting are already reliable enough to prevent filling the data volume. - Replace or supplement PeerDB for high-volume
metric_streamwith a bounded direct backfill/catch-up path that can restore recent rows without relying on an old logical slot. - Add a scheduled CDC health check/alert using
scripts/check-clickhouse-cdc.tsor an equivalent monitor, because dashboard correctness depends on this mirror freshness.