ssmd-dq-checks

star 1

Catalog of ssmd DQ checks — what each measures, pass/fail criteria, common failure modes and fixes, fanout rules, and accountability SLOs. Use when investigating DQ failures, understanding check behavior, or adding new checks.

aaronwald By aaronwald schedule Updated 2/18/2026

name: ssmd-dq-checks description: Catalog of ssmd DQ checks — what each measures, pass/fail criteria, common failure modes and fixes, fanout rules, and accountability SLOs. Use when investigating DQ failures, understanding check behavior, or adding new checks.

ssmd-dq-checks

Reference catalog for all 13 DQ checks in data/dq.py.

Check Catalog

Check 1: data_completeness

What: Verifies parquet files exist and the archiver manifest is present.

Status Criteria
pass Parquet files found AND manifest present with files listed
warn Parquet files found but no manifest (or manifest has no files)
fail No parquet files found

Common failures: Archiver not running, GCS permissions, wrong date.

Check 2: record_counts

What: Reports row counts per message type from parquet files. Uses archiver manifest records_by_type or falls back to parquet-gen manifest records_written.

Status Criteria
pass At least one parquet file has rows
fail All parquet files are empty (0 rows)

Note: Polymarket archiver manifest has files: [] — the check falls back to pq_manifest for type breakdown.

Check 3: schema

What: Validates parquet column names match expected schema per feed/type.

Status Criteria
pass All files match expected schema
warn Extra columns present (forward-compatible)
fail Missing required columns

Check 4: null_rates

What: Checks null percentage for required (non-nullable) columns.

Status Criteria
pass All required columns have < 1% nulls
warn 1-5% nulls in required columns
fail > 5% nulls in required columns

Check 5: duplicates

What: Checks for duplicate rows by primary key (feed-specific).

Status Criteria
pass No duplicates found
warn < 1% duplicate rate
fail >= 1% duplicate rate

Primary keys: Kalshi trade: trade_id. Kalshi ticker: (market_ticker, ts, _nats_seq).

Check 6: nats_continuity

What: Checks for gaps in _nats_seq (NATS JetStream sequence numbers).

Status Criteria
pass No gaps in _nats_seq
warn Gaps present but < 1% of total
fail Gaps >= 1% of total

Common failures: NATS consumer restart, archiver reconnection.

Check 7: ts_continuity

What: Checks for time gaps in exchange timestamps. Divides the day into 15-minute slots and checks for empty slots.

Status Criteria
pass < 5% of 15-min slots empty
warn 5-15% empty
fail > 15% empty

Check 8: per_ticker_stats

What: Per-ticker row counts and time coverage. Reports top/bottom tickers by volume.

Status Criteria
pass Always passes (informational)

Check 9: parquet_vs_jsonl (accountability)

What: Verifies every JSONL line is accounted for in parquet. The zero-tolerance accountability check.

Sources (in priority order):

  1. parquet-gen manifest (parquet-manifest.json) — preferred, scoped to what was actually processed
  2. Archiver manifest (manifest.json) — fallback for pre-v2.0.0 data

Comparison logic:

  • Data types: parquet_rows == jsonl_lines (exact match), or parquet_rows >= jsonl_lines for fanout types
  • Control types: counted and reported, NOT converted to parquet
  • Unknown types: any unrecognized type = FAIL (needs new schema)
Status Criteria
pass All data types accounted for
warn Minor discrepancies or coverage < 100%
fail Missing data or unknown types

Pipeline stats surfaced: lines_json_error, lines_type_unknown, lines_no_schema, parse_batch_dropped

u64 underflow guard: Values > 2^63 in parse_batch_dropped are treated as 0 (known Rust underflow from fanout).

Check 10: exchange_seq_gaps

What: Gaps in exchange-provided sequence numbers per file.

Grouping:

  • Kalshi: group by _shard_id (v1.3.0+), falls back to sid (v1.2.0), then ticker_col
  • Kraken Futures: group by product_id
Status Criteria
pass No gaps in exchange sequences
warn Gaps present but coverage > 95%
fail Coverage < 95%

Common failures: Kalshi pre-1.3.0 data groups by sid which collides across shards — expect false gaps on old data. Post-1.3.0 groups by _shard_id for accurate per-shard analysis.

Check 11: data_coverage

What: Percentage of 15-minute time slots (96/day) that have exchange-timestamped data.

Status Criteria
pass >= 95% of slots have data
warn 80-95%
fail < 80%

Feed-specific timestamp columns: Kalshi ts, Kraken time, Polymarket timestamp_ms.

Note: Polymarket coverage may be lower due to activity concentrated in US hours.

Check 12: connection_uptime

What: WebSocket connection uptime from Cloud Monitoring websocket_connected gauge.

Status Criteria
pass >= 99% uptime (min across all shards)
warn 95-99%
fail < 95%

Known issue: Ghost shards from before connector restarts have low data point counts, dragging down the min. The check should consider only the latest shard generation.

Requires: google-cloud-monitoring package, Monitoring Viewer IAM role on the SA.

Check 13: schema_version

What: Verifies parquet files report their schema version via metadata.

Status Criteria
pass Schema version present and recognized
warn Schema version missing (pre-versioning data)
fail Unknown schema version

Fanout Rules

Some message types produce multiple parquet rows per JSONL line (1:N fan-out):

Feed Type Reason
kraken ticker Batch array: one WS message contains multiple symbol updates
kraken trade Batch array: one WS message contains multiple trades
polymarket price_change Array-wrapped messages may contain multiple updates

For fanout types: parquet_rows >= jsonl_lines (not exact match). For non-fanout types: parquet_rows == jsonl_lines (exact match).

Per-Feed Expected Message Types

Feed Data Types (have parquet schemas) Control Types (skipped)
kalshi ticker, trade, market_lifecycle_v2 subscribed, ok, error
kraken ticker, trade heartbeat, status
kraken-futures ticker, trade control
polymarket book, last_trade_price, price_change, best_bid_ask new_market, market_resolved

Trade Message Schemas

Kalshi NATS message (post-connector injection)

{
  "type": "trade",
  "sid": 2,
  "seq": 5724,
  "msg": {
    "trade_id": "uuid",
    "market_ticker": "KXBTCD-26FEB0317-T76999.99",
    "yes_price": 17,
    "count": 130,
    "taker_side": "no",
    "ts": 1770153448
  },
  "_shard_id": 3
}

Kalshi REST API trade

{
  "trade_id": "uuid",
  "ticker": "KXBTCD-26FEB0317-T76999.99",
  "yes_price": 17,
  "count": 130,
  "taker_side": "no",
  "created_time": "2026-02-03T21:17:28.18002Z"
}

Adding a New Check

  1. Add def check_<name>(con, base, ...) -> dict to data/dq.py
  2. Return dict with at minimum: {"check": "<name>", "status": "pass|warn|fail|skip", "detail": "..."}
  3. Add to check_fns list in DQRunner.run()
  4. Update this skill with the check's criteria and common failures
  5. Deploy: tag dq-v*, update dq-daily.yaml manifest (see ssmd-deploy skill)

GCS Data Layout

gs://ssmd-data/{prefix}/{feed}/{stream}/{date}/
  manifest.json           # archiver manifest (records_by_type, files list)
  parquet-manifest.json   # parquet-gen manifest (parse_batch_input, records_written)
  ticker_000000.parquet   # parquet files per type
  trade_000000.parquet
  ...
Install via CLI
npx skills add https://github.com/aaronwald/dlawskillz --skill ssmd-dq-checks
Repository Details
star Stars 1
call_split Forks 0
navigation Branch main
article Path SKILL.md
More from Creator