ssmd-dq-run

star 1

How to run ssmd DQ checks locally and in-cluster, interpret scores, trigger email reports, and verify results. Use when running data quality checks, re-sending DQ emails, or verifying pipeline health after deployments or backfills.

aaronwald By aaronwald schedule Updated 2/18/2026

name: ssmd-dq-run description: How to run ssmd DQ checks locally and in-cluster, interpret scores, trigger email reports, and verify results. Use when running data quality checks, re-sending DQ emails, or verifying pipeline health after deployments or backfills.

ssmd-dq-run

Procedures for running ssmd Data Quality checks and interpreting results.

Source Files

File Purpose
data/dq.py DQRunner engine — 13 checks, scoring, CLI
data/dq_email.py Email report wrapper — runs all feeds, HTML output
data/Dockerfile DQ image: python:3.12-slim + duckdb + gcloud monitoring

Running DQ Locally

Requires gcloud auth application-default login for GCS access.

# Single feed
uv run data/dq.py --date 2026-02-17 --feed kalshi --stream crypto

# With verbose progress
uv run data/dq.py --date 2026-02-17 --feed kalshi --stream crypto --verbose

# JSON output (for programmatic use)
uv run data/dq.py --date 2026-02-17 --feed kalshi --stream crypto --json

# Non-default prefix (when GCS prefix differs from feed name)
uv run data/dq.py --date 2026-02-17 --feed kraken-futures --stream futures --prefix kraken-futures
uv run data/dq.py --date 2026-02-17 --feed polymarket --stream markets --prefix polymarket

All Three Feeds

Run all feeds in parallel for full pipeline verification:

uv run data/dq.py --date 2026-02-17 --feed kalshi --stream crypto
uv run data/dq.py --date 2026-02-17 --feed kraken-futures --stream futures --prefix kraken-futures
uv run data/dq.py --date 2026-02-17 --feed polymarket --stream markets --prefix polymarket

Feed Parameters

Feed --feed --stream --prefix
Kalshi kalshi crypto (default: kalshi)
Kraken Futures kraken-futures futures kraken-futures
Polymarket polymarket markets polymarket

Running DQ In-Cluster

The DQ CronJob runs at 03:30 UTC daily (after parquet-gen at 02:00 UTC).

Manifest: clusters/gke-prod/apps/ssmd/cronjobs/dq-daily.yaml

Trigger a manual DQ email run

kubectl create job --from=cronjob/ssmd-dq-daily ssmd-dq-manual-MMDD -n ssmd

Watch progress

kubectl logs -n ssmd job/ssmd-dq-manual-MMDD -f

Re-run for a specific date

The CronJob defaults to yesterday. To override:

kubectl create job --from=cronjob/ssmd-dq-daily ssmd-dq-rerun-MMDD -n ssmd --dry-run=client -o yaml | \
  sed 's|dq_email.py|dq_email.py --date 2026-02-17|' | \
  kubectl apply -f -

Interpreting Scores

Grades

Grade Score Range Meaning
GREEN >= 98 Pipeline healthy, all checks passing
YELLOW >= 85 Minor issues, investigate when convenient
RED < 85 Significant issues, investigate promptly

Check Statuses

Status Weight Meaning
pass 1.0 Check passed
warn 0.7 Threshold exceeded but not critical
fail 0.0 Check failed
skip excluded Not enough data to run, excluded from score

Score = average of weights * 100.

Exit Codes

  • dq.py exits 1 if any check has status fail
  • dq_email.py always exits 0 (email is the alert mechanism)

Notebook / Programmatic Usage

from dq import DQRunner

runner = DQRunner(bucket="ssmd-data", feed="kalshi", stream="crypto")
results = runner.run("2026-02-12")
results.summary()       # print human-readable report
results.score()         # float 0-100
results.to_json()       # JSON string

# Ad-hoc queries via the shared DuckDB connection
runner.con.execute(
    "SELECT * FROM read_parquet('gcs://ssmd-data/kalshi/crypto/2026-02-12/ticker_*.parquet') LIMIT 10"
).fetchdf()

# Date range
all_results = runner.run_range("2026-02-10", "2026-02-17")

Email Report

dq_email.py runs all 3 feeds, generates an HTML email with per-feed grades and check details, and sends via SMTP.

Required env vars: SMTP_USER, SMTP_PASS, SMTP_TO Optional: SMTP_HOST (default: smtp.gmail.com), SMTP_PORT (default: 587)

These are provided in-cluster via the ssmd-smtp-credentials Secret.

Post-Deploy / Post-Backfill Verification

After deploying a new DQ version or backfilling parquet data:

  1. Run DQ locally for all 3 feeds (see commands above)
  2. Verify target checks show PASS
  3. Optionally trigger in-cluster email: kubectl create job --from=cronjob/ssmd-dq-daily ...
  4. Verify email arrives with corrected scores

Image Build

DQ image is built from data/Dockerfile, triggered by dq-v* tags in the 899bushwick repo (not ssmd).

See the ssmd-deploy skill for full deployment procedure.

Install via CLI
npx skills add https://github.com/aaronwald/dlawskillz --skill ssmd-dq-run
Repository Details
star Stars 1
call_split Forks 0
navigation Branch main
article Path SKILL.md
More from Creator