dev-data

star 4

Data engineering and analysis guide for orchestrated sub-agents. Data pipelines, ETL/ELT design, data quality validation, SQL optimization, and analysis patterns. Injected when role=data.

lidge-jun By lidge-jun schedule Updated 6/8/2026

name: dev-data description: "Data engineering and analysis guide for orchestrated sub-agents. Data pipelines, ETL/ELT design, data quality validation, SQL optimization, and analysis patterns. Injected when role=data." metadata: { "keywords": ["dashboard-db", "sqlite", "audit-log-schema", "connector-data"]

}

Dev-Data — Data Engineering & Analysis Guide

Production-grade data engineering patterns for building reliable data systems.

C0/C1 work (small local patches): See dev §0.0 Work Classifier + §0.1 Patch Fast-Path before reading references.

When to Activate

  • Building data pipelines or ETL/ELT processes
  • Processing CSV, JSON, Parquet, or Excel files
  • Writing SQL queries or designing database schemas
  • Setting up data quality checks or validation
  • Performing data analysis, aggregation, or reporting
  • Choosing between batch and streaming architectures

External/current data evidence

For current external dataset contracts, source freshness, pipeline/tool version behavior, provider data API changes, or public benchmark/source claims, read the active search skill and follow its query-rewrite, source-fetch, and evidence-status rules. Use browser fetch/open/text/get-dom/snapshot only after candidate URLs exist and the claim needs browser-verifiable source evidence.


1. Data Processing Principles

Five rules that apply to every data task:

Principle What It Means
Pipeline thinking Every pipeline is Extract → Transform → Load. Keep each stage as an independent, testable function.
Schema-first Define expected columns, types, and constraints BEFORE writing transformation logic.
Defensive parsing External data will have nulls, wrong types, extra columns, missing columns, and encoding issues. Assume all of these.
Idempotent operations Running the same pipeline twice on the same input must produce the same output. Use upsert patterns, not blind inserts.
Fail fast, fail loud Raise errors at pipeline boundaries immediately. Internal transforms propagate errors; dead-letter queues handle row-level quarantine at the boundary (see §3).

2. Data Ingestion Patterns

Format-Specific Guidance

Format Best For Watch Out For
CSV Simple tabular data, human-readable Encoding (UTF-8 BOM), delimiter ambiguity, multiline values, inconsistent quoting
JSON Nested structures, API responses Large files (stream, don't load all at once), deeply nested objects, encoding
Parquet Large analytical datasets, columnar queries Requires library support, not human-readable, schema evolution
Excel Business user handoffs Multiple sheets, merged cells, formulas vs. values, date formatting
Database Production system access Connection pooling, query timeouts, use read replicas for analytics

Incremental Loading

For large or frequently updated data sources:

  1. Use a watermark column (e.g., updated_at, id) to track the last processed record.
  2. Store the watermark after successful load. On failure, restart from the last saved watermark.
  3. Process in batches (tune based on source limits and memory), not all-at-once.
  4. Validate row counts: loaded_rows should equal source_rows_since_watermark.

Schema Validation on Ingest

Before any transformation, validate incoming data:

✅ Check: Expected columns exist
✅ Check: Data types match (string, number, date, boolean)
✅ Check: Required fields are not null
✅ Check: Values are within expected ranges
✅ Check: No unexpected duplicate keys
❌ Fail: If any check fails, write to error log with row details. Don't silently drop.

3. ETL/ELT Pipeline Design

Layer Architecture

Rules:

  • Keep staging immutable. Copy first, transform in a separate step — this enables replay and debugging.
  • One transformation per step. Don't combine cleaning + joining + aggregating in one function. Chain separate steps.
  • Incremental processing. Process only new/changed records when possible. Full reloads only when schema changes.

dbt Integration Patterns

When using dbt for transformations, follow the staging → intermediate → mart layer architecture:

Rules:

  • Staging models: rename, cast, filter NULLs — no joins, no business logic
  • Intermediate models: joins across staging, deduplication, business transforms
  • Mart models: aggregations, final business entities consumed by BI/analytics
  • Every model has a schema.yml with tests (not_null, unique, relationships, custom SQL).
  • Run validation tests in CI and after significant changes — treat test failures as pipeline failures.
  • Use dbt source freshness to monitor upstream data staleness

Error Handling in Pipelines

Scenario Pattern
Invalid records Write to dead-letter table/file for manual review. Preserve every record for debugging.
Source unavailable Retry with exponential backoff (1s, 2s, 4s). Alert after 3 failures.
Schema mismatch Halt pipeline. Log expected vs. actual schema. Don't attempt partial loads.
Duplicate records Use upsert (INSERT ON CONFLICT UPDATE) or deduplicate with window functions.

Orchestration Basics

When pipelines have multiple steps with dependencies:

  • Define tasks as a DAG (Directed Acyclic Graph). Each task depends on its upstream tasks.
  • Each task must be independently retryable. If step 3 fails, you restart step 3, not step 1.
  • Set reasonable retries (2-3) with delay (5 min between attempts).
  • Add timeout per task to prevent hung pipelines.
  • Alert on failure: email, Slack, or monitoring dashboard.

4. Data Quality

Validation Checks

Run these after every pipeline step, not just at the end:

Check What It Validates Example
Not null Required fields have values WHERE order_id IS NULL → 0 rows
Unique No duplicates on key columns COUNT(*) = COUNT(DISTINCT id)
Range Numeric values within bounds amount BETWEEN 0 AND 1,000,000
Categorical Values in allowed set status IN ('pending', 'active', 'closed')
Freshness Data is recent enough MAX(updated_at) > NOW() - INTERVAL '24 hours'
Row count No unexpected data loss or explosion Within ±10% of previous run
Referential Foreign keys point to existing records customer_id EXISTS IN customers

Quality Tool Integration

Use a layered quality strategy — different tools at different pipeline stages:

Stage Tool Purpose
Ingest Great Expectations Validate raw data against expectations before staging
Transform dbt tests Assert model-level quality (not_null, unique, relationships, custom SQL)
Production Soda / Monte Carlo Real-time monitoring, anomaly detection, SLA enforcement

Validate data dimensions: completeness, uniqueness, range, format, referential integrity, freshness.

Rule: Run validation on every pipeline step — skipping "because the data looks fine" leads to silent downstream corruption.

Data Contracts

For datasets shared between teams, define a contract:

A data contract must include:

  • name, owner, version
  • schema: column name, type, nullability, uniqueness, allowed values
  • SLA: freshness threshold, minimum completeness percentage
  • consumers: list of downstream teams/systems

Changes to a contracted schema require versioning and consumer notification.


5. Analysis & Reporting

Always Start with Summary Statistics

Before any deep analysis, provide:

Metric What to Report
Row count Total records in dataset
Column inventory Name, type, null count per column
Numeric summary min, max, mean, median, std dev
Categorical summary Unique values, top 5 most frequent
Time range Earliest and latest timestamp
Data quality Null percentage, duplicate percentage

Output Formats

Format When to Use
Markdown tables Inline reports, ≤50 rows, quick summaries
JSON Programmatic consumption, API responses
CSV export Handoff to spreadsheet users, large datasets
HTML + charts Dashboards, visual reports (Chart.js, Mermaid diagrams)

Statistical Reporting

When analysis involves statistics:

  • State the method used and its assumptions.
  • Report confidence intervals, not just point estimates.
  • Visualize distributions (histograms, box plots), not just averages.
  • Distinguish correlation from causation explicitly.

6. Architecture Decisions

Batch vs. Streaming

Condition Choose
Real-time insight required (sub-minute latency) Streaming (Kafka + Flink, Spark Structured Streaming, or Kafka Streams depending on complexity)
Exactly-once semantics needed Kafka transactional producers + Flink/Spark
Latency >1 min acceptable, volume >1TB/day Distributed batch (Spark, Databricks)
Latency >1 min acceptable, volume <1TB/day Single-node batch (SQL, Python, dbt)

Default to batch. Streaming adds significant complexity in error handling, state management, and debugging. Only use streaming when latency requirements genuinely demand it.

Streaming Decision Tiers (heuristic guidance)

Latency Requirement Framework Complexity
Sub-100ms, complex stateful Apache Flink High (dedicated cluster)
Sub-second, existing Spark infra Spark Structured Streaming Medium
Sub-second, Kafka-centric Kafka Streams (embedded library) Low-Medium
Minutes acceptable Batch with frequent scheduling Low

Kafka essentials for data engineers:

  • Partition by expected throughput — avoid excessive partitions
  • Use Schema Registry for backwards-compatible evolution
  • Default to at-least-once delivery + idempotent consumers
  • Use exactly-once only for financial/billing (transactional producers + consumers)
  • Monitor consumer lag via Prometheus/Grafana

See references/streaming.md for Kafka configuration, CDC patterns, and windowing.

Storage Selection

Need Choose
SQL analytics, BI dashboards, structured queries Data warehouse (Snowflake, BigQuery, PostgreSQL)
ML training, unstructured data, large-scale storage Data lake (S3/GCS + Parquet or Delta format)
Both SQL and ML needs Lakehouse (Delta Lake, Apache Iceberg)
Real-time key-value lookups, caching Redis, DynamoDB
Graph relationships Neo4j, Neptune

Tool Selection

Category Options
Orchestration Airflow, Prefect, Dagster
Transformation dbt, Spark, plain SQL
Streaming Kafka, Kinesis, Pub/Sub
Quality Great Expectations, dbt tests, Soda, custom validators
Monitoring Prometheus, Grafana, Datadog, Monte Carlo
Local analysis DuckDB (in-process SQL), Polars (fast DataFrame), pandas (exploration/ML)

Tool Decision Matrix

Factor pandas Polars DuckDB
Best for <100MB, exploration, ML prep >100MB, batch ETL, performance SQL analytics, ad-hoc queries
Execution Single-threaded, eager Multi-threaded Rust, lazy eval Vectorized, auto disk spill
Speed (groupby/join) Baseline 5-10x faster Matches Polars on SQL-native
Memory Full load into RAM Streaming, lazy chains Spill-to-disk for out-of-core
API style DataFrame (imperative) DataFrame (expression-based) SQL-first
ML interop Excellent (scikit-learn, etc.) Good (.to_pandas()) Good (.fetchdf())
File format CSV, JSON, Excel CSV, Parquet, Arrow-native CSV, Parquet, JSON, S3 direct

Decision rule:

Data size / workflow Recommended tool
Small (<100MB), interactive exploration pandas
Medium (100MB-10GB), batch transforms Polars
SQL-first analytics, any size DuckDB
Blended workflow Polars transforms, DuckDB aggregations (zero-copy via Arrow)

See references/tools.md for full patterns and code examples. See references/ml-pipeline.md for ML training pipelines, experiment tracking (MLflow 3.x), feature stores (Feast), and data versioning (DVC/Delta Lake).


7. Data Governance & PII

Data Classification

Level Examples Handling
Public Aggregated metrics, public reports No restrictions
Internal Business KPIs, operational data Access controls, no external sharing
Confidential Customer data, financial records Encryption at rest, column-level masking
Restricted SSN, payment data, health records Tokenization, row-level security, audit logging

PII Handling Checklist

Before building any pipeline that touches PII:

  • Classify all columns by sensitivity level
  • Apply masking/tokenization for non-production environments (static masking)
  • Implement dynamic masking for production queries (role-based)
  • Set data retention TTL — don't keep PII longer than needed
  • Support right-to-erasure (GDPR Article 17): cascading delete across all pipeline stages
  • Log all PII access for audit trail
  • Mask raw PII values before logs and traces — use structured logging with redaction

GDPR/CCPA Quick Reference

Requirement Engineering Pattern
Right to erasure Soft delete → batch purge → propagate to downstream stores including data lake
Data minimization Collect only necessary fields; TTL on non-essential data
Consent tracking Consent event store with versioned preferences; consent-aware pipeline branches
Data portability Standardized export endpoint (JSON/CSV) per user request

See references/governance.md for detailed implementation patterns, row-level security, and retention policies.


8. Query Performance Guidelines

  • Every query that runs in production: EXPLAIN ANALYZE before deploy
  • Slow query threshold: > 100ms for OLTP, > 5s for OLAP/analytics
  • Index strategy: B-tree for equality/range, GIN for array/JSONB, GiST for geo
  • Missing index detection: pg_stat_user_tables → seq_scan / idx_scan ratio
  • Partition tables > 10M rows if query patterns allow time-range or hash partitioning
  • Never SELECT * in production code — specify columns

For pipeline observability, follow the OpenTelemetry patterns in dev-backend/references/core/observability.md. Instrument pipeline stages as spans, data quality checks as events.

When pipeline errors surface through APIs, use the AppError taxonomy from dev-backend/SKILL.md §3. Map pipeline failures to appropriate HTTP status codes (422 for validation, 502 for upstream failures, 503 for capacity).

For data API patterns (pagination of large datasets, cursor-based access, streaming responses), see dev-backend/references/core/api-design.md.


9. Companion Skills

Data engineering does not exist in isolation. Cross-reference these skills when your pipeline connects to other systems:

Companion When to Consult Key Sections
dev-backend Exposing data via API, response envelope shape, pagination §5 API Response Contract, §2 Layered Architecture
dev-security PII handling, data classification, access controls, audit logging, input validation policy (per dev-security §10 ownership matrix) §1 Input Validation, §4 Secrets, §8 Pre-Flight
dev-testing Pipeline validation, contract tests for data APIs, CI gates §2 Backend & API Testing, §3 Contract Testing
dev-frontend Downstream reporting/dashboard consumers, data format expectations §15 Backend Contract & Security Alignment

Integration patterns:

  • Data APIs serving frontend dashboards must use the standard response envelope (dev-backend §5)
  • PII pipelines must classify columns and apply masking per dev-security guidance before this skill's §7 rules
  • Data contract changes (§4 Data Contracts) must notify downstream consumers including frontend teams

Install via CLI
npx skills add https://github.com/lidge-jun/cli-jaw-skills --skill dev-data
Repository Details
star Stars 4
call_split Forks 0
navigation Branch main
article Path SKILL.md
More from Creator