name: dev-data description: "Data engineering and analysis guide for orchestrated sub-agents. Data pipelines, ETL/ELT design, data quality validation, SQL optimization, and analysis patterns. Injected when role=data." metadata: { "keywords": ["dashboard-db", "sqlite", "audit-log-schema", "connector-data"]
}
Dev-Data — Data Engineering & Analysis Guide
Production-grade data engineering patterns for building reliable data systems.
C0/C1 work (small local patches): See
dev§0.0 Work Classifier + §0.1 Patch Fast-Path before reading references.
When to Activate
- Building data pipelines or ETL/ELT processes
- Processing CSV, JSON, Parquet, or Excel files
- Writing SQL queries or designing database schemas
- Setting up data quality checks or validation
- Performing data analysis, aggregation, or reporting
- Choosing between batch and streaming architectures
External/current data evidence
For current external dataset contracts, source freshness, pipeline/tool version
behavior, provider data API changes, or public benchmark/source claims, read the
active search skill and follow its query-rewrite, source-fetch, and
evidence-status rules. Use browser fetch/open/text/get-dom/snapshot only after
candidate URLs exist and the claim needs browser-verifiable source evidence.
1. Data Processing Principles
Five rules that apply to every data task:
| Principle | What It Means |
|---|---|
| Pipeline thinking | Every pipeline is Extract → Transform → Load. Keep each stage as an independent, testable function. |
| Schema-first | Define expected columns, types, and constraints BEFORE writing transformation logic. |
| Defensive parsing | External data will have nulls, wrong types, extra columns, missing columns, and encoding issues. Assume all of these. |
| Idempotent operations | Running the same pipeline twice on the same input must produce the same output. Use upsert patterns, not blind inserts. |
| Fail fast, fail loud | Raise errors at pipeline boundaries immediately. Internal transforms propagate errors; dead-letter queues handle row-level quarantine at the boundary (see §3). |
2. Data Ingestion Patterns
Format-Specific Guidance
| Format | Best For | Watch Out For |
|---|---|---|
| CSV | Simple tabular data, human-readable | Encoding (UTF-8 BOM), delimiter ambiguity, multiline values, inconsistent quoting |
| JSON | Nested structures, API responses | Large files (stream, don't load all at once), deeply nested objects, encoding |
| Parquet | Large analytical datasets, columnar queries | Requires library support, not human-readable, schema evolution |
| Excel | Business user handoffs | Multiple sheets, merged cells, formulas vs. values, date formatting |
| Database | Production system access | Connection pooling, query timeouts, use read replicas for analytics |
Incremental Loading
For large or frequently updated data sources:
- Use a watermark column (e.g.,
updated_at,id) to track the last processed record. - Store the watermark after successful load. On failure, restart from the last saved watermark.
- Process in batches (tune based on source limits and memory), not all-at-once.
- Validate row counts:
loaded_rowsshould equalsource_rows_since_watermark.
Schema Validation on Ingest
Before any transformation, validate incoming data:
✅ Check: Expected columns exist
✅ Check: Data types match (string, number, date, boolean)
✅ Check: Required fields are not null
✅ Check: Values are within expected ranges
✅ Check: No unexpected duplicate keys
❌ Fail: If any check fails, write to error log with row details. Don't silently drop.
3. ETL/ELT Pipeline Design
Layer Architecture
Rules:
- Keep staging immutable. Copy first, transform in a separate step — this enables replay and debugging.
- One transformation per step. Don't combine cleaning + joining + aggregating in one function. Chain separate steps.
- Incremental processing. Process only new/changed records when possible. Full reloads only when schema changes.
dbt Integration Patterns
When using dbt for transformations, follow the staging → intermediate → mart layer architecture:
Rules:
- Staging models: rename, cast, filter NULLs — no joins, no business logic
- Intermediate models: joins across staging, deduplication, business transforms
- Mart models: aggregations, final business entities consumed by BI/analytics
- Every model has a
schema.ymlwith tests (not_null, unique, relationships, custom SQL). - Run validation tests in CI and after significant changes — treat test failures as pipeline failures.
- Use
dbt source freshnessto monitor upstream data staleness
Error Handling in Pipelines
| Scenario | Pattern |
|---|---|
| Invalid records | Write to dead-letter table/file for manual review. Preserve every record for debugging. |
| Source unavailable | Retry with exponential backoff (1s, 2s, 4s). Alert after 3 failures. |
| Schema mismatch | Halt pipeline. Log expected vs. actual schema. Don't attempt partial loads. |
| Duplicate records | Use upsert (INSERT ON CONFLICT UPDATE) or deduplicate with window functions. |
Orchestration Basics
When pipelines have multiple steps with dependencies:
- Define tasks as a DAG (Directed Acyclic Graph). Each task depends on its upstream tasks.
- Each task must be independently retryable. If step 3 fails, you restart step 3, not step 1.
- Set reasonable retries (2-3) with delay (5 min between attempts).
- Add timeout per task to prevent hung pipelines.
- Alert on failure: email, Slack, or monitoring dashboard.
4. Data Quality
Validation Checks
Run these after every pipeline step, not just at the end:
| Check | What It Validates | Example |
|---|---|---|
| Not null | Required fields have values | WHERE order_id IS NULL → 0 rows |
| Unique | No duplicates on key columns | COUNT(*) = COUNT(DISTINCT id) |
| Range | Numeric values within bounds | amount BETWEEN 0 AND 1,000,000 |
| Categorical | Values in allowed set | status IN ('pending', 'active', 'closed') |
| Freshness | Data is recent enough | MAX(updated_at) > NOW() - INTERVAL '24 hours' |
| Row count | No unexpected data loss or explosion | Within ±10% of previous run |
| Referential | Foreign keys point to existing records | customer_id EXISTS IN customers |
Quality Tool Integration
Use a layered quality strategy — different tools at different pipeline stages:
| Stage | Tool | Purpose |
|---|---|---|
| Ingest | Great Expectations | Validate raw data against expectations before staging |
| Transform | dbt tests | Assert model-level quality (not_null, unique, relationships, custom SQL) |
| Production | Soda / Monte Carlo | Real-time monitoring, anomaly detection, SLA enforcement |
Validate data dimensions: completeness, uniqueness, range, format, referential integrity, freshness.
Rule: Run validation on every pipeline step — skipping "because the data looks fine" leads to silent downstream corruption.
Data Contracts
For datasets shared between teams, define a contract:
A data contract must include:
- name, owner, version
- schema: column name, type, nullability, uniqueness, allowed values
- SLA: freshness threshold, minimum completeness percentage
- consumers: list of downstream teams/systems
Changes to a contracted schema require versioning and consumer notification.
5. Analysis & Reporting
Always Start with Summary Statistics
Before any deep analysis, provide:
| Metric | What to Report |
|---|---|
| Row count | Total records in dataset |
| Column inventory | Name, type, null count per column |
| Numeric summary | min, max, mean, median, std dev |
| Categorical summary | Unique values, top 5 most frequent |
| Time range | Earliest and latest timestamp |
| Data quality | Null percentage, duplicate percentage |
Output Formats
| Format | When to Use |
|---|---|
| Markdown tables | Inline reports, ≤50 rows, quick summaries |
| JSON | Programmatic consumption, API responses |
| CSV export | Handoff to spreadsheet users, large datasets |
| HTML + charts | Dashboards, visual reports (Chart.js, Mermaid diagrams) |
Statistical Reporting
When analysis involves statistics:
- State the method used and its assumptions.
- Report confidence intervals, not just point estimates.
- Visualize distributions (histograms, box plots), not just averages.
- Distinguish correlation from causation explicitly.
6. Architecture Decisions
Batch vs. Streaming
| Condition | Choose |
|---|---|
| Real-time insight required (sub-minute latency) | Streaming (Kafka + Flink, Spark Structured Streaming, or Kafka Streams depending on complexity) |
| Exactly-once semantics needed | Kafka transactional producers + Flink/Spark |
| Latency >1 min acceptable, volume >1TB/day | Distributed batch (Spark, Databricks) |
| Latency >1 min acceptable, volume <1TB/day | Single-node batch (SQL, Python, dbt) |
Default to batch. Streaming adds significant complexity in error handling, state management, and debugging. Only use streaming when latency requirements genuinely demand it.
Streaming Decision Tiers (heuristic guidance)
| Latency Requirement | Framework | Complexity |
|---|---|---|
| Sub-100ms, complex stateful | Apache Flink | High (dedicated cluster) |
| Sub-second, existing Spark infra | Spark Structured Streaming | Medium |
| Sub-second, Kafka-centric | Kafka Streams (embedded library) | Low-Medium |
| Minutes acceptable | Batch with frequent scheduling | Low |
Kafka essentials for data engineers:
- Partition by expected throughput — avoid excessive partitions
- Use Schema Registry for backwards-compatible evolution
- Default to at-least-once delivery + idempotent consumers
- Use exactly-once only for financial/billing (transactional producers + consumers)
- Monitor consumer lag via Prometheus/Grafana
See references/streaming.md for Kafka configuration, CDC patterns, and windowing.
Storage Selection
| Need | Choose |
|---|---|
| SQL analytics, BI dashboards, structured queries | Data warehouse (Snowflake, BigQuery, PostgreSQL) |
| ML training, unstructured data, large-scale storage | Data lake (S3/GCS + Parquet or Delta format) |
| Both SQL and ML needs | Lakehouse (Delta Lake, Apache Iceberg) |
| Real-time key-value lookups, caching | Redis, DynamoDB |
| Graph relationships | Neo4j, Neptune |
Tool Selection
| Category | Options |
|---|---|
| Orchestration | Airflow, Prefect, Dagster |
| Transformation | dbt, Spark, plain SQL |
| Streaming | Kafka, Kinesis, Pub/Sub |
| Quality | Great Expectations, dbt tests, Soda, custom validators |
| Monitoring | Prometheus, Grafana, Datadog, Monte Carlo |
| Local analysis | DuckDB (in-process SQL), Polars (fast DataFrame), pandas (exploration/ML) |
Tool Decision Matrix
| Factor | pandas | Polars | DuckDB |
|---|---|---|---|
| Best for | <100MB, exploration, ML prep | >100MB, batch ETL, performance | SQL analytics, ad-hoc queries |
| Execution | Single-threaded, eager | Multi-threaded Rust, lazy eval | Vectorized, auto disk spill |
| Speed (groupby/join) | Baseline | 5-10x faster | Matches Polars on SQL-native |
| Memory | Full load into RAM | Streaming, lazy chains | Spill-to-disk for out-of-core |
| API style | DataFrame (imperative) | DataFrame (expression-based) | SQL-first |
| ML interop | Excellent (scikit-learn, etc.) | Good (.to_pandas()) |
Good (.fetchdf()) |
| File format | CSV, JSON, Excel | CSV, Parquet, Arrow-native | CSV, Parquet, JSON, S3 direct |
Decision rule:
| Data size / workflow | Recommended tool |
|---|---|
| Small (<100MB), interactive exploration | pandas |
| Medium (100MB-10GB), batch transforms | Polars |
| SQL-first analytics, any size | DuckDB |
| Blended workflow | Polars transforms, DuckDB aggregations (zero-copy via Arrow) |
See references/tools.md for full patterns and code examples.
See references/ml-pipeline.md for ML training pipelines, experiment tracking (MLflow 3.x), feature stores (Feast), and data versioning (DVC/Delta Lake).
7. Data Governance & PII
Data Classification
| Level | Examples | Handling |
|---|---|---|
| Public | Aggregated metrics, public reports | No restrictions |
| Internal | Business KPIs, operational data | Access controls, no external sharing |
| Confidential | Customer data, financial records | Encryption at rest, column-level masking |
| Restricted | SSN, payment data, health records | Tokenization, row-level security, audit logging |
PII Handling Checklist
Before building any pipeline that touches PII:
- Classify all columns by sensitivity level
- Apply masking/tokenization for non-production environments (static masking)
- Implement dynamic masking for production queries (role-based)
- Set data retention TTL — don't keep PII longer than needed
- Support right-to-erasure (GDPR Article 17): cascading delete across all pipeline stages
- Log all PII access for audit trail
- Mask raw PII values before logs and traces — use structured logging with redaction
GDPR/CCPA Quick Reference
| Requirement | Engineering Pattern |
|---|---|
| Right to erasure | Soft delete → batch purge → propagate to downstream stores including data lake |
| Data minimization | Collect only necessary fields; TTL on non-essential data |
| Consent tracking | Consent event store with versioned preferences; consent-aware pipeline branches |
| Data portability | Standardized export endpoint (JSON/CSV) per user request |
See references/governance.md for detailed implementation patterns, row-level security, and retention policies.
8. Query Performance Guidelines
- Every query that runs in production: EXPLAIN ANALYZE before deploy
- Slow query threshold: > 100ms for OLTP, > 5s for OLAP/analytics
- Index strategy: B-tree for equality/range, GIN for array/JSONB, GiST for geo
- Missing index detection:
pg_stat_user_tables→ seq_scan / idx_scan ratio - Partition tables > 10M rows if query patterns allow time-range or hash partitioning
- Never
SELECT *in production code — specify columns
For pipeline observability, follow the OpenTelemetry patterns in dev-backend/references/core/observability.md. Instrument pipeline stages as spans, data quality checks as events.
When pipeline errors surface through APIs, use the AppError taxonomy from dev-backend/SKILL.md §3. Map pipeline failures to appropriate HTTP status codes (422 for validation, 502 for upstream failures, 503 for capacity).
For data API patterns (pagination of large datasets, cursor-based access, streaming responses), see dev-backend/references/core/api-design.md.
9. Companion Skills
Data engineering does not exist in isolation. Cross-reference these skills when your pipeline connects to other systems:
| Companion | When to Consult | Key Sections |
|---|---|---|
dev-backend |
Exposing data via API, response envelope shape, pagination | §5 API Response Contract, §2 Layered Architecture |
dev-security |
PII handling, data classification, access controls, audit logging, input validation policy (per dev-security §10 ownership matrix) | §1 Input Validation, §4 Secrets, §8 Pre-Flight |
dev-testing |
Pipeline validation, contract tests for data APIs, CI gates | §2 Backend & API Testing, §3 Contract Testing |
dev-frontend |
Downstream reporting/dashboard consumers, data format expectations | §15 Backend Contract & Security Alignment |
Integration patterns:
- Data APIs serving frontend dashboards must use the standard response envelope (
dev-backend§5) - PII pipelines must classify columns and apply masking per
dev-securityguidance before this skill's §7 rules - Data contract changes (§4 Data Contracts) must notify downstream consumers including frontend teams