name: data-checking-eda description: Check, profile, tidy, and explore tabular datasets before modeling or reporting. Use when Codex needs to inspect CSV, TSV, JSON lines, Parquet metadata, spreadsheet exports, database query results, joins, missingness, duplicates, outliers, row counts, column types, units, exploratory plots, or first-pass descriptive summaries for a data science analysis.
Data Checking And EDA
Run cheap checks before trusting any analysis. Prefer deterministic profiling scripts and project-native validation over visual inspection alone.
Domain Context Requirement
Use the Domain Context Contract to decide which columns, units, row entities, time windows, labels, leakage paths, missingness patterns, and data quality failures matter most. EDA must be adapted to the domain decision and success metric, not only generic column summaries. If the contract is missing for a business/domain task, request or create it before declaring EDA complete.
Minimum Checks
- Restate the domain, unit of analysis, target/KPI, success metric, and prohibited claims from the domain contract.
- Confirm file paths, row counts, column counts, encoding, delimiter, and parse errors.
- Inspect column names, inferred types, example values, units, and codebook availability.
- Check missingness by column and by row.
- Check duplicates and candidate keys.
- Check categorical cardinality and unexpected levels.
- Check numeric ranges, impossible values, zeros, negative values, and extreme values.
- Check date/time parsing, time zones, gaps, and sortedness.
- Check join keys before and after joins: one-to-one, one-to-many, dropped rows, duplicated rows.
- Check target leakage before predictive modeling, using domain-specific test-time availability and prohibited claims.
- Make first plots and tables that match the analytic question and stakeholder decision.
Script
Use ../../scripts/dataset_profile.py for quick local CSV/TSV/JSONL profiling when no project-specific profiler exists.
Example:
python3 handy-data-science/scripts/dataset_profile.py data.csv --max-rows 100000
EDA Output
Return a short evidence block:
Domain context used:
Data checked:
Rows/columns:
Key columns:
Missingness:
Duplicates:
Type/range issues:
Join risks:
Leakage risks:
Useful first plots/tables:
Next analysis step: