name: explore-ml-data
description: >
Owns data understanding BEFORE any model is designed. Places and
executes data/eda.py (a jupytext # %% script) via the shared
in-process runner, reads the streamed digest, then writes a
persisted data/eda.md report (plus linked data/eda_<table>.html
skrub TableReport pages) and the ## Data understanding (EDA)
section of journal/JOURNAL.md. The point is to surface the
dataset facts — shape, dtypes, missingness, cardinality, target
balance / skew, datetime / group structure, feature associations —
that JUSTIFY the later learner / splitter / metric decisions, so the
user understands why the modelling choices are made. Uses
skrub.TableReport for dataframe overviews and the shared runner
audit-ml-pipeline/scripts/run_cells.py. Stops at "EDA executed,
data/eda.md + HTML written, JOURNAL EDA section updated." Never
designs the model, never edits src/<pkg>/, never modifies the
user's raw data files.
TRIGGER — any of:
iterate-ml-experiment§ 0 bootstrap, BEFORE the baseline design note — the G-EDA gate fires here (run / skip).- The user asks to "explore the data", "do an EDA", "profile the dataset", "what does the data look like", "understand the data".
- A new or changed data source needs (re-)understanding before the next experiment.
SKIP when: the workspace isn't scaffolded / bootstrapped yet —
iterate-ml-experiment § 0 owns bootstrap ordering and will
dispatch here at the G-EDA step; don't run standalone ahead of
scaffolding (route to iterate-ml-experiment / organize-ml- workspace); there is no data to explore yet; the user wants to
inspect a finished run's skore report rather than the raw dataset
(audit-ml-pipeline); the user is past data understanding and wants
pipeline / evaluation mechanics (build-ml-pipeline /
evaluate-ml-pipeline); a pure symbol lookup (python-api); EDA is
already recorded (data/eda.md + the JOURNAL EDA section exist) and
the user is not asking to refresh it.
HOW TO USE: run the Detection step (does data/eda.md + the JOURNAL
EDA section already exist?), emit the Pre-flight checklist as
visible text, read the Stop conditions, then place data/eda.py
from templates/eda.py, execute it via the shared runner, read the
digest, and author data/eda.md + the JOURNAL EDA section. Always
resolve skrub / pandas / polars symbols via python-api, never from
memory.
Explore ML Data
Understand the dataset before designing a model. One project-level
EDA per workspace: an executable data/eda.py, a persisted
data/eda.md narrative, rich data/eda_<table>.html reports, and a
short JOURNAL section that links them. The findings feed the baseline
design note's learner / splitter / metric choices.
Next-step pointers — where you go after this skill
| You came here for… | → next |
|---|---|
| Bootstrap, before the first baseline | → back to iterate-ml-experiment § 0; the EDA findings inform the auto-drafted 01_baseline.md |
| User free-text ("explore the data") | → surface the findings; no further dispatch unless the user asks to model |
| Re-understand a changed data source | → re-run, overwrite data/eda.*, refresh the JOURNAL EDA section |
Always re-emit the Pre-flight checklist with evidence before declaring the turn done.
Where this sits in the loop
EDA is a bootstrap-time gate (G-EDA) owned by this skill and
fired by iterate-ml-experiment § 0 before the baseline design
note. Ordering matters: the dataset facts (class balance, datetime /
group columns, missingness, cardinality) are exactly what justifies
the splitter (G-CV-SPLITTER), the metric default, and the learner
default. Running EDA after the model is designed defeats the purpose.
scaffold → JOURNAL → goal from data/README.md
│
└─► G-EDA (run | skip) ◄── this skill
│ run
└─► data/eda.py → execute → data/eda.md + HTML + JOURNAL §EDA
│
└─► auto-draft 01_baseline.md (cites the EDA findings)
Where things live — visual map
Two locations are kept separate: the raw data source (read-only,
may live anywhere) and the EDA deliverables (always under
<project>/data/).
| Path | Durability | Who writes it | What it holds |
|---|---|---|---|
raw data source (data/, raw/, an absolute path, external) |
user-owned, READ-ONLY | the user | The dataset. EDA reads it; never modifies it. May be anywhere — not assumed to be data/ |
data/eda.py |
Durable (committed) | This skill, once per workspace | The jupytext # %% EDA cells. Source of truth. Openable as a notebook for the rich view |
data/eda.md |
Durable (committed) | This skill (authored from the digest) | The prose narrative: findings + modelling implications that the baseline note cites |
data/eda_<table>.html |
Durable (committed) | data/eda.py via TableReport.write_html(...) |
The rich, interactive skrub report per table — for the human |
scratch/eda/eda.md |
Ephemeral (gitignored), optional | run_cells.py when given a 2nd arg |
Per-cell digest the agent reads. Same content as stdout |
journal/JOURNAL.md § Data understanding (EDA) |
Durable (committed) | This skill | 2–4 line summary + link to data/eda.md |
Mnemonic: the raw data is read-only and lives wherever the user
keeps it; data/eda.py is source; data/eda.md + the HTML are the
durable deliverables, always under data/; scratch/eda/ and
stdout are the ephemeral run digest.
Read-only-against-raw-data contract
The central rule. Surfaced as the first Stop condition below.
Allowed — this skill writes ONLY (deliverables always under
<project>/data/, created if absent):
data/eda.py— the EDA script (created / overwritten in place).data/eda.md— the authored narrative.data/eda_<table>.html— the skrubTableReportpages.scratch/eda/— the ephemeral digest.journal/JOURNAL.md§ Data understanding (EDA).
Forbidden:
- Modifying, deleting, renaming, re-encoding, or "cleaning" the
user's raw data files — wherever they live (
data/, another folder, an absolute/external path). EDA reads them; it never rewrites them. Data cleaning is the pipeline's job (build-ml-pipeline), declared at fit time, not a one-off mutation. - Writing anywhere outside the five paths above — no
src/<pkg>/edits, noreports/writes, no new experiment files. - Designing the model: no
skore.evaluate(...), noproject.put(...), no learner selection here. EDA informs those; it does not make them.
Stop conditions — read before anything else
- Read-only against the user's raw data. See § Read-only-
against-raw-data contract.
data/eda.pyreads the raw files (wherever they live) and writes only thedata/eda.*deliverables. - Deliverables always under
<project>/data/; the raw source is separate. Writedata/eda.py/data/eda.md/data/eda_<table>.htmlunder<project>/data/(create the folder if absent). The raw data the script reads may live anywhere (data/, another in-repo folder, an absolute or external path) — decouple the two: aRAW = <LOAD_RAW_DATA>source vs anEDA_DIRoutput. Never assume the raw data is indata/. - EDA precedes model design (G-EDA). In bootstrap, the gate fires
before
journal/01_baseline.mdis drafted. It is binary: run (place + executedata/eda.py, write the deliverables) or skip (recordStatus: skipped — <date>in the JOURNAL section and proceed). Do not silently bypass — fire theAskUserQuestion. Free-text "go fast" / "quick baseline" does NOT resolve it. - Agent feature required to execute. The cell runner needs
ipython. If it is missing and the user chose run, STOP and delegate topython-env-manager§ "Agent feature" (G-AGENT-FEATURE). Do NOT typepixi add ... ipythonyourself; do NOT fabricate EDA output with hand-writtenprint()s. If the user declines the agent feature, fall back to the skip path (recordStatus: skipped) — never loop between run and install. - Symbol from memory is forbidden. Any
skrub/pandas/polarssymbol (TableReport,TableReport.json,write_html,column_associations, the tabular reader, …) must come frompython-apithis turn. Cache hits underscratch/api/<lib>/<version>/count; inline memory does not.TableReport.json()'s key names are not formally documented and drift across skrub versions — confirm them viapython-apiand parse defensively (.get(...)). - Library-agnostic — read facts off skrub, not pandas/polars. The
workspace may use pandas OR polars (G-TABULAR), whose summary
methods differ (
select_dtypesdoesn't even exist in polars). The structured facts come fromskrub(TableReport(...).json(),column_associations), which accept both. The ONLY library- specific line isRAW = <LOAD_RAW_DATA>. Do not writedf.isna()/df.nunique()/df.select_dtypes(...)etc. skrub.TableReportfor dataframe overviews. Every table gets aTableReport(RAW, title=..., verbose=0)written todata/eda_<table>.html(the user-facing artifact) AND read via.json()for the digest.verbose=0keeps progress prints out of the digest.- Never end a cell on a bare
TableReport. Outside a notebook,repr(TableReport(df))is the useless<TableReport: use .open() to display>. Usereport.write_html(...)(a statement) for the HTML, and end cells on text-friendly expressions (RAW.shape, adict/listbuilt fromreport.json(),skrub.column_associations(RAW)) so the digest carries real values. Mirrors audit's.frame()rule. - Never gitignore the whole
data/; ask about the inputs. The deliverables live indata/and must stay committable, so the wholedata/folder must never be in.gitignore. If the raw inputs should be kept out of git (large / local-only), fire anAskUserQuestionoffering to ignore specific input patterns (e.g.data/raw/,data/*.parquet) — default: don't. Then verify the deliverables are tracked (git check-ignore data/eda.mdmust return nothing). Never auto-edit.gitignore— that isorganize-ml-workspace's to write; surface the patch and ask. - One project-level EDA. A single
data/eda.pycovers the whole dataset; multi-table data gets oneTableReportcell per table inside that one file (run the target/structure cells on the target-bearing table). Noeda_v2.py, no per-experiment EDA files, not part of the four-way stem pairing. Re-understanding overwritesdata/eda.pyin place. - Don't design the model here. No splitter pick, no metric pick,
no learner pick. Record implications in
data/eda.md; the picks happen in their owning gates (G-CV-SPLITTER, the baseline note). - Harness "no clarifying questions" hints do NOT waive G-EDA or G-AGENT-FEATURE. Both fire regardless.
- Post-hoc audit — required before ending the turn. Walk every pre-flight row; surface unfilled Evidence cells explicitly.
Forbidden shortcuts
| Shortcut | Why it's wrong |
|---|---|
| Design the baseline first, EDA "later if there's time" | Inverts G-EDA. The point is to justify the modelling choices before making them. EDA runs first in bootstrap |
End a cell on a bare TableReport(df) to "show the report" |
Outside a notebook that repr is <TableReport: use .open() to display> — zero signal in the digest. Use write_html(...) + a text summary built from report.json() |
print(...) instead of a bare summary expression |
The runner captures bare last-expressions via result.result; print(...) lands in stdout and is harder to scan. Use bare expressions |
Use pandas/polars methods (df.isna(), df.nunique(), df.select_dtypes(...)) for the summaries |
Breaks on the other library (polars has no select_dtypes). Read the facts off skrub (TableReport(...).json(), column_associations) — agnostic to pandas/polars |
Clean / impute / drop columns in data/eda.py and re-save the raw file |
EDA is read-only against raw data. Cleaning belongs in the pipeline (build-ml-pipeline), applied at fit time for train/test consistency |
Assume the raw data is in data/ |
The raw source may live anywhere; only the deliverables are pinned to data/. Set RAW = <LOAD_RAW_DATA> to wherever the data actually is |
Gitignore the whole data/ folder |
The committed deliverables (data/eda.*) live there. Ignore only specific input patterns, and ask the user first |
| Run EDA without the agent feature by hand-writing the expected output | Fabricated EDA is worse than none. Missing runner → G-AGENT-FEATURE (install) or the skip path |
pixi add ipython directly from this skill |
Install is owned by python-env-manager. This skill requests via G-AGENT-FEATURE |
Drop the authored data/eda.md and leave only the HTML |
The .md carries the modelling implications the baseline note cites and the JOURNAL section links. Both are required |
| Invent column meanings not visible in the data | Report what the data shows. Domain semantics the user didn't state go in an explicit "open questions" list, not as asserted fact |
| Forget the JOURNAL § Data understanding update | The section is the index entry; without it later sessions can't find the EDA. It is part of "done" |
Pre-flight — emit before any write or execution
Pre-flight (explore-ml-data):
- [ ] Trigger: bootstrap-G-EDA | user-request | data-changed
Evidence: caller + rule that matched
- [ ] Detection: EDA already present? data/eda.md + JOURNAL §EDA
Evidence: ls / Glob on data/eda.md + Read JOURNAL §EDA
| "n/a — first EDA"
- [ ] G-EDA resolved: run | skip
Evidence: AskUserQuestion id=<id>, answer=<run|skip>
| user free-text quote turn N
If skip: JOURNAL §EDA records "Status: skipped — <date>"; STOP here.
- [ ] Tabular library known (G-TABULAR): pandas | polars
Evidence: JOURNAL.md Status (Workspace decisions) | AskUserQuestion
via data-science-python-stack
- [ ] Raw data located (may be outside data/): <paths / loader>
Evidence: ls / Glob on the data location + the RAW load call placed
in data/eda.py | user-quoted path turn N
- [ ] data/ not gitignored as a whole; deliverables will be tracked
Evidence: `git check-ignore data/eda.md` returns nothing
| AskUserQuestion id=<id> on ignoring specific inputs
| "n/a — no .gitignore yet"
- [ ] Agent feature available (run path only):
`pixi run -e agent ipython -c "print(0)"` exit 0
Evidence: tool output | JOURNAL.md Status `agent feature: installed`
Missing → STOP, delegate to python-env-manager G-AGENT-FEATURE
(decline → fall back to skip path)
- [ ] python-api consulted for symbols used:
skrub.TableReport, TableReport.write_html, TableReport.json,
skrub.column_associations, the tabular reader (load cell only)
Evidence: Read/Write scratch/api/<lib>/<version>/<topic>.md (this turn)
| "n/a — cache hit + Read this turn"
- [ ] Template copy + substitution decided:
<pkg> → package name from src/<pkg>/
<LOAD_RAW_DATA> → the real loader, pointing wherever the data lives
<TARGET_COLUMN> → the target (from goal / data/README.md), or n/a
<table> → short slug per table for eda_<table>.html
Evidence: Read templates/eda.py this turn before Write data/eda.py
- [ ] Execution command shape confirmed:
pixi run -e agent python \
.agents/skills/audit-ml-pipeline/scripts/run_cells.py \
data/eda.py [scratch/eda/eda.md]
Evidence: command emitted before running
- [ ] Deliverables written: data/eda.md (prose + implications),
data/eda_<table>.html (≥1), JOURNAL §Data understanding
Evidence: Write of each | "n/a — skip path"
- [ ] Pre-flight re-emitted with evidence before final message.
Evidence: this checklist appears in the end-of-turn summary.
EDA file contract — overview
data/eda.py is jupytext percent format (# %%), executed by
the shared runner. Template: templates/eda.py. Full cell-by-cell
anatomy with right / wrong shapes: → references/cell_anatomy.md.
Substitutions
| Placeholder | Replaced with |
|---|---|
<pkg> |
The importable package name (from src/<pkg>/); used for from <pkg> import PROJECT_ROOT (only to locate EDA_DIR = PROJECT_ROOT / "data") |
<LOAD_RAW_DATA> |
The real load of the raw file(s), pointing wherever the data lives (in data/, another folder, an absolute path, or external). Uses the workspace tabular lib (pandas/polars); skrub accepts both. The one library-specific line |
<TARGET_COLUMN> |
The target column name (from the goal / data/README.md), or remove the target cell if unsupervised / unknown |
<table> |
A short slug per table for the HTML filename (eda_<table>.html) — for a single table use the dataset name |
Cell sequence (what each cell does)
Brief outline; concrete examples → references/cell_anatomy.md.
- Module docstring (markdown) — what this file is, the read-only-against-raw-data rule, raw-vs-deliverables split, how it is executed.
- Imports + paths (code) —
import json,import skrub,from <pkg> import PROJECT_ROOT,EDA_DIR = PROJECT_ROOT / "data"(+EDA_DIR.mkdir(parents=True, exist_ok=True)). No pandas/polars import here. - Load raw data (code, bare expression) —
RAW = <LOAD_RAW_DATA>pointing wherever the data lives; end onRAW.shape. - Per-table overview (code) —
report = skrub.TableReport(RAW, title=..., verbose=0);report.write_html(EDA_DIR / "eda_<table>.html"); thensummary = json.loads(report.json())and end on adict/listof per-column dtype / null / cardinality facts. One such cell per table. - Target analysis (code, bare expression) — pick the target's
entry out of
summary["columns"]; it carries value counts (classification) or a distribution summary (regression). Drives the metric default and whether the splitter should stratify. - Structure signals (code, bare expression) — datetime columns
(from skrub's inferred dtypes, catches string dates) and high
unique-ratio id/group columns. Drives the
G-CV-SPLITTERchoice (TimeSeriesSplit/GroupKFold). - Associations (code, bare expression) —
skrub.column_associations(RAW)to flag strong predictors and possible leakage. - End (markdown) — reminder that the agent now authors
data/eda.md+ the JOURNAL section from this digest.
write_html(...) is load-bearing on the overview cells (the human
artifact). verbose=0 and the bare report.json()-derived
expressions are load-bearing for a clean, library-agnostic digest.
For multi-table data, run cells 5–7 on the target-bearing table; for
very large data, load a row sample (see references/cell_anatomy.md).
Execution contract — one command
pixi run -e agent python \
.agents/skills/audit-ml-pipeline/scripts/run_cells.py \
data/eda.py
The runner (shared with audit-ml-pipeline) streams the digest to
stdout — the agent reads it directly from the bash tool output. Pass
a second arg scratch/eda/eda.md to also write the digest to a file.
For non-pixi workspaces, swap the activation prefix per
python-env-manager § "Agent feature".
This skill ships no runner of its own — there is no
explore-ml-data/scripts/. Always invoke the shared
audit-ml-pipeline/scripts/run_cells.py at the path above; don't
look for or fork a local copy.
Prerequisites for the run path: the workspace package must be
importable (from <pkg> import PROJECT_ROOT — editable install done
during scaffold) and skrub installed (Tier 1). If either import
fails, the digest shows the ImportError; route to
python-env-manager for the missing piece rather than working around
it.
Re-execution semantics
- A changed / added data source → overwrite
data/eda.py, re-run, re-authordata/eda.md+ HTML, refresh the JOURNAL section. scratch/eda/is overwritten on every run. The durable record isdata/eda.py+data/eda.md+ git history.
Authoring data/eda.md
After the run, read the digest and write data/eda.md from
templates/eda.md. It is prose, grounded in the digest — no invented
facts. Required sections:
- Dataset at a glance — tables, rows × columns, target.
- Per-column findings — dtypes, missingness, cardinality highlights, anything surprising.
- Target — balance / skew; class counts or distribution summary.
- Structure — datetime ordering, groups / ids (or "none found").
- Associations — strong feature↔target / feature↔feature links; flag possible leakage explicitly.
- Modelling implications — the payoff section. Translate findings
into candidate picks the baseline note will weigh: e.g.
"imbalanced target →
StratifiedKFold+ look at ROC-AUC / PR-AUC, not accuracy"; "user_idrepeats across rows → considerGroupKFold"; "timestamp present →TimeSeriesSplitif forecasting". These are implications, not decisions — the gates own the picks. - Open questions — domain ambiguities for the user to confirm.
Link each data/eda_<table>.html from the relevant section.
JOURNAL § Data understanding (EDA)
iterate-ml-experiment's JOURNAL.md carries a top-level
## Data understanding (EDA) section (placed right after ## Status). This skill owns its content:
## Data understanding (EDA)
- **Status:** done — <YYYY-MM-DD> <!-- or: skipped — <YYYY-MM-DD> -->
- **Summary:** <2–4 lines: dataset shape, target balance/skew, the
one or two findings that most shape the modelling choices>
- **Report:** [data/eda.md](../data/eda.md)
Keep it to a few lines — it is an index entry, not the report. The
detail lives in data/eda.md. On the skip path, only the
Status: skipped line is required.
Dispatching in and out
Called from
| Caller | When |
|---|---|
iterate-ml-experiment § 0 bootstrap |
Automatic; G-EDA fires before the baseline design note |
| User free-text | "explore the data", "do an EDA", "profile the dataset" — resolves directly |
Calls into
| Callee | Why |
|---|---|
python-env-manager § Agent feature |
When ipython is missing on the run path — G-AGENT-FEATURE |
python-api |
Every skrub / pandas / polars symbol. Cache hits first |
data-science-python-stack |
G-TABULAR (pandas / polars) if not yet recorded; skrub TableReport reference |
python-code-style |
After writing data/eda.py — ruff format / check + contextualize the comments to this dataset (strip any leftover workflow/process prose) |
What this skill does NOT do
- Design, select, or evaluate a model (
build-ml-pipeline/evaluate-ml-pipeline/iterate-ml-experiment). - Pick the CV splitter or metric — it only surfaces the evidence for those picks.
- Edit
src/<pkg>/or the experiment / audit files. - Clean, transform, or re-save the user's raw data.
- Install
ipython/pyright(python-env-managerowns). - Open or write the skore Project.
- Render commits or PRs.
Companion skills
| Skill | Relationship |
|---|---|
iterate-ml-experiment |
Caller. § 0 fires G-EDA before the baseline note; the EDA findings seed the note's Method / Risks |
audit-ml-pipeline |
Owns the shared cell runner scripts/run_cells.py this skill executes; same bare-expression discipline |
organize-ml-workspace |
Workspace layout; data/ is user-owned — this skill is the one exception that writes data/eda.* into it |
python-env-manager |
Agent feature install (G-AGENT-FEATURE). This skill requests; that skill installs |
python-api |
skrub / pandas / polars symbol lookups. Cache hits first |
data-science-python-stack |
G-TABULAR; skrub TableReport is catalogued there |
python-code-style |
ruff after writing data/eda.py |
Templates and assets
templates/eda.py— thedata/eda.pyskeleton. Copy + substitute; don't rewrite from memory.templates/eda.md— thedata/eda.mdreport skeleton.
The cell runner is not owned here — it is
audit-ml-pipeline/scripts/run_cells.py (shared). Don't fork it.
References (load on demand)
references/cell_anatomy.md— concrete cell examples (right / wrong shapes), theTableReportrepr trap, the full cell sequence, and how each finding maps to a downstream gate.