name: build-ml-pipeline
description: >
Declare the pipeline from data source to predictor as a skrub
DataOps graph (not as a bare sklearn.Pipeline). Every step is
either a pure-Python function (stateless) attached via
.skb.apply_func, or a sklearn-compatible estimator (stateful)
attached via .skb.apply. Stops at the declared object — no fit,
split, tuning, persistence, or evaluation.
TRIGGER — any of:
- Writing or editing code that declares any link in the chain
data source → predictor: loaders, preprocessing, encoders /
imputers / scalers, feature steps, composition objects
(
Pipeline,ColumnTransformer, skrubtabular_pipeline,nn.Module), or the final estimator. - A pure-Python data-processing function destined for the
pipeline path (cleans / derives / reshapes) — whether wrapped
via
FunctionTransformer,skrub.@deferred/skrub.var, a customBaseEstimatorsubclass, or just called in the training path before the estimator. - A step is added, removed, swapped, or reordered inside an existing pipeline declaration.
- A bare
sklearn.Pipeline/make_pipelineis being used as the top-level — fire to redirect into a skrub DataOps graph. - The user asks to build / declare / set up a pipeline / classifier / regressor for X.
SKIP when: .fit(...) calls / training loops / Trainer.fit /
epoch loops; train/test split or cross-validation splitting;
hyperparameter search; persistence (joblib.dump, checkpointing);
evaluation / metrics / scoring; inference over a pre-trained
model; pure EDA; library-choice questions with no concrete
declaration in play.
HOW TO USE: consult before the first declarative line and on
every structural edit (added/swapped step, changed input columns,
changed estimator family). Don't re-consult for cosmetic edits.
First, read the Stop conditions and emit the Pre-flight
checklist as visible text before any code. Always invoke
python-api to confirm skrub / sklearn symbol names and
signatures before typing — don't guess from memory.
Build ML Pipeline (Declaration)
Declarative shape of a Python ML pipeline from data source to predictor.
Terms used in this skill
Read these once; they're referenced throughout.
- X marker — the
.skb.mark_as_X()call that anchors the predict-time slice. Everything upstream runs identically at fit and predict; everything downstream is per-prediction work. - Predict grid — the rows you want predictions for at predict
time. For IID flat tables: the loaded frame itself. For
time-series / panels: a
(group, time)set. - Cold-start row — a predict-grid row that has no in-slice history available (typical for lags at the start of the slice).
- Predict-time replay — re-binding the graph to a fresh source
identifier at predict (e.g.
learner.predict({"data_dir": …})). - Cross-row step — a feature whose output for a row reads values from other rows (lag, rolling window, group aggregation, side-table join by time/group, drop_nulls on a shifted column).
- Layers 1 / 2 / 3 — source / predict-grid + X-marker / features after the marker. Defined in Rule 2.
Next-step pointers — where you go after this skill
| You came here for… | → next |
|---|---|
| Declared pipeline → CV strategy | → evaluate-ml-pipeline (the G-CV-SPLITTER gate, rule 3) |
| Declared pipeline → smoke test | → test-ml-pipeline → smoke-test-ml-pipeline |
| Symbol lookup mid-declaration | → python-api (Shape 1 / 1b / 3) |
| Missing skrub/sklearn import | → python-env-manager § install |
Modified pipeline.py / features.py / data.py |
→ python-code-style (ruff + NumPyDoc) |
Always re-emit the Pre-flight checklist with evidence before declaring the turn done.
Canonical pipeline shape — IID flat-table
The 90% case. Copy + adapt; replace TARGET_COL and the regressor.
import skrub
from sklearn.ensemble import HistGradientBoostingRegressor
from <pkg>.data import TARGET_COL, load_raw
def build_learner(data_dir_preview=None):
"""Return the unfit learner (skrub SkrubLearner)."""
data_dir = (
skrub.var("data_dir", value=str(data_dir_preview))
if data_dir_preview is not None
else skrub.var("data_dir")
)
# Layer 1 + 2: load + mark X / y on the source frame.
# No cross-row feature steps → marker sits here.
data = data_dir.skb.apply_func(load_raw)
X = data.drop(columns=[TARGET_COL]).skb.mark_as_X()
y = data[TARGET_COL].skb.mark_as_y()
# Layer 3: estimator at the tail. Feature engineering (if any)
# chains between mark_as_X and the final .skb.apply.
predictions = X.skb.apply(
HistGradientBoostingRegressor(random_state=0), y=y
)
return predictions.skb.make_learner()
For history-dependent / panel / cold-start cases (≠ IID):
→ references/layer_examples.md § history-dependent.
For loader-baked-shift counter-example (what NOT to do):
→ references/layer_examples.md § counter-example.
Stop conditions — read before anything else
Each Stop condition: rule → symptom → recovery. Scan top to bottom; any match means STOP.
S1. Missing dependency
- Rule:
import skrubraising meanspython-env-manageris next, not a substitute library. - Symptom:
ModuleNotFoundError: No module named 'skrub'. - Recovery: invoke
python-env-managerfor the install command. Do NOT substitute withsklearn.Pipeline/make_pipeline/FunctionTransformer— that silently rewrites this skill out of the project.
S2. Symbol from memory is forbidden
- Rule: every skrub / scikit-learn / skore name must come from
a
python-apilookup this turn. - Symptom: you type
tabular_learner(renamed in 0.7+),mark_as_y(col)(signature dropped the positional in 0.9+), or any name "you remember". - Recovery: invoke
python-api. Recognition is not a lookup; names drift between releases.
S3. Splitter selection is out of scope
- Rule: no
KFold/StratifiedKFold/train_test_split/ any splitter import in pipeline code. - Symptom: you're about to type
from sklearn.model_selection import KFoldinpipeline.py. - Recovery: that's
evaluate-ml-pipeline's territory. This skill only wiressplit_kwargsAT the X marker (see Rule 2).
S4. skrub.X(...) / skrub.y(...) are not acceptable graph roots
- Rule: root on
skrub.var("<source>", value=preview)instead. - Symptom: code starts with
skrub.X(df)/skrub.y(s). - Recovery: rewrite to
skrub.var("data_dir", value=...)→.skb.apply_func(load_fn)→.skb.mark_as_X(). The shortcuts (1) bake the marker at the source — defeating Layer 1; (2) force a pre-loaded binding, breaking predict-time replay; (3) silently re-enable the late-mark_as_Xbug for cross-row features.
S5. Late mark_as_X is forbidden when any feature step is cross-row
- Rule: for any cross-row step (lag, rolling, group-agg,
target shift, side-join,
drop_nullson shifted col), the X marker goes UPSTREAM of that step. The step references the cross-row source as an additionalapply_funcargument (Layer 1 source → Layer 3 feature, via the marker bypass). - Symptom: the smoke test fails on
len(predictions) != n_predict_grid_rows; OR afeature_steps=[]toggle appears inbuild_learner"to make predict work for cold-start"; OR a temp-dir gymnastic at predict time to fake history; OR a wrapper estimator whose only job is to filter NaN rows the pipeline itself produced. (Don't be misled by syntax —pl.col("x").shift(k)IS cross-row.) - Recovery: fix the graph topology via Rule 2's three-layer
model. Don't loosen the smoke-test assertion. Don't wrap the
predictor. Don't
feature_steps=[]. - Proof: smoke test (
smoke-test-ml-pipeline) — pipeline with marker in the right place passes by construction.
S6. Layer 1 doesn't know the question
- Rule: Layer 1 (sources + loaders) describes what data exists. Anything that requires knowing which rows we want predictions for — any horizon / lag / window / shift — belongs to Layer 2 or downstream, never Layer 1.
- Symptom: the loader's body contains a
target.shift(-HORIZON), adrop_nulls("y"), or any task-specific filter. - Recovery: push the task-specific operation past the marker into Layer 2 (target derivation via a stateful estimator) or Layer 3 (history-dependent feature). The smoke test passes trivially when the bug is fused into Layer 1 — CV looks fine, and the structural debt only surfaces when the next experiment composes against the raw source.
- Constructive test: would an external consumer — a SQL view, a feature store, a second model — derive this same output without knowing your task? No → push it past the marker.
S7. All Python execution goes to scratch/
- Rule: every Python command (version check, signature
lookup, data inspection, loader sanity-check, anything) lands
in
scratch/<YYYY-MM-DD>_<HHMMSS>_<short>.pyand runs viapixi run python scratch/<ts>_<short>.py. - Symptom: you catch yourself typing
pixi run python -corpython -c. - Recovery: write the file first, then execute. Inline is
forbidden regardless of length (see
python-api§ Stop conditions). No 2-line carve-out.
S8. Don't filter warnings
- Rule: no
warnings.filterwarnings(...)inpipeline.pyor scratch probes unless the user explicitly asks. Seepython-code-style§ Stop conditions.
Forbidden shortcuts
| Shortcut | Why it's wrong |
|---|---|
tabular_learner from memory |
Renamed to tabular_pipeline in skrub 0.7+. Memory typed → ImportError on modern installs |
mark_as_y(target_column) positional arg |
Dropped in 0.9+. Use .skb.select("...") BEFORE the mark |
skrub.X(df) / skrub.y(s) as roots |
Forbidden (S4). Use skrub.var("<source>", value=...) |
value="data/train.parquet" literal in pipeline.py |
Resolves against CWD; breaks runs from non-root dirs. Expose data_dir_preview as kwarg; caller passes PROJECT_ROOT / "data" |
feature_steps=[] toggle "to make predict work" |
S5 symptom. Fix the graph, not the predict-time bypass |
skore.evaluate(learner, X, y, ...) |
SkrubLearner takes an env-dict. Use data={"data_dir": ..., ...} |
bare sklearn.Pipeline as top-level |
Rewrite as skrub DataOps graph (Rule 1) |
Inline pixi run python -c "..." |
S7. Write to scratch/<ts>_*.py instead |
Pre-flight — emit before any code
Each ticked box requires an actual tool call this turn. Empty Evidence = unchecked.
Pre-flight (build-ml-pipeline):
- [ ] Tier 1 mandatory libs importable: sklearn, skrub, skore
Evidence: scratch/<ts>_check_tier1.py + `pixi run python …` output.
**Inline `python -c` is NOT evidence.**
- [ ] Tabular library identified: pandas | polars
Evidence: JOURNAL.md Status (Workspace decisions) | user quote
| "n/a — pandas already in loader signature"
- [ ] python-api consulted for skrub symbols this turn
Evidence: Read scratch/api/skrub/<v>/<topic>.md (this turn)
| "n/a — no new skrub symbol this turn"
- [ ] python-api consulted for sklearn symbols this turn
Evidence: Read scratch/api/sklearn/<v>/<topic>.md (this turn)
| "n/a — no new sklearn symbol this turn"
- [ ] Source-binding pattern chosen
Evidence: list each planned `skrub.var("<name>")` and state
whether it's a source identifier (e.g. `data_dir`)
or a predict-grid descriptor. IID: one `skrub.var`
rooted on the loaded frame is enough.
- [ ] X-marker placement decided
Evidence: name the DataOp node where `.skb.mark_as_X()` lands.
IID: on the loaded source frame. Panel / cold-start:
on the predict-grid node, BEFORE any history-dep step.
- [ ] (Cross-row pipelines only) Each cross-row step references the
upstream history DataOp as an extra `apply_func` arg
Evidence: name each step + its history-DataOp argument
| "n/a — no cross-row steps"
- [ ] Layer 1 audit — every `apply_func` upstream of `mark_as_X`
passes the constructive test (S6)
Evidence: per-step "external consumer would derive this: yes/no"
- [ ] Preview value handling
Evidence: `build_learner` exposes `data_dir_preview=None` kwarg;
no relative-path literal baked into `pipeline.py`
- [ ] split_kwargs at the X marker decided: groups | time | none
Evidence: name the column(s) wired OR "n/a — i.i.d., no group
or time structure"
- [ ] Smoke test wired (`tests/smoke/test_NN_<short_name>.py`)
Evidence: per `smoke-test-ml-pipeline`; trivial assertions if no
history-dep
- [ ] Pre-flight re-emitted with evidence before final message.
Evidence: this checklist appears in the end-of-turn summary.
Scope
- In scope: how the pipeline object is composed — source wiring, preprocessing/feature steps, estimator at the tail.
- Out of scope: fitting, splitting, tuning, persisting, evaluating — those have their own skills.
Core rules
Rule 1 — Skrub DataOps is the pipeline entry point
Declare the pipeline as a skrub DataOps graph rooted at one or
more skrub.var(...) calls — not as a bare
sklearn.Pipeline. The skrub.X(...) / skrub.y(...) shortcuts
are not acceptable roots (see S4). Look up the underlying
signatures via python-api.
Reference: https://skrub-data.org/stable/data_ops.html
→ next: Rule 2 (where the marker goes).
Rule 2 — Mark X early; featurize after
The marker is the shared-vs-predict-specific boundary.
One question to place the marker: does any feature step look at rows other than the one currently being processed?
| Answer | Placement | Pattern |
|---|---|---|
| No (per-row math, stateful encoders that learn at fit and apply per-row) | Marker on the loaded source frame | Canonical IID example above |
| Yes (lag / rolling / cross-row join / target-shift) | Marker UPSTREAM of every cross-row step | Three-layer model below |
The three logical layers:
Layer 1 — Sources. One
skrub.var(...)per input identifier: raw history file(s) / URL(s) / table name(s), side tables, and — for time-series / cold-start panels — the predict-time-grid description (start/endrange, list of(group_id, time)). The loader for each source is its first.skb.apply_func. Loaders are pure functions of a single source identifier. Do not load + featurize in oneapply_func— that fuses Layers 2 + 3 with the loader and breaks predict-time replay.Layer 2 — Predict-time grid + X marker. A DataOp whose rows are exactly the predict grid.
- IID flat tables: this IS the loaded source frame.
- Time-series / panel: the
(group, time)grid derived from Layer 1's predict-time bounds.
mark_as_Xandmark_as_ygo here. Target derivation that requires history (anddrop_nullsony) belongs to a small statefulBaseEstimatorwithfit_transform → {X, y}/transform → {X, y=None}, attached at this layer.Layer 3 — Feature engineering.
apply_funcchained on the X-branch aftermark_as_X. History-dependent steps take the X DataOp as their first argument and the relevant Layer-1 source DataOp(s) as additional arguments — history is referenced, not bound to X. The same history node materializes the full available history at fit and at predict, so a backward lag computed for a row in the predict grid sees real values from the train history — no cold-start NaN.
Worked examples (full code, IID + history-dependent +
counter-example): → references/layer_examples.md. Also see
python-api/references/pre_mark_alignment.md for the
production-style three-layer walkthrough drawn from this
workspace's 01_baseline.
Preview value is a caller-supplied parameter, not a literal in
pipeline.py. value= controls what learner.skb.preview()
sees during interactive iteration — nothing else. A literal like
value="data/train.parquet" resolves against CWD and silently
breaks runs not started from the project root. Expose the preview
as an optional kwarg on build_learner and leave it None for
production fit / cross-validate.
Downstream evaluation contract. A SkrubLearner does NOT
implement sklearn's fit(X, y) signature — it takes an
environment dict. Pair with
skore.evaluate(learner, data={"data_dir": ..., ...}, splitter=...),
never with skore.evaluate(learner, X, y, ...) (raises). See
evaluate-ml-pipeline; confirm signatures via python-api.
Cross-validation metadata at the X marker. If the data has
group structure (subjects, sessions, customer IDs, repeated
measures) or temporal ordering, attach the relevant column at
.skb.mark_as_X(split_kwargs={...}):
X = data.drop(columns=[...]).skb.mark_as_X(
split_kwargs={"groups": data["customer_id"]},
)
Keys map to the cross-validator's split(X, y, **split_kwargs)
(e.g. groups). Ask the user when you can't tell from data
alone whether such structure exists — name suspect columns
(anything ending in _id, columns called subject / session /
region, any date / timestamp for temporal ordering) and
ask whether to wire them. Don't silently leave split_kwargs
empty when group structure is plausible — that produces optimistic
CV downstream. Choosing the splitter itself is
evaluate-ml-pipeline's job; this skill only wires the metadata.
When editing an existing pipeline that uses skrub.X /
skrub.y or binds materialized data: do not auto-rewrite.
Surface the source-bound alternative and ask whether to refactor.
Full catalogue: → references/source-binding.md.
→ next: Rule 3 (attach mechanism).
Rule 3 — Attach data modifications via .skb
Two attach points:
.skb.apply_func(fn)— wraps a callable that transforms data..skb.apply(estimator)— wraps any sklearn-compatible estimator (transformer in the middle, or the final predictor).
When to use skrub.deferred instead of apply_func: rare —
only when the callable must combine multiple DataOps at once
(e.g. a custom join over two tables). Even then, check whether a
skrub joiner (Joiner / AggJoiner / MultiAggJoiner) covers it
first. Default: .skb.apply_func. Details:
→ references/source-binding.md.
→ next: Rule 4 (function vs estimator).
Rule 4 — Stateless → function. Stateful → estimator.
The only decision rule for picking apply_func vs apply:
- Stateless — output for a row depends only on that row (and constants). No info borrowed across rows.
- Stateful — needs statistics / vocabulary / learned parameters fit on training data and re-applied unchanged to test data.
# Stateless — pure function + apply_func
import numpy as np
X = X.skb.apply_func(lambda df: df.assign(log_price=np.log1p(df["price"])))
# Stateful — estimator + apply
from sklearn.preprocessing import StandardScaler
X = X.skb.apply(StandardScaler())
If a step would silently learn from the test set when called as a plain function, it is stateful — promote it.
→ next: Rule 5 (leakage check).
Rule 5 — Leakage rule
Any computation using statistics learned from the data (means, medians, quantiles, vocabularies, target distribution) MUST be stateful. Calling such a computation as a plain function over the whole frame leaks test into training.
# WRONG — pct rank fits on the full frame, leaks test into training
X = X.skb.apply_func(lambda df: df.assign(p=df["x"].rank(pct=True)))
# RIGHT — quantile transformer learns on training fold only
from sklearn.preprocessing import QuantileTransformer
X = X.skb.apply(QuantileTransformer(output_distribution="uniform"))
Classic traps by name:
- target encoding (must
fiton training y only), - target-aware or quantile-based imputation,
- quantile binning /
KBinsDiscretizer(strategy="quantile"), OrdinalEncoder/LabelEncoderwhose categories come from the full dataset rather thanfiton training only,- vocabulary-building text tokenizers, TF-IDF, IDF weights.
Litmus test: would this output change if I called it on the
training subset alone vs the whole frame? If yes → stateful →
.skb.apply with an estimator, never .skb.apply_func.
→ next: Decision flow.
Decision flow for a new step
- Does the operation only need the current row (and constants)?
→ stateless → pure Python function +
.skb.apply_func. - Otherwise it must learn from training data and reapply on test.
→ stateful → sklearn-compatible estimator +
.skb.apply.
→ next: Reproducibility (when touching shared modules).
Reproducibility — extending without breaking prior experiments
iterate-ml-experiment enforces a hard rule: every done row in
JOURNAL.md History must stay runnable on main and produce the
same result. When touching a shared module under src/<pkg>/,
default behavior must preserve prior experiments' shape.
Three options, picked by judgment (full procedures + worked
examples: → references/reproducibility_mechanics.md):
- Option 1 — parametrize the existing function (with a default-preserving flag). Pick when the change is small and scoped: a step appended at the end, a single conditional, a stateless transform that adds columns without reshaping existing ones. The flag's default mirrors prior behavior.
- Option 2 — add a new function called only from the new experiment. Pick when the change doesn't fit cleanly behind a flag: new estimator at the tail, a step that reshapes the graph, or Option 1 would grow ugly internal branching.
- Option 3 — branch the module. Last resort. Only when the change touches enough internal structure that Options 1 and 2 would obscure the diff. Usually a signal of a deeper layering issue worth surfacing to the user.
Tripwires (load-bearing)
- 3+ flags in one function → parametrization is leaking; reach for Option 2 next.
- Visible branching in the function body that makes it hard to read → reach for Option 2.
- A flag changes default behavior of an existing caller → STOP. Rule broken. Either keep the default preserving, or use Option 2.
Cheap executable check
iterate-ml-experiment § 3's smoke-test gate runs all of
tests/smoke/, not just the new one. A prior smoke test going
red after a change = default behavior not preserved. Fix before
declaring the new experiment ready.
→ next: Common patterns (for recurring shapes).
Common patterns
Short catalogue. Look up exact symbols in python-api. Full
catalogue with code: → references/common_patterns.md.
- Heterogeneous columns — skrub column selectors with
cols=on.skb.apply(one apply per group), notColumnTransformer. - Default starting point for tabular data — reach for
skrub.tabular_pipeline(...)orTableVectorizer+ estimator first; specialize column-by-column only when default is insufficient. - Multi-table inputs — one
skrub.var(...)per table; join with skrubJoiner/AggJoiner/MultiAggJoinervia.skb.apply(...). - Meta-estimator at the tail —
StackingClassifier,CalibratedClassifierCV,TransformedTargetRegressor. Wrap the predictor first, then attach via.skb.applyas the final step. - Mark hyperparameter knobs in place — wrap with
skrub.choose_from/choose_int/choose_float/optionalinside the declaration. Don't importGridSearchCVhere; the tuning skill owns search. - Custom sklearn transformer — author only when (a) no
built-in fits and (b) the operation is stateful. Subclass
BaseEstimator+TransformerMixin. For a stateless op, write a function and use.skb.apply_func.
Companion skills
| Skill | Relationship |
|---|---|
python-api |
Authoritative lookup of sklearn / skrub / skore. Invoke whenever picking a symbol; cache hits first (Shape 0) |
evaluate-ml-pipeline |
Owns skore.evaluate, CV selection, metric defaults. Consumes the split_kwargs wired at the X marker |
smoke-test-ml-pipeline |
Executable proof of Rule 2's early-mark. Smoke failure → route back here; fix the topology, don't loosen the assertion |
test-ml-pipeline |
Router for tests/. Smoke test pairs 1:1 with the experiment script |
python-env-manager |
Detection + install commands. Invoke when import skrub raises |
python-code-style |
Must be invoked after writing or editing pipeline.py / features.py / data.py. Direct pixi run ruff check drops the NumPyDoc convention |
References (load on demand)
references/source-binding.md— full catalogue of source-binding patterns (encouraged / discouraged / OK-but-offer-upgrade) + theapply_funcvsdeferreddecision.references/layer_examples.md— worked code for the IID flat-table case, the loader-baked-shift counter-example, and the history-dependent three-layer pattern.references/reproducibility_mechanics.md— full Option 1 / 2 / 3 procedures with code, plus the tripwire criterion.references/common_patterns.md— full catalogue of recurring pipeline shapes with code snippets.
Companion skill (planned):
review-ml-pipeline— methodological review of an existing declaration (leakage audit, statelessness check, step ordering, scope creep). When it flags a problem, return here to fix.