name: missing-data-imputation description: Use this skill when the user wants to fill missing values in a tabular dataset and obtain a reusable, fitted scikit-learn imputer plus an auditable report of what was imputed. Triggers include "impute missing values", "fill the NaNs", "handle missing data", "KNN imputation", "iterative / MICE imputation", "remplir les valeurs manquantes", "impute mes données", "gérer les données manquantes", "imputation KNN", "imputation itérative". Supports numeric strategies (mean, median, KNN, iterative/MICE) and categorical strategies (mode, constant), exports a pickled fitted imputer for reuse on new data, and produces a JSON report mapping each column to its strategy, fill value, and missing counts. Aligned to the DAMA-DMBOK2 Completeness dimension.
Missing Data Imputation
A completeness-remediation skill. Takes a dataset with missing values and produces a fitted, pickled scikit-learn imputer that can be reused on new data, plus a machine-readable JSON report documenting every imputation decision. Aligned to the DAMA-DMBOK2 Completeness dimension.
When to use this skill
Activate when the user has a dataset with missing values and wants to fill them in a principled, reproducible way — not just a one-off fillna. Typical signals:
- "Impute the missing values in this dataset"
- "Fill the NaNs so I can train a model"
- "Use KNN / iterative / MICE imputation on the numeric columns"
- "Remplis les valeurs manquantes de ce dataset"
- "Gère les données manquantes avant le feature engineering"
Pre-conditions:
- The dataset is loaded and tabular (CSV, Excel, Parquet).
- The user has decided to impute rather than drop. If missingness is very high (>50% on many columns), warn that imputation may inject noise and suggest dropping those columns instead — but let the user decide.
Do NOT activate this skill for:
- Auditing quality without fixing it → use
data-quality-report - General exploration → use
eda-explorer - Time-series gap filling that needs forward/backward fill or interpolation along the time axis → use
time-series-features(lag/rolling logic); this skill is for cross-sectional imputation.
Position in the pipeline: this skill typically runs after eda-explorer / data-quality-report (which surface the missingness) and before feature-engineer (which assumes a complete matrix). Fit the imputer on the training split only, then reuse the pickled object on validation/test/production data.
Workflow
For every request, follow these 6 phases in order. Never skip Phase 1 (leakage check) or Phase 6 (export) — they are what make the result reusable and auditable.
Phase 1 — Profile missingness & guard against leakage
- Count missingness per column: absolute count and percentage.
- Classify each column as numeric or categorical (use
select_dtypes; treatobject/category/boolas categorical, numeric dtypes as numeric). - Flag fully-missing columns (100% NaN) — these cannot be imputed; recommend dropping them.
- Flag high-missingness columns (>50%) — warn that imputation is low-confidence.
- CRITICAL — leakage guard: the imputer MUST be
fiton training data only. If the user provides a single dataset, ask whether a train/test split exists; if they intend to model, recommend splitting first and fitting the imputer on train, thentransformon test. Neverfit_transformthe test set. - Optional missingness indicators: offer to add
<col>_was_missingboolean columns (SimpleImputer(add_indicator=True)or a manual mask) so the model can learn from missingness itself. Off by default; recommend ON when missingness may be informative (MNAR).
Output a profile table:
| Column | Type | n_missing | pct_missing | Proposed strategy |
|---|---|---|---|---|
| age | numeric | 120 | 6.0% | median |
| income | numeric | 450 | 22.5% | iterative |
| region | categorical | 30 | 1.5% | mode |
| comment | categorical | 1800 | 90.0% | DROP (too sparse) |
Phase 2 — Choose a strategy per column
Default decision rules (the user can override any of them):
Numeric columns:
| Condition | Default strategy | Rationale |
|---|---|---|
| Symmetric / low-skew distribution | mean |
Unbiased center for roughly normal data |
| Skewed or has outliers | median |
Robust to outliers |
| Missingness depends on other numeric columns, moderate # of features | KNN |
Uses neighbours in feature space |
| Complex multivariate relationships, want MICE | iterative |
Models each column from the others |
Default for numeric when unspecified: median (robust, safe baseline).
Categorical columns:
| Condition | Default strategy | Rationale |
|---|---|---|
| A dominant category exists / missingness is incidental | mode (most_frequent) |
Preserves the modal class |
| Missingness is meaningful, or you want it explicit | constant (fill value "Missing" / "Unknown") |
Makes "no value" a first-class category |
Default for categorical when unspecified: mode.
⚠️ KNN and iterative imputers require numeric input. Apply them only to numeric columns. Categorical columns are always handled by SimpleImputer (mode/constant). Build the per-type pipelines separately and combine with ColumnTransformer.
⚠️ IterativeImputer is experimental in scikit-learn — you MUST enable it explicitly:
from sklearn.experimental import enable_iterative_imputer # noqa: F401
from sklearn.impute import IterativeImputer
Phase 3 — Build the imputer (ColumnTransformer)
Construct a single ColumnTransformer so the whole thing is one picklable object. Group numeric columns by their chosen strategy (mean/median can share a SimpleImputer; KNN and iterative each need their own transformer), and categorical columns by strategy (mode vs constant).
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer, KNNImputer
transformers = []
if median_cols:
transformers.append(("num_median", SimpleImputer(strategy="median"), median_cols))
if mean_cols:
transformers.append(("num_mean", SimpleImputer(strategy="mean"), mean_cols))
if knn_cols:
transformers.append(("num_knn", KNNImputer(n_neighbors=5), knn_cols))
if iterative_cols:
transformers.append(("num_iter", IterativeImputer(random_state=42, max_iter=10), iterative_cols))
if mode_cols:
transformers.append(("cat_mode", SimpleImputer(strategy="most_frequent"), mode_cols))
if constant_cols:
transformers.append(("cat_const", SimpleImputer(strategy="constant", fill_value="Missing"), constant_cols))
imputer = ColumnTransformer(transformers, remainder="passthrough", verbose_feature_names_out=False)
imputer.set_output(transform="pandas") # keep column names / DataFrame output
Always set random_state=42 on IterativeImputer for reproducibility. Always set_output(transform="pandas") so the transformed result keeps column names.
Phase 4 — Fit, transform & capture fill values
imputer.fit(X_train)— on training data only.X_imputed = imputer.transform(X)— apply to the target frame.- Extract the learned fill values for the report:
SimpleImputerexposes.statistics_(one value per column it handled).KNNImputer/IterativeImputerare model-based — there is no single fill value; record the strategy + hyperparameters and the per-column count of values actually filled (compare the pre/post missing mask).
- Reorder columns back to the original order if
remainder="passthrough"reshuffled them, so downstream code is unaffected. - Sanity check: assert
X_imputed.isna().sum().sum() == 0for the imputed columns (excluding any intentionally-dropped fully-missing columns). If NaNs remain, stop and report — do not export a broken imputer.
Phase 5 — Validate the imputation
Quick post-imputation diagnostics, reported to the user:
- No NaNs remain (per Phase 4 sanity check).
- Distribution drift (numeric): compare mean and std before vs after for each imputed numeric column. Large shifts (e.g. mean moves > 0.5 std) are a yellow flag — note them; mean/median imputation always shrinks variance, so call that out.
- Category balance (categorical): report how many rows went to each fill category; if a
constantfill ("Missing") now dominates a column, flag it. - Row coverage: total cells filled and the resulting Completeness score (see Phase 6).
Phase 6 — Export artifacts
Save 3 files in outputs/imputation/:
imputer_<YYYY-MM-DD_HHMMSS>.pkl— the fittedColumnTransformer, pickled withjoblib.dump. This is the reusable deliverable:joblib.load(...).transform(new_df).imputation_report_<YYYY-MM-DD_HHMMSS>.json— machine-readable report (schema below).imputation_log.txt— append-only log: timestamp + skill_version + source + completeness_before → completeness_after.
Never overwrite the log — append only (consistency with data-quality-report auditability).
DAMA-DMBOK2 alignment — Completeness
This skill operationalises the Completeness dimension (the same dimension scored by data-quality-report). Report the score before and after, using the identical formula so the two skills are comparable:
completeness = 100 * (1 - mean_missing_pct_across_columns / 100)
- Compute
completeness_beforeon the raw frame andcompleteness_afteron the imputed frame (should be ~100 for imputed columns). - Imputation raises measured completeness but not necessarily accuracy — state this explicitly. Filling a value makes the cell present (completeness ✓) but the filled value may be wrong (accuracy ✗). Recommend keeping the missingness indicators (Phase 1.6) and validating downstream model performance.
JSON schema (machine-readable export)
{
"skill_version": "0.1.0",
"generated_at": "2026-06-02T14:05:11Z",
"source": {
"filename": "insurance_sample.csv",
"shape": [2000, 10],
"fit_on": "train_split"
},
"dama_completeness": {
"completeness_before": 91.3,
"completeness_after": 100.0,
"cells_filled": 1740,
"note": "Completeness raised; accuracy of filled values not guaranteed."
},
"columns": [
{
"column": "age",
"type": "numeric",
"strategy": "median",
"fill_value": 42.0,
"n_missing": 120,
"pct_missing": 6.0,
"n_filled": 120
},
{
"column": "income",
"type": "numeric",
"strategy": "iterative",
"fill_value": null,
"hyperparameters": {"max_iter": 10, "random_state": 42},
"n_missing": 450,
"pct_missing": 22.5,
"n_filled": 450
},
{
"column": "region",
"type": "categorical",
"strategy": "mode",
"fill_value": "Île-de-France",
"n_missing": 30,
"pct_missing": 1.5,
"n_filled": 30
}
],
"dropped_columns": [
{"column": "comment", "reason": "90.0% missing — too sparse to impute"}
],
"missingness_indicators_added": false,
"validation": {
"no_nan_remaining": true,
"distribution_flags": [
{"column": "income", "metric": "std", "before": 18200.0, "after": 16100.0, "note": "Variance shrunk by imputation"}
]
},
"artifacts": {
"imputer_pkl": "outputs/imputation/imputer_2026-06-02_140511.pkl",
"report_json": "outputs/imputation/imputation_report_2026-06-02_140511.json",
"log": "outputs/imputation/imputation_log.txt"
}
}
Output format (response to user)
After generation, respond with a concise summary:
# Missing Data Imputation — <filename>
**Completeness DAMA : XX.X → 100.0 / 100** (X cells filled)
## Stratégie par colonne
| Colonne | Type | n manquants | Stratégie | Valeur de remplissage |
| age | numeric | 120 (6.0%) | median | 42.0 |
| income | numeric | 450 (22.5%) | iterative | modèle (MICE) |
| region | categorical | 30 (1.5%) | mode | "Île-de-France" |
## Colonnes écartées
- comment : 90.0% manquant — trop creux pour imputer
## Validation
- ✅ Aucun NaN restant sur les colonnes imputées
- ⚠️ income : variance réduite (std 18200 → 16100) — effet attendu de l'imputation
- Indicateurs de manquant ajoutés : non
## Artefacts générés
- 🧩 Imputer (réutilisable) : `outputs/imputation/imputer_<timestamp>.pkl`
- 📋 Rapport JSON : `outputs/imputation/imputation_report_<timestamp>.json`
- 📜 Log mis à jour : `outputs/imputation/imputation_log.txt`
## Réutilisation
```python
import joblib
imputer = joblib.load("outputs/imputation/imputer_<timestamp>.pkl")
X_test_imputed = imputer.transform(X_test) # même imputer, pas de re-fit
Prochaines étapes recommandées
- Passer le dataset complété à
feature-engineer - Conserver les indicateurs de manquant si la missingness est informative
- Valider que l'imputation n'a pas dégradé la performance du modèle
## Constraints
- **Always** fit the imputer on training data only; never `fit_transform` the test/validation set (leakage).
- **Always** build a single `ColumnTransformer` so the deliverable is one picklable object, and call `.set_output(transform="pandas")` to preserve column names.
- **Always** set `random_state=42` on `IterativeImputer` and enable it via `enable_iterative_imputer` before importing.
- **Always** restrict KNN and iterative imputation to numeric columns; categorical columns use `SimpleImputer` (mode/constant) only.
- **Always** export the pickled imputer AND the JSON report (reusability + auditability both required).
- **Always** append to `imputation_log.txt` — never overwrite.
- **Always** report completeness before and after, and state that imputation raises completeness but not accuracy.
- **Never** impute fully-missing (100% NaN) columns — recommend dropping them instead.
- **Never** silently drop high-missingness columns — flag them and let the user decide.
- **Output language**: French if the user wrote in French, English otherwise (match the user's language).
## Code skeleton
Use this as the structural backbone. Adapt to actual data, but keep the 6-phase structure intact.
```python
import json
from datetime import datetime, timezone
from pathlib import Path
import numpy as np
import pandas as pd
import joblib
from sklearn.experimental import enable_iterative_imputer # noqa: F401 (must precede IterativeImputer)
from sklearn.impute import SimpleImputer, KNNImputer, IterativeImputer
from sklearn.compose import ColumnTransformer
# ─── Setup ──────────────────────────────────────────────
SKILL_VERSION = "0.1.0"
OUTPUT_DIR = Path("outputs/imputation")
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
# ─── Phase 1: Profile missingness ───────────────────────
def profile_missingness(df: pd.DataFrame) -> dict:
n = len(df)
miss = {
c: {
"type": "numeric" if pd.api.types.is_numeric_dtype(df[c]) else "categorical",
"n_missing": int(df[c].isna().sum()),
"pct_missing": float(df[c].isna().mean() * 100),
}
for c in df.columns
}
fully_missing = [c for c, m in miss.items() if m["pct_missing"] == 100.0]
high_missing = [c for c, m in miss.items() if 50.0 < m["pct_missing"] < 100.0]
return {"per_col": miss, "fully_missing": fully_missing, "high_missing": high_missing}
def completeness_score(df: pd.DataFrame) -> float:
return float(100 * (1 - df.isna().mean().mean()))
# ─── Phase 2: Choose strategy per column ────────────────
def choose_strategies(df, profile, overrides=None):
"""Return {col: strategy}. Defaults: numeric->median, categorical->mode.
`overrides` lets the user force a strategy per column."""
overrides = overrides or {}
strat = {}
for c, m in profile["per_col"].items():
if c in profile["fully_missing"] or m["n_missing"] == 0:
continue
if c in overrides:
strat[c] = overrides[c]
elif m["type"] == "numeric":
skew = df[c].dropna().skew()
strat[c] = "median" if abs(skew) > 1 else "mean"
else:
strat[c] = "mode"
return strat
# ─── Phase 3: Build the ColumnTransformer ───────────────
def build_imputer(strategies: dict) -> ColumnTransformer:
by = lambda s: [c for c, v in strategies.items() if v == s]
transformers = []
if by("median"): transformers.append(("num_median", SimpleImputer(strategy="median"), by("median")))
if by("mean"): transformers.append(("num_mean", SimpleImputer(strategy="mean"), by("mean")))
if by("KNN"): transformers.append(("num_knn", KNNImputer(n_neighbors=5), by("KNN")))
if by("iterative"):transformers.append(("num_iter", IterativeImputer(random_state=42, max_iter=10), by("iterative")))
if by("mode"): transformers.append(("cat_mode", SimpleImputer(strategy="most_frequent"), by("mode")))
if by("constant"): transformers.append(("cat_const", SimpleImputer(strategy="constant", fill_value="Missing"), by("constant")))
ct = ColumnTransformer(transformers, remainder="passthrough", verbose_feature_names_out=False)
ct.set_output(transform="pandas")
return ct
# ─── Phase 4: Fit, transform, capture fill values ───────
def fit_transform_capture(imputer, X_train, X_target, strategies):
imputer.fit(X_train)
X_out = imputer.transform(X_target)[X_target.columns] # restore original order
# SimpleImputer fill values live in .statistics_ per fitted sub-transformer
fill_values = {}
for name, trans, cols in imputer.transformers_:
if isinstance(trans, SimpleImputer) and hasattr(trans, "statistics_"):
for col, val in zip(cols, trans.statistics_):
fill_values[col] = val
assert X_out[list(strategies)].isna().sum().sum() == 0, "NaNs remain after imputation"
return X_out, fill_values
# ─── Phase 5: Validation diagnostics ────────────────────
def validate(X_before, X_after, strategies):
flags = []
for c, s in strategies.items():
if s in ("mean", "median", "KNN", "iterative"):
std_b, std_a = X_before[c].std(), X_after[c].std()
if std_b and abs(std_a - std_b) / std_b > 0.1:
flags.append({"column": c, "metric": "std", "before": float(std_b),
"after": float(std_a), "note": "Variance changed by imputation"})
return flags
# ─── Phase 6: Export artifacts ──────────────────────────
def export_artifacts(imputer, report: dict, source: Path) -> dict:
ts = datetime.now(timezone.utc).strftime("%Y-%m-%d_%H%M%S")
pkl_path = OUTPUT_DIR / f"imputer_{ts}.pkl"
json_path = OUTPUT_DIR / f"imputation_report_{ts}.json"
log_path = OUTPUT_DIR / "imputation_log.txt"
joblib.dump(imputer, pkl_path)
report["artifacts"] = {"imputer_pkl": str(pkl_path), "report_json": str(json_path), "log": str(log_path)}
json_path.write_text(json.dumps(report, indent=2, default=str))
with open(log_path, "a") as f:
cb = report["dama_completeness"]["completeness_before"]
ca = report["dama_completeness"]["completeness_after"]
f.write(f"{ts} | v{SKILL_VERSION} | {source} | completeness {cb:.1f} -> {ca:.1f}\n")
return {"pkl": pkl_path, "json": json_path, "log": log_path}