missing-data-imputation - SKILL.md Agent Skill

name: missing-data-imputation description: Use this skill when the user wants to fill missing values in a tabular dataset and obtain a reusable, fitted scikit-learn imputer plus an auditable report of what was imputed. Triggers include "impute missing values", "fill the NaNs", "handle missing data", "KNN imputation", "iterative / MICE imputation", "remplir les valeurs manquantes", "impute mes données", "gérer les données manquantes", "imputation KNN", "imputation itérative". Supports numeric strategies (mean, median, KNN, iterative/MICE) and categorical strategies (mode, constant), exports a pickled fitted imputer for reuse on new data, and produces a JSON report mapping each column to its strategy, fill value, and missing counts. Aligned to the DAMA-DMBOK2 Completeness dimension.

Missing Data Imputation

A completeness-remediation skill. Takes a dataset with missing values and produces a fitted, pickled scikit-learn imputer that can be reused on new data, plus a machine-readable JSON report documenting every imputation decision. Aligned to the DAMA-DMBOK2 Completeness dimension.

When to use this skill

Activate when the user has a dataset with missing values and wants to fill them in a principled, reproducible way — not just a one-off fillna. Typical signals:

"Impute the missing values in this dataset"
"Fill the NaNs so I can train a model"
"Use KNN / iterative / MICE imputation on the numeric columns"
"Remplis les valeurs manquantes de ce dataset"
"Gère les données manquantes avant le feature engineering"

Pre-conditions:

The dataset is loaded and tabular (CSV, Excel, Parquet).
The user has decided to impute rather than drop. If missingness is very high (>50% on many columns), warn that imputation may inject noise and suggest dropping those columns instead — but let the user decide.

Do NOT activate this skill for:

Auditing quality without fixing it → use data-quality-report
General exploration → use eda-explorer
Time-series gap filling that needs forward/backward fill or interpolation along the time axis → use time-series-features (lag/rolling logic); this skill is for cross-sectional imputation.

Position in the pipeline: this skill typically runs after eda-explorer / data-quality-report (which surface the missingness) and before feature-engineer (which assumes a complete matrix). Fit the imputer on the training split only, then reuse the pickled object on validation/test/production data.

Workflow

For every request, follow these 6 phases in order. Never skip Phase 1 (leakage check) or Phase 6 (export) — they are what make the result reusable and auditable.

Phase 1 — Profile missingness & guard against leakage

Count missingness per column: absolute count and percentage.
Classify each column as numeric or categorical (use select_dtypes; treat object/category/bool as categorical, numeric dtypes as numeric).
Flag fully-missing columns (100% NaN) — these cannot be imputed; recommend dropping them.
Flag high-missingness columns (>50%) — warn that imputation is low-confidence.
CRITICAL — leakage guard: the imputer MUST be fit on training data only. If the user provides a single dataset, ask whether a train/test split exists; if they intend to model, recommend splitting first and fitting the imputer on train, then transform on test. Never fit_transform the test set.
Optional missingness indicators: offer to add <col>_was_missing boolean columns (SimpleImputer(add_indicator=True) or a manual mask) so the model can learn from missingness itself. Off by default; recommend ON when missingness may be informative (MNAR).

Output a profile table:

| Column | Type | n_missing | pct_missing | Proposed strategy |
|---|---|---|---|---|
| age | numeric | 120 | 6.0% | median |
| income | numeric | 450 | 22.5% | iterative |
| region | categorical | 30 | 1.5% | mode |
| comment | categorical | 1800 | 90.0% | DROP (too sparse) |

Phase 2 — Choose a strategy per column

Default decision rules (the user can override any of them):

Numeric columns:

Condition	Default strategy	Rationale
Symmetric / low-skew distribution	`mean`	Unbiased center for roughly normal data
Skewed or has outliers	`median`	Robust to outliers
Missingness depends on other numeric columns, moderate # of features	`KNN`	Uses neighbours in feature space
Complex multivariate relationships, want MICE	`iterative`	Models each column from the others

Default for numeric when unspecified: median (robust, safe baseline).

Categorical columns:

Condition	Default strategy	Rationale
A dominant category exists / missingness is incidental	`mode` (most_frequent)	Preserves the modal class
Missingness is meaningful, or you want it explicit	`constant` (fill value `"Missing"` / `"Unknown"`)	Makes "no value" a first-class category

Default for categorical when unspecified: mode.

⚠️ KNN and iterative imputers require numeric input. Apply them only to numeric columns. Categorical columns are always handled by SimpleImputer (mode/constant). Build the per-type pipelines separately and combine with ColumnTransformer.

⚠️ IterativeImputer is experimental in scikit-learn — you MUST enable it explicitly:

from sklearn.experimental import enable_iterative_imputer  # noqa: F401
from sklearn.impute import IterativeImputer

Phase 3 — Build the imputer (ColumnTransformer)

Construct a single ColumnTransformer so the whole thing is one picklable object. Group numeric columns by their chosen strategy (mean/median can share a SimpleImputer; KNN and iterative each need their own transformer), and categorical columns by strategy (mode vs constant).

from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer, KNNImputer

transformers = []
if median_cols:
    transformers.append(("num_median", SimpleImputer(strategy="median"), median_cols))
if mean_cols:
    transformers.append(("num_mean", SimpleImputer(strategy="mean"), mean_cols))
if knn_cols:
    transformers.append(("num_knn", KNNImputer(n_neighbors=5), knn_cols))
if iterative_cols:
    transformers.append(("num_iter", IterativeImputer(random_state=42, max_iter=10), iterative_cols))
if mode_cols:
    transformers.append(("cat_mode", SimpleImputer(strategy="most_frequent"), mode_cols))
if constant_cols:
    transformers.append(("cat_const", SimpleImputer(strategy="constant", fill_value="Missing"), constant_cols))

imputer = ColumnTransformer(transformers, remainder="passthrough", verbose_feature_names_out=False)
imputer.set_output(transform="pandas")  # keep column names / DataFrame output

Always set random_state=42 on IterativeImputer for reproducibility. Always set_output(transform="pandas") so the transformed result keeps column names.

Phase 4 — Fit, transform & capture fill values

imputer.fit(X_train) — on training data only.
X_imputed = imputer.transform(X) — apply to the target frame.
Extract the learned fill values for the report:
- SimpleImputer exposes .statistics_ (one value per column it handled).
- KNNImputer / IterativeImputer are model-based — there is no single fill value; record the strategy + hyperparameters and the per-column count of values actually filled (compare the pre/post missing mask).
Reorder columns back to the original order if remainder="passthrough" reshuffled them, so downstream code is unaffected.
Sanity check: assert X_imputed.isna().sum().sum() == 0 for the imputed columns (excluding any intentionally-dropped fully-missing columns). If NaNs remain, stop and report — do not export a broken imputer.

Phase 5 — Validate the imputation

Quick post-imputation diagnostics, reported to the user:

No NaNs remain (per Phase 4 sanity check).
Distribution drift (numeric): compare mean and std before vs after for each imputed numeric column. Large shifts (e.g. mean moves > 0.5 std) are a yellow flag — note them; mean/median imputation always shrinks variance, so call that out.
Category balance (categorical): report how many rows went to each fill category; if a constant fill ("Missing") now dominates a column, flag it.
Row coverage: total cells filled and the resulting Completeness score (see Phase 6).

Phase 6 — Export artifacts

Save 3 files in outputs/imputation/:

imputer_<YYYY-MM-DD_HHMMSS>.pkl — the fitted ColumnTransformer, pickled with joblib.dump. This is the reusable deliverable: joblib.load(...).transform(new_df).
imputation_report_<YYYY-MM-DD_HHMMSS>.json — machine-readable report (schema below).
imputation_log.txt — append-only log: timestamp + skill_version + source + completeness_before → completeness_after.

Never overwrite the log — append only (consistency with data-quality-report auditability).

DAMA-DMBOK2 alignment — Completeness

This skill operationalises the Completeness dimension (the same dimension scored by data-quality-report). Report the score before and after, using the identical formula so the two skills are comparable:

completeness = 100 * (1 - mean_missing_pct_across_columns / 100)

Compute completeness_before on the raw frame and completeness_after on the imputed frame (should be ~100 for imputed columns).
Imputation raises measured completeness but not necessarily accuracy — state this explicitly. Filling a value makes the cell present (completeness ✓) but the filled value may be wrong (accuracy ✗). Recommend keeping the missingness indicators (Phase 1.6) and validating downstream model performance.

JSON schema (machine-readable export)

{
  "skill_version": "0.1.0",
  "generated_at": "2026-06-02T14:05:11Z",
  "source": {
    "filename": "insurance_sample.csv",
    "shape": [2000, 10],
    "fit_on": "train_split"
  },
  "dama_completeness": {
    "completeness_before": 91.3,
    "completeness_after": 100.0,
    "cells_filled": 1740,
    "note": "Completeness raised; accuracy of filled values not guaranteed."
  },
  "columns": [
    {
      "column": "age",
      "type": "numeric",
      "strategy": "median",
      "fill_value": 42.0,
      "n_missing": 120,
      "pct_missing": 6.0,
      "n_filled": 120
    },
    {
      "column": "income",
      "type": "numeric",
      "strategy": "iterative",
      "fill_value": null,
      "hyperparameters": {"max_iter": 10, "random_state": 42},
      "n_missing": 450,
      "pct_missing": 22.5,
      "n_filled": 450
    },
    {
      "column": "region",
      "type": "categorical",
      "strategy": "mode",
      "fill_value": "Île-de-France",
      "n_missing": 30,
      "pct_missing": 1.5,
      "n_filled": 30
    }
  ],
  "dropped_columns": [
    {"column": "comment", "reason": "90.0% missing — too sparse to impute"}
  ],
  "missingness_indicators_added": false,
  "validation": {
    "no_nan_remaining": true,
    "distribution_flags": [
      {"column": "income", "metric": "std", "before": 18200.0, "after": 16100.0, "note": "Variance shrunk by imputation"}
    ]
  },
  "artifacts": {
    "imputer_pkl": "outputs/imputation/imputer_2026-06-02_140511.pkl",
    "report_json": "outputs/imputation/imputation_report_2026-06-02_140511.json",
    "log": "outputs/imputation/imputation_log.txt"
  }
}

Output format (response to user)

After generation, respond with a concise summary:

# Missing Data Imputation — <filename>

**Completeness DAMA : XX.X → 100.0 / 100** (X cells filled)

## Stratégie par colonne
| Colonne | Type | n manquants | Stratégie | Valeur de remplissage |
| age | numeric | 120 (6.0%) | median | 42.0 |
| income | numeric | 450 (22.5%) | iterative | modèle (MICE) |
| region | categorical | 30 (1.5%) | mode | "Île-de-France" |

## Colonnes écartées
- comment : 90.0% manquant — trop creux pour imputer

## Validation
- ✅ Aucun NaN restant sur les colonnes imputées
- ⚠️ income : variance réduite (std 18200 → 16100) — effet attendu de l'imputation
- Indicateurs de manquant ajoutés : non

## Artefacts générés
- 🧩 Imputer (réutilisable) : `outputs/imputation/imputer_<timestamp>.pkl`
- 📋 Rapport JSON : `outputs/imputation/imputation_report_<timestamp>.json`
- 📜 Log mis à jour : `outputs/imputation/imputation_log.txt`

## Réutilisation
```python
import joblib
imputer = joblib.load("outputs/imputation/imputer_<timestamp>.pkl")
X_test_imputed = imputer.transform(X_test)   # même imputer, pas de re-fit

Prochaines étapes recommandées

Passer le dataset complété à feature-engineer
Conserver les indicateurs de manquant si la missingness est informative
Valider que l'imputation n'a pas dégradé la performance du modèle


## Constraints

- **Always** fit the imputer on training data only; never `fit_transform` the test/validation set (leakage).
- **Always** build a single `ColumnTransformer` so the deliverable is one picklable object, and call `.set_output(transform="pandas")` to preserve column names.
- **Always** set `random_state=42` on `IterativeImputer` and enable it via `enable_iterative_imputer` before importing.
- **Always** restrict KNN and iterative imputation to numeric columns; categorical columns use `SimpleImputer` (mode/constant) only.
- **Always** export the pickled imputer AND the JSON report (reusability + auditability both required).
- **Always** append to `imputation_log.txt` — never overwrite.
- **Always** report completeness before and after, and state that imputation raises completeness but not accuracy.
- **Never** impute fully-missing (100% NaN) columns — recommend dropping them instead.
- **Never** silently drop high-missingness columns — flag them and let the user decide.
- **Output language**: French if the user wrote in French, English otherwise (match the user's language).

## Code skeleton

Use this as the structural backbone. Adapt to actual data, but keep the 6-phase structure intact.

```python
import json
from datetime import datetime, timezone
from pathlib import Path

import numpy as np
import pandas as pd
import joblib

from sklearn.experimental import enable_iterative_imputer  # noqa: F401  (must precede IterativeImputer)
from sklearn.impute import SimpleImputer, KNNImputer, IterativeImputer
from sklearn.compose import ColumnTransformer

# ─── Setup ──────────────────────────────────────────────
SKILL_VERSION = "0.1.0"
OUTPUT_DIR = Path("outputs/imputation")
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

# ─── Phase 1: Profile missingness ───────────────────────
def profile_missingness(df: pd.DataFrame) -> dict:
    n = len(df)
    miss = {
        c: {
            "type": "numeric" if pd.api.types.is_numeric_dtype(df[c]) else "categorical",
            "n_missing": int(df[c].isna().sum()),
            "pct_missing": float(df[c].isna().mean() * 100),
        }
        for c in df.columns
    }
    fully_missing = [c for c, m in miss.items() if m["pct_missing"] == 100.0]
    high_missing = [c for c, m in miss.items() if 50.0 < m["pct_missing"] < 100.0]
    return {"per_col": miss, "fully_missing": fully_missing, "high_missing": high_missing}

def completeness_score(df: pd.DataFrame) -> float:
    return float(100 * (1 - df.isna().mean().mean()))

# ─── Phase 2: Choose strategy per column ────────────────
def choose_strategies(df, profile, overrides=None):
    """Return {col: strategy}. Defaults: numeric->median, categorical->mode.
    `overrides` lets the user force a strategy per column."""
    overrides = overrides or {}
    strat = {}
    for c, m in profile["per_col"].items():
        if c in profile["fully_missing"] or m["n_missing"] == 0:
            continue
        if c in overrides:
            strat[c] = overrides[c]
        elif m["type"] == "numeric":
            skew = df[c].dropna().skew()
            strat[c] = "median" if abs(skew) > 1 else "mean"
        else:
            strat[c] = "mode"
    return strat

# ─── Phase 3: Build the ColumnTransformer ───────────────
def build_imputer(strategies: dict) -> ColumnTransformer:
    by = lambda s: [c for c, v in strategies.items() if v == s]
    transformers = []
    if by("median"):   transformers.append(("num_median", SimpleImputer(strategy="median"), by("median")))
    if by("mean"):     transformers.append(("num_mean", SimpleImputer(strategy="mean"), by("mean")))
    if by("KNN"):      transformers.append(("num_knn", KNNImputer(n_neighbors=5), by("KNN")))
    if by("iterative"):transformers.append(("num_iter", IterativeImputer(random_state=42, max_iter=10), by("iterative")))
    if by("mode"):     transformers.append(("cat_mode", SimpleImputer(strategy="most_frequent"), by("mode")))
    if by("constant"): transformers.append(("cat_const", SimpleImputer(strategy="constant", fill_value="Missing"), by("constant")))
    ct = ColumnTransformer(transformers, remainder="passthrough", verbose_feature_names_out=False)
    ct.set_output(transform="pandas")
    return ct

# ─── Phase 4: Fit, transform, capture fill values ───────
def fit_transform_capture(imputer, X_train, X_target, strategies):
    imputer.fit(X_train)
    X_out = imputer.transform(X_target)[X_target.columns]  # restore original order
    # SimpleImputer fill values live in .statistics_ per fitted sub-transformer
    fill_values = {}
    for name, trans, cols in imputer.transformers_:
        if isinstance(trans, SimpleImputer) and hasattr(trans, "statistics_"):
            for col, val in zip(cols, trans.statistics_):
                fill_values[col] = val
    assert X_out[list(strategies)].isna().sum().sum() == 0, "NaNs remain after imputation"
    return X_out, fill_values

# ─── Phase 5: Validation diagnostics ────────────────────
def validate(X_before, X_after, strategies):
    flags = []
    for c, s in strategies.items():
        if s in ("mean", "median", "KNN", "iterative"):
            std_b, std_a = X_before[c].std(), X_after[c].std()
            if std_b and abs(std_a - std_b) / std_b > 0.1:
                flags.append({"column": c, "metric": "std", "before": float(std_b),
                              "after": float(std_a), "note": "Variance changed by imputation"})
    return flags

# ─── Phase 6: Export artifacts ──────────────────────────
def export_artifacts(imputer, report: dict, source: Path) -> dict:
    ts = datetime.now(timezone.utc).strftime("%Y-%m-%d_%H%M%S")
    pkl_path = OUTPUT_DIR / f"imputer_{ts}.pkl"
    json_path = OUTPUT_DIR / f"imputation_report_{ts}.json"
    log_path = OUTPUT_DIR / "imputation_log.txt"

    joblib.dump(imputer, pkl_path)
    report["artifacts"] = {"imputer_pkl": str(pkl_path), "report_json": str(json_path), "log": str(log_path)}
    json_path.write_text(json.dumps(report, indent=2, default=str))
    with open(log_path, "a") as f:
        cb = report["dama_completeness"]["completeness_before"]
        ca = report["dama_completeness"]["completeness_after"]
        f.write(f"{ts} | v{SKILL_VERSION} | {source} | completeness {cb:.1f} -> {ca:.1f}\n")
    return {"pkl": pkl_path, "json": json_path, "log": log_path}