name: agentic-kaggle-skill description: Kaggle-first end-to-end competition workflow for scored submissions. Use when Codex must run Kaggle or competitive ML workflows through scored submission, including code competitions, validation, metrics, policy-safe public notebook/discussion intel, tabular/text/image modeling, tuning, ensembling/stacking, proactive multi-notebook architectures, producer notebooks that train models and export private Kaggle artifact datasets, downstream consumer notebooks, Kaggle GPU offload, kagglehub access, hidden-test debugging, and public score retrieval.
Agentic Kaggle Skill
Operating Loop
Treat every competition as a validation problem first and a modeling problem second. The default target platform is Kaggle, so prefer Kaggle-native notebooks/scripts, datasets, model artifacts, competition submissions, and score receipts. For code competitions, assume the final notebook/kernel will be rerun by Kaggle against hidden data unless the competition docs prove otherwise.
- Read the competition page, classify the submission mode as classic file submission or code/notebook scoring, then inspect the rules, data-use terms, sharing policy, data dictionary, metric, submission format, train/test construction hints, and leakage warnings.
- If live competition intelligence tools are available, inspect top public open notebook solutions and relevant discussion activity before major architecture choices; treat them as clues, not authority.
- Identify the task type: binary, multiclass, multilabel, regression, ranking, image, segmentation, text, time series, grouped entities, or a hybrid.
- Design folds before feature engineering or modeling. Prefer a fold column saved into the training data so every experiment uses the same comparison surface.
- Build the simplest metric-correct baseline and produce out-of-fold (OOF) predictions plus a valid submission.
- Proactively plan a stronger architecture once the baseline is trustworthy: diverse model families, feature/embedding producers, augmentation/pseudo-label/distillation stages, calibration/postprocessing, and an ensemble or stacker.
- Iterate with validation gates: test one meaningful change at a time when possible, but launch several independent producer notebooks in parallel when they create diverse artifacts that can be compared by OOF score or ensemble diversity.
- Offload heavy training, inference, embedding generation, image/text experiments, or memory-risky jobs to Kaggle notebooks/scripts when local compute may OOM or take too long.
- For sophisticated architectures, split work into a small Kaggle pipeline: run several independent producer notebooks/scripts first, save each useful producer output as a private Kaggle dataset, then run one consumer notebook/script that attaches those datasets and creates the final OOF/test/submission outputs. When a producer trains a model, its checkpoint, tokenizer/config, fold metadata, OOF/test predictions, and manifest should be exported as a Kaggle dataset; downstream notebooks load the model from
/kaggle/input/.... - Submit the final submission-producing artifact to Kaggle for scoring and retrieve the resulting submission status/score.
- If Kaggle returns a code-competition error or vague scoring failure, enter the debugging loop: retrieve available logs, classify likely failure mode, patch defensively, rerun the final kernel, resubmit, and repeat until scored or concretely blocked.
- Track local CV, remote Kaggle run status, Kaggle submission score, public LB, private-risk notes, seed, code version, data version, and artifact paths for every run.
- Ensemble only with OOF predictions generated without in-fold leakage.
Resource Map
- Read
references/method-map.mdfor the neutral workflow map behind the skill. - Read
references/information-sharing-policy.mdbefore publishing competition code, notebooks, datasets, models, artifacts, or reports outside the active team or workspace. - Read
references/competition-intel.mdwhen a Kaggle competition slug, URL, title, public leaderboard context, open solutions, or discussion activity may inform the approach. - Read
references/cross-validation-and-metrics.mdwhen choosing folds, metrics, thresholds, or leakage checks. - Read
references/tabular-workflow.mdfor categorical variables, feature engineering, selection, and hyperparameter tuning. - Read
references/image-text-workflow.mdfor image, segmentation, and NLP competition approaches. - Read
references/kaggle-code-competition-pipeline.mdwhen the target is a Kaggle code competition, hidden rerun, final notebook scoring flow, or model artifact handoff between producer and consumer notebooks. - Read
references/advanced-notebook-architecture.mdwhen a stronger solution may need multiple model families, staged feature/embedding/model producers, pseudo-labeling, distillation, postprocessing, blending, stacking, or parallel Kaggle GPU notebooks. - Read
references/kaggle-offload.mdwhen local runs are heavy, GPU is useful, Kaggle data access is needed, or remote notebook results must be collected. - Read
references/kaggle-pipeline-datasets.mdwhen using multiple Kaggle notebooks, intermediate datasets, notebook output sources, orkagglehub. - Read
references/submission-endgame.mdbefore stopping work; this skill is not done until Kaggle scoring has been attempted and the result has been collected or a concrete blocker is documented. - Read
references/code-competition-debugging.mdwhen a Kaggle code competition submission fails, times out, OOMs, produces no score, or reports a vague hidden-run/scoring error. - Read
references/ensembling-and-reproducibility.mdfor project layout, OOF artifacts, stacking, blending, and repeatability. - Read
references/research/orexamples/only when the user asks for historical case studies, Hermes-era patterns, or concrete competition lessons from this repository. - Run
scripts/scaffold_competition.pyto create a competition workspace. - Run
scripts/make_folds.pyto add a fold column to a training CSV. - Run
scripts/prepare_kaggle_kernel.pyto create a Kaggle kernel folder with metadata and retrievable experiment logs. - Run
scripts/prepare_kaggle_dataset.pyto create metadata and commands for versioned intermediate artifact datasets.
Default Kaggle Procedure
Start with the metric and validation:
- Reproduce the competition metric locally, including clipping, transforms, averaging mode, thresholds, and sample weights.
- Check whether the competition allows public code, external data, pretrained weights, generated artifacts, and public repositories before sharing anything outside the team.
- Classify the Kaggle submission path early. If it is a code competition or notebook-scored workflow, design the final consumer notebook as the scoring artifact from the start.
- Use
kaggle-competition-intelMCP tools when available:top_open_solutionsfor score-ranked plus latest high-vote notebooks, andcompetition_discussionsfor top-voted plus recently active discussion topics before choosing expensive baselines. - Choose folds to match the hidden test distribution. Use stratification for classification, binned stratification for regression, group splits for repeated entities, time-aware splits for temporal data, and multilabel-aware splits when label co-occurrence matters.
- Save OOF predictions for every model. OOF files are the currency for error analysis, threshold tuning, blending, and stacking.
- Compare mean fold score and fold variance, not only the best fold.
- Treat a public LB jump that disagrees with CV as a validation question before treating it as a modeling victory.
Then build baselines in this order:
- Metric-only or constant baseline to verify scoring and submission format.
- Fast classical baseline: LightGBM/CatBoost/XGBoost for tabular, TF-IDF plus linear model for text, pretrained backbone for images.
- Strong baseline with stable folds, clean artifacts, and OOF predictions.
- Feature/model experiments with run logs and one primary metric.
- Ensembling or stacking after several diverse, individually validated models exist.
Do not wait for the user to explicitly ask for sophistication when the competition warrants it. After a stable baseline, propose and execute a stronger staged architecture if public solutions, discussion intel, data modality, metric pressure, or CV plateau suggests that a single notebook/model will underperform. Keep the architecture gated by OOF evidence, artifact manifests, and final Kaggle scoring.
When a run is likely to exceed local RAM/VRAM, require a GPU/TPU, or needs Kaggle-only data mounts, prepare a Kaggle kernel run instead of forcing it locally. The remote run must emit experiment_log.json, metrics.jsonl, and an artifact manifest so results can be retrieved and interpreted after kaggle kernels output.
When the solution has multiple heavy stages, keep the remote graph shallow and inspectable:
- Use fewer than five independent producer notebooks/scripts per wave.
- Give each producer a single responsibility such as feature generation, embedding extraction, fold training, checkpoint training, model distillation, or inference.
- For code competitions, prefer producer notebooks that train models/checkpoints and export them as private Kaggle datasets; the final consumer should attach those datasets, load models from
/kaggle/input/..., and perform hidden-test-safe inference. - Save producer notebook outputs as versioned private Kaggle datasets for durable reuse and as stable inputs to downstream notebooks; use direct
kernel_sourcesonly for short-lived chains where durability/versioning is unnecessary. - Run one final consumer notebook/script that attaches the produced datasets, reads their artifacts, writes the final OOF, test predictions, blend/stack report, and submission, then submit that final artifact/kernel to Kaggle for scoring.
- Use
kagglehubinside Python when it is more convenient to download datasets, competition files, notebook outputs, or upload dataset versions programmatically.
Definition Of Done
Do not stop at a plan, scaffold, trained model, notebook run, downloaded output, or local validation score. Continue until one of these is true:
- A competition submission has been sent to Kaggle, Kaggle has processed it, and the public score or submission status has been retrieved and recorded.
- A code-competition final notebook/kernel has consumed all required producer artifacts, has been submitted with its kernel/version reference, and the resulting submission status/score has been retrieved and recorded.
- A concrete external blocker prevents scoring: missing Kaggle credentials, rules not accepted, quota exhausted after retry attempts, competition closed, scoring disabled, required manual UI-only action, unavailable kernel version, or Kaggle service failure. Record the exact blocker and the next command/action needed.
Validation Choices
Use this quick mapping:
- Balanced classification:
StratifiedKFold. - Imbalanced classification: stratified folds plus metric-sensitive thresholding or class weights.
- Regression:
KFoldfor ordinary targets, binned stratified folds for skewed or multimodal targets. - Same user, patient, product, session, document, or image source appears multiple times:
GroupKFoldor stratified group splitting. - Ordered events or forecasting: time split, expanding-window validation, or competition-specific period split.
- Images from same subject or scene: group by subject/scene before augmentation.
- Text with duplicated sources, authors, questions, products, or prompts: group by source key.
- Tiny data: repeated folds or more folds, but keep the final model selection discipline strict.
Modeling Priorities
- For tabular competitions, prefer tree boosting first unless the data is mostly sparse text or high-cardinality categorical interactions. Add CatBoost when categorical columns are central.
- For linear models and neural networks, scale numeric features and encode categoricals carefully.
- For tree models, start with label/count/frequency encodings and sensible missing values; scaling is usually unnecessary.
- For target encoding, always compute encodings out-of-fold, with smoothing and no validation-row target visibility.
- For image tasks, get dataset, transforms, labels, masks, and metric correct before changing architecture.
- For text tasks, keep a TF-IDF baseline even when using transformers; it is fast, interpretable, and often blends well.
- For hyperparameter search, tune only after folds and baselines are trustworthy. Search learning rate, depth/leaves, regularization, sampling, and number of estimators with early stopping.
Competition Hygiene
- Keep raw data read-only.
- Keep competition data, private datasets, model checkpoints, downloaded outputs, submissions, credentials, and data-derived artifacts out of public Git repositories unless the competition rules and artifact licenses explicitly allow public redistribution.
- Put all experiment-changing values in config or command-line args.
- Seed Python, NumPy, framework, and model libraries when possible.
- Save model files, OOF predictions, test predictions, fold scores, feature lists, and submission files with run IDs.
- For Kaggle-offloaded jobs, write retrievable logs and artifacts into the notebook working directory, including run ID, git commit, config, fold, metric, hardware, elapsed time, OOM notes, and output file paths.
- For multi-notebook pipelines, maintain a pipeline manifest with producer kernel refs, produced dataset refs, dataset versions, artifact schemas, metrics, and the final consumer run ID.
- Keep producer datasets, kernels, and model artifacts private by default. Make them public only after verifying the target competition rules, data license, third-party IP, and user intent.
- Record submission message, submission command or kernel/version, Kaggle submission timestamp, status, public score if available, and any error returned by Kaggle.
- For vague code-competition failures, record the observed status, available logs, suspected failure class, patch attempted, and next retry command.
- Never train a stacker on in-sample base predictions.
- Never select features, target encoders, scalers, thresholds, or augmentations using validation targets outside the fold boundary.
- Prefer scripts for repeatable training and notebooks for exploration and plots.
Useful Commands
Create a competition skeleton:
python3 <skill-dir>/scripts/scaffold_competition.py --root .
Create stratified folds:
python3 <skill-dir>/scripts/make_folds.py \
--input input/train.csv \
--output input/train_folds.csv \
--target target \
--strategy stratified \
--n-splits 5
Create grouped folds:
python3 <skill-dir>/scripts/make_folds.py \
--input input/train.csv \
--output input/train_folds.csv \
--target target \
--group-col patient_id \
--strategy group
Prepare a Kaggle GPU kernel experiment folder:
python3 <skill-dir>/scripts/prepare_kaggle_kernel.py \
--output kaggle_kernels/exp_lgbm_gpu \
--username kaggle-user \
--slug exp-lgbm-gpu \
--title "exp lgbm gpu" \
--competition playground-series-sample \
--accelerator NvidiaTeslaT4
Prepare a private Kaggle dataset folder for intermediate artifacts:
python3 <skill-dir>/scripts/prepare_kaggle_dataset.py \
--output kaggle_datasets/exp_features_v1 \
--username kaggle-user \
--slug exp-features-v1 \
--title "exp features v1" \
--description "OOF-safe feature artifacts for experiment exp_features_v1"
Prepare a private Kaggle dataset folder for trained model artifacts:
python3 <skill-dir>/scripts/prepare_kaggle_dataset.py \
--output kaggle_datasets/exp_model_v1 \
--username kaggle-user \
--slug exp-model-v1 \
--title "exp model v1" \
--description "Trained model artifacts for experiment exp_model_v1" \
--artifact-kind model