tabular-ml-modeling - SKILL.md Agent Skill

name: tabular-ml-modeling description: State-of-the-art machine learning for large-scale tabular data (millions of rows, thousands of features). Use when building predictive models on tabular/structured data including regression and classification tasks. Covers GPU-accelerated gradient boosting (XGBoost, LightGBM, CatBoost), RAPIDS acceleration (cuDF, cuML), validation strategies, feature engineering, ensembling (hill climbing, stacking), and competition-winning techniques. Includes Numerai-specific optimizations with deep model parameters, era-based validation, multi-target ensembling, and payout scoring. Optimized for NVIDIA A100 GPU with CUDA 12.4.

Tabular ML Modeling

Battle-tested techniques for large-scale tabular data modeling, refined from Kaggle Grandmaster competition experience and optimized for GPU acceleration.

Environment Specifications

GPU: NVIDIA A100-SXM4-80GB (80GB VRAM), CUDA 12.4
CPU: 8 cores x64, 160GB RAM
Disk: 200GB free
Dataset: 4.5M rows, 2748 int8 features [0-4], era-based structure

Platform Persistence (CRITICAL)

The platform is ephemeral. It shuts down when idle and all work is lost.

Required Practices

Save models frequently to disk after each training run
Maintain a log file (numerai.log) with progress, findings, decisions
Checkpoint OOF predictions after each model completes
Save ensemble weights when optimizing

import logging
import joblib

# Setup logging
logging.basicConfig(
    filename='numerai.log',
    level=logging.INFO,
    format='%(asctime)s - %(message)s'
)

def log(msg):
    logging.info(msg)
    print(msg)

# Save model after training
def save_checkpoint(model, name, oof_preds, scores):
    joblib.dump(model, f'{name}_model.pkl')
    np.save(f'{name}_oof.npy', oof_preds)
    log(f"Saved {name}: NC={scores['nc']:.4f}, CC={scores['cc']:.4f}")

Core Principles

Fast Experimentation

Maximize high-quality experiments. GPU acceleration transforms day-long iterations into minutes:

Use RAPIDS cuDF for dataframe operations (up to 150x faster than pandas)
Use GPU backends for XGBoost, LightGBM, CatBoost
Use cuML for sklearn-compatible ML algorithms

Robust Validation

Never trust a single train/test split. Use k-fold cross-validation matched to data structure:

Standard: StratifiedKFold for classification, KFold for regression
Time-series: TimeSeriesSplit
Grouped data: GroupKFold (e.g., era-based predictions)

Ensemble Everything

Single models leave performance on the table. Combine aggressively:

Diverse frameworks: CatBoost + XGBoost + LightGBM capture different patterns
Multiple targets: Auxiliary targets provide orthogonal signal
Hill climbing: Optimize weights on OOF predictions
Stacking: Meta-models can learn non-linear combinations

Smarter EDA

Go beyond basic statistics. These checks catch problems that sink models:

Train vs Test Distribution

# Check for distribution shift that breaks generalization
import matplotlib.pyplot as plt

for col in important_features[:10]:
    fig, ax = plt.subplots()
    train[col].hist(ax=ax, alpha=0.5, label='train', bins=50)
    test[col].hist(ax=ax, alpha=0.5, label='test', bins=50)
    ax.legend()
    ax.set_title(f'{col} distribution')

Temporal Patterns in Target

# Check for trends/seasonality that require time-aware validation
df.groupby('era')['target_ender_20'].mean().plot()
plt.title('Target mean by era - look for trends')

Diverse Baselines

Don't commit to one model type early. Train multiple frameworks to understand your data:

# Quick baseline comparison across frameworks
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor

baselines = {
    'xgb': XGBRegressor(device='cuda', n_estimators=1000, learning_rate=0.05),
    'lgb': LGBMRegressor(device='gpu', n_estimators=1000, learning_rate=0.05),
    'cat': CatBoostRegressor(task_type='GPU', iterations=1000, learning_rate=0.05, verbose=0),
}

for name, model in baselines.items():
    scores = cross_val_score(model, X, y, cv=cv, scoring='neg_mean_squared_error')
    print(f"{name}: RMSE = {np.sqrt(-scores.mean()):.4f} (+/- {np.sqrt(-scores).std():.4f})")

Why this matters: Different model types reveal different aspects of your data. If linear models do surprisingly well, interactions may not matter. If one GBDT framework dominates, that guides where to focus.

Workflow Overview

Smarter EDA - Check train/test distribution shift, temporal patterns in target
Diverse Baselines - Compare CatBoost, XGBoost, LightGBM, neural nets
Setup Era-Based CV - GroupKFold by era (no temporal leakage)
Feature Engineering - Interactions, aggregations, combinations (GPU-accelerated)
Train Deep Models - Multiple frameworks × multiple targets × multiple seeds
Ensemble: Hill Climbing - Optimize weights on OOF predictions
Ensemble: Stacking - Meta-model on OOF if beneficial
Pseudo-Labeling - Expand training with confident predictions (if unlabeled data available)
Extra Training - Seed averaging + retrain on 100% data
Evaluate Payout - numerai_corr + 3× correlation_contribution

Quick Start

Installation

# RAPIDS suite (cuDF, cuML) - requires CUDA 12.x
pip install cudf-cu12 cuml-cu12 --extra-index-url=https://pypi.nvidia.com

# Gradient boosting frameworks with GPU
pip install xgboost catboost lightgbm

# Optional: For neural approaches
pip install pytorch-tabnet tabpfn

GPU-Accelerated Data Loading

import cudf

# Load data on GPU (150x faster than pandas for large files)
train = cudf.read_parquet("train.parquet")
test = cudf.read_parquet("test.parquet")

# Or use pandas accelerator mode (zero code change)
# %load_ext cudf.pandas
# import pandas as pd  # now GPU-accelerated

Baseline Model Template

import xgboost as xgb
import numpy as np

# Start with LIGHT parameters, use early stopping
params = {
    "device": "cuda",
    "tree_method": "hist",
    "objective": "reg:squarederror",
    "learning_rate": 0.01,       # Start higher
    "max_depth": 8,
    "n_estimators": 2000,        # Start smaller
    "colsample_bytree": 0.1,
    "subsample": 0.8,
    "early_stopping_rounds": 100,
}

# Time-ordered CV with embargo (NOT GroupKFold!)
splits = make_embargo_splits(df['era'], n_splits=5, embargo_eras=4)
oof_preds = np.zeros(len(df))

for fold, (train_idx, val_idx) in enumerate(splits):
    X_tr, X_val = X[train_idx], X[val_idx]
    y_tr, y_val = y[train_idx], y[val_idx]
    
    model = xgb.XGBRegressor(**params)
    model.fit(X_tr, y_tr, eval_set=[(X_val, y_val)], verbose=100)
    oof_preds[val_idx] = model.predict(X_val)
    
    # Save checkpoint
    joblib.dump(model, f'xgb_fold{fold}.pkl')

Reference Materials

For detailed guidance on specific topics, see:

references/numerai_guide.md: Numerai competition specifics - era validation, payout scoring, multi-target ensembling
references/gbdt_frameworks.md: XGBoost vs LightGBM vs CatBoost selection and GPU configs
references/validation_strategies.md: Cross-validation strategies and leakage prevention
references/feature_engineering.md: GPU-accelerated feature generation patterns
references/ensembling.md: Hill climbing, stacking, and pseudo-labeling
references/gpu_acceleration.md: RAPIDS cuDF/cuML optimization guide
references/neural_approaches.md: TabNet, TabPFN, and deep learning for tabular

Scripts

scripts/numerai_trainer.py: Numerai-optimized deep model training with era-based CV
scripts/baseline_trainer.py: Multi-framework baseline training
scripts/cv_validator.py: Cross-validation with proper OOF generation
scripts/hill_climb_ensemble.py: GPU-accelerated ensemble optimization

Deep Model Configuration

This dataset rewards DEEP models. But start light and let early stopping guide you deeper.

Progressive Training Strategy

Start light: n_estimators=2,000, learning_rate=0.01
Use early stopping to find optimal tree count
Gradually deepen: Reduce LR, increase trees based on findings
Target: n_estimators=30,000, learning_rate=0.001 for final models

GPU Availability

Framework	GPU Support	Device Setting
XGBoost	✅ CUDA	`device="cuda"`
CatBoost	✅ GPU	`task_type="GPU"`
LightGBM	❌ CPU only	Keep smaller models (2K trees)

Strategy: Let XGBoost and CatBoost bring deep nuance (30K trees). Use LightGBM with lighter params for ensemble diversity.

Light vs Deep Parameters

Parameter	Light (Start Here)	Deep (Target)
n_estimators	2,000	30,000
learning_rate	0.01	0.001
max_depth	5	10
num_leaves (LGB)	15	1,024
min_data_in_leaf	1,000	10,000
early_stopping	100 rounds	500 rounds

XGBoost Deep Config (GPU)

params = {
    "device": "cuda",
    "tree_method": "hist",
    "objective": "reg:squarederror",
    "n_estimators": 30_000,
    "learning_rate": 0.001,
    "max_depth": 10,
    "max_leaves": 1024,
    "min_child_weight": 10_000,
    "colsample_bytree": 0.1,
    "subsample": 0.8,
    "early_stopping_rounds": 500,
}

CatBoost Deep Config (GPU)

params = {
    "task_type": "GPU",
    "devices": "0",
    "iterations": 30_000,
    "learning_rate": 0.001,
    "depth": 10,
    "min_data_in_leaf": 10_000,
    "rsm": 0.1,
    "subsample": 0.8,
    "early_stopping_rounds": 500,
}

LightGBM Config (CPU - Keep Light)

params = {
    "device": "cpu",  # GPU not supported on this platform
    "objective": "regression",
    "n_estimators": 2_000,  # Keep smaller for CPU
    "learning_rate": 0.01,
    "max_depth": 8,
    "num_leaves": 255,
    "min_data_in_leaf": 1_000,
    "feature_fraction": 0.1,
    "bagging_fraction": 0.8,
    "early_stopping_rounds": 100,
}

Numerai-Style Workflow

For datasets with era-based structure and custom scoring:

1. Time-Ordered CV with Embargo (CRITICAL)

# WARNING: GroupKFold causes severe temporal leakage!
# Adjacent eras have ~99% target correlation (overlapping forward returns)
# Must use time-ordered CV with embargo gap

def make_embargo_splits(
    era: pd.Series, 
    n_splits: int = 5, 
    embargo_eras: int = 4,
    min_train_ratio: float = 0.5,
) -> list[tuple[np.ndarray, np.ndarray]]:
    """
    Time-ordered CV with embargo between train and val.
    - Training data always comes BEFORE validation data temporally
    - Embargo gap prevents target leakage from overlapping returns
    """
    unique_eras = np.sort(era.unique())
    n_eras = len(unique_eras)
    
    min_train_eras = int(n_eras * min_train_ratio)
    remaining_eras = n_eras - min_train_eras - embargo_eras
    test_size = remaining_eras // n_splits
    
    era_to_idx = {e: np.where(era == e)[0] for e in unique_eras}
    
    splits = []
    for i in range(n_splits):
        train_end_idx = min_train_eras + (i * test_size)
        test_start_idx = train_end_idx + embargo_eras
        test_end_idx = test_start_idx + test_size
        
        if test_end_idx > n_eras:
            test_end_idx = n_eras
        if test_start_idx >= n_eras:
            break
            
        train_eras = unique_eras[:train_end_idx]
        test_eras = unique_eras[test_start_idx:test_end_idx]
        
        train_idx = np.concatenate([era_to_idx[e] for e in train_eras])
        test_idx = np.concatenate([era_to_idx[e] for e in test_eras])
        
        splits.append((np.sort(train_idx), np.sort(test_idx)))
    
    return splits

# Usage
splits = make_embargo_splits(df['era'], n_splits=5, embargo_eras=4)
for train_idx, val_idx in splits:
    # Train always comes BEFORE val temporally
    pass

2. Multi-Target Training

Train separate models on auxiliary targets for ensemble diversity:

targets = ["target_ender_20", "target_ender_60", "target_jasper_20", ...]
oof_preds = {}
for target in targets:
    oof_preds[target] = train_model(X, df[target], groups=df['era'])

3. Rank-Gaussian Ensembling

from scipy import stats

def rank_gauss(s):
    """Rank-normalize and gaussianize."""
    ranks = s.rank(method='average') / (len(s) + 1)
    return stats.norm.ppf(ranks) / stats.norm.ppf(ranks).std()

# Apply per-era, then combine
for col in pred_cols:
    df[col] = df.groupby('era')[col].transform(rank_gauss)
final = df[pred_cols].dot(weights)
final = df.groupby('era')['final'].transform(rank_gauss)

4. Payout Scoring

from numerai_tools.scoring import numerai_corr, correlation_contribution

# IMPORTANT: Compute per-era, then average
# numerai_corr(predictions: DataFrame, targets: Series)
# correlation_contribution(predictions: DataFrame, meta_model: Series, live_targets: Series)

pred_col = 'prediction'  # or list of prediction columns

per_era_nc = df.groupby('era').apply(
    lambda d: numerai_corr(d[[pred_col]], d['target_ender_20'])
)
per_era_cc = df.groupby('era').apply(
    lambda d: correlation_contribution(d[[pred_col]], d['v52_lgbm_ender20'], d['target_ender_20'])
)

# Mean across eras
nc = per_era_nc[pred_col].mean()
cc = per_era_cc[pred_col].mean()

# Payout = 0.75 * NC + 2.25 * CC (CC weighted 3x more!)
payout = 0.75 * nc + 2.25 * cc

Model Selection Decision Tree

For this Numerai-style dataset:

All three frameworks are viable - explore freely:
├─ CatBoost GPU (known strong performer in Numerai)
├─ XGBoost GPU (robust, well-documented)
└─ LightGBM CPU (fast training, ensemble diversity)

Ensemble diversity is key:
└─ Train each framework on all 6 targets
   └─ Different tree-building strategies capture different patterns

Parameter regime:
├─ Exploration: 2K trees, LR=0.01 (quick iteration)
└─ Production: 30K trees, LR=0.001 (GPU models only)

Key insight: Deep GPU models >> shallow models for this data

Memory Management for 4.5M Rows

# Dataset: 4.5M rows × 2748 int8 features
# Memory: 4.5M × 2748 × 1 byte ≈ 12.4 GB (very efficient!)
# A100 80GB + 160GB RAM = abundant resources

# Features are pre-quantized int8 [0,1,2,3,4] - no conversion needed
df = pd.read_parquet("dataset.parquet")
feature_cols = [c for c in df.columns if c.startswith("feature_")]

# Keep int8! Tree models handle it natively
# Do NOT convert to float32 - wastes 4x memory
X = df[feature_cols].values  # Stays as int8

# For GPU training with cuDF
import cudf
gdf = cudf.read_parquet("dataset.parquet")

Competition-Winning Checklist

1. Smarter EDA

Check train vs test distribution for important features
Analyze target variable for temporal patterns across eras
Identify data leakage risks

2. Data Preparation

Load dataset.parquet and benchmark.parquet, merge on id
Identify feature columns (start with "feature_")
Verify features are int8 [0-4] - do NOT scale or transform
Never use "era" as a feature (grouping only)

3. Diverse Baselines

Train quick baselines with CatBoost, XGBoost, LightGBM
Compare performance to understand data characteristics

4. Feature Engineering

Explore feature interactions (pairwise combinations of strong features)
Try aggregation features (mean/std of feature groups per era)
Test dimensionality reduction (PCA components as additional features)
Use GPU-accelerated generation with cuDF

5. Validation Setup

Use time-ordered CV with embargo (NOT GroupKFold - causes leakage!)
Set embargo_eras=4 for 20-day targets
Verify train eras always come BEFORE val eras
Evaluate with groupby('era').apply() pattern for numerai_corr and correlation_contribution
Track payout = 0.75NC + 2.25CC
Realistic OOF scores: NC ~0.03-0.055, CC ~0.01-0.035

6. Model Training

Start light: n_estimators=2000, LR=0.01, early_stopping=100
Use early stopping to find optimal tree count
GPU models (XGBoost, CatBoost): Progress to deep params (30K trees)
CPU models (LightGBM): Keep at 2K trees for efficiency
Train on main target: target_ender_20
Train on auxiliary targets for ensemble diversity
Save models and OOF predictions after each run

7. Ensembling

Hill climbing: Optimize weights on OOF predictions
Stacking: Meta-model on diverse OOF predictions
Multi-target blend: Rank-gaussianize per era, weighted combine

8. Pseudo-Labeling (if unlabeled data available)

Generate soft pseudo-labels
Multi-round for best results
Avoid leakage: separate pseudo-labels per CV fold

9. Extra Training

Seed averaging: Train with 5-10 seeds, average predictions
Retrain on 100% data: Final model uses all training data

10. Final Evaluation

Monitor correlation_contribution (3x weight in payout!)
Compare ensemble vs single best model
Try neural approaches (TabNet) for diversity

Advanced Techniques

Beyond basic model training, these techniques often separate top performers:

Feature Engineering

With 2748 features, strategic engineering can unlock hidden signal:

# See references/feature_engineering.md for full patterns
# Example: pairwise interactions of top features
top_features = get_top_n_by_importance(model, n=50)
for f1, f2 in combinations(top_features, 2):
    df[f'{f1}_x_{f2}'] = df[f1] * df[f2]

Hill Climbing Ensemble

Optimize weights across all OOF predictions (targets × frameworks):

# See scripts/hill_climb_ensemble.py
from hill_climb_ensemble import HillClimbingEnsemble

# Collect OOF predictions from all models
oof_matrix = np.column_stack([
    oof_catboost_ender20, oof_xgb_ender20, oof_lgb_ender20,
    oof_catboost_ender60, oof_xgb_ender60, ...
])

hc = HillClimbingEnsemble(metric='numerai_corr')
weights = hc.fit(oof_matrix, y_true)
ensemble_pred = oof_matrix @ weights

Stacking

Train a meta-model on OOF predictions:

# See references/ensembling.md for full stacking guide
# Level 1: Base model OOF predictions become meta-features
meta_features = np.column_stack([oof_cat, oof_xgb, oof_lgb, oof_tabnet])

# Level 2: Meta-model (often simple linear or shallow tree)
from sklearn.linear_model import Ridge
meta_model = Ridge(alpha=1.0)
meta_model.fit(meta_features[train_idx], y[train_idx])
final_pred = meta_model.predict(meta_features)

Neural Approaches for Diversity

Add TabNet or MLP predictions to ensemble for different inductive bias:

# See references/neural_approaches.md
from pytorch_tabnet.tab_model import TabNetRegressor

tabnet = TabNetRegressor(
    n_d=32, n_a=32, n_steps=5,
    gamma=1.5, lambda_sparse=1e-4,
    optimizer_params=dict(lr=2e-2),
    scheduler_params=dict(step_size=50, gamma=0.9),
    mask_type='entmax'
)

Pseudo-Labeling

Turn unlabeled data into training signal (see references/ensembling.md):

# Use soft labels (probabilities) for regularization
# Multi-round pseudo-labeling often outperforms single-pass
# Avoid leakage: use k separate pseudo-label sets for k-fold CV

Extra Training (Final Polish)

Two techniques that squeeze out extra performance after optimization:

Seed Averaging: Train identical models with different random seeds, average predictions:

# Reduces variance, improves robustness
seeds = [42, 123, 456, 789, 1000]
predictions = []

for seed in seeds:
    model = CatBoostRegressor(**deep_params, random_seed=seed)
    model.fit(X_train, y_train)
    predictions.append(model.predict(X_test))

final_pred = np.mean(predictions, axis=0)

Retrain on 100% Data: After finding optimal hyperparameters via CV, retrain final model on all data:

# After hyperparameter search with CV
best_params = {...}  # Found via CV

# Final model uses ALL training data
final_model = CatBoostRegressor(**best_params)
final_model.fit(X_full, y_full)  # No holdout
final_predictions = final_model.predict(X_test)