xgboost

name: xgboost description: "[Applies to: **/*.py] Definitive guidelines for writing robust, performant, and maintainable Python code using the xgboost library, focusing on modern best practices and common pitfalls." source: "cursor_mdc"

xgboost Best Practices

This document outlines the definitive coding standards and best practices for developing with xgboost in Python. Adhering to these guidelines ensures your code is efficient, scalable, maintainable, and aligned with the project's official recommendations and modern ML workflows.

1. Code Organization and Structure

Organize your xgboost code for clarity, reusability, and maintainability. Separate concerns: data preparation, model definition, training, evaluation, and logging.

1.1. Modularity with `DMatrix` and `Booster`

For optimal performance and control, especially with large datasets or custom objectives, use xgboost.DMatrix for data handling and xgboost.Booster for the core model. This separates data management from model logic.

❌ BAD: Repeatedly converting data

import xgboost as xgb
import pandas as pd
from sklearn.model_selection import train_test_split

# Assume X, y are loaded
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Inefficient for repeated operations or large datasets
model = xgb.train(
    {'objective': 'binary:logistic'},
    xgb.DMatrix(X_train, label=y_train),
    num_boost_round=100
)
# Later, if you need to retrain or predict:
model.predict(xgb.DMatrix(X_test)) # Another conversion

✅ GOOD: Pre-process data into DMatrix once

import xgboost as xgb
import pandas as pd
from sklearn.model_selection import train_test_split

# Assume X, y are loaded
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Convert data to DMatrix once for efficiency
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

params = {
    'objective': 'binary:logistic',
    'eval_metric': 'logloss',
    'tree_method': 'hist', # Modern, faster tree method
    'device': 'cuda' # Leverage GPU if available
}

model = xgb.train(
    params,
    dtrain,
    num_boost_round=100,
    evals=[(dtrain, 'train'), (dtest, 'eval')]
)
preds = model.predict(dtest)

1.2. Adhere to Python Style Guides

Follow PEP 8 for code style and use NumPy-style docstrings for all functions and classes. This is enforced by xgboost's own contribution guidelines.

❌ BAD: Inconsistent formatting, missing docstrings

def train_model(X,y):
    model=xgb.XGBClassifier()
    model.fit(X,y)
    return model

✅ GOOD: PEP 8 compliant, NumPy-style docstrings

import xgboost as xgb
import numpy as np
from typing import Tuple

def train_xgboost_classifier(
    X_train: np.ndarray, y_train: np.ndarray, n_estimators: int = 100
) -> xgb.XGBClassifier:
    """
    Trains an XGBoost classifier model.

    Parameters
    ----------
    X_train : np.ndarray
        Training features.
    y_train : np.ndarray
        Training labels.
    n_estimators : int, optional
        Number of boosting rounds, by default 100.

    Returns
    -------
    xgb.XGBClassifier
        The trained XGBoost classifier model.
    """
    model = xgb.XGBClassifier(n_estimators=n_estimators, tree_method='hist', random_state=42)
    model.fit(X_train, y_train)
    return model

2. Common Patterns and Anti-patterns

2.1. Scikit-learn API vs. Native API

Use the Scikit-learn API (XGBClassifier, XGBRegressor) for seamless integration into standard ML pipelines (e.g., GridSearchCV, Pipeline). Opt for the native xgboost.train API when you need fine-grained control, custom objectives/metrics, or to leverage DMatrix features like external memory.

✅ GOOD: Scikit-learn API for standard use cases

from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = XGBClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=5,
    tree_method='hist', # Always prefer 'hist'
    enable_categorical=True, # Enable native categorical support
    random_state=42
)
model.fit(X_train, y_train)
preds = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, preds):.4f}")

✅ GOOD: Native API for advanced control (e.g., custom objective, DMatrix features)

import xgboost as xgb
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_regression
from typing import Tuple