ds-instrumental-variables - SKILL.md Agent Skill

name: ds-instrumental-variables compatibility: opencode completeness: 95 content-types:

code
guidance
do-dont
examples description: '"Provides Uses instrumental variables (IV), two-stage least squares (2SLS), and IV estimation to identify causal effects in observational data"' license: MIT maturity: stable metadata: domain: coding output-format: code related-skills: ds-causal-inference, ds-linear-regression, ds-observational-studies role: implementation scope: implementation triggers: instrumental variables, IV, 2SLS, endogeneity, causal effect, how do i handle endogeneity archetypes:
- tactical
- generation anti_triggers:
- brainstorming
- vague ideation
- code golf
- over-engineering response_profile: verbosity: low directive_strength: high abstraction_level: operational version: "1.0.0"

Instrumental Variables

Comprehensive guide to instrumental variables in machine learning and data science workflows.

When to Use This Skill

Solving real-world causal inference problems
Building machine learning pipelines with instrumental variables
Implementing best practices for instrumental variables
Optimizing model performance using instrumental variables techniques
Learning industry-standard approaches to instrumental variables

When NOT to Use This Skill

When using pre-built libraries without understanding underlying concepts
For toy problems that don't require instrumental variables rigor
When domain expertise in specific problem requires different approach
If your problem doesn't require the complexity this skill provides

Purpose and Key Concepts

Instrumental Variables is a critical component of the machine learning workflow. This skill covers:

Theoretical foundations — Mathematical principles and statistical concepts
Practical implementation — Working code examples and patterns
Common pitfalls — Mistakes to avoid and how to recover from them
Best practices — Industry-standard approaches and optimization techniques

Core Workflow

Understand the problem — Clearly define what you're solving for
Select approach — Choose the right technique for your data and constraints
Implement solution — Write clean, tested code following best practices
Validate results — Verify your implementation with tests and validation
Optimize performance — Improve efficiency and accuracy incrementally

Implementation Patterns

Pattern 1: Basic Instrumental Variables

import pandas as pd
import numpy as np
import statsmodels.api as sm

def basic_iv_2sls(X: pd.DataFrame, Z: pd.DataFrame, y: pd.Series) -> dict:
    """
    Perform Two-Stage Least Squares (2SLS) estimation.
    X: Endogenous regressors (n, k)
    Z: Instruments (n, l) including exogenous variables
    y: Dependent variable (n,)
    """
    if X.shape[0] != Z.shape[0] or X.shape[0] != y.shape[0]:
        raise ValueError("Sample sizes must match")
    
    # Stage 1: Regress X on Z to get predicted values
    X_with_const = sm.add_constant(X)
    Z_with_const = sm.add_constant(Z)
    
    stage1_results = sm.OLS(X_with_const, Z_with_const).fit()
    X_hat = stage1_results.fittedvalues
    
    # Stage 2: Regress y on predicted X_hat
    stage2_results = sm.OLS(y, X_hat).fit()
    
    return {
        'coefficients': stage2_results.params.values
        'standard_errors': stage2_results.bse.values
        'r_squared': stage2_results.rsquared
        'f_statistic': stage2_results.fvalue
        'stage1_f_statistic': stage1_results.fvalue
    }

# Generate synthetic data with endogeneity
np.random.seed(42)
n = 500
Z = np.random.randn(n, 2)  # Instruments
X = 0.5 * Z[:, 0] + 0.3 * Z[:, 1] + np.random.randn(n) * 0.5  # Endogenous
u = np.random.randn(n) * 0.5
y = 2.0 * X + 1.5 * Z[:, 0] + u  # True model with endogeneity bias in OLS
df_X = pd.DataFrame(X, columns=['x1'])
df_Z = pd.DataFrame(Z, columns=['z1', 'z2'])
series_y = pd.Series(y, name='y')

results = basic_iv_2sls(df_X, df_Z, series_y)
print(f"IV Coefficients: {results['coefficients']}")
print(f"Stage 1 F-stat: {results['stage1_f_statistic']:.2f}")

Pattern 2: Production-Ready Instrumental Variables

import logging
import pandas as pd
import numpy as np
from typing import Any, Dict, List, Union
import statsmodels.api as sm

logger = logging.getLogger(__name__)

class InstrumentalVariables:
    """Production implementation of Instrumental Variables (2SLS)"""
    
    def __init__(self, instrument_cols: List[str], endog_cols: List[str], 
                 exog_cols: List[str] = None, alpha: float = 0.05):
        self.instrument_cols = instrument_cols
        self.endog_cols = endog_cols
        self.exog_cols = exog_cols or []
        self.alpha = alpha
        self.model = None
        self.results = None
        
    def _validate_data(self, data: pd.DataFrame) -> None:
        required = self.instrument_cols + self.endog_cols + self.exog_cols
        missing = [col for col in required if col not in data.columns]
        if missing:
            raise ValueError(f"Missing columns: {missing}")
        if data.isnull().any().any():
            raise ValueError("Input data contains NaN values")
            
    def fit(self, data: pd.DataFrame) -> 'InstrumentalVariables':
        self._validate_data(data)
        X = data[self.endog_cols + self.exog_cols]
        Z = data[self.instrument_cols + self.exog_cols]
        y = data['target']
        
        X_const = sm.add_constant(X)
        Z_const = sm.add_constant(Z)
        
        stage1 = sm.OLS(X_const, Z_const).fit()
        X_hat = stage1.fittedvalues
        
        self.results = sm.OLS(y, X_hat).fit()
        self.model = self.results
        logger.info(f"IV model fitted. R-squared: {self.results.rsquared:.4f}")
        return self
        
    def predict(self, data: pd.DataFrame) -> np.ndarray:
        if self.results is None:
            raise RuntimeError("Model must be fitted before prediction")
        X = data[self.endog_cols + self.exog_cols]
        Z = data[self.instrument_cols + self.exog_cols]
        X_const = sm.add_constant(X)
        Z_const = sm.add_constant(Z)
        stage1 = sm.OLS(X_const, Z_const).fit()
        X_hat = stage1.fittedvalues
        return self.results.predict(X_hat)
        
    def execute(self, data: pd.DataFrame) -> Dict[str, Any]:
        self.fit(data)
        predictions = self.predict(data)
        return {
            'status': 'success'
            'coefficients': self.results.params.to_dict()
            'p_values': self.results.pvalues.to_dict()
            'confidence_intervals': self.results.conf_int(self.alpha).to_dict()
            'predictions': predictions.tolist()
            'r_squared': self.results.rsquared
            'aic': self.results.aic
            'bic': self.results.bic
        }

Pattern 3: Anti-Patterns vs Best Practices (BAD vs GOOD)

# BAD: Hardcoded values, no validation, ignores instrument relevance
def bad_iv_implementation(df):
    X = df['x']
    Z = df['z']
    y = df['y']
    # Assumes perfect correlation, no stage 1 check
    beta = np.sum(y * Z) / np.sum(X * Z)
    return beta

# GOOD: Validates assumptions, checks first-stage strength, returns structured output
def good_iv_implementation(df: pd.DataFrame, z_col: str, x_col: str, y_col: str) -> Dict[str, Any]:
    if df[z_col].corr(df[x_col]) < 0.3:
        raise ValueError("Weak instrument detected: correlation < 0.3")
    # Proper 2SLS with constant and error handling
    X_const = sm.add_constant(df[x_col])
    Z_const = sm.add_constant(df[z_col])
    stage1 = sm.OLS(X_const, Z_const).fit()
    if stage1.fvalue < 10:
        logger.warning("Weak instrument warning: First-stage F-stat < 10")
    X_hat = stage1.fittedvalues
    iv_res = sm.OLS(df[y_col], X_hat).fit()
    return {'coefficient': iv_res.params.iloc[1], 'p_value': iv_res.pvalues.iloc[1]}

Best Practices

✅ Always validate your implementation on test data
✅ Document your assumptions and methodology
✅ Use version control for reproducibility
✅ Monitor performance metrics in production
✅ Periodically review and update your approach
✅ Test with edge cases and outliers
✅ Log all significant operations for debugging

Common Pitfalls

Constraints

MUST DO

Validate all data preprocessing steps are fit-only on training data, never on validation or test sets
Implement reproducible pipelines with fixed random seeds and deterministic operations where possible
Report model performance with confidence intervals via bootstrapping or cross-validation across multiple runs
Log all experiments with parameters, metrics, and artifacts using MLflow or equivalent tracking system

MUST NOT DO

Do not evaluate a model on the same data used for training — always hold out a proper test set
Avoid overfitting to the validation set by limiting hyperparameter search iterations
Never use features that can only be computed at inference time (look-ahead bias)
Do not report single-run accuracy without statistical significance testing or error bars

Live References

Authoritative documentation links for this skill's domain. The model follows markdown links at load time to resolve external references and inline content.