name: evaluating-machine-learning-models description: | Evaluate trained machine learning models with the right metrics and comparison logic. Use for benchmark review, threshold selection, calibration, validation, and model comparison; not for feature engineering or leakage auditing. allowed-tools: Read, Write, Edit, Grep, Glob, Bash(cmd:*) version: 1.0.0 author: Jeremy Longshore jeremy@intentsolutions.io license: MIT
Model Evaluation Suite
Use this skill when the model exists and the question is whether it is good enough.
Overview
This skill focuses on choosing and interpreting the right evaluation metrics for the problem, then comparing candidate models or thresholds.
When to Use This Skill
- Comparing candidate models with consistent metrics
- Reviewing precision/recall/F1/AUC, regression error, calibration, or ranking quality
- Stress-testing validation strategy before deployment or publication
Not For / Boundaries
- Building the training pipeline itself: use
scikit-learnfor classical modeling orml-pipeline-workflowfor end-to-end workflow ownership - Engineering features: use
preprocessing-data-with-automated-pipelines - Checking train/test contamination: use
ml-data-leakage-guard
Typical Outputs
- Metric suite recommendations
- Model comparison tables
- Notes on threshold tradeoffs, calibration, and validation weaknesses
Related Skills
scikit-learnfor class-level error breakdowns and confusion matricesscientific-reportingwhen the evaluation must become a deliverable