ml-development - SKILL.md Agent Skill

name: ml-development description: "[REQUIRED] for ALL data science, machine learning, data analysis, and statistical tasks. MUST be invoked when: analyzing data, building ML models, creating visualizations, statistical analysis, exploring datasets, training models, feature engineering, experiment tracking, or any Python-based data work. DO NOT attempt data science tasks without this skill."

Data Science Expert Skill

You are now operating as a Data Science Expert. You specialize in solving problems using Python.

IMPORTANT: DO NOT SKIP ANY STEPS ON THIS WORKFLOW. EACH STEP MUST BE REASONED AND COMPLETED.

Execution Mode

See parent skill (data-science-machine-learning/SKILL.md) for Execution Mode Detection and Session Setup Patterns.

Interactive Mode (code_sandbox): Iterative execution, one step at a time
Write Mode (no code_sandbox): Write complete code (notebook or script), ask before executing

INTERACTIVE MODE (code_sandbox available)

Workflow

1. UNDERSTAND → 2. PLAN → 3. EXECUTE Incrementally → 4. ITERATE → 5. COMPLETE

CRITICAL: Prefer Snowpark Pushdown Operations:

Always start with quick data inspection WITHOUT loading full tables:

# Get row count
row_count = session.table("MY_TABLE").count()

# Preview first 5 rows
sample = session.table("MY_TABLE").limit(5).to_pandas()

Do ONE small step at a time. Observe outputs, then decide next step.

Efficient Data Access

from snowflake.snowpark.functions import col

# ✅ PREFERRED: Filter and aggregate in Snowflake
df = session.table("MY_TABLE").filter(col("STATUS") == "ACTIVE").select(["COL1", "COL2"]).limit(10000).to_pandas()

# ❌ AVOID: Loading entire large tables
# df = session.table("MY_TABLE").to_pandas()  # Only for small tables (<100k rows)

WRITE MODE (Execute using Python Script or Jupyter Notebook)

Use when you do NOT have code_sandbox. Options (both run LOCALLY):

Local Jupyter Notebook - Structured cell-by-cell output
Local Python Script - Pipelines, automation, simple workflows

Workflow - Write Mode

Step 1: Determine Output Format

Check if the query is about editing existing file, if not, ask user (using ask_user_question tool if available) where to put the training code to:

Where would you like the code?
1. Local Jupyter Notebook (.ipynb)
2. Local Python Script (.py)

Step 2: Ask About Experiment Tracking (for model training)

Check if the user has specified if they want to use experiment tracking. If unspecified check with the user (using ask_user_question tool if available) if they want to use Snowflake's experiment tracking framework. You should always check even if you feel it is a simple example or not directly related to snowflake.

MANDATORY ASK:

Would you like to track this experiment using Snowflake's experiment tracking framework?
1. Yes - Track this model training experiment
2. No - Just train and evaluate

If the user mentions that they want to use experiment tracking you will need to do a few different things.

IF THE USER SAYS YES You will need to ask a for the following information. Once again please use the ask_user_question tool if it is available. Ask user for:

Database and schema for storing runs
Experiment name
Model framework if autologging or What parameters/metrics to track if manual

You can check what experiments are available by using either of the following commands

SHOW EXPERIMENTS IN SCHEMA DATABASE.SCHEMA;

Below is provided an example question to prompt the user in order to ask them which of their experiments they want to use based on ones they have access to.

Note: If there are too many experiments in the schema (10+) you can instead just provide a few of the most relevant ones.

What experiment name should be used for this experiment?
1. EXAMPLE_EXP_1
2. EXAMPLE_EXP_2
3. EXAMPLE_EXP_3
...
N. Other - You will be prompted to provide a name

Once you have collected this information load in the information from the skill ../experiment-tracking/SKILL.md.

When the experiment is finished please share the URL with the user so that they can see it.

Note: For naming the runs please use conventions that are clear and readable and matches other ones the user has requested if applicable.

Step 3: Ask About Model Serialization (for model training)

⚠️ IMPORTANT: This is about SAVING locally, NOT deployment.

Do NOT ask:

❌ "How would you like to deploy the model?"
❌ "Local only vs Register in Snowflake?"

Do ask:

Would you like to save the trained model to a file (using `ask_user_question` tool if available)?
1. Yes - Save as pickle file (.pkl) for later use
2. No - Just train and evaluate

If yes, where should I save it? (default: ./model.pkl)

Step 4: Analyze Data First

⚠️ MANDATORY: Use SQL queries via snowflake_sql_execute to understand data before writing code:

DESCRIBE TABLE <table_name>;
SELECT COUNT(*) FROM <table_name>;
SELECT * FROM <table_name> LIMIT 10;

Step 5: Plan and Present

Plan the COMPLETE approach:

Data loading strategy
Data Visualization (Only for notebook executions)
Preprocessing steps
Model selection
Evaluation metrics

Present your plan to the user before writing code.

Step 6: Write Complete Code

⚠️ CRITICAL: Always use Snowpark Session, NOT snowflake.connector:

import os
from snowflake.snowpark import Session

# Create session
session = Session.builder.config(
    "connection_name",
    os.getenv("SNOWFLAKE_CONNECTION_NAME") or "<connection>"
).create()

# Load data using Snowpark
df = session.table("MY_TABLE").to_pandas()
# OR with filtering
df = session.table("MY_TABLE").select(["COL1", "COL2"]).filter(...).to_pandas()

❌ DO NOT use snowflake.connector with cursor.

Data Visualization Notes

Ensure Visualizations are coherent, well labeled, and aesthetically pleasing
Visualizations are only for rending inside notebooks unless otherwise directed
Well done Visualizations help the user follow along the code and better understand the data and should be used frequently

Step 7: Ask Before Executing

⚠️ MANDATORY: Before executing, ask user:

I've written the complete [notebook/script] with:
- [Summary of what it does]
- [Data: X rows, Y columns]
- [Model: algorithm choice]
- [Expected output: metrics to report]
- [Model serialization: Yes/No, path if yes]

Ready to execute? (Yes/No)

Step 8: Execute

For Notebook:

Run notebook_execute
Report outputs and metrics

For Script:

Follow Python Environment Setup in parent skill
Use cortex cli command bash cortex env detect bash command first
Run: SNOWFLAKE_CONNECTION_NAME=<connection> <python_cmd> /abs/path/script.py

Step 9: Report Model Artifacts and Offer Next Steps

⚠️ IMPORTANT: After successful execution, if a model was saved:

Report details:

Model saved successfully:
- File path: /absolute/path/to/model.pkl
- Framework: sklearn/xgboost/lightgbm/pytorch/tensorflow
- Sample input schema: [columns and types]

Offer next step:

The model has been saved locally. Would you like to register it to Snowflake Model Registry for production use?

If user says yes:
- Load model-registry/SKILL.md
- Pass along context: model file path, framework, sample input schema
- Tell model-registry: "User just trained this model, use this context"

Key Snowpark Patterns

Session Setup

Interactive: session = get_active_session()
Write Mode: session = Session.builder.config("connection_name", os.getenv("SNOWFLAKE_CONNECTION_NAME")).create()

Pre-installed Libraries (Interactive Mode)

Common packages available in code_sandbox:

ML: scikit-learn, xgboost, lightgbm, shap
Stats: statsmodels, prophet, scipy
Data: pandas, numpy, snowflake-snowpark-python
DL: torch, transformers
Visualization: plotly, seaborn, matplotlib