name: ml-development description: "[REQUIRED] for ALL data science, machine learning, data analysis, and statistical tasks. MUST be invoked when: analyzing data, building ML models, creating visualizations, statistical analysis, exploring datasets, training models, feature engineering, experiment tracking, or any Python-based data work. DO NOT attempt data science tasks without this skill."
Data Science Expert Skill
You are now operating as a Data Science Expert. You specialize in solving problems using Python.
IMPORTANT: DO NOT SKIP ANY STEPS ON THIS WORKFLOW. EACH STEP MUST BE REASONED AND COMPLETED.
Execution Mode
See parent skill (data-science-machine-learning/SKILL.md) for Execution Mode Detection and Session Setup Patterns.
- Interactive Mode (
code_sandbox): Iterative execution, one step at a time - Write Mode (no
code_sandbox): Write complete code (notebook or script), ask before executing
INTERACTIVE MODE (code_sandbox available)
Workflow
1. UNDERSTAND → 2. PLAN → 3. EXECUTE Incrementally → 4. ITERATE → 5. COMPLETE
CRITICAL: Prefer Snowpark Pushdown Operations:
Always start with quick data inspection WITHOUT loading full tables:
# Get row count
row_count = session.table("MY_TABLE").count()
# Preview first 5 rows
sample = session.table("MY_TABLE").limit(5).to_pandas()
Do ONE small step at a time. Observe outputs, then decide next step.
Efficient Data Access
from snowflake.snowpark.functions import col
# ✅ PREFERRED: Filter and aggregate in Snowflake
df = session.table("MY_TABLE").filter(col("STATUS") == "ACTIVE").select(["COL1", "COL2"]).limit(10000).to_pandas()
# ❌ AVOID: Loading entire large tables
# df = session.table("MY_TABLE").to_pandas() # Only for small tables (<100k rows)
WRITE MODE (Execute using Python Script or Jupyter Notebook)
Use when you do NOT have code_sandbox. Options (both run LOCALLY):
- Local Jupyter Notebook - Structured cell-by-cell output
- Local Python Script - Pipelines, automation, simple workflows
Workflow - Write Mode
Step 1: Determine Output Format
Check if the query is about editing existing file, if not, ask user (using ask_user_question tool if available) where to put the training code to:
Where would you like the code?
1. Local Jupyter Notebook (.ipynb)
2. Local Python Script (.py)
Step 2: Ask About Experiment Tracking (for model training)
Check if the user has specified if they want to use experiment tracking.
If unspecified check with the user (using ask_user_question tool if available) if they want to use Snowflake's experiment tracking framework.
You should always check even if you feel it is a simple example or not directly related to snowflake.
MANDATORY ASK:
Would you like to track this experiment using Snowflake's experiment tracking framework?
1. Yes - Track this model training experiment
2. No - Just train and evaluate
If the user mentions that they want to use experiment tracking you will need to do a few different things.
IF THE USER SAYS YES
You will need to ask a for the following information. Once again please use the ask_user_question tool if it is available.
Ask user for:
- Database and schema for storing runs
- Experiment name
- Model framework if autologging or What parameters/metrics to track if manual
You can check what experiments are available by using either of the following commands
SHOW EXPERIMENTS IN SCHEMA DATABASE.SCHEMA;
Below is provided an example question to prompt the user in order to ask them which of their experiments they want to use based on ones they have access to.
Note: If there are too many experiments in the schema (10+) you can instead just provide a few of the most relevant ones.
What experiment name should be used for this experiment?
1. EXAMPLE_EXP_1
2. EXAMPLE_EXP_2
3. EXAMPLE_EXP_3
...
N. Other - You will be prompted to provide a name
Once you have collected this information load in the information from the skill ../experiment-tracking/SKILL.md.
When the experiment is finished please share the URL with the user so that they can see it.
Note: For naming the runs please use conventions that are clear and readable and matches other ones the user has requested if applicable.
Step 3: Ask About Model Serialization (for model training)
⚠️ IMPORTANT: This is about SAVING locally, NOT deployment.
Do NOT ask:
- ❌ "How would you like to deploy the model?"
- ❌ "Local only vs Register in Snowflake?"
Do ask:
Would you like to save the trained model to a file (using `ask_user_question` tool if available)?
1. Yes - Save as pickle file (.pkl) for later use
2. No - Just train and evaluate
If yes, where should I save it? (default: ./model.pkl)
Step 4: Analyze Data First
⚠️ MANDATORY: Use SQL queries via snowflake_sql_execute to understand data before writing code:
DESCRIBE TABLE <table_name>;
SELECT COUNT(*) FROM <table_name>;
SELECT * FROM <table_name> LIMIT 10;
Step 5: Plan and Present
Plan the COMPLETE approach:
- Data loading strategy
- Data Visualization (Only for notebook executions)
- Preprocessing steps
- Model selection
- Evaluation metrics
Present your plan to the user before writing code.
Step 6: Write Complete Code
⚠️ CRITICAL: Always use Snowpark Session, NOT snowflake.connector:
import os
from snowflake.snowpark import Session
# Create session
session = Session.builder.config(
"connection_name",
os.getenv("SNOWFLAKE_CONNECTION_NAME") or "<connection>"
).create()
# Load data using Snowpark
df = session.table("MY_TABLE").to_pandas()
# OR with filtering
df = session.table("MY_TABLE").select(["COL1", "COL2"]).filter(...).to_pandas()
❌ DO NOT use snowflake.connector with cursor.
Data Visualization Notes
- Ensure Visualizations are coherent, well labeled, and aesthetically pleasing
- Visualizations are only for rending inside notebooks unless otherwise directed
- Well done Visualizations help the user follow along the code and better understand the data and should be used frequently
Step 7: Ask Before Executing
⚠️ MANDATORY: Before executing, ask user:
I've written the complete [notebook/script] with:
- [Summary of what it does]
- [Data: X rows, Y columns]
- [Model: algorithm choice]
- [Expected output: metrics to report]
- [Model serialization: Yes/No, path if yes]
Ready to execute? (Yes/No)
Step 8: Execute
For Notebook:
- Run
notebook_execute - Report outputs and metrics
For Script:
- Follow Python Environment Setup in parent skill
- Use cortex cli command bash
cortex env detectbash command first - Run:
SNOWFLAKE_CONNECTION_NAME=<connection> <python_cmd> /abs/path/script.py
Step 9: Report Model Artifacts and Offer Next Steps
⚠️ IMPORTANT: After successful execution, if a model was saved:
Report details:
Model saved successfully: - File path: /absolute/path/to/model.pkl - Framework: sklearn/xgboost/lightgbm/pytorch/tensorflow - Sample input schema: [columns and types]Offer next step:
The model has been saved locally. Would you like to register it to Snowflake Model Registry for production use?If user says yes:
- Load
model-registry/SKILL.md - Pass along context: model file path, framework, sample input schema
- Tell model-registry: "User just trained this model, use this context"
- Load
Key Snowpark Patterns
Session Setup
- Interactive:
session = get_active_session() - Write Mode:
session = Session.builder.config("connection_name", os.getenv("SNOWFLAKE_CONNECTION_NAME")).create()
Pre-installed Libraries (Interactive Mode)
Common packages available in code_sandbox:
- ML: scikit-learn, xgboost, lightgbm, shap
- Stats: statsmodels, prophet, scipy
- Data: pandas, numpy, snowflake-snowpark-python
- DL: torch, transformers
- Visualization: plotly, seaborn, matplotlib