csc311-data-exploration - SKILL.md Agent Skill

name: csc311-data-exploration description: > Perform thorough, structured data exploration for the CSC311 ML classification project. Use this skill whenever the user asks to explore a dataset for a machine learning project, perform EDA (exploratory data analysis), analyze features, check for data issues, plan preprocessing, or generate data exploration reports. Also trigger when the user mentions CSC311, project proposal, data exploration section, or asks about class balance, missing values, outliers, text features, or data splitting strategies. This skill produces publication-ready analysis with figures, following the CSC311 report template requirements.

CSC311 Data Exploration Skill

You are an expert data scientist performing exploratory data analysis for a university ML classification project. The dataset contains student responses that must be classified into one of three categories. Your analysis must be rigorous enough for a graded academic report.

Overview

This skill guides a complete data exploration workflow that produces:

A structured analysis covering all 6 required sections
Publication-quality figures saved as PNG files
A markdown summary report suitable for inclusion in a LaTeX document
Concrete recommendations connecting findings to model selection

Workflow

Execute these phases in order. Complete each phase fully before moving to the next.

Phase 1: Initial Data Audit

Load the CSV and produce a comprehensive first look:

import pandas as pd
import numpy as np

df = pd.read_csv("<path_to_csv>")

# Shape and structure
print(f"Rows: {df.shape[0]}, Columns: {df.shape[1]}")
print(f"\nColumn names:\n{list(df.columns)}")
print(f"\nData types:\n{df.dtypes}")
print(f"\nFirst 5 rows:\n{df.head()}")
print(f"\nBasic statistics:\n{df.describe(include='all')}")

Classify every column into one of:

Numerical: continuous or discrete numeric values
Categorical: finite set of labels/categories
Text: free-form text responses (likely the main features)
Identifier: student IDs, row indices (not features)

Record your classification in a summary table.

Phase 2: Dataset Summary (Report Section 1)

For each feature type, analyze distributions:

Numerical features:

Compute mean, median, std, min, max, skewness
Generate histograms with KDE overlays
Note any bimodal or heavily skewed distributions

Categorical features:

Compute value counts and proportions
Generate bar charts

Text features:

Compute document lengths (word count, character count)
Compute vocabulary size and most frequent terms
Generate word frequency bar plots
Optionally generate word clouds (if wordcloud is installed)

Class balance:

Compute class distribution (counts and percentages)
Generate a class distribution bar chart
Assess whether classes are balanced, slightly imbalanced, or heavily imbalanced
If imbalanced, note implications for model training and evaluation metrics

Save all figures to an outputs/figures/ directory with descriptive filenames.

Phase 3: Data Issues (Report Section 2)

Systematically check for and document:

Missing values:

missing = df.isnull().sum()
missing_pct = (missing / len(df)) * 100
missing_report = pd.DataFrame({'count': missing, 'pct': missing_pct})
print(missing_report[missing_report['count'] > 0])

For each column with missing values, recommend a handling strategy:

Drop rows (if very few)
Impute with mean/median/mode (if moderate)
Create a "missing" indicator feature (if missingness is informative)

Outliers:

For numerical columns, use IQR method and z-scores
Visualize with box plots
Recommend: keep, clip, or remove — with justification

Inconsistencies:

For text: check for encoding issues, extra whitespace, HTML artifacts
For categorical: check for near-duplicate labels (case differences, typos)
For all: check for duplicate rows

Noise:

For text: check for very short or empty responses that may be low-effort
Flag any suspicious patterns

For every issue found, state:

What the issue is
How many rows/values are affected
Your recommended handling approach and why

Phase 4: Preprocessing Plan (Report Section 3)

Based on findings from Phases 2-3, outline a concrete preprocessing pipeline:

Text representation (most likely the core preprocessing):

Tokenization approach (word-level, subword)
Lowercasing, punctuation removal, stop word removal
Representation: TF-IDF, Bag-of-Words, or embeddings
Justify the choice based on vocabulary size and document lengths observed

Numerical features:

StandardScaler vs MinMaxScaler — choose based on distribution shape
Handle outliers before or after scaling

Categorical features:

One-hot encoding vs label encoding — choose based on cardinality

Feature engineering ideas (optional but valuable):

Text length as a feature
Sentiment scores
N-gram features
PCA for dimensionality reduction on high-dimensional text features

Phase 5: Data Splitting Strategy (Report Section 4)

This is critical for correctness. Implement and explain:

from sklearn.model_selection import GroupShuffleSplit

# Identify the student ID column (the grouping key)
student_id_col = "<student_id_column>"  # Determine this from the data

# All 3 data points from the same student MUST stay in the same split
gss = GroupShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
train_val_idx, test_idx = next(gss.split(df, groups=df[student_id_col]))

# Further split train_val into train and validation
df_train_val = df.iloc[train_val_idx]
gss2 = GroupShuffleSplit(n_splits=1, test_size=0.25, random_state=42)  # 0.25 of 0.8 = 0.2
train_idx, val_idx = next(gss2.split(df_train_val, groups=df_train_val[student_id_col]))

You MUST explicitly address:

Why grouped splitting: each student contributed 3 related data points (one per class). If split across train/val/test, the model could learn student-specific patterns instead of generalizable class patterns. This is data leakage.
The split ratios: e.g., 60/20/20 or 70/15/15 — justify the choice
That the test set is reserved and untouched during all exploration and model development
Verify no leakage: confirm no student ID appears in multiple splits

# Verification
train_students = set(df.iloc[train_idx][student_id_col])
val_students = set(df.iloc[val_idx][student_id_col])
test_students = set(df.iloc[test_idx][student_id_col])
assert len(train_students & val_students) == 0, "Leakage between train and val!"
assert len(train_students & test_students) == 0, "Leakage between train and test!"
assert len(val_students & test_students) == 0, "Leakage between val and test!"
print("No leakage detected.")

Phase 6: Connect to Model Choices (Report Section 5)

Based on everything discovered, write a section connecting findings to model selection. Use this decision framework:

Finding	Suggested Model Family	Reasoning
High-dimensional sparse text (TF-IDF)	Linear models (Logistic Regression, Linear SVM)	Effective in high-dim sparse spaces
Nonlinear patterns in numerical features	Tree-based (Random Forest, Gradient Boosting)	Capture nonlinear decision boundaries
Small dataset size	Simpler models, strong regularization	Avoid overfitting
Class imbalance	Models with class_weight support	Handle imbalance natively
Mixed feature types	Tree-based or ensemble methods	Handle heterogeneous features naturally
Complex text semantics	Neural approaches (if dataset is large enough)	Capture semantic meaning

Be specific — don't just list models generically. Connect each recommendation to a concrete observation from your analysis.

Phase 7: Generate the Report

After all analysis is complete, compile everything into a structured markdown report:

# Data Exploration

## 1. Dataset Summary
[Feature types table, distribution descriptions, class balance analysis]
[Reference figures by filename]

## 2. Data Issues
[Missing values, outliers, inconsistencies — with handling strategies]

## 3. Preprocessing Plan
[Transformations with justifications]

## 4. Data Splitting Strategy
[Grouped splitting explanation, leakage prevention, verification]

## 5. Key Insights and Model Choice Connections
[Findings → model family recommendations]

Save this report as outputs/data_exploration_report.md.

Figure Standards

All figures must:

Have descriptive titles
Have labeled axes with units where applicable
Use a consistent color palette (e.g., tab10 or a custom 3-color scheme for the 3 classes)
Be saved at 300 DPI as PNG
Be sized appropriately (typically 8x6 or 10x6 inches)
Include a brief caption/explanation in the report

import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('seaborn-v0_8-whitegrid')
COLORS = ['#2ecc71', '#3498db', '#e74c3c']  # Green, Blue, Red for 3 classes
FIG_DIR = 'outputs/figures/'
os.makedirs(FIG_DIR, exist_ok=True)

def save_fig(fig, name):
    fig.savefig(f"{FIG_DIR}/{name}.png", dpi=300, bbox_inches='tight')
    plt.close(fig)

Common Pitfalls to Avoid

Do NOT use the test set for any exploration, visualization, or statistics. All analysis uses the full dataset BEFORE splitting, or the training set AFTER splitting. State this explicitly.
Do NOT just paste plots without interpretation. Every figure needs 1-3 sentences explaining what it shows and why it matters.
Do NOT skip the leakage verification step. This is specifically called out in the rubric.
Do NOT present surface-level analysis. "There are no missing values" is fine if true, but also check for implicit missing values (empty strings, "N/A", "none", etc.).
Do NOT forget to connect findings to model choices. This is what separates a good report from a great one.

Additional References

For detailed preprocessing guidance for specific feature types, see:

references/text_preprocessing.md — detailed text feature analysis patterns
references/reporting_checklist.md — final checklist before submission