name: data-analysis-workflow description: Structured data analysis workflow with automatic logging and standardized project organization. Use when starting a new data analysis project or when you need to organize analysis operations, track decisions, and maintain analysis logs in specific directories for easy retrieval and reporting.
Data Analysis Workflow
Overview
This skill provides a standardized structure for organizing data analysis projects with automatic logging of all operations, decisions, and results. It ensures that key operations and analysis logs are systematically recorded in specific directories, making it easy to review, reproduce, and report on analysis work.
Use this skill when:
- Starting a new data analysis project
- Need to track analysis operations systematically
- Want to maintain organized logs of decisions and results
- Preparing analysis for publication or reporting
- Collaborating on data analysis projects
- Need to reproduce or review past analyses
Core Principles
- Standardized Directory Structure: All projects follow consistent organization
- Automatic Logging: Every operation is logged with timestamp and details
- Decision Documentation: Analytical decisions are recorded with reasoning
- Timestamped Analysis: Each analysis run gets a unique timestamped directory
- Progressive Organization: Data flows logically from raw → processed → results
Standard Directory Structure
Every data analysis project follows this structure:
project_root/
├── 00_raw_data/ # Original, unmodified data
│ ├── experiment_1/
│ └── README.md # Data provenance and descriptions
│
├── 01_analysis/ # Analysis scripts and notebooks
│ ├── YYYYMMDD_HHMM_descriptive_name/ # Timestamped analysis directories
│ │ ├── analysis_log.md # Detailed analysis log
│ │ ├── script.py or notebook.ipynb
│ │ └── outputs/
│ └── archive/ # Completed analyses
│
├── 02_processed_data/ # Cleaned, transformed data
│ ├── YYYYMMDD_dataset_name.csv
│ └── processing_notes.md
│
├── 03_results/ # Final analysis outputs
│ ├── figures/
│ ├── tables/
│ └── statistics/
│
├── 04_reports/ # Written reports and summaries
│ ├── YYYYMMDD_report_name.md
│ └── final_manuscript/
│
├── 05_models/ # Trained models and parameters
│ ├── model_v1/
│ └── model_registry.md
│
└── logs/ # Centralized logging
├── MASTER_LOG.md # Complete project operation log
├── decisions.md # Key analytical decisions
└── errors.md # Error tracking and resolutions
Workflow
1. Initialize a New Project
Create the standard directory structure:
python ~/.claude/skills/data-analysis-workflow/scripts/init_project.py project_name
cd project_name
This creates:
- All standard directories
- Initial README files
- Empty MASTER_LOG.md with header
- .gitignore configured for data analysis
2. Log Every Operation
Use the AnalysisLogger class to record operations:
from data_analysis_workflow.log_helper import AnalysisLogger
logger = AnalysisLogger(project_root=".")
# Log a data processing operation
logger.log_operation(
operation="Data cleaning",
location="02_processed_data/20260228_cleaned_data.csv",
params={"method": "remove_outliers", "threshold": 3},
result="Removed 47 outliers (2.3% of data)",
key_findings=["Most outliers in column X", "Normal distribution after cleaning"],
notes="Used IQR method"
)
# Log an analytical decision
logger.log_decision(
question="Which statistical test to use for group comparison?",
decision="Welch's t-test instead of standard t-test",
reasoning="Unequal variances detected (Levene p=0.003)",
references=["Statistics textbook, p.234"]
)
3. Create Analysis-Specific Logs
For each analysis session, create a timestamped directory with detailed log:
logger.create_analysis_log(
analysis_name="differential_expression",
input_data=["00_raw_data/counts_matrix.csv", "00_raw_data/metadata.csv"],
methods=[
"DESeq2 normalization",
"Negative binomial GLM",
"Benjamini-Hochberg FDR correction"
],
results={
"significant_genes": 423,
"upregulated": 256,
"downregulated": 167
},
next_steps=[
"Pathway enrichment analysis",
"Validate top 20 genes with qPCR"
],
notes="Used stricter FDR < 0.01 instead of 0.05 due to large sample size"
)
This creates:
01_analysis/20260228_1430_differential_expression/01_analysis/20260228_1430_differential_expression/analysis_log.md
4. Organize Data Progressively
Follow the data flow through directories:
# Step 1: Load raw data
raw_data = pd.read_csv("00_raw_data/experiment_1/data.csv")
logger.log_operation("Load raw data", "00_raw_data/experiment_1/data.csv",
result=f"{len(raw_data)} rows loaded")
# Step 2: Clean and save to processed
cleaned_data = clean_data(raw_data)
output_path = "02_processed_data/20260228_cleaned_data.csv"
cleaned_data.to_csv(output_path, index=False)
logger.log_operation("Clean data", output_path,
params={"method": "remove_na", "impute": False},
result=f"{len(cleaned_data)} rows after cleaning")
# Step 3: Analyze and save results
results = analyze_data(cleaned_data)
fig_path = "03_results/figures/distribution_plot.png"
save_figure(fig_path)
logger.log_operation("Generate distribution plot", fig_path,
key_findings=["Bimodal distribution detected"])
5. Review and Summarize
Generate project summary from logs:
python ~/.claude/skills/data-analysis-workflow/scripts/summarize_analysis.py
This produces:
- Summary of all operations from MASTER_LOG.md
- Timeline of analysis progression
- Key decisions and their rationales
- Figures and results inventory
- Outstanding next steps
Log Format Specifications
MASTER_LOG.md Format
# Project: [Project Name]
**Created:** YYYY-MM-DD
**Last Updated:** YYYY-MM-DD HH:MM:SS
## Operation Log
### [YYYY-MM-DD HH:MM:SS] Operation Name
**Type:** Data Processing / Analysis / Modeling / Visualization
**Location:** path/to/file
**Parameters:**
- param1: value1
- param2: value2
**Result:** Brief description of outcome
**Key Findings:**
- Finding 1
- Finding 2
**Figures/Tables Generated:**
- path/to/figure1.png
- path/to/table1.csv
**Notes:** Additional context or observations
---
analysis_log.md Format (in timestamped directories)
# Analysis: [Descriptive Name]
**Date:** YYYY-MM-DD HH:MM
**Analyst:** [Name]
**Status:** In Progress / Complete / Archived
## Objective
Brief description of analysis goals
## Input Data
- `path/to/input1.csv` - Description
- `path/to/input2.csv` - Description
## Methods
1. Method 1 with parameters
2. Method 2 with parameters
## Results
### Key Statistics
- Statistic 1: value (interpretation)
- Statistic 2: value (interpretation)
### Figures
- `outputs/figure1.png` - Description
- `outputs/figure2.png` - Description
## Key Findings
1. Finding 1 with evidence
2. Finding 2 with evidence
## Next Steps
- [ ] Action item 1
- [ ] Action item 2
## Notes
Additional observations, caveats, or context
decisions.md Format
# Analytical Decisions Log
## [YYYY-MM-DD] Decision Title
**Question:** What decision needed to be made?
**Decision:** What was decided?
**Reasoning:**
- Reason 1
- Reason 2
- Supporting evidence
**Alternatives Considered:**
- Alternative 1 (why rejected)
- Alternative 2 (why rejected)
**References:**
- Citation 1
- Citation 2
**Impact:** Expected impact on analysis
---
Best Practices
Directory Organization
- Never modify files in 00_raw_data/: Always work on copies
- Use timestamps for files: Format
YYYYMMDD_descriptive_name - Include README files: Document data sources and processing
- Archive completed analyses: Move to
01_analysis/archive/ - Organize by data stage: Follow the numbered directory progression
Logging Practices
- Log immediately: Record operations as they complete
- Be specific: Include parameters, file paths, sample sizes
- Document decisions: Record why, not just what
- Track errors: Log failures and how they were resolved
- Link outputs: Connect figures/tables to the operations that created them
Analysis Organization
- One directory per analysis run: Use timestamped directories
- Self-contained analyses: Each directory should be reproducible independently
- Clear naming: Use descriptive names, not just dates
- Document dependencies: List all input files explicitly
- Track next steps: Note what should happen next
Collaboration
- Commit logs frequently: Logs are as important as code
- Review MASTER_LOG: Check before starting new work
- Update decisions.md: Keep team aligned on analytical choices
- Standardize formats: Follow the templates consistently
- Document everything: Assume you'll forget details
Integration with Scientific Writing
When preparing manuscripts or reports:
- Extract from MASTER_LOG: Use logged operations for Methods section
- Reference decisions.md: Justify analytical choices in manuscript
- Link figures automatically: Logs contain all figure paths
- Generate methods text: Summarize logged operations
- Track reproducibility: Logs provide complete workflow documentation
Example extraction:
from data_analysis_workflow.log_helper import extract_methods
methods_text = extract_methods(
log_path="logs/MASTER_LOG.md",
start_date="2026-02-01",
end_date="2026-02-28",
format="markdown" # or "latex"
)
This generates Methods section text directly from logged operations.
Example Workflows
Workflow 1: Exploratory Data Analysis
from data_analysis_workflow.log_helper import AnalysisLogger
logger = AnalysisLogger()
# Create analysis session
logger.create_analysis_log(
analysis_name="initial_exploration",
input_data=["00_raw_data/dataset.csv"],
methods=["Summary statistics", "Distribution plots", "Correlation analysis"]
)
# Perform analysis and log each step
data = pd.read_csv("00_raw_data/dataset.csv")
logger.log_operation("Load data", "00_raw_data/dataset.csv",
result=f"{data.shape[0]} rows, {data.shape[1]} columns")
# Generate and log figures
fig_path = "03_results/figures/20260228_distributions.png"
plot_distributions(data, save_path=fig_path)
logger.log_operation("Generate distribution plots", fig_path,
key_findings=["Age: right-skewed", "Income: bimodal"])
# Log decision
logger.log_decision(
question="Should we transform skewed variables?",
decision="Apply log transformation to age and income",
reasoning="Both show right skew >2.0, transformation may improve model fit"
)
Workflow 2: Model Training
logger = AnalysisLogger()
# Create model training session
logger.create_analysis_log(
analysis_name="random_forest_v1",
input_data=["02_processed_data/20260228_training_data.csv"],
methods=["Random Forest", "5-fold CV", "Hyperparameter tuning"]
)
# Train and log
model = RandomForestClassifier(**params)
model.fit(X_train, y_train)
# Save model
model_path = "05_models/20260228_rf_v1.pkl"
joblib.dump(model, model_path)
logger.log_operation(
operation="Train Random Forest",
location=model_path,
params={"n_estimators": 100, "max_depth": 10},
result=f"CV accuracy: {cv_score:.3f}",
key_findings=[
"Top feature: age (importance=0.34)",
"Converged after 87 trees"
]
)
Workflow 3: Report Generation
# Summarize project for manuscript
from data_analysis_workflow.log_helper import generate_report
report = generate_report(
log_path="logs/MASTER_LOG.md",
sections=["methods", "results", "figures"],
output_format="markdown"
)
# Save to reports directory
report_path = "04_reports/20260228_analysis_summary.md"
with open(report_path, 'w') as f:
f.write(report)
logger.log_operation("Generate analysis report", report_path,
notes="Ready for manuscript Methods section")
Helper Scripts
scripts/init_project.py
Initialize new project with standard directory structure
python ~/.claude/skills/data-analysis-workflow/scripts/init_project.py my_project
scripts/log_helper.py
Python module with AnalysisLogger class for automatic logging
from data_analysis_workflow.log_helper import AnalysisLogger
logger = AnalysisLogger()
scripts/summarize_analysis.py
Generate project summary from logs
python ~/.claude/skills/data-analysis-workflow/scripts/summarize_analysis.py
scripts/extract_methods.py
Extract Methods section text from operation logs
python ~/.claude/skills/data-analysis-workflow/scripts/extract_methods.py --format markdown
References
For detailed documentation:
references/log_formats.md: Complete log format specifications with examplesreferences/examples.md: Example projects with complete logsreferences/best_practices.md: Extended best practices guide
Integration with Other Skills
This skill works well with:
- scientific-writing: Extract logged operations for Methods sections
- statistical-analysis: Log all statistical tests and decisions
- exploratory-data-analysis: Document EDA findings systematically
- peer-review: Provide complete analysis documentation for reviewers
- research-grants: Demonstrate systematic approach and reproducibility