name: data-skill description: AI-driven hybrid data analysis workflow supporting R (tidyverse and ggplot2) and Python (pandas and seaborn). Provides templates for 70 percent of common tasks including variable exploration, data cleaning, processing, and plotting. Also generates AI code for custom needs. Uses bash testing with iteration and MCP Context7 for function references. Use when analyzing data files, creating visualizations, cleaning and transforming data, running statistical tests, creating publication quality figures, or working with Quarto qmd documents.
Data Analysis Workflow
Hybrid approach combining templates for common patterns with AI generation for custom needs.
How It Works
User Request → Pattern Matching → [Template Available?]
↓
Yes (Quick Start) ←──→ No (AI Generate)
↓ ↓
└────→ Bash Test ←────┘
↓
[Success?]
↓
No ←──────────→ Yes
↓ ↓
Iterate/Fix Return Working Code
Quick Start
1. Detect Language
R: When user mentions tidyverse, ggplot2, dplyr, .R files
Python: When user mentions pandas, seaborn, .py files
Ask if uncertain.
2. Pattern Matching
Check if the request matches a common template:
| Task | Use Template |
|---|---|
| Explore variable types and distributions | templates/*/explore_variable.* |
| Clean missing values and outliers | templates/*/clean_data.* |
| Recode/transform variables | templates/*/process_data.* |
| Create scatter plot | templates/*/plot_scatter.* |
| Create bar chart | templates/*/plot_bar.* |
| Create box plot | templates/*/plot_box.* |
| Run t-test/correlation | templates/*/statistical_test.* |
| Plot causal DAG | templates/*/plot_dag.* |
| Target Trial Emulation | templates/*/tte_cloning.* |
If match found: Provide template, let user modify, run via bash.
If no match: AI generates custom code (use MCP Context7 for syntax lookup).
3. Test and Iterate
# Write code to temp file
# Run: Rscript temp.R or python temp.py
# Analyze output/errors
# Fix and retry if needed
4. QMD Integration (When Working in Quarto)
When user is working with qmd documents:
- Bash test the code first to ensure it works
- Edit qmd file directly with the working code
- Render to verify:
quarto render report.qmd - Iterate if render fails
See QMD Integration below for chunk options and best practices.
Analysis Phases
Instead of rigid steps, use flexible phases:
Phase 1: Understand Data
- Load and inspect data structure
- Explore variable types and distributions
- Identify quality issues (missing values, outliers)
Phase 2: Prepare Data
- Clean missing values and outliers
- Transform and recode variables
- Create derived features
Phase 3: Analyze
- Descriptive statistics
- Hypothesis testing
- Modeling (if needed)
Phase 4: Visualize
- Create appropriate plots for the data type
- Apply publication-ready themes
- Export high-resolution figures
Phase 5: Report
- Interpret statistical results
- Write narrative text
- Compile final report
When to Use Templates vs AI Generation
Use Templates For
- Standard variable exploration
- Common chart types (scatter, bar, box)
- Routine data cleaning
- Basic statistical tests
Use AI Generation For
- Custom transformations
- Advanced or unusual visualizations
- Complex statistical models
- Domain-specific analysis patterns
MCP Integration
When uncertain about function syntax, query Context7 MCP:
- "How to use geom_smooth() in ggplot2?"
- "pandas groupby and aggregate syntax"
- "seaborn boxplot with hue parameter"
QMD Integration
QMD Workflow
1. Generate/test code via bash
2. Edit qmd with working code
3. Render: quarto render report.qmd
4. Iterate if needed
Recommended Chunk Options
R Chunks
```{r chunk-name, fig.width=8, fig.height=6, fig.cap="Caption"}
# Code here
#### Python Chunks
```markdown
```{python chunk-name, fig.width=8, fig.height=6, fig.cap="Caption"}
# Code here
### Auto-Recommended Options by Chart Type
| Chart Type | fig.width | fig.height | Notes |
|------------|-----------|------------|-------|
| Scatter plot | 8 | 6 | Standard |
| Bar chart | 8 | 6 | Add `fig.cap` |
| Box plot | 8 | 6 | Add `fig.cap` |
| Histogram | 8 | 5 | Taller not needed |
| Line plot | 10 | 6 | Wider for time series |
| Multi-panel | 12 | 8 | Larger for facets |
### QMD Best Practices
1. **Always use chunk names** for cross-referencing
2. **Add fig.cap** for figure captions in reports
3. **Test code** in bash before adding to qmd
4. **Use echo=FALSE** to hide source code if needed
5. **Use warning=FALSE** to suppress warnings in output
### Complete Chunk Options Reference
#### Output Control Options
| Option | Values | Default | Purpose |
|--------|--------|---------|---------|
| `echo` | TRUE/FALSE | TRUE | Show/hide source code |
| `eval` | TRUE/FALSE | TRUE | Run code or skip |
| `include` | TRUE/FALSE | TRUE | Show code output |
| `warning` | TRUE/FALSE | TRUE | Show warning messages |
| `message` | TRUE/FALSE | TRUE | Show package load messages |
| `error` | TRUE/FALSE | TRUE | Stop on errors or continue |
| `collapse` | TRUE/FALSE | FALSE | Merge output with code |
| `results` | markup/asis/hide/hold | markup | How to display output |
#### Figure Options
| Option | Purpose |
|--------|---------|
| `fig.width` | Figure width in inches |
| `fig.height` | Figure height in inches |
| `fig.cap` | Figure caption text |
| `fig.alt` | Alt text for accessibility |
| `fig.align` | left/right/center/default |
| `out.width` | Output width (e.g., "80%") |
| `out.height` | Output height (e.g., "auto") |
#### Cache Options
| Option | Purpose |
|--------|---------|
| `cache` | TRUE/FALSE - cache results |
| `cache.path` | Custom cache directory |
| `dependson` | Chunk dependencies for cache |
### Common Chunk Patterns
Publication figure (no code, with caption):
```markdown
```{r my-plot, echo=FALSE, fig.width=8, fig.height=6, fig.cap="My figure caption"}
# Your plot code here
Data loading (suppress messages, with cache):
```markdown
```{r load-data, message=FALSE, cache=TRUE}
# Your data loading code here
Analysis (hide code, show output):
```markdown
```{r analysis, echo=FALSE, warning=FALSE}
# Your analysis code here
Draft mode (quick render, no cache):
```markdown
```{r draft, cache=FALSE, eval=TRUE}
# Your code here
Setup (run but don't show):
```markdown
```{r setup, include=FALSE}
library(tidyverse)
## Bundled Resources
### Templates (`templates/`)
**R templates** (`templates/r/`):
- `explore_variable.R` - Variable type and distribution analysis
- `clean_data.R` - Handle missing values, duplicates, outliers
- `process_data.R` - Filter, recode, transform
- `plot_scatter.R` - Scatter plots with themes
- `plot_bar.R` - Bar charts for categorical data
- `plot_box.R` - Box plots for distributions
- `statistical_test.R` - t-tests, correlations
- `plot_dag.R` - Causal DAG visualization with dagitty/ggdag
- `tte_cloning.R` - Target Trial Emulation cloning
**Python templates** (`templates/python/`):
- Mirror of R templates using pandas/seaborn/networkx
### References (`references/`)
- `workflow_phases.md` - When to do what, decision trees
- `data_patterns.md` - Common transformation patterns
- `statistical_tests.md` - Test selection guide
### Assets (`assets/`)
- `themes/theme_r.R` - ggplot2 publication theme
- `themes/theme_python.py` - seaborn/matplotlib theme
### Learning (`learning/`)
Domain-specific knowledge accumulated through usage:
- `analysis_patterns.md` - Successful analysis patterns
- `user_preferences.md` - User-specific preferences
- `mcp_queries.md` - Successful MCP queries for reference
**Note**: This directory works with the `reflect` skill for automatic learning from corrections.
## Best Practices
1. **Always test code** via bash before presenting to user
2. **Iterate on errors** - analyze output, fix, retry
3. **Use MCP for syntax** - don't guess function parameters
4. **Prefer templates** for common tasks (faster, tested)
5. **Generate custom code** only when templates don't fit
6. **For qmd**: Test in bash, then edit and render