r-scripting - SKILL.md Agent Skill

name: r-scripting description: R coding standards for scripts and data analysis. Use for writing R scripts, data analysis, tidyverse workflows, or exploratory analysis. Includes guidance for creating teaching demos with knitr::spin (literate R scripts that render to PDF/HTML). Not for package development.

R Scripting

Core Principles

Clear over clever - readability trumps brevity
Explicit over implicit - all function inputs as arguments (no globals)
Functional over imperative - prefer map() over loops when clearer

Design Principles

Top-Down Design

Decompose problems before coding:

Problem → high-level subtasks
Subtasks → smaller pieces
Code → implement bottom-up

Start with the big picture, then drill down.

DRY (Don't Repeat Yourself)

Extract repeated code into functions—errors fix in one place, changes propagate automatically.

# BAD: Copy-paste
result_a <- data_a |> filter(x > 0) |> mutate(y = log(x)) |> summarize(m = mean(y))
result_b <- data_b |> filter(x > 0) |> mutate(y = log(x)) |> summarize(m = mean(y))

# GOOD: Abstraction
compute_log_mean <- function(data) {
  data |> filter(x > 0) |> mutate(y = log(x)) |> summarize(m = mean(y))
}
results <- map(list(data_a, data_b), compute_log_mean)

One Task Per Function

Keep functions short (~20-50 lines). If a function does multiple things, split it.

Naming

# Functions: snake_case verbs
calculate_mean <- function(x) mean(x, na.rm = TRUE)
filter_outliers <- function(data, threshold = 3) { ... }

# Variables: descriptive, plural for collections
patient_ages <- c(25, 32, 41)
model_predictions <- predict(fit)

# Constants: UPPER_SNAKE_CASE
MAX_ITERATIONS <- 1000
DEFAULT_ALPHA <- 0.05

Formatting

# Assignment: use <-
result <- x + y

# Pipes: one operation per line
result <- raw_data |>
  filter(age > 18) |>
  mutate(log_income = log(income)) |>
  summarize(mean = mean(log_income))

# Lambda (R ≥ 4.1)
map(data, \(x) x^2 + 1)

2 spaces indentation, ~80 chars/line
Use native pipe |> (R ≥ 4.1) or %>%
Run air format . after every major edit (not just before commit). A pre-commit hook enforces this, so formatting early avoids wasted cycles.

Functions

my_function <- function(data, threshold = 0.05, verbose = FALSE) {
  # 1. Validate
  stopifnot(
    "data must be data.frame" = is.data.frame(data),
    "threshold must be positive" = threshold > 0
  )
  # 2. Early return for edge cases
  if (nrow(data) == 0) return(NULL)
  # 3. Main logic
  process_data(data, threshold)
}

Pure Functions (Critical)

# BAD: global dependency
THRESHOLD <- 0.05
filter_data <- function(df) df |> filter(p < THRESHOLD)

# GOOD: explicit argument
filter_data <- function(df, threshold = 0.05) df |> filter(p < threshold)

All inputs must be explicit arguments. Same inputs → same outputs.

Control Flow

Return Early, Keep Happy Path Flat

Handle edge cases and errors first, then let the main logic flow with minimal indentation.

# BAD: Deeply nested
process <- function(x, method) {
  if (!is.null(x)) {
    if (is.numeric(x)) {
      if (method == "mean") {
        return(mean(x))
      } else {
        return(median(x))
      }
    }
  }
  return(NA)
}

# GOOD: Early returns, flat happy path
process <- function(x, method = c("mean", "median")) {
  if (is.null(x)) return(NA)
  if (!is.numeric(x)) return(NA)
  method <- match.arg(method)  # Homogenize input

  switch(method,
    mean = mean(x),
    median = median(x)
  )
}

Homogenize Inputs Early

Normalize variants at the start, then work with a single form:

# Homogenize string inputs
method <- match.arg(method)  # Pick from predefined choices
type <- tolower(type)        # Case-insensitive

# Homogenize NULL to default
arg <- arg %||% default_value

Vectorized Conditions

# Simple binary
result <- ifelse(x > 0, "pos", "neg")

# Multiple conditions
category <- case_when(
  score >= 90 ~ "A",
  score >= 80 ~ "B",
  TRUE ~ "F"
)

# Discrete options
result <- switch(method,
  "mean" = mean(x, na.rm = TRUE),
  "median" = median(x, na.rm = TRUE),
  stop("Unknown method: ", method)
)

Functional Programming (purrr)

# Typed output
map_dbl(data, compute_value)
map_lgl(data, is_valid)

# Multiple inputs
map2(x, y, \(a, b) a + b)
pmap(list(x, y, z), \(a, b, c) a * b + c)

# Error handling
safe_mean <- possibly(mean, otherwise = NA)

Data Manipulation (tidyverse)

result <- raw_data |>
  filter(!is.na(value)) |>
  mutate(log_value = log(value), scaled = scale(value)) |>
  group_by(category) |>
  summarize(n = n(), mean = mean(value)) |>
  arrange(desc(mean))

# Multiple columns
df |> mutate(across(where(is.numeric), scale))
df |> summarize(across(where(is.numeric), list(mean = mean, sd = sd)))

Error Handling

# Validation with messages (R ≥ 4.0)
stopifnot(
  "x must be numeric" = is.numeric(x),
  "denom cannot be zero" = all(denom != 0)
)

# Informative errors
if (x < 0) {
  stop("x must be non-negative, got x = ", x,
       "\nTry using abs(x) or filtering negatives")
}

# Try-catch
result <- tryCatch(
  risky_computation(data),
  error = function(e) { message("Failed: ", e$message); NULL }
)

Script Structure

# Setup -----------------------------------------------------------------------

library(tidyverse)
DATA_PATH <- "data/input.csv"

# Functions -------------------------------------------------------------------

compute_stats <- function(x) { ... }

# Main ------------------------------------------------------------------------

data <- read_csv(DATA_PATH)
results <- compute_stats(data)
write_csv(results, "output/results.csv")

Section headers: Use # Section Name ---- or # Section Name ------ format (dashes to ~col 80). RStudio recognizes these for code folding. Never use multi-line # ==== block headers.

Anti-Patterns

Code Examples

# BAD: magic numbers
significant <- filter(results, p < 0.05)
# GOOD
ALPHA <- 0.05
significant <- filter(results, p < ALPHA)

# BAD: repetition
summary_a <- data_a |> filter(valid) |> summarize(m = mean(val))
summary_b <- data_b |> filter(valid) |> summarize(m = mean(val))
# GOOD
calc_summary <- function(d) d |> filter(valid) |> summarize(m = mean(val))
summaries <- map(list(data_a, data_b), calc_summary)

# BAD: swallowing errors
try(risky(x), silent = TRUE)
# GOOD
tryCatch(risky(x), error = function(e) { message("Failed: ", e$message); NA })

Code Smells & Fixes

Smell	Problem	Fix
Long functions (>50 lines)	Hard to understand/test	Extract subtasks into helpers
Long parameter lists	Unwieldy API	Group related args into list
Data clumps	Variables that always appear together	Unify into single object/list
Dead/commented-out code	Confusion, maintenance burden	Delete (Git has history)
Deep nesting	Hard to follow	Return early, flatten logic
High cyclomatic complexity	Many paths = many bugs	Target < 10 per function

Cyclomatic Complexity

Measure with cyclocomp package—target < 10 per function:

cyclocomp::cyclocomp(my_function)

Comments

# GOOD: explain WHY
# Log transform to normalize right-skewed distribution
data <- mutate(data, log_val = log(value + 1))

# GOOD: document assumptions
# Assumes data sorted by date
rolling <- zoo::rollmean(values, k = 7)

Teaching Demos with knitr::spin

For literate R scripts that render to PDF/HTML, see knitr-spin-reference.md in this skill folder.

Key points: use #' prefix for prose, #+ for chunk options, interleave code and explanation.

Tools

# R console: check style
lintr::lint("script.R")

# Terminal: auto-format (run before commit)
air format script.R
air format .  # Format all R files

Quick Checklist

Functions pure (all inputs as args)
No magic numbers (use named constants)
Informative error messages
Comments explain why, not what
Works in fresh R session
Functions < 50 lines (cyclomatic complexity < 10)
No dead/commented-out code
Run air format . before commit
For demos: see knitr-spin-reference.md