stata

name: stata description: > Comprehensive Stata reference for writing correct .do files, data management, econometrics, causal inference, graphics, Mata programming, and 17+ community packages (reghdfe, estout, did, rdrobust, etc.). Covers syntax, options, gotchas, and idiomatic patterns. Use this skill whenever the user asks you to write, debug, or explain Stata code. triggers: - stata - .do file - do-file - regress - regression in stata - panel data - fixed effects - reghdfe - estout - esttab - outreg2 - difference-in-differences - event study - propensity score - rdrobust - synthetic control - xtset - merge - reshape - collapse - egen - ssc install - mata - putexcel - putdocx - graph export - survival analysis - heckman - tobit - logit - probit - arima - var model - gmm estimation - bootstrap stata - survey weights - multiple imputation - lasso stata

Stata Skill

You have access to comprehensive Stata reference files. Do not load all files. Read only the 1-3 files relevant to the user's current task using the routing table below.

Critical Gotchas

These are Stata-specific pitfalls that lead to silent bugs. Internalize these before writing any code.

Missing Values Sort to +Infinity

Stata's . (and .a-.z) are greater than all numbers.

* WRONG — includes observations where income is missing!
gen high_income = (income > 50000)

* RIGHT
gen high_income = (income > 50000) if !missing(income)

* WRONG — missing ages appear in this list
list if age > 60

* RIGHT
list if age > 60 & !missing(age)

`=` vs `==`

= is assignment; == is comparison. Mixing them up is a syntax error or silent bug.

* WRONG — syntax error
gen employed = 1 if status = 1

* RIGHT
gen employed = 1 if status == 1

Local Macro Syntax

Locals use `name' (backtick + single-quote). Globals use $name or ${name}. Forgetting the closing quote is the #1 macro bug.

local controls "age education income"
regress wage `controls'        // correct
regress wage `controls         // WRONG — missing closing quote
regress wage 'controls'        // WRONG — wrong quote characters

`by` Requires Prior Sort (Use `bysort`)

* WRONG — error if data not sorted by id
by id: gen first = (_n == 1)

* RIGHT — bysort sorts automatically
bysort id: gen first = (_n == 1)

* Also RIGHT — explicit sort
sort id
by id: gen first = (_n == 1)

Factor Variable Notation (`i.` and `c.`)

Use i. for categorical, c. for continuous. Omitting i. treats categories as continuous.

* WRONG — treats race as continuous (e.g., race=3 has 3x effect of race=1)
regress wage race education

* RIGHT — creates dummies automatically
regress wage i.race education

* Interactions
regress wage i.race##c.education    // full interaction
regress wage i.race#c.education     // interaction only (no main effects)

`generate` vs `replace`

generate creates new variables; replace modifies existing ones. Using generate on an existing variable name is an error.

gen x = 1
gen x = 2          // ERROR: x already defined
replace x = 2      // correct

String Comparison Is Case-Sensitive

* May miss "Male", "MALE", etc.
keep if gender == "male"

* Safer
keep if lower(gender) == "male"

`merge` Always Check `_merge`

merge 1:1 id using other.dta
tab _merge                      // always inspect
assert _merge == 3              // or handle mismatches
drop _merge

`preserve` / `restore` for Temporary Changes

preserve
collapse (mean) income, by(state)
* ... do something with collapsed data ...
restore   // original data is back

Weights Are Not Interchangeable

fweight — frequency weights (replication)
aweight — analytic/regression weights (inverse variance)
pweight — probability/sampling weights (survey data, implies robust SE)
iweight — importance weights (rarely used)

`capture` Swallows Errors

capture some_command
if _rc != 0 {
    di as error "Failed with code: " _rc
    exit _rc
}

Line Continuation Uses `///`

regress y x1 x2 x3 ///
    x4 x5 x6, ///
    vce(robust)

Stored Results: `r()` vs `e()` vs `s()`

r() — r-class commands (summarize, tabulate, etc.)
e() — e-class commands (estimation: regress, logit, etc.)
s() — s-class commands (parsing)

A new estimation command overwrites previous e() results. Store them first:

regress y x1 x2
estimates store model1

Routing Table

Read only the files relevant to the user's task. Paths are relative to this SKILL.md file.

Data Operations

File	Topics & Key Commands
`references/basics-getting-started.md`	`use`, `save`, `describe`, `browse`, `sysuse`, basic workflow
`references/data-import-export.md`	`import delimited`, `import excel`, ODBC, `export`, web data
`references/data-management.md`	`generate`, `replace`, `merge`, `append`, `reshape`, `collapse`, `recode`, `egen`, `encode`/`decode`
`references/variables-operators.md`	Variable types, `byte`/`int`/`long`/`float`/`double`, operators, missing values (`.<.a`), `if`/`in` qualifiers
`references/string-functions.md`	`substr()`, `regexm()`, `strtrim()`, `split`, `ustrlen()`, regex, Unicode
`references/date-time-functions.md`	`date()`, `clock()`, `%td`/`%tc` formats, `mdy()`, `dofm()`, business calendars
`references/mathematical-functions.md`	`round()`, `log()`, `exp()`, `abs()`, `mod()`, `cond()`, distributions, random numbers

Statistics & Econometrics

File	Topics & Key Commands
`references/descriptive-statistics.md`	`summarize`, `tabulate`, `correlate`, `tabstat`, `codebook`, weighted stats
`references/linear-regression.md`	`regress`, `vce(robust)`, `vce(cluster)`, `test`, `lincom`, `margins`, `predict`, `ivregress`
`references/panel-data.md`	`xtset`, `xtreg fe`/`re`, Hausman test, `xtabond`, dynamic panels
`references/time-series.md`	`tsset`, ARIMA, VAR, `dfuller`, `pperron`, `irf`, forecasting
`references/limited-dependent-variables.md`	`logit`, `probit`, `tobit`, `poisson`, `nbreg`, `mlogit`, `ologit`, `margins` for nonlinear
`references/bootstrap-simulation.md`	`bootstrap`, `simulate`, `permute`, Monte Carlo
`references/survey-data-analysis.md`	`svyset`, `svy:`, `subpop()`, complex survey design, replicate weights
`references/missing-data-handling.md`	`mi impute`, `mi estimate`, FIML, `misstable`, diagnostics
`references/maximum-likelihood.md`	`ml model`, custom likelihood functions, `ml init`, gradient-based optimization
`references/gmm-estimation.md`	`gmm`, moment conditions, `estat overid`, J-test

Causal Inference

File	Topics & Key Commands
`references/treatment-effects.md`	`teffects ra/ipw/ipwra/aipw`, `stteffects`, ATE/ATT/ATET
`references/difference-in-differences.md`	DiD, parallel trends, event studies, staggered adoption
`references/regression-discontinuity.md`	Sharp/fuzzy RD, bandwidth selection, `rdplot`
`references/matching-methods.md`	PSM, nearest neighbor, kernel matching, `teffects nnmatch`
`references/sample-selection.md`	`heckman`, `heckprobit`, treatment models, exclusion restrictions

Advanced Methods

File	Topics & Key Commands
`references/survival-analysis.md`	`stset`, `stcox`, `streg`, Kaplan-Meier, parametric models
`references/sem-factor-analysis.md`	`sem`, `gsem`, CFA, path analysis, `alpha`, reliability
`references/nonparametric-methods.md`	`kdensity`, rank tests, `qreg`, `npregress`
`references/spatial-analysis.md`	`spmatrix`, `spregress`, spatial weights, Moran's I
`references/machine-learning.md`	`lasso`, `elasticnet`, `cvlasso`, cross-validation

Graphics

File	Topics & Key Commands
`references/graphics.md`	`twoway`, `scatter`, `line`, `bar`, `histogram`, `graph combine`, `graph export`, schemes

Programming

File	Topics & Key Commands
`references/programming-basics.md`	`local`, `global`, `foreach`, `forvalues`, `program define`, `syntax`, `return`
`references/advanced-programming.md`	`syntax`, `mata`, classes, `_prefix`, dialog boxes, `tempfile`/`tempvar`
`references/mata-introduction.md`	Mata basics, when to use Mata vs ado, data types
`references/mata-programming.md`	Mata functions, flow control, structures, pointers
`references/mata-matrix-operations.md`	Matrix creation, decompositions, solvers, `st_matrix()`
`references/mata-data-access.md`	`st_data()`, `st_view()`, `st_store()`, performance tips

Output & Workflow

File	Topics & Key Commands
`references/tables-reporting.md`	`putexcel`, `putdocx`, `putpdf`, LaTeX integration, `collect`
`references/workflow-best-practices.md`	Project structure, master do-files, version control, debugging, common mistakes
`references/external-tools-integration.md`	Python via `python:`, R via `rsource`, shell commands, Git

Community Packages

File	What It Does
`packages/reghdfe.md`	High-dimensional fixed effects OLS (absorbs multiple FE sets efficiently)
`packages/estout.md`	`esttab`/`estout`: publication-quality regression tables
`packages/outreg2.md`	Alternative regression table exporter (Word, Excel, TeX)
`packages/asdoc.md`	One-command Word document creation for any Stata output
`packages/tabout.md`	Cross-tabulations and summary tables to file
`packages/coefplot.md`	Coefficient plots from stored estimates
`packages/graph-schemes.md`	`grstyle`, `schemepack`, `plotplain` — better graph themes
`packages/did.md`	Modern DiD: `csdid`, `did_multiplegt`, `did_imputation` (Callaway-Sant'Anna, de Chaisemartin-D'Haultfoeuille, Borusyak-Jaravel-Spiess)
`packages/event-study.md`	`eventstudyinteract`, `eventdd` — event study estimators
`packages/rdrobust.md`	Robust RD estimation with optimal bandwidth (`rdrobust`, `rdplot`, `rdbwselect`)
`packages/psmatch2.md`	Propensity score matching (nearest neighbor, kernel, radius)
`packages/synth.md`	Synthetic control method (`synth`, `synth_runner`)
`packages/ivreg2.md`	Enhanced IV/2SLS: `ivreg2`, `xtivreg2` with additional diagnostics
`packages/xtabond2.md`	Dynamic panel GMM (Arellano-Bond/Blundell-Bond)
`packages/binsreg.md`	Binned scatter plots with CI (`binsreg`, `binstest`)
`packages/nprobust.md`	Nonparametric kernel estimation and inference
`packages/diagnostics.md`	`bacondecomp`, `xttest3`, collinearity, heteroskedasticity tests
`packages/winsor.md`	Winsorizing and trimming: `winsor2`, `winsor`
`packages/data-manipulation.md`	`gtools` (fast collapse/egen), `rangestat`, `egenmore`
`packages/package-management.md`	`ssc install`, `net install`, `ado update`, finding packages

Common Patterns

Regression Table Workflow

* Estimate models
eststo clear
eststo: regress y x1 x2, vce(robust)
eststo: regress y x1 x2 x3, vce(robust)
eststo: regress y x1 x2 x3 x4, vce(cluster id)

* Export table
esttab using "results.tex", replace ///
    se star(* 0.10 ** 0.05 *** 0.01) ///
    label booktabs ///
    title("Main Results") ///
    mtitles("(1)" "(2)" "(3)")

Panel Data Setup

xtset panelid timevar          // declare panel structure
xtdescribe                      // check balance
xtsum outcome                   // within/between variation

* Fixed effects
xtreg y x1 x2, fe vce(cluster panelid)
* Or with reghdfe (preferred for multiple FE)
reghdfe y x1 x2, absorb(panelid timevar) vce(cluster panelid)

Difference-in-Differences

* Classic 2x2 DiD
gen post = (year >= treatment_year)
gen treat_post = treated * post
regress y treated post treat_post, vce(cluster id)

* Modern staggered DiD (Callaway & Sant'Anna)
csdid y x1 x2, ivar(id) time(year) gvar(first_treat) agg(event)
csdid_plot

Graph Export

* Publication-quality scatter with fit line
twoway (scatter y x, mcolor(navy%50) msize(small)) ///
       (lfit y x, lcolor(cranberry) lwidth(medthick)), ///
    title("Title Here") ///
    xtitle("X Label") ytitle("Y Label") ///
    legend(off) scheme(s2color)
graph export "figure1.pdf", replace as(pdf)
graph export "figure1.png", replace as(png) width(2400)

Data Cleaning Pipeline

* Load and inspect
import delimited "raw_data.csv", clear varnames(1)
describe
codebook, compact

* Clean
rename *, lower                 // lowercase all varnames
destring income, replace force  // convert string to numeric
replace income = . if income < 0

* Label
label variable income "Annual household income (USD)"
label define yesno 0 "No" 1 "Yes"
label values employed yesno

* Save
compress
save "clean_data.dta", replace

Multiple Imputation

mi set mlong
mi register imputed income education
mi impute chained (regress) income (ologit) education = age i.gender, add(20) rseed(12345)
mi estimate: regress wage income education age i.gender