name: disclosure-check description: Pre-screen analysis outputs (tables, figures, logs) built on restricted or confidential data for statistical-disclosure-limitation problems before any release. Scans for small cell counts, complementary-suppression gaps, dominance (p-percent / (n,k)), re-identifiable exact counts, PII leakage, and unrounded sensitive statistics; classifies each finding CRITICAL / WARNING / OK and gates on any CRITICAL. Use before depositing or sharing restricted-data results, or when the user says "disclosure check", "SDL scan", "is this output safe to release", "check for small cells", "disclosure avoidance", "pre-screen for the RDC", or "can I export this from the enclave". argument-hint: "[outputs-dir] [--provider census|irs|irb|generic] [--threshold N] (outputs-dir defaults to scripts/R/_outputs/)" disable-model-invocation: true allowed-tools: ["Read", "Grep", "Glob", "Write", "Bash"] effort: high
/disclosure-check — Statistical-Disclosure-Limitation pre-screen
Scan analysis outputs built on restricted or confidential data (Census FSRDC, IRS SOI, administrative registers, linked health records, proprietary firm panels) for the disclosure-avoidance problems that get an export request rejected — before it reaches the data provider's official disclosure review. The skill is a pre-screen, not a substitute for that review.
Core principle: A single un-suppressed n=3 cell, an exact count that pins down one firm, or a p-percent dominance failure can re-identify a person or establishment. Catch it on your machine, not in the rejection email from the RDC analyst.
When to use
- Before requesting an export from a Census FSRDC / secure data enclave / RDC.
- Before depositing restricted-data results to openICPSR, a journal, or a co-author outside the enclave.
- Before sharing any figure, table, or log derived from confidential microdata.
- As a release gate. Pair with a pre-commit / pre-deposit invocation so no restricted-data output ships un-screened. This is the foundation of the data-management plan for any restricted-data project.
Inputs
$0— outputs directory to scan. Defaults toscripts/R/_outputs/. Recognised siblings:scripts/stata/_outputs/,scripts/python/_outputs/, or any export-staging directory (e.g., ato_review/folder the analyst stages for the RDC).--provider— selects which disclosure-rule profile to load (Phase 0). One ofcensus/irs/irb/generic. Providers differ — thresholds and rules are not interchangeable; defaultgenericis deliberately conservative.--threshold N— override the minimum cell count (defaultn<10). Census FSRDC commonly uses 10 for establishments; IRS and many IRBs differ. Always reconcile with your provider's written rules.
Workflow
Phase 0: Load the provider's disclosure rules
- Read
.claude/rules/confidential-data.mdfor the project's restricted-data handling contract and the rule-profile placeholder. - Load the
--providerprofile (a placeholder config the forker fills in from their signed agreement — Census, IRS, and IRB rules differ and supersede any default here):- min cell count (default
n<10), - dominance rules:
p-percent (a cell is unsafe if the largest respondents contribute >p% of the total) and(n,k)(topnunits >k% of total), - rounding required for sensitive statistics (counts, totals, ratios),
- top-coding / bottom-coding thresholds for extreme values,
- geographic minimum population for any geocoded statistic.
- min cell count (default
- If no signed-rule values are recorded, fall back to the conservative
genericprofile and flag prominently in the report that real provider thresholds must be substituted.
Phase 1: Scan the outputs directory
Glob the outputs dir for .tex, .csv, .txt, .log, .smcl, .out, .md tables and figure-data files. For each:
- Cell counts — parse table cells / frequency columns; flag any count
0 < n < thresholdthat is not already suppressed. - Complementary-suppression gaps — if one cell in a row/column is suppressed but the margin total and the other cells let a reader back it out by subtraction, the suppression is incomplete.
- Dominance — for any total/mean cell where unit-level contributions are available (or inferable), apply the
p-percent and(n,k)rules. - Exact re-identifying counts — small exact integers (e.g., "4 hospitals", "1 firm", a max/min that is a single observation) that single out a unit.
- PII leakage — regex for names, SSNs (
\d{3}-\d{2}-\d{4}), exact dates of birth, addresses, exact lat/long or fine geocodes, record IDs that survived into an output. - Unrounded sensitive statistics — exact unrounded counts/totals where the provider requires rounding.
Phase 2: Classify each finding — CRITICAL / WARNING / OK
| Disposition | Meaning | Examples |
|---|---|---|
| CRITICAL | Would fail the provider's disclosure review; blocks release. | n=3 cell un-suppressed; complementary-suppression hole; p-percent dominance failure; any PII; an exact count identifying ≤2 units. |
| WARNING | Plausibly safe but needs a human judgment call. | Cell at exactly the threshold; unrounded total just over a rounding base; geographic statistic near the min-population floor. |
| OK | Within the loaded rules, no action needed. | Counts ≥ threshold and rounded; dominance passes; no PII. |
When two findings interact (a suppressed cell + a recoverable margin), report them together — the gate cares about the joint disclosure risk, not each cell in isolation. Be economics-aware: DiD / event-study cell counts per (cohort × period), IV first-stage subsamples, RCT arm × stratum balance tables, and panel firm-counts are the usual offenders.
Phase 3: Suggest remediation
For each CRITICAL / WARNING, propose the standard SDL fix, in order of preference:
- Suppress the offending cell (and its complement, if a margin allows back-out).
- Round counts/totals to the provider's base (e.g., nearest 10 or 15).
- Top-code / bottom-code extreme values.
- Aggregate — collapse thin categories, coarsen geography, widen bins until every cell clears the threshold.
- Drop the statistic if no remediation preserves both safety and meaning.
Each suggestion names the file, the cell/location, the rule it violates, and the concrete edit — never auto-applies it (the analyst owns the disclosure decision).
Phase 4: Gate
Exit non-zero on any CRITICAL. WARNINGs surface but do not block. See Exit behavior.
Output / Report format
Write quality_reports/disclosure_check_[outputs-dir-slug].md:
# Disclosure Check: [outputs dir]
**Date:** [YYYY-MM-DD]
**Provider profile:** census | irs | irb | generic (rules source: confidential-data.md)
**Min cell count:** [N] **Dominance:** p=[p]%, (n,k)=([n],[k]%) **Rounding base:** [b]
## Summary
| Disposition | Count |
|---|---|
| CRITICAL | M |
| WARNING | W |
| OK | P |
| **Verdict** | **PASS / FAIL** (FAIL iff M > 0) |
## CRITICAL (blocks release)
| File | Location | Rule violated | Observed | Suggested remediation |
|---|---|---|---|---|
| tab3_by_cohort.tex | row "2008", col "n" | min cell (n<10) | n=4 | suppress cell + suppress complement in margin |
## WARNING (human judgment)
| File | Location | Concern | Suggested action |
|---|---|---|---|
## OK
[counts only, or a short list]
## Next steps
1. Resolve every CRITICAL — suppress / round / top-code / aggregate, then re-run.
2. Review WARNINGs with the agreement's written rules in hand.
3. Re-run until zero CRITICAL, THEN submit to the provider's OFFICIAL disclosure review.
Exit behavior
- Zero CRITICAL: exit 0; report printed. (WARNINGs allowed — they are surfaced, not blocking.)
- Any CRITICAL: exit 1; summary to stderr. This makes the skill usable as a release / pre-deposit gate. Mirrors
/audit-reproducibility's gate semantics: WARNING ≠ FAIL, only CRITICAL blocks. - No rules loaded (generic fallback): exit 0 with a prominent warning that real provider thresholds were not supplied — the pre-screen ran but at conservative defaults, not the actual agreement.
Flags
--provider<name>— Load that data provider's disclosure rules (e.g.census-fsrdc,irs,irb). Default: the generic small-cell ruleset.--threshold<n>— Override the minimum cell-count threshold (defaultn<10); match your data-use agreement's actual rule.
Cross-references
.claude/rules/confidential-data.md— restricted-data handling contract + the provider-rule profiles this skill loads..claude/rules/replication-protocol.md— for restricted-data papers the replication package ships code + access path, not the microdata; screen every released output first..claude/skills/audit-reproducibility/SKILL.md— numeric paper↔code verification: run it on the retained values, this skill on the released ones..claude/skills/data-analysis/SKILL.md,.claude/skills/stata-replication/SKILL.md— produce the R / Stata / Python outputs this skill screens.- AEA Data Editor checklist and the DCAS standard — disclosure + access expectations for restricted-data deposits (openICPSR restricted-access stub).
What this skill does NOT do
- It does not replace the data provider's official disclosure review. Census/RDC, IRS, and IRB analysts run the authoritative review; this skill pre-screens so the official review is more likely to pass on the first pass. A PASS here is not clearance to release.
- It does not certify your rules are correct. It applies the thresholds you load from your signed agreement; if the loaded
--providerprofile is wrong, the scan is wrong. Reconcile with the written agreement, not a default. - It does not move, encrypt, or transmit data, never exfiltrates microdata from the enclave — it reads only the staged outputs you point it at.
- It does not catch every disclosure risk. Differencing across released tables, longitudinal re-identification, and model-based inferential disclosure can evade a per-file scan. A clean run is necessary, not sufficient.