name: codebook-yaml description: Format guide for writing and editing YAML codebooks (C1-C4) following the Halterman & Keith (2025) semi-structured codebook format. Apply when creating or modifying any codebook YAML file in prompts/. user-invocable: false
Codebook YAML Format Guide
This skill defines conventions for all YAML codebook files (prompts/c1_*.yml, prompts/c2_*.yml, etc.). Apply these rules when creating or editing codebook definitions.
Reference Documents
Before writing or modifying a codebook, read:
docs/strategy.mdfor codebook-to-R&R-phase mapping and success criteriadocs/codebook_sources.mdfor implementation-critical details from R&R and H&Kdocs/methods/Methodology for Quantifying Exogenous Fiscal Shocks.mdfor R&R operationalizationsdocs/methods/The Halterman & Keith Framework for LLM Content Analysis.mdfor H&K format specs
File Structure
Each codebook is a single YAML file in prompts/:
prompts/
c1_measure_id.yml # C1: Measure Identification
c2a_extraction.yml # C2a: Motivation Evidence Extraction (per-chunk)
c2b_classification.yml # C2b: Motivation Classification (act-level)
c3_timing.yml # C3: Timing Extraction
c4_magnitude.yml # C4: Magnitude Extraction
Top-Level Structure
codebook:
name: "C1: Measure Identification"
version: "0.1.0"
# description is optional — a human-facing summary, not sent to the LLM.
# If present, it is ignored by the prompt assembly pipeline.
instructions: >
Overall task instructions provided to the LLM before the class definitions.
Must describe the input format, expected output format, and any global rules.
classes:
- label: "CATEGORY_NAME"
# ... class definition (see below)
output_instructions: >
Output reminder enumerating all valid labels.
Example: "Classify the passage using exactly one of the following labels:
FISCAL_MEASURE, NOT_FISCAL_MEASURE"
Class Definition Structure (Required Fields)
Every class MUST include all of the following fields. This follows H&K Figure 1.
- label: "DEFICIT_DRIVEN"
label_definition: >
A single sentence defining the class. Must be precise enough that a domain
expert could apply it without additional context.
clarification:
- "Inclusion criterion 1: specific enough to ablate individually"
- "Inclusion criterion 2: each item testable in isolation"
- "Key evidence phrases that indicate this category"
negative_clarification:
- "Exclusion rule 1: addresses the most common confusion case"
- "Exclusion rule 2: distinguishes from the nearest neighboring class"
- "Boundary case: how to handle ambiguous situations"
positive_examples:
- text: >
A passage from a real or realistic document that clearly belongs
to this category.
reasoning: >
Explanation of why this passage fits the definition, referencing
specific phrases and connecting them to the clarification criteria.
negative_examples:
- text: >
A passage that is a plausible near-miss for this category but
actually belongs to a different class or fails an exclusion criterion.
reasoning: >
Explanation of why this passage does NOT fit, referencing the
specific negative clarification rule that excludes it.
Rules
Field Requirements
- label: UPPER_SNAKE_CASE string. This is the exact string the LLM must return.
- label_definition: Exactly one sentence. No bullet points or sub-items.
- clarification: List of 2-5 items. Each item must be specific enough to remove individually for ablation testing (H&K Table 4).
- negative_clarification: List of 2-5 items. Must address the most common confusion cases between this class and its nearest neighbors.
- positive_examples: Optional (0 or more). When present, each must have both
textandreasoning. Ideally 2-3 when included. - negative_examples: Optional (0 or more). When present, each must have both
textandreasoning. Should be near-misses, not obviously wrong examples. Ideally 2-3 when included. - output_instructions: Must enumerate every valid label explicitly.
Example Quality
- text fields should be realistic passages (from R&R data, government documents, or closely modeled on them)
- reasoning fields must reference specific codebook criteria, not just assert correctness
- Positive examples should demonstrate the prototypical case AND at least one edge case
- Negative examples should be near-misses that test the boundary between classes
- Each example pair (positive + negative) should illuminate one specific distinction
Country-Agnostic Language (CRITICAL)
Codebooks must transfer across countries without modification. Follow these rules:
| Instead of... | Write... |
|---|---|
| "Tax liabilities" (when US-specific) | "Fiscal liabilities or obligations" |
| "Ways and Means Committee" | "Relevant legislative committee" |
| "Economic Report of the President" | "Official economic outlook or assessment documents" |
| "Billions USD" | "Domestic currency, billions" |
| "Congress" | "Legislature" or "legislative body" |
| "President" | "Head of government" or "executive" |
| US-specific act names in definitions | Generic descriptions of fiscal actions |
Exception: US-specific terminology IS allowed in examples (text + reasoning fields), since examples are inherently country-specific. But definitions, clarifications, and negative clarifications must be country-agnostic.
Ablation-Ready Design
Per H&K Table 4, each codebook component may be individually removed during S3 error analysis. Design with this in mind:
- Each clarification item should make an independent, testable contribution
- Removing any single clarification should measurably change model behavior
- Do not combine multiple criteria into a single clarification bullet
- Negative clarifications should each address a distinct confusion pattern
Output Instructions Format
The output_instructions field must:
- Remind the LLM of the exact valid labels
- Specify the output format (plain text label, JSON, etc.)
- Include any structured output requirements
Example for C2 (Motivation):
output_instructions: >
Classify the motivation using exactly one of: SPENDING_DRIVEN,
COUNTERCYCLICAL, DEFICIT_DRIVEN, LONG_RUN.
Then determine exogeneity: EXOGENOUS if the motivation is DEFICIT_DRIVEN
or LONG_RUN; ENDOGENOUS if SPENDING_DRIVEN or COUNTERCYCLICAL.
Return your answer as:
Label: [MOTIVATION]
Exogenous: [EXOGENOUS/ENDOGENOUS]
Reasoning: [Brief explanation citing specific evidence from the passage]
S0-S3 Workflow
Codebook development follows a sequential stage-gate process. Each stage must pass before proceeding to the next.
- S0 (Codebook Preparation): Draft the YAML codebook following the structure in this SKILL. All required fields must be present. Submit for domain expert approval before proceeding.
- S1 (Behavioral Tests): Run Tests I-IV (see below) to verify the codebook produces sane model behavior. All tests must pass thresholds before proceeding to evaluation.
- S2 (Zero-Shot Evaluation): Run zero-shot classification on chunk test set using
R/codebook_stage_2.R. Compare results to success criteria indocs/strategy.md. If targets are not met, return to S0 for codebook revision. - S3 (Error Analysis): Run Tests V-VII plus ablation studies. Categorize errors using H&K taxonomy (A-F). Use findings to inform S0 revisions or, as a last resort, trigger S4.
The stage-gate rule: do not proceed to S(N+1) until S(N) passes. Iteration loops back to S0 (codebook revision), not forward to the next stage.
Behavioral Test Design (S1)
H&K define 7 behavioral tests (their Table 3). Tests I-IV run during S1; Tests V-VII run during S3.
S1 Tests (Run Before Evaluation)
| Test | Name | What It Checks | Codebook Fields Used |
|---|---|---|---|
| I | Legal Output | Does the model always return valid output matching the schema? | output_instructions |
| II | Definition Recovery | Given the label definition as input text, does the model return the correct label? | label, label_definition |
| III | Example Recovery | Given positive/negative examples, does the model return the correct labels? | positive_examples, negative_examples |
| IV | Order Invariance | Does shuffling the order of class definitions change predictions? | classes (ordering) |
Pass criteria for all codebooks: Test I: 100% valid outputs. Test II: 100% correct labels. Test III: 100% correct labels. Test IV: <5% label change rate across orderings.
S3 Tests (Run During Error Analysis)
| Test | Name | What It Checks | Codebook Fields Used |
|---|---|---|---|
| V | Exclusion Criteria | Does removing a negative clarification increase errors for that confusion case? | negative_clarification (individual items) |
| VI | Generic Labels | Does replacing label names with LABEL_1..N change predictions? | label (names vs. definitions) |
| VII | Swapped Labels | Does swapping definitions across label names change predictions? | label, label_definition (cross-assignment) |
Tests VI and VII detect whether the model relies on the semantic content of label names rather than the definitions. This is especially critical for C2 (motivation), where class names like DEFICIT_DRIVEN carry strong semantic priors.
Semantic Label Risk (Tests VI/VII)
H&K find that LLMs can rely on the semantic meaning of label names rather than reading the actual definitions. This is a major risk for this project because C2's labels (SPENDING_DRIVEN, COUNTERCYCLICAL, DEFICIT_DRIVEN, LONG_RUN) are highly semantically loaded.
Mitigation guidance for codebook authors:
- Write definitions that add information beyond what the label name implies
- Include negative clarifications that explicitly contradict the "obvious" reading of the label name (e.g., "An act is NOT
DEFICIT_DRIVENmerely because the word 'deficit' appears in the passage") - Ensure the distinction between classes cannot be resolved by label name alone
- For C2 specifically: the countercyclical/long-run boundary requires the "return to normal" test, which is NOT implied by either label name
When to worry: If Test VI (generic labels) produces significantly different results than the original labels, the model is relying on label semantics. If Test VII (swapped labels) produces results that follow the swapped names rather than the swapped definitions, the model is ignoring definitions entirely.
Output Instruction Templates
C1: Measure Identification
output_instructions: >
Classify the passage using exactly one of: FISCAL_MEASURE, NOT_FISCAL_MEASURE.
Return your answer as JSON:
{
"label": "FISCAL_MEASURE" or "NOT_FISCAL_MEASURE",
"measure_name": "Name of the act if FISCAL_MEASURE, null otherwise",
"reasoning": "Brief explanation citing specific evidence from the passage"
}
C2: Motivation Classification
output_instructions: >
Classify the motivation using exactly one of: SPENDING_DRIVEN,
COUNTERCYCLICAL, DEFICIT_DRIVEN, LONG_RUN.
Then determine exogeneity: EXOGENOUS if the motivation is DEFICIT_DRIVEN
or LONG_RUN; ENDOGENOUS if SPENDING_DRIVEN or COUNTERCYCLICAL.
Return your answer as JSON:
{
"label": "MOTIVATION_LABEL",
"exogenous": true or false,
"reasoning": "Brief explanation citing specific evidence from the passage"
}
C3: Timing Extraction
output_instructions: >
Extract the implementation timing for the fiscal measure.
Return your answer as JSON:
{
"timing": [
{"quarter": "YYYY-QN", "amount_at_annual_rate": number_or_null}
],
"retroactive": true or false,
"reasoning": "Brief explanation of how timing was determined"
}
Use the midpoint rule for phased changes. Record each phase as a
separate entry. Use null for amount if not extractable from this passage.
C4: Magnitude Extraction
output_instructions: >
Extract the fiscal impact magnitude of the measure.
Return your answer as JSON:
{
"magnitude_billions": number,
"currency": "USD",
"annual_rate": true or false,
"source_tier": 1-4,
"sign_convention": "positive = tax increase / revenue gain",
"reasoning": "Brief explanation citing the source of the estimate"
}
Source tier: 1 = ERP/official economic assessment, 2 = calendar year estimate,
3 = fiscal year estimate, 4 = conference report / legislative estimate.
Prefer the highest-tier (lowest number) source available.
Codebook-Specific Notes
C1: Measure Identification
- Binary classification: FISCAL_MEASURE vs NOT_FISCAL_MEASURE
- Must operationalize R&R's "significant mention" rule
- Must exclude extensions and withholding-only changes
- Include extraction instruction for identified measure text
C2: Motivation Classification
- 4-class: SPENDING_DRIVEN, COUNTERCYCLICAL, DEFICIT_DRIVEN, LONG_RUN
- Plus derived exogeneity flag (not a separate classification)
- Must handle mixed motivations (apportionment guidance)
- Boundary between COUNTERCYCLICAL and LONG_RUN is the critical distinction
C3: Timing Extraction
- Structured extraction, not classification
- Must operationalize the midpoint rule
- Must handle phased changes (multiple quarters per act)
- Must handle retroactive components (standard vs adjusted series)
C4: Magnitude Extraction
- Structured extraction of revenue estimates
- Must specify the fallback hierarchy for sources
- Must distinguish policy-driven from growth-driven revenue changes
- Annual rate convention must be explicit
Validation Checklist
Before finalizing any codebook, verify:
- All required fields present for every class
- label_definition is exactly one sentence per class
- Every example has both text and reasoning
- Negative examples are near-misses, not strawmen
- No US-specific terminology in definitions or clarifications
- Each clarification item is independently testable (ablation-ready)
- output_instructions enumerate all valid labels
- Version number follows semver (start at 0.1.0)