name: cds-extractor description: | Extract structured CSV/JSON data from any U.S. university's Common Data Set (CDS) PDFs and assemble a clean open-data repository. The CDS uses a standardized template (sections A through J, fields A0–J3) shared across nearly every American college, so this works for Stanford, Princeton, Harvard, MIT, public flagships, liberal arts colleges — anyone who publishes a CDS.
Trigger this skill when the user wants to: - "Parse the CDS for [university]" - "Extract admissions / financial aid / tuition data from CDS PDFs" - "Build an open data repo from [university]'s Common Data Set" - "Convert CDS PDFs to CSV" - "Track 5/10/20-year trends in admit rate / yield / test scores at [university]" - "Reproduce the Stanford CDS open data repo for another school" Also trigger for any request that names the Common Data Set or a specific CDS section by code (e.g. "B1 enrollment table", "C9 test scores").
Common Data Set extractor
Convert university Common Data Set PDFs into a queryable open-data repo.
When to use this skill
Use it when the user has, or wants to acquire, one or more CDS PDFs from a U.S. college or university and wants the data in CSV/JSON form. Don't use it for other types of higher-ed data (IPEDS direct, college scorecard, internal admissions data); those have their own structures.
Inputs
The user typically supplies one of:
- A list of public PDF URLs — most institutions publish their CDSes on a
public IR/registrar page (e.g.
irds.stanford.edu/data-findings/cds,princeton.edu/.../common-data-set). Some host on Google Drive. - A folder of PDFs they've already downloaded.
- The URL of a CDS index page — Claude can scrape it for PDF links.
If the user only names the institution without supplying URLs, web-search
<university> common data set and surface the index page. Confirm the URLs
with the user before downloading.
Workflow
1. Build the manifest
Edit scripts/manifest.py to list every (year_label, source) pair you want
to ingest. source can be:
- A direct PDF URL:
"https://provost.princeton.edu/sites/default/files/2024-04/CDS_2023-2024.pdf" - A Google Drive file ID:
"1GIPKgVj1d86dkmLkHI_mZVCk_iY6kiCp" - An absolute local path:
"/Users/me/Downloads/cds_2024.pdf"
CDS_MANIFEST = [
("2024-2025", "https://example.edu/cds/2024-2025.pdf"),
("2023-2024", "https://example.edu/cds/2023-2024.pdf"),
]
2. Download the PDFs
python scripts/fetch.py
Saves to raw_pdfs/cds-<year>.pdf. Skips files that already exist with
non-trivial size, so it's safe to re-run after manifest edits.
3. Convert PDFs to layout text
for f in raw_pdfs/*.pdf; do
pdftotext -layout "$f" "raw_text/$(basename "${f%.pdf}").txt"
done
pdftotext ships with poppler (brew install poppler on macOS). The
-layout flag preserves column alignment which the parsers rely on.
4. Chunk into per-field structured JSON
python scripts/parse_blocks.py
Walks each raw_text/ file, splits it into the 10 sections (A–J), and
keys every numbered field block (A0, A1, …, J3) with its raw text.
Output: data/json/<year_label>/blocks.json.
This step works on the standard CDS template and rarely needs adjusting.
If a year fails, the most likely culprit is a section header that uses a
non-standard label — search for SECTION_HEADERS in parse_blocks.py and
add a regex variant.
Step 2.5 (optional): ingesting HTML-published CDSes
Some institutions (MIT after 2017-18, a handful of others) publish their CDS as web pages instead of PDF files. To pull those in, add a parallel manifest:
# scripts/manifest_html.py
HTML_MANIFEST = [
("2024-2025", "https://ir.example.edu/projects/2024-25-common-data-set/"),
("2023-2024", "https://web.archive.org/web/2024/https://ir.example.edu/cds-2024/"),
# Wayback URLs are fine for years that have rolled off the live site.
]
Then:
python scripts/fetch_html.py
This downloads each page, strips nav/footer/script tags, and saves
raw_text/cds-<year>.txt in the same format pdftotext produces. After
that, the rest of the pipeline (parse_blocks.py → extract_metrics.py →
validate.py) runs unchanged.
5. Extract clean numeric tables
python scripts/extract_metrics.py
Reads data/json/*/blocks.json and produces long-format CSVs in
data/csv/:
| File | What it covers |
|---|---|
admissions_summary.csv |
applied / admitted / enrolled per year |
admissions_by_sex.csv |
same broken out by male/female/unknown |
test_scores.csv |
SAT & ACT 25th/50th/75th percentiles |
enrollment_summary.csv |
undergrad/grad × FT/PT × M/F |
tuition_and_fees.csv |
tuition, fees, food/housing, books, transport |
financial_aid_summary.csv |
avg need-based grant / loan / aid package |
faculty_summary.csv |
total faculty + student/faculty ratio |
graduation_rates.csv |
4/5/6-yr grad rates + freshman retention |
all_fields_long.csv |
every field × every year (searchable index) |
6. Validate
python scripts/validate.py
Picks 19 random extracted values and confirms each appears verbatim in the
source PDF text. Writes docs/VALIDATION.md with results. The Stanford
benchmark hit 100% pass rate. New universities should also hit 90%+; lower
indicates a parser quirk to investigate.
7. Wrap as a repo (optional)
git init -b main
git add -A
git commit -m "Initial release: <university> CDS"
gh repo create <university>cds --public --source=. --push
What this skill does NOT extract automatically
The CDS contains some multi-dimensional tables that don't fit a flat
(year, metric, value) shape. They are preserved verbatim in
data/json/<year>/blocks.json under the relevant field code, but adding
them to the curated CSVs requires a custom extractor:
- B2 — Enrollment by race/ethnicity (categories changed in 2010 and 2024)
- C7 — Importance of admissions factors (categorical: Very Important / …)
- H1 — Total dollars awarded × aid type × need-based vs. non-need-based
- H2 / H2A — Lettered subrows A–N for aid recipients
- I3 — Class size distribution histogram
- J1 — Most common fields of study (CIP-coded)
When the user asks for one of these, work from blocks.json[<section>][<field>]['raw_text']
and follow the extractor pattern in extract_metrics.py.
Adapting to a new university
The four scripts are mostly institution-agnostic — they target standard CDS field codes. The only files you typically need to change are:
manifest.py— list the institution's PDFs.- Section headers in
parse_blocks.py— only if the institution uses unusual capitalization or wording (e.g. "ENROLLMENT" vs. "Enrollment").
Spot-check the first run with validate.py — if the pass rate is below 80%,
look at any failing rows and tighten the extractor regexes in
extract_metrics.py.
See also
- The reference Stanford repo: https://github.com/mzrascoff/stanfordcds
- Common Data Set initiative: https://www.commondataset.org/
- IPEDS for institution-level data outside the CDS: https://nces.ed.gov/ipeds/