name: wrds version: 1.0 description: Use when "query WRDS", "pull SEC filings", "access Compustat/CRSP/ExecuComp/Capital IQ", "Form 4 insider data", "13F institutional ownership (Thomson)", "13D/13G blockholders", "ISS governance/compensation/voting/directors", "proxy advisor recommendations", "TAQ intraday/NBBO", "SDC M&A or new issues", "DealScan syndicated loans", "PitchBook PE/VC deals", "FISD corporate bonds", "municipal bonds / muni trades / MSRB RTRS / SDC municipals", "Form D/ADV", "fund formation", "FJC court data", "linking datasets / join keys (gvkey-permno via CCM, cik-gvkey via wciklink, DealScan-Compustat)", or any WRDS PostgreSQL query or SAS ETL on the WRDS grid (qsub/qsas/SGE). user-invocable: false
Contents
- WRDS Login Node Enforcement
- Query Enforcement
- SAS ETL Enforcement
- Quick Reference: Table Names
- Connection
- Critical Filters
- Parameterized Queries
- Additional Resources
WRDS Login Node Enforcement
IRON LAW: NEVER RUN COMPUTE ON THE WRDS LOGIN NODE
ALWAYS write an SGE submission script and submit via qsub. No exceptions.
ssh wrds 'cat files.tsv | ./parser > output.tsv'→ WRONG. Use qsub.ssh wrds 'nohup ./process &'→ WRONG. Still the login node. Use qsub.ssh wrds 'python3 bulk_process.py'→ WRONG. Use qsub.qsub -t 1-20 submit.sh→ CORRECT.
The login node is for: qsub, qstat, qdel, scp, ls, head, short psql queries.
See references/constraints/wrds-sge-enforcement.md for the full pattern and existing examples (quorum parser, state-of-incorp parser, SAS pipeline).
Running compute on the login node is NOT HELPFUL — it gets the user's account flagged, the job killed, and the work lost. You run on the login node because qsub feels like overhead. The overhead is 5 minutes of script writing. The downside is account suspension and a rerun from scratch.
Login Node & Infrastructure Facts
- Tests go through the scheduler too:
qsub -t 1-1 submit.sh. The login-node "quick test" is the run that flags the account — one file becomes 100K when the command changes, and 173K filings over NFS is not 30 seconds. - The quorum parser does not run on the login node and never did — it runs via
submit_quorum.sh. Citing it as login-node precedent is an unverified claim presented as fact. - The
wrds_clean_filingspath convention iscik_int.zfill(10)[:6]/{cik_int}/{accession}.txt(seereferences/edgar.md). Hand-rolled path logic gets this wrong. scan_coversprofiles handle header extraction, body parsing, and custom extractors (Customfield type) — "this parser is different enough to need its own binary" has not yet been true once.
Red Flags — STOP Immediately If You're About To:
- Write
ssh wrds '... | ./binary > output'→ STOP. That's login-node compute. Write a submit script. - Write
ssh wrds 'nohup ... &'→ STOP. nohup doesn't change the node. Use qsub. - Write
ssh wrds 'python3 ...'for anything that reads >10 files → STOP. Use qsub. - Skip reading
references/edgar.mdbefore building a new WRDS file parser → STOP. The path conventions, SGE patterns, and existing parsers are already documented. Read them first. - Create a new standalone Go binary for EDGAR extraction → STOP.
scripts/scan_covers/is a generic profile-based framework. Add aprofiles_*.gofile, not a new binary. The framework handles SGE sharding, path construction, concurrency, and form-type filtering. - Build a new Go/Python parser without checking
scripts/scan_covers/→ STOP. This framework exists precisely so you don't reinvent extraction infrastructure. Every standalone parser is technical debt that should have been a profile.
IRON LAW: USE SCAN_COVERS, NOT STANDALONE BINARIES
- Read
scripts/scan_covers/— generic profile-based Go framework with SGE, concurrency, path handling - Add a
profiles_*.gofile — not a standalone binary. The Profile struct supports pattern-based fields AND custom extractors (setFullBody: truefor body-text searches like prospectus 485 filings — seeprofiles_proxy_advisors.go) - Read
references/edgar.md— path conventions, existing profiles, SGE submission patterns
Building a standalone parser when scan_covers exists is NOT HELPFUL — it reinvents infrastructure that already handles SGE sharding, NFS concurrency, path construction, form-type filtering, and error handling. You built a 300-line standalone Go binary, ran it on the login node, got the path convention wrong, and spent 5 iterations fixing it. Adding a 60-line profile to scan_covers would have worked on the first try.
Every standalone EDGAR parser is technical debt. The scan_covers framework exists to eliminate this class of mistake.
WRDS Data Access
WRDS (Wharton Research Data Services) provides academic research data via PostgreSQL at wrds-pgdata.wharton.upenn.edu:9737.
Query Enforcement
IRON LAW: NO QUERY WITHOUT FILTER VALIDATION FIRST
Before executing ANY WRDS query, you MUST:
- IDENTIFY what filters are required for this dataset
- VALIDATE the query includes those filters
- VERIFY parameterized queries (never string formatting)
- EXECUTE the query
- INSPECT a sample of results before claiming success
This is not negotiable. Skipping sample inspection is NOT HELPFUL — the user builds analysis on data with undetected quality problems.
Red Flags
- Running a query without checking the Critical Filters section → standard filters apply even when the user doesn't mention them, and even for test queries.
- Pulling everything to filter in pandas later → filter at the database level first.
- Guessing a table name from the request → check the Quick Reference section for exact names.
- Claiming success before sample inspection → inspect
.head()/.sample()first; query success ≠ data quality.
Query Validation Checklist
Before EVERY query execution:
For Compustat queries (comp.funda, comp.fundq):
- Includes
indfmt = 'INDL' - Includes
datafmt = 'STD' - Includes
popsrc = 'D' - Includes
consol = 'C' - Uses parameterized queries for variables
- Date range is explicitly specified
For CRSP v2 queries (crsp.dsf_v2, crsp.msf_v2):
- Post-query filter:
sharetype == 'NS' - Post-query filter:
securitytype == 'EQTY' - Post-query filter:
securitysubtype == 'COM' - Post-query filter:
usincflg == 'Y' - Post-query filter:
issuertype.isin(['ACOR', 'CORP']) - Uses parameterized queries
For Form 4 queries (tr_insiders.table1):
- Transaction type filter specified (acqdisp)
- Transaction codes specified (trancode)
- Date range is explicitly specified
- Uses parameterized queries
For ALL queries:
- Sample inspection with
.head()or.sample()BEFORE claiming success - Row count verification (is result size reasonable?)
- NULL value check on critical columns
- Date range validation (does min/max match expectations?)
SAS ETL Enforcement
IRON LAW: NO SAS CODE WITHOUT PERFORMANCE VALIDATION FIRST
- MERGE STRATEGY — Is hash or sort-merge appropriate? Justify the choice.
- WHERE CLAUSES — Are all date/string filters index-friendly? No functions on indexed columns.
- PARALLELISM — Can this job run as an SGE array? Year-by-year is always parallelizable.
- SQL OPTIMIZATION — For PROC SQL: pass-through opportunity? Indexed join columns?
Writing SAS code that forces full table scans when indexes exist is NOT HELPFUL — the user's job runs 100x slower than necessary and may timeout.
SAS Code Validation Checklist
Before EVERY SAS program execution:
For merges/joins:
- Small lookup + large fact table → hash object (not
PROC SORT+DATAmerge) - Hash uses
defineKey/defineData/defineDonepattern correctly -
h.output()uses double quotes for macro resolution (not single quotes) -
call missing()initializes hash data variables for non-matches - Both tables >50M rows → sort-merge is justified (document why)
For WHERE clauses (CRITICAL):
- NO
year(date),month(date),datepart(dt)wrapping indexed columns - Date filters use
BETWEEN "01jan&year."d AND "31dec&year."drange pattern - String filters avoid
upcase(),substr()on indexed columns - Compound date filters collapsed to single range (not
year() = X AND quarter() = Y)
For batch processing:
- Multi-year jobs use SGE array (
#$ -t start-end) not sequential loop - Year passed via
-sysparm(not-setor%sysget) - Per-year log files (not single shared log)
- Memory allocation appropriate for workload (
#$ -l m_mem_free=4Gminimum) - Single-year benchmark run completed before full array submission
For PROC SQL:
- Join columns are not wrapped in functions
-
calculatedkeyword used for computed column references in HAVING - Pass-through SQL considered for direct WRDS PostgreSQL queries
- No redundant subqueries that could be hash lookups
For macros:
- Macro variables terminated with period (
&year.not&year) - Double quotes used where macro resolution is needed
-
options mprint mlogic symbolgenused during development
SAS Performance Facts
- Hash lookup joins are ~10x faster than
PROC SORT+MERGEand need no sorting; PROC SQL still sorts for joins. The hash is 5 extra lines — choosing sort-merge for a lookup join makes the user's job slower for your convenience. year(date)(or any function) on an indexed column forces a full table scan over millions of rows;BETWEENwith date literals uses the index.- Sequential multi-year jobs run ~18x slower than the SGE array (18 years × 3 minutes = 54 minutes sequential vs 3 minutes parallel) — "I'll parallelize later" is anti-efficient on its own terms.
- Single quotes in
h.output(dataset: '...')block macro resolution — the output dataset name comes out wrong. Always double quotes. %sysgetis unreliable under SGE — it may return blank silently. Pass the year via-sysparm+&sysparm..
SAS Red Flags - STOP Immediately If You're About To:
- Write
where year(date) =anything → STOP. UseBETWEENwith date literals. - Write
proc sort; data; mergefor a lookup join → STOP. Use hash object. - Write a
%do year = start %to endloop → STOP. Use SGE array job. - Use single quotes in
h.output(dataset: '...')→ STOP. Use double quotes. - Submit a full array job without testing one year first → STOP. Benchmark first.
- Use
-setor%sysgetfor SGE task parameters → STOP. Use-sysparm.
SAS Reference
See references/sas-etl.md for complete patterns:
- Hash object merge (basic, multidata, accumulator)
- Index-friendly WHERE clause quick reference table
- SGE array job templates with memory and logging
- PROC SQL pass-through and optimization
- Macro quoting and debugging
Quick Reference: Table Names
| Dataset | Schema | Key Tables |
|---|---|---|
| Compustat | comp |
company, funda, fundq, secd |
| ExecuComp | comp_execucomp |
anncomp |
| CRSP | crsp |
dsf, msf, stocknames, ccmxpf_linkhist |
| CRSP v2 | crsp |
dsf_v2, msf_v2, stocknames_v2 |
| Form 4 Insiders | tr_insiders |
table1, header, company |
| ISS Incentive Lab | iss_incentive_lab |
comppeer, sumcomp, participantfy |
| Capital IQ | ciq |
wrds_compensation |
| IBES | tr_ibes |
det_epsus, statsum_epsus |
| Form D / Reg D | wrdssec |
wrds_vc_formd (parsed, 2000–2020); index: wrdssec_all.forms (all CIKs) or wrds_forms (filer only) — default to forms, see references/wrds-forms-tables.md |
| SEC EDGAR | wrdssec_all |
forms (raw index, all CIKs per filing — default), wrds_forms (filer-only view), wciklink_cusip |
| SEC Search | wrds_sec_search |
filing_view, registrant |
| EDGAR | edgar |
filings, filing_docs |
| Fama-French | ff |
factors_monthly, factors_daily |
| LSEG/Datastream | tr_ds |
ds2constmth, ds2indexlist |
| FJC (Federal Judicial Center) | fjc |
civil, criminal, bankruptcy, appeals |
| FJC Linking | fjc_linking |
wrds_civil_link, wrds_criminal_link |
| SDC New Issues (IPO/SEO/Debt) | tr_sdc_ni |
wrds_ni_details — equity + debt offerings |
| SDC Mergers & Acquisitions | tr_sdc_ma |
wrds_ma_details — M&A transactions |
| TAQ Legacy | taq |
mast_YYYY, wrds_iid_YYYY — second-level (1993–2006) |
| TAQ Millisecond | taqmsec |
mastm_YYYY, wrds_iid_YYYY, ctm_YYYYMM, complete_nbbo_YYYYMMDD |
| Thomson S12 (Mutual Fund Holdings) | tfn (SAS) / tr_mutualfunds (PG) |
s12 — 13F/N-CSR fund holdings |
| Thomson S34 (13-F Institutional) | tfn (SAS) / tr_13f (PG) |
s34 — 13-F institutional holdings |
| FISD / Mergent (Corporate Bonds) | fisd_fisd |
fisd_mergedissue, fisd_mergedissuer — corporate/agency/Treasury; NOT the muni source (issuer_type='M' munis are incidental) |
| Municipal trades (MSRB RTRS) | msrb |
msrb (trades + inline CUSIP master: coupon, maturity), msrb_lookup; also msrb_all, msrbsamp. Primary muni source. See references/muni-bonds.md |
| Municipal new issues (SDC) | tr_sdc_municipals |
deal-level: ratings, GO/rev, bank-qualified, callable, size, sector — but SELECT is permission-denied on this subscription (not licensed); msrb is the only readable muni schema. See references/muni-bonds.md |
| PitchBook | pitchbk_companies_deals, pitchbk_investors_funds_lps, pitchbk_fund_returns |
deal, company, fund, wrds_fund_returns — dealsize in USD millions |
Connection
Initialize PostgreSQL connection to WRDS:
import psycopg2
conn = psycopg2.connect(
host='wrds-pgdata.wharton.upenn.edu',
port=9737,
database='wrds',
sslmode='require'
# Credentials from ~/.pgpass
)
Configure authentication via ~/.pgpass with chmod 600:
wrds-pgdata.wharton.upenn.edu:9737:wrds:USERNAME:PASSWORD
Connect via SSH tunnel:
ssh wrds
This uses ~/.ssh/wrds_rsa for authentication.
Critical Filters
Compustat Standard Filters
Always include for clean fundamental data:
WHERE indfmt = 'INDL'
AND datafmt = 'STD'
AND popsrc = 'D'
AND consol = 'C'
CRSP v2 Common Stock Filter
Equivalent to legacy shrcd IN (10, 11):
df = df.loc[
(df.sharetype == 'NS') &
(df.securitytype == 'EQTY') &
(df.securitysubtype == 'COM') &
(df.usincflg == 'Y') &
(df.issuertype.isin(['ACOR', 'CORP']))
]
Form 4 Transaction Types
WHERE acqdisp = 'D' -- Dispositions
AND trancode IN ('S', 'D', 'G', 'F') -- Sales, Dispositions, Gifts, Tax
Parameterized Queries
Always use parameterized queries (never string formatting):
Use scalar parameter binding for single values:
cursor.execute("""
SELECT gvkey, conm FROM comp.company WHERE gvkey = %s
""", (gvkey,))
Use ANY() for list parameters:
cursor.execute("""
SELECT * FROM comp.funda WHERE gvkey = ANY(%s)
""", (gvkey_list,))
Additional Resources
Reference Files
Detailed query patterns and table documentation:
references/compustat.md- Compustat tables, ExecuComp, financial variablesreferences/crsp.md- CRSP stock data, CCM linking, v2 formatreferences/insider-form4.md- Thomson Reuters Form 4, rolecodes, insider typesreferences/iss-compensation.md- ISS Incentive Lab, peer companies, compensationreferences/formd.md- Form D / Reg D (canonical): two sources (WRDSwrds_vc_formd+ SEC EDGAR TSV/XML), grain & keys, denormalization gotcha, exemption + industry codes, post-2020 gap, validated benchmarksreferences/edgar.md- SEC EDGAR filings, URL construction, DCN vs accession numbersreferences/connection.md- Connection pooling, caching, error handlingreferences/taq.md- TAQ: master files, IID, raw tick processing (NBBO, VWAP, closing auctions), CRSP–TAQ merge, era transition (legacy vs millisecond)references/sas-etl.md- SAS hash objects, index-friendly WHERE, SGE array jobs, PROC SQL optimizationreferences/postgres-vs-sas.md- Decision guide: when to use PostgreSQL vs SAS for WRDS ETL (benchmarks, constraints, hybrid pattern)references/fjc.md- FJC Integrated Database: civil/criminal case data, NOS codes, securities litigation queries, firm linkingreferences/sdc-issuances.md- SDC New Issues: IPOs, SEOs, 144A equity, debt offerings — schema discovery, cleaning filters, CRSP/Compustat linkingreferences/fisd-bonds.md- FISD/Mergent: corporate bond issuances, IG vs HY, 144A vs registered, rating classification, TRACE linkingreferences/sdc-ma.md- SDC M&A: deal counts, PE/LBO vs strategic buyer, deal status codes, public vs private targetreferences/fund-formation.md- Fund formation: Form D (pooled investment funds), EDGAR N-2 (closed-end fund IPOs), Form ADV (RIA registrations)references/pitchbook.md- PitchBook: schema architecture, dealsize/fundsize in USD millions, dealdate outliers, CIK crosswalk, fund performance (wrds_fund_returns), PE/VC/fund formation patternsreferences/proxy-advisors.md- Proxy-advisor customer identification: 485BPOS/485APOS body scan for ISS/Glass Lewis/Egan-Jones name variants; CRSP MFDB lift to mgmt_cd × year; validates against chongshu published CSV
Example Files
Working code from real projects:
examples/form4_disposals.py- Insider trading analysis (from SVB project)examples/wrds_connector.py- Connection pooling patternexamples/formd_regd.ipynb- Form D / Reg D: dedup validation, SEC TSV download, exemption trend chartsexamples/sdc_issuances_eda.ipynb- SDC New Issues: annual IPO/SEO/debt counts, 144A share, IG vs HY breakdownexamples/sdc_ma_eda.ipynb- SDC M&A: annual deal counts, PE/LBO vs strategic, public vs private target trendsexamples/fund_formation_eda.ipynb- Fund formation: Form D 3C.1/3C.7 counts, EDGAR N-2 closed-end fund IPOs, Form ADV RIA registrationsexamples/pitchbook_eda.ipynb- PitchBook: PE deal activity, VC rounds by stage, fund formation by vintage, IRR/TVPI by strategyexamples/voting_ownership_pipeline/- Self-contained hybrid SAS+Python pipeline: ISS votes, 13-F inst. ownership, MF holdings via MFLINKS, merged panel. Canonical example of PostgreSQL vs SAS decision-making on WRDS. SeeREADME.mdfor architecture and usage.
Scripts
scripts/test_connection.py- Validate WRDS connectivity
Local Sample Notebooks
WRDS-provided samples at ~/resources/wrds-code-samples/:
ResearchApps/CCM2025.ipynb- Modern CRSP-Compustat mergeResearchApps/ff3_crspCIZ.ipynb- Fama-French factor constructioncomp/sas/execcomp_ceo_screen.sas- ExecuComp patterns
Date Awareness
When querying historical data, leverage current date context for dynamic range calculations.
Current date is automatically available via datetime.now(). Apply this to:
- Data range validation (e.g., "get data for last 5 years")
- Fiscal year calculations
- Event study windows
Implement dynamic date ranges in queries:
from datetime import datetime, timedelta
# Query last 5 years of data
end_date = datetime.now()
start_date = end_date - timedelta(days=5*365)
query = """
SELECT * FROM comp.funda
WHERE datadate BETWEEN %s AND %s
"""
df = pd.read_sql(query, conn, params=(start_date, end_date))
Always incorporate current date awareness in date-dependent queries to ensure results remain fresh across time.