name: ma-fulltext-management description: Collect and manage full-text PDFs for included studies, track provenance, and prepare documents for extraction. Use when moving from screening to data extraction.
Ma Fulltext Management
Overview
Gather full texts, validate completeness, and prepare a clean manifest.
Inputs
03_screening/round-01/included.bib
Outputs
04_fulltext/manifest.csv04_fulltext/unpaywall_results.csv(optional OA lookup)04_fulltext/fulltext_decisions.csv(Stage 04b — full-text eligibility screening)04_fulltext/ft_agreement.md(Stage 04b — full-text inter-rater agreement)04_fulltext/README.md04_fulltext/PDF files04_fulltext/previews/(optional PDF image previews)
Workflow (Web-First Hybrid — Default)
⚠️ Default approach: Web-based extraction first, PDF retrieval only for gaps.
Phase 1: Web-Based Data Gathering (Default — No PDFs Needed)
- Create
04_fulltext/and buildmanifest.csvwithrecord_id, DOI, PMID, title, and access notes.- Read from
03_screening/round-01/included.bib - Use
references/manifest-template.csvas template - Write to
04_fulltext/manifest.csv(columns: record_id, DOI, PMID, title, access_method, confidence_score)
- Read from
- Automatically run web extraction for all included studies using Claude Code's
WebSearchandWebFetchtools:- Query PubMed structured abstracts (
https://pubmed.ncbi.nlm.nih.gov/<pmid>/) - Query ClinicalTrials.gov registries (
https://clinicaltrials.gov/study/<nct_id>) - Search Europe PMC, journal supplementary materials
- Query PubMed structured abstracts (
- Record confidence scores per field (see
references/web-extraction.mdfor scoring).- Update
04_fulltext/manifest.csv(confidence_score column)
- Update
- Flag studies with confidence < 0.7 for primary outcome fields → these need PDFs.
- Mark in
04_fulltext/manifest.csv(needs_pdf = TRUE)
- Mark in
Phase 2: Targeted PDF Retrieval (Only for Low-Confidence Studies)
- For flagged studies only (~20-30%), query Unpaywall for OA links using
scripts/unpaywall_fetch.pyviauv run.- Use
scripts/unpaywall_fetch.py - Read from
04_fulltext/manifest.csv(needs_pdf = TRUE rows) - Write to
04_fulltext/unpaywall_results.csv
- Use
- Download available PDFs with
scripts/download_oa_pdfs.py.- Use
scripts/download_oa_pdfs.py - Write to
04_fulltext/<record_id>.pdf
- Use
- Optionally render PDF previews with
scripts/render_pdf_previews.pyfor visual QA.- Use
scripts/render_pdf_previews.py - Write to
04_fulltext/previews/<record_id>_page1.png
- Use
- Request user to manually deposit any remaining PDFs that cannot be auto-retrieved.
- Update
04_fulltext/manifest.csv(access_method = "manual")
- Update
- Run OCR only when needed and preserve original files.
Why Web-First?
- Speed: 50-70% faster than PDF-only (2-3h vs 8-12h)
- No institutional access required for Phase 1
- 90-95% completeness with hybrid approach
- PDFs are only needed for ~20-30% of studies
Resources
references/manifest-template.csvprovides a manifest header.scripts/unpaywall_fetch.pyqueries Unpaywall for open-access links.scripts/analyze_unpaywall.pyanalyzes Unpaywall results and generates summary statistics.scripts/download_oa_pdfs.pydownloads open-access PDFs automatically from Unpaywall URLs.scripts/render_pdf_previews.pyrenders PDF pages to PNG previews. Note: Unpaywall requiresUNPAYWALL_EMAILin.env. Note: PDF previews requirepdftoppmormutoolinstalled.
Stage 04b: Full-Text Eligibility Screening (PRISMA Item 16)
⚠️ MANDATORY — PRISMA 2020 requires reporting the number of full-text articles excluded with reasons.
After completing full-text retrieval (Phases 1-2 above), re-screen all included studies against the full text to confirm eligibility. This step catches issues not visible at the abstract stage (e.g., wrong population subgroup, insufficient sample size, protocol-only publications).
Workflow
Run AI full-text screening (Reviewer 1):
uv run tooling/python/ai_screen.py --project <project-name> --stage fulltext --reviewer 1Run AI full-text screening (Reviewer 2) for dual review:
uv run tooling/python/ai_screen.py --project <project-name> --stage fulltext --reviewer 2Compute full-text inter-rater agreement (Cohen's kappa):
uv run ma-screening-quality/scripts/dual_review_agreement.py \ --file projects/<project-name>/04_fulltext/fulltext_decisions.csv \ --col-a FT_Reviewer1_Decision --col-b FT_Reviewer2_Decision \ --out projects/<project-name>/04_fulltext/ft_agreement.mdResolve conflicts (if any) — update
FT_Final_DecisionandFT_Exclusion_Codecolumns.Only studies with
FT_Final_Decision = includeproceed to Stage 05 (data extraction).
Output Schema (fulltext_decisions.csv)
| Column | Description |
|---|---|
record_id |
Matches manifest.csv and screening decisions |
title |
Study title |
doi |
Digital Object Identifier |
pmid |
PubMed ID |
FT_Reviewer1_Decision |
include / exclude |
FT_Reviewer1_Reason |
Reason with exclusion code reference |
FT_Reviewer2_Decision |
include / exclude |
FT_Reviewer2_Reason |
Reason with exclusion code reference |
FT_Final_Decision |
include / exclude (resolved) |
FT_Exclusion_Code |
Exclusion code (P1, S2, etc.) or NONE |
Exclusion Codes
Reuses standard codes from ma-screening-quality/references/screening-labels.md:
P1/P2 (population), I1/I2 (intervention), C1 (comparator), S1-S4 (study design),
O1/O2 (outcomes), T1/T2 (time), L1 (language), D1 (duplicate).
QA Thresholds
- Full-text kappa ≥ 0.60 (same threshold as abstract screening)
- All exclusions must have a documented reason and code
FT_Exclusion_Codefeeds directly into PRISMA flow diagram item 16
Validation
- Ensure every included record has a matching full-text file or a documented reason for absence.
- Ensure
record_idcontinuity with screening decisions. - Ensure
fulltext_decisions.csvexists before proceeding to Stage 05. - Ensure all
FT_Final_Decisionvalues are resolved (no blanks) before extraction.
Pipeline Navigation
| Step | Skill | Stage |
|---|---|---|
| Prev | /ma-screening-quality |
03 Screening & Quality |
| 04b | (this skill) | Full-text eligibility screening |
| Next | /ma-data-extraction |
05 Data Extraction |
| All | /ma-end-to-end |
Full pipeline orchestration |