name: literature-review description: Search and review biological literature using MCP tools (PubMed, arXiv, bioRxiv, Google Scholar) with full-text reading, citation snowballing, and PaperBLAST integration. Use when the user wants to find papers, review existing research on a topic, check what's known about an organism or pathway, or support a hypothesis with citations. allowed-tools: Bash, Read, Write, WebSearch, Agent, ToolSearch user-invocable: true
Literature Review Skill
Search, read, and synthesize biological literature relevant to BERDL research. Combines multi-source discovery (PubMed, arXiv, bioRxiv, Google Scholar) with full-text analysis, citation network exploration, and PaperBLAST cross-referencing for literature reviews that go beyond abstract-level summaries.
Prerequisites
Two MCP servers provide literature access. Both are configured in .mcp.json and available to all collaborators.
Primary: PubMed MCP (project-level, HTTP)
The pubmed MCP server (https://pubmed.mcp.claude.com/mcp) provides the richest PubMed access with date filters, pagination, MeSH support, citation networks, and full-text retrieval from PMC. Tools are prefixed mcp__pubmed__.
Search & discovery:
search_articles— Rich PubMed search with date filters, sorting, MeSH, paginationfind_related_articles— Citation network: similar papers, PMC links, gene/protein linksget_article_metadata— Detailed metadata for specific PMIDs
Full-text & access:
get_full_text_article— Full text from PMC (~6M open-access articles)get_copyright_status— Open access/license infoconvert_article_ids— PMID ↔ PMCID ↔ DOI conversion
Secondary: paper-search-mcp (openags)
The paper-search-mcp from openags/paper-search-mcp provides keyword search across arXiv, bioRxiv, medRxiv, and Google Scholar. It runs via uvx --from paper-search-mcp python -m paper_search_mcp.server — collaborators only need Python 3.10+ and uv.
Search tools:
search_pubmed— PubMed search (use as fallback; prefer bio-researchsearch_articles)search_arxiv— Search arXiv preprintssearch_biorxiv— Search bioRxiv preprints (keyword search — bio-research bioRxiv plugin only supports category/date browsing)search_medrxiv— Search medRxiv preprintssearch_google_scholar— Search Google Scholar
Full-text read tools (extract text from preprint PDFs):
read_arxiv_paper— Read arXiv paper full textread_biorxiv_paper— Read bioRxiv paper full textread_medrxiv_paper— Read medRxiv paper full text
Note:
download_pubmedandread_pubmed_paperfrom paper-search-mcp are not supported and return errors. Use the pubmed MCP'sget_full_text_articlefor PubMed/PMC full text instead.
PaperBLAST (BERDL local resource)
The kescience_paperblast database (gene-paper associations and text snippets from PMC full-text mining) links protein sequences to scientific literature. Use berdl_notebook_utils.get_tables/get_table_schema(... return_json=False) for live schemas, and see the kescience_paperblast section of docs/pitfalls.md for non-derivable gotchas. Used in deep reviews when the research question involves specific genes, proteins, or pathways.
Supported Sources
| Source | Best for | Primary tool |
|---|---|---|
| PubMed | Biomedical, microbiology, genomics — primary for BERDL | mcp__pubmed__search_articles |
| bioRxiv | Recent preprints in biology, genomics, microbiology | paper-search-mcp search_biorxiv |
| arXiv | Computational biology, bioinformatics methods | paper-search-mcp search_arxiv |
| Google Scholar | Broad coverage, catching papers not in other databases | paper-search-mcp search_google_scholar |
| PaperBLAST | Gene/protein-focused literature from PMC text mining | BERDL SQL queries |
Workflow
Step 1: Understand the Research Question
Clarify what the user wants to search for. Ask if needed:
- Specific organism, gene, pathway, or phenotype?
- Time frame (recent papers only, or comprehensive)?
- Does the question involve specific genes or proteins? (triggers PaperBLAST integration)
Select a review depth tier based on user needs:
| Tier | Papers | Full text? | Citation snowball? | PaperBLAST? | When to use |
|---|---|---|---|---|---|
| Quick scan | 5-10 | No | No | No | Ad-hoc questions, quick checks |
| Standard review | 20-30 | Top 10 | Yes | If genes involved | Default for project workflows, /berdl_start hypotheses |
| Deep review | 50+ | Top 20 | Yes | Yes | Systematic/comprehensive needs, grant writing |
Default to quick scan for ad-hoc questions and standard review for project-based work or when invoked via /berdl_start. Use deep review only when explicitly requested or when the question demands comprehensive coverage.
If invoked during the /berdl_start research workflow, the hypothesis provides the search context.
Step 2: Construct Search Queries
Build search queries using biology-aware strategies:
MeSH Term Expansion
For biological topics, expand to MeSH terms for better PubMed coverage:
| User term | MeSH expansion |
|---|---|
| "pangenome" | "pangenome" OR "pan-genome" OR "core genome" OR "accessory genome" |
| "horizontal gene transfer" | "Gene Transfer, Horizontal"[MeSH] OR "lateral gene transfer" |
| "E. coli" | "Escherichia coli"[MeSH] OR "E. coli" |
| "antibiotic resistance" | "Drug Resistance, Microbial"[MeSH] OR "antimicrobial resistance" |
| "metabolic pathway" | "Metabolic Networks and Pathways"[MeSH] |
Organism Filters (aligned with BERDL's GTDB taxonomy)
When searching for a BERDL species, use both the GTDB name and common variants:
"Escherichia coli" OR "E. coli"
"Staphylococcus aureus" OR "S. aureus" OR "MRSA"
Functional Annotation Keyword Expansion
When searching for gene functions found in BERDL data, expand:
| BERDL annotation | Search terms |
|---|---|
| COG category J | "translation" AND "ribosomal" |
| COG category V | "defense mechanisms" OR "restriction modification" OR "CRISPR" |
| COG category X | "mobilome" OR "transposon" OR "prophage" OR "mobile genetic element" |
| EC 2.7.1.* | "kinase" AND "phosphorylation" |
| KEGG pathway map00010 | "glycolysis" OR "gluconeogenesis" |
Step 3: Discover and Rank Papers via Subagent
The discovery pipeline (search, deduplicate, rank, citation snowball) runs inside a subagent that returns a compact manifest. This keeps ~15-40K+ tokens of raw search results, abstracts, and citation data out of the main context.
Context budget: The discovery subagent returns
100-150 tokens per paper in a structured manifest (2-4K total for a standard review) vs. 15-40K+ tokens if search + rank + snowball ran in the main context.
3a: Prepare Inputs
Gather these values from Steps 1-2:
- Research question
- Review tier (
quick_scan,standard,deep) - Search queries (PubMed, preprint, Scholar)
- Organism filters
- Paper count target (5-10 for quick, 20-30 for standard, 50+ for deep)
3b: Spawn Discovery Subagent
Spawn via Agent(subagent_type="general-purpose") with this prompt template (fill [bracketed] values):
You are a literature discovery agent. Search scientific databases, deduplicate, rank by relevance, and return a compact paper manifest.
RESEARCH QUESTION: [research_question]
REVIEW TIER: [quick_scan|standard|deep]
SEARCH QUERIES:
- PubMed: [pubmed_query]
- Preprint: [preprint_query]
- Scholar: [scholar_query]
- Organism filters: [organism_filters]
PAPER COUNT TARGET: [5-10|20-30|50+]
STEP 1 — Load tools via ToolSearch:
- "select:mcp__pubmed__search_articles"
- "select:mcp__paper-search__search_biorxiv"
- "select:mcp__paper-search__search_arxiv"
- "select:mcp__paper-search__search_google_scholar"
- "select:mcp__pubmed__convert_article_ids"
- "select:mcp__pubmed__find_related_articles"
STEP 2 — Search all sources (priority order):
1. PubMed → 2. bioRxiv → 3. arXiv → 4. Google Scholar
Start focused; broaden if <5 results; narrow if >100.
Collect: title, authors, year, DOI, PMID/PMCID, abstract, source.
If a tool fails, note failure and continue with remaining sources.
STEP 3 — Deduplicate by DOI (primary) or PMID.
Use convert_article_ids to resolve PMID↔PMCID↔DOI.
For PubMed papers, check if PMCID exists (record has_pmcid).
Track which sources each paper appeared in.
STEP 4 — Rank by BERDL relevance:
HIGH: BERDL organism overlap, pangenome/comparative genomics, metabolic pathway analyses, environmental genomics with BERDL taxonomic overlap
MEDIUM: Methodology papers, reviews, related organisms/pathways
LOW: Tangential topics, distant organisms
STEP 5 — Citation snowball (SKIP for quick_scan):
For top 10 PMIDs, call find_related_articles (pubmed_pubmed).
Score new papers by same criteria. Deduplicate again.
STEP 6 — Return EXACTLY this format:
DISCOVERY_MANIFEST
TOTAL_FOUND: [N]
TOTAL_AFTER_DEDUP: [N]
TOTAL_RANKED: [N]
SOURCES_SEARCHED: [list]
SOURCES_FAILED: [list or "none"]
SNOWBALL_PERFORMED: [yes|no]
SNOWBALL_NEW_PAPERS: [N]
---PAPERS---
1. TITLE: [title]
AUTHORS: [first author et al.]
YEAR: [year]
PMID: [or "none"]
PMCID: [or "none"]
DOI: [or "none"]
SOURCE: [pubmed|biorxiv|arxiv|scholar]
FOUND_IN: [sources list]
RELEVANCE: [HIGH|MEDIUM|LOW]
HAS_PMCID: [yes|no]
ABSTRACT_SUMMARY: [2-3 sentences focused on the research question]
---END---
Sort: HIGH→MEDIUM→LOW, newest first within each tier.
~100-150 tokens per entry. Do NOT include full abstracts.
3c: Parse the Discovery Manifest
The subagent returns a DISCOVERY_MANIFEST with:
- Summary stats (total found, after dedup, sources searched/failed, snowball stats)
- Per-paper entries: title, authors, year, IDs, relevance tier, abstract summary
- Papers sorted HIGH → MEDIUM → LOW, newest first within each tier
Use this manifest for all downstream steps. If the subagent fails, fall back to running search + rank directly in main context (see Error Handling below).
Step 4: PaperBLAST Cross-Reference via Subagent (Deep tier, or when genes/proteins are involved)
Launch in parallel with Step 3: Steps 3 and 4 are independent — spawn both subagents in a single message (two Agent tool calls) for efficiency.
When the research question involves specific genes, proteins, enzymes, or pathways, a PaperBLAST subagent queries BERDL's kescience_paperblast database and returns a compact gene-literature summary.
Context budget: ~1-3K tokens returned vs. 9.5-40K+ tokens if PaperBLAST SQL ran in the main context.
4a: Extract Gene/Protein Identifiers
From the research question, identify:
- Gene names (e.g., rpoB, dnaA)
- Protein accessions (e.g., NP_*, WP_*)
- Enzyme EC numbers
- Pathway identifiers
4b: Spawn PaperBLAST Subagent
Spawn via Agent(subagent_type="general-purpose") with this prompt template (fill [bracketed] values):
You are a PaperBLAST cross-reference agent. Query BERDL's PaperBLAST database for gene/protein literature and return a compact summary.
RESEARCH QUESTION: [research_question]
GENE/PROTEIN IDENTIFIERS: [list]
ORGANISMS OF INTEREST: [organisms]
SQL:
Use the active BERDL Spark session and run bounded native Spark SQL:
spark.sql("<SQL>")
QUERIES (for each identifier):
1. Gene lookup: SELECT geneId, organism, desc FROM kescience.paperblast.gene WHERE desc LIKE '%[name]%' OR geneId='[accession]' LIMIT 20
2. Gene-paper: SELECT geneId, title, journal, CAST(year AS INT) as year, pmId, doi FROM kescience.paperblast.genepaper WHERE geneId='[geneId]' ORDER BY CAST(year AS INT) DESC LIMIT 20
3. Snippets (top 5 genes): SELECT snippet, pmId, pmcId FROM kescience.paperblast.snippet WHERE geneId='[geneId]' LIMIT 10
4. Curated: SELECT cg.protId, cg.name, cg.desc, cg.organism, cg.comment, cp.pmId FROM kescience.paperblast.curatedgene cg JOIN kescience.paperblast.curatedpaper cp ON cg.db=cp.db AND cg.protId=cp.protId WHERE cg.name LIKE '%[name]%' LIMIT 20
5. GeneRIF: SELECT comment, pmId FROM kescience.paperblast.generif WHERE geneId='[geneId]' LIMIT 10
IMPORTANT: year is STRING — always CAST(year AS INT).
Return EXACTLY:
PAPERBLAST_SUMMARY
GENES_SEARCHED: [N]
GENES_FOUND: [N]
TOTAL_PAPERS: [N unique]
TOTAL_SNIPPETS: [N]
---GENE_RESULTS---
GENE: [name / geneId]
ORGANISM: [organism]
DESCRIPTION: [description]
PAPERS_FOUND: [N]
PAPER: [title] | [journal] | [year] | PMID:[pmid]
CURATED: [db]: [description] | PMID:[pmid]
GENERIF: "[summary]" | PMID:[pmid]
SNIPPET: "[truncated ~100 chars]..." | PMID:[pmid]
---ALL_PMIDS---
[comma-separated unique PMIDs for cross-referencing]
---END---
Truncate snippets to ~100 chars. If auth fails, report error and stop.
4c: Cross-Reference with Discovery Manifest
After both subagents return:
- Extract
---ALL_PMIDS---from the PaperBLAST summary - Compare against PMIDs in the discovery manifest (Step 3)
- Categorize:
- Confirmed: PMIDs found in both (confirms relevance)
- New from PaperBLAST: PMIDs not in discovery manifest (add to results)
- Discovery-only: Papers without PaperBLAST gene associations (still relevant for broader context)
If the PaperBLAST subagent fails (auth missing, crash), proceed with discovery results alone (see Error Handling below).
Pitfall: PaperBLAST
yearis stored as a string — always useCAST(year AS INT)for comparisons or ordering. Gene IDs span multiple namespaces (RefSeq NP_*, UniProt WP_*, VIMSS) — useseqtoduplicatetable for cross-referencing.
Use berdl_notebook_utils.get_table_schema(... return_json=False) for live PaperBLAST table schemas, see the kescience_paperblast section of docs/pitfalls.md for non-derivable gotchas, and .claude/skills/berdl/modules/query-patterns.md for additional query patterns.
Step 5: Deep Reading of Key Papers via Subagents (Standard + Deep tiers only)
Full-text papers are 5K-20K+ tokens each. Reading them directly in the main context would consume 50K-200K tokens, leaving little room for synthesis. Instead, delegate full-text reading to subagents that each return a structured summary (~200-400 tokens).
Context budget: ~200-400 tokens per paper summary vs. 5K-20K+ for raw full text. For 10 papers this reduces context consumption from ~100K tokens to ~3K-5K tokens.
5a: Build the Paper Manifest
From the discovery manifest (Step 3), select the top papers to read in full:
| Field | Description |
|---|---|
title |
Paper title |
id |
PMID, PMCID, arXiv ID, or DOI |
source |
pubmed, arxiv, biorxiv, or medrxiv |
has_pmcid |
Whether a PMCID is available (from the discovery manifest's HAS_PMCID field) |
Standard tier: top 10 papers. Deep tier: top 20 papers.
5b: Spawn Paper-Reader Subagents
For each paper in the manifest, spawn a subagent using the Agent tool with subagent_type: "general-purpose". Launch multiple subagents in parallel — use a single message with multiple Agent tool calls (up to 5 at a time) for efficiency.
Each subagent receives this prompt template (fill in the bracketed values):
You are a paper-reader agent. Your task is to retrieve and extract structured information from a scientific paper.
RESEARCH QUESTION: [the user's research question]
PAPER: "[title]"
SOURCE: [pubmed|arxiv|biorxiv|medrxiv]
IDENTIFIER: [PMCID, arXiv ID, or DOI]
STEP 1: Load the appropriate MCP tool using ToolSearch:
- For PubMed/PMC papers: ToolSearch query "select:mcp__pubmed__get_full_text_article", then call mcp__pubmed__get_full_text_article with pmcid="[PMCID]"
- For arXiv papers: ToolSearch query "select:mcp__paper-search__read_arxiv_paper", then call mcp__paper-search__read_arxiv_paper with the arXiv ID
- For bioRxiv papers: ToolSearch query "select:mcp__paper-search__read_biorxiv_paper", then call mcp__paper-search__read_biorxiv_paper with the DOI
- For medRxiv papers: ToolSearch query "select:mcp__paper-search__read_medrxiv_paper", then call mcp__paper-search__read_medrxiv_paper with the DOI
STEP 2: Read the full text. Focus on Methods, Results, and Discussion sections.
STEP 3: Return EXACTLY this structured extraction (400 words max):
**[First Author et al. (Year)] — [Title]**
STATUS: FULL_TEXT
### Methods
- [study design, key techniques, sample size, organisms used] (1-3 bullets)
### Key Results
- [findings with specific numbers, effect sizes, p-values] (3-5 bullets, focused on the research question)
### Limitations
- [acknowledged weaknesses] (1-3 bullets)
### BERDL Relevance
- [organisms, genes, pathways, EC numbers, or data types that map to BERDL tables]
### Notable Quotes
- "[verbatim quote]" — [section name]
If the full text is unavailable (tool error, not in PMC, PDF extraction fails), return:
STATUS: ABSTRACT_ONLY
and note the reason. The main agent will use the abstract summary from the discovery manifest instead.
5c: Collect and Tag Results
After all subagents return:
- Collect each structured extraction
- Tag each paper as
[FULL TEXT]or[ABSTRACT ONLY]based on the STATUS field - For
ABSTRACT_ONLYpapers, use the abstract summary from the discovery manifest for synthesis - Use the structured extractions (not raw full text) for the summary in Step 6
Error handling:
- Paper not in PMC: Subagent returns
STATUS: ABSTRACT_ONLY— use abstract summary from the discovery manifest - MCP tool unavailable: Subagent returns
ABSTRACT_ONLYwith reason — fall back per the Fallback section - PDF extraction fails: Same
ABSTRACT_ONLYfallback - Subagent timeout/failure: Note as
ABSTRACT_ONLY, continue with remaining papers
Step 5.5: On-Demand Deep Reading (after presenting the review)
After the review is presented to the user, if they want deeper analysis of a specific paper, spawn a single subagent with an expanded extraction prompt:
- Increase the word limit to 800 words
- Include the user's specific question about the paper
- Add a
### Detailed Analysissection addressing the user's question - Add a
### Future Directionssection (what the authors suggest for follow-up) - Include more extensive quotes from relevant sections
This keeps the main context clean while allowing drill-down into any paper on demand.
Step 6: Summarize Findings
Group results by theme and present as a structured summary. The depth of summary should match the review tier.
Quick scan — use the basic template below (themes + gaps).
Standard/Deep review — use the extended template with methods comparison, quantitative results, and evidence quality indicators.
## Literature Review: [Topic]
**Review depth**: [Quick scan | Standard review | Deep review]
**Papers found**: [N total] | **Full text read**: [N] | **Abstract only**: [N] | **PaperBLAST additions**: [N]
**Sources searched**: [list sources used]
### Summary
[2-3 sentence overview of what the literature says]
### Key Findings by Theme
#### Theme 1: [e.g., "Pangenome methods"]
- **Author et al. (Year)** — [Key finding]. DOI: [doi] [FULL TEXT | ABSTRACT ONLY]
- **Author et al. (Year)** — [Key finding]. DOI: [doi] [FULL TEXT | ABSTRACT ONLY]
#### Theme 2: [e.g., "Core gene evolution"]
- ...
### Methods Comparison *(Standard + Deep tiers)*
| Study | Organism(s) | Method | Sample size | Key metric |
|---|---|---|---|---|
| Author (Year) | E. coli | Pangenome analysis (Roary) | 500 genomes | Core genes: 2,800 |
| ... | ... | ... | ... | ... |
### Key Quantitative Results *(Standard + Deep tiers)*
| Finding | Value | Study | Evidence quality |
|---|---|---|---|
| Core genome size | 2,800 genes | Author (2024) | Large-scale, peer-reviewed |
| Accessory/total ratio | 0.45 | Author (2023) | Preprint, n=50 |
| ... | ... | ... | ... |
### Evidence Quality Notes *(Standard + Deep tiers)*
- [Note which findings are from peer-reviewed vs. preprint sources]
- [Flag small sample sizes, single-organism studies, or methodological limitations]
- [Note if key claims are supported by multiple independent studies]
### Gaps in Current Knowledge
- [What hasn't been studied yet that BERDL could address]
### PaperBLAST Findings *(if Step 4 was performed)*
- [Gene-specific literature connections found via text mining]
- [PMIDs confirmed by both keyword search and PaperBLAST]
- [New papers found only through PaperBLAST]
### Relevance to BERDL
- [Specific tables/queries that could extend these findings]
- [Which BERDL species overlap with the studies found]
Step 7: Store References
Save structured references to the project directory:
# References
## [Topic or Research Question]
Searched: [date], Sources: [PubMed, bioRxiv, arXiv, Google Scholar, PaperBLAST — list those actually used]
Query: "[search terms used]"
Review depth: [Quick scan | Standard review | Deep review]
### Cited References
1. Author A, Author B. (Year). "Title." *Journal*, Volume(Issue), Pages. DOI: [doi]. PMID: [pmid]
2. ...
### Additional References (not cited but relevant)
1. ...
File location: projects/{project_id}/references.md
If no project context exists, offer to create the file in the current working directory.
Step 8: Connect to BERDL (optional)
If the literature review reveals organisms, genes, or pathways present in BERDL:
- Note which BERDL tables contain relevant data
- Suggest specific queries to test literature findings at scale
- Identify discrepancies between published results and BERDL data
- Flag opportunities for novel analysis
Error Handling
Both discovery and PaperBLAST subagents are optimizations, not hard requirements. If a subagent fails, degrade gracefully:
| Failure | Recovery |
|---|---|
| Discovery: single source fails | Subagent proceeds with remaining sources (noted in SOURCES_FAILED) |
| Discovery: all sources fail | Fall back to WebSearch in main context |
| Discovery: subagent crash | Run search + rank + snowball directly in main context (degraded) |
| PaperBLAST: auth missing | Warn user, proceed without PaperBLAST |
| PaperBLAST: gene not found | Normal — note in summary |
| PaperBLAST: subagent crash | Omit PaperBLAST section, proceed with discovery results |
Integration with Other Skills
From hypothesis generation (via /berdl_start)
After generating a hypothesis, /literature-review can be used to:
- Check if the hypothesis has already been tested
- Find supporting or contradicting evidence
- Identify methods used in similar studies
- Discover additional variables to consider
To /berdl
Literature findings can inform BERDL queries:
- Paper mentions specific EC numbers → query
eggnog_mapper_annotations - Paper studies specific species → look up in
gtdb_species_clade - Paper reports gene essentiality → cross-reference with fitness browser
To /submit
The references.md file created by this skill is checked during project submission as an advisory item.
Fallback: WebSearch
If MCP tools are unavailable:
- PubMed MCP unavailable: Fall back to paper-search-mcp's
search_pubmedfor PubMed search - paper-search-mcp unavailable: Check
.mcp.jsonis configured with"command": "uvx", "args": ["--from", "paper-search-mcp", "python", "-m", "paper_search_mcp.server"]. Test:uvx --from paper-search-mcp python -m paper_search_mcp.server - All MCP unavailable: Use
WebSearchto search PubMed:site:pubmed.ncbi.nlm.nih.gov [query] - Use
WebFetchto retrieve paper details from DOIs:https://doi.org/[doi] - Note in the output that results may be less comprehensive than MCP-based search (no full-text retrieval, no citation snowballing)
Pitfall Detection
When you encounter errors, unexpected results, retry cycles, performance issues, or data surprises during this task, follow the pitfall-capture protocol. Read .claude/skills/pitfall-capture/SKILL.md and follow its instructions to determine whether the issue should be added to the active project's projects/<id>/memories/pitfalls.md.