ncbi-sequence-fetch - SKILL.md Agent Skill

name: ncbi-sequence-fetch category: science version: "1.0.0" kind: python description: > Retrieve protein and nucleotide sequences from NCBI databases using E-utilities. Supports direct accession lookup, CDS translation, gene+organism search, locus lookup, PubMed-linked sequences, patent protein extraction, and organism+length fallback search. Use when you need to fetch biological sequences by accession, gene name, locus tag, PubMed ID, or patent number.

NCBI Sequence Fetch

Prerequisites

uv: Read the uv skill and follow its Setup instructions to ensure uv is installed and on PATH.
User Notification: If LICENSE_NOTIFICATION.txt does not already exist in this skill directory then (1) prominently notify the user to check the terms at https://www.ncbi.nlm.nih.gov/ and https://www.ncbi.nlm.nih.gov/home/about/policies/, then (2) create the file recording the notification text and timestamp.
.env file: Make sure the .env file exists in your home directory. Create one if it does not exist.
NCBI_API_KEY (optional): Raises the NCBI rate limit from 3 to 10 requests/second. The skill works without it, but a key is recommended if the user plans many queries or encounters a 429 error. The user can obtain one for free by registering at https://www.ncbi.nlm.nih.gov/account/settings/. If the variable is missing from .env, do NOT ask the user to paste it into the chat (this would leak the key into the agent's context). Instead, give the user this command â€” substituting ENV_FILE with the resolved literal path to the .env file:
```
printf "Enter NCBI API key (typing hidden): " && read -s key && echo && echo "NCBI_API_KEY=$key" >> "ENV_FILE" && echo "Saved."
```
The scripts load credentials automatically via dotenv. NEVER read, print, or inspect the .env file or its variables (e.g. no cat, grep, echo, printenv, or os.environ.get on keys). Credentials must stay out of the agent's context.

Core Rules

Use the Wrapper: ALWAYS execute the provided helper scripts to query the database rather than accessing the database directly. The scripts automatically enforce the required rate limit gracefully.
API Key Support: If the user provides an NCBI_API_KEY in their environment, the query speed limits are automatically increased significantly.
Notification: If this skill is used, ensure this is mentioned in the output.

Overview

Wraps NCBI's Entrez E-utilities (efetch, esearch, elink, esummary) for retrieving protein and nucleotide sequences. Provides 10 subcommands covering the full range of sequence retrieval workflows:

fetch-protein â€” Direct protein accession lookup (GenPept, RefSeq)
fetch-nucleotide â€” Direct nucleotide accession lookup
cds-translate â€” Fetch CDS and translate to protein (3 methods)
search â€” Free-text search of any NCBI database
elink â€” Follow cross-database links (PubMedâ†’Protein, etc.)
gene-protein â€” Search protein by gene name + organism
locus-protein â€” Search protein by locus tag + organism
pubmed-proteins â€” Find proteins linked to a PubMed article
patent-search â€” Extract protein sequences from patents
organism-length â€” Last-resort search by organism + exact AA length

Utility Scripts

scripts/ncbi_fetch.py â€” Single script with subcommands.

All subcommands write structured JSON output. Use --output FILE to save to a file, or omit it to print to stdout. A human-readable summary is always printed to stdout.

1. Fetch Protein by Accession

Fetches protein FASTA from NCBI by accession (XP_, NP_, GenPept, etc.)

uv run scripts/ncbi_fetch.py fetch-protein XP_022033624 -o /tmp/result.json
uv run scripts/ncbi_fetch.py fetch-protein NP_001234567 ABC12345.1

2. Fetch Nucleotide by Accession

Fetches nucleotide FASTA from NCBI by accession.

uv run scripts/ncbi_fetch.py fetch-nucleotide MK034466 -o /tmp/result.json

3. CDS Translate

Fetches a CDS/nucleotide accession and translates to protein sequence. Tries three approaches in order: 1. NCBI's pre-translated CDS protein (fasta_cds_aa) 2. GenBank XML CDS annotation translations 3. Raw nucleotide â†’ 6-frame ORF finding

uv run scripts/ncbi_fetch.py cds-translate MK034466 -o /tmp/result.json
uv run scripts/ncbi_fetch.py cds-translate HQ662330 --target-length 1043

If the accession is a genomic record (not mRNA/CDS), the tool will report is_genomic: true so you can fall back to a homology-based approach instead.

4. Search Any Database

Free-text search using Entrez query syntax. Supports all NCBI databases.

# Search protein database
uv run scripts/ncbi_fetch.py search "WRR4B[Gene Name] AND Arabidopsis[Organism]" \
  --database protein --retmax 5 --fetch-sequences

# Search nucleotide database
uv run scripts/ncbi_fetch.py search "Rz2[Gene Name] AND Beta vulgaris[Organism]" \
  --database nuccore --retmax 10

# Search with patent filter
uv run scripts/ncbi_fetch.py search "disease resistance AND Solanum[Organism] AND patent[Properties]" \
  --database protein --fetch-sequences

# Search by sequence length
uv run scripts/ncbi_fetch.py search '"Oryza sativa"[Organism] AND 1043[SLEN]' \
  --database protein --fetch-sequences --retmax 50

5. Cross-Database Links (elink)

Follow NCBI's cross-database links (e.g., PubMed article â†’ linked proteins).

uv run scripts/ncbi_fetch.py elink 24896089 --dbfrom pubmed --db protein \
  --fetch-sequences -o /tmp/linked.json

6. Gene + Organism Search

Searches for protein sequences by gene name and organism. Searches NCBI Protein with [Gene Name] and [Organism] qualifiers.

uv run scripts/ncbi_fetch.py gene-protein WRR4B --organism "Arabidopsis thaliana"
uv run scripts/ncbi_fetch.py gene-protein Pikh-2 --organism "Oryza sativa" \
  --target-length 1043 -o /tmp/result.json

7. Locus Tag Search

Searches by locus tag in both NCBI Protein and Nuccore databases. Extracts CDS translations from GenBank XML when direct protein hits aren't available.

uv run scripts/ncbi_fetch.py locus-protein At1g56540 --organism "Arabidopsis thaliana"
uv run scripts/ncbi_fetch.py locus-protein Niben101Scf02422g02015.1 \
  --organism "Nicotiana benthamiana" -o /tmp/result.json

8. PubMed-Linked Proteins

Finds protein sequences linked to a PubMed article. Searches NCBI Protein by PMID, follows elink PubMedâ†’Protein, and extracts CDS translations from linked Nuccore records.

uv run scripts/ncbi_fetch.py pubmed-proteins 30692254 --identifier WRR4B
uv run scripts/ncbi_fetch.py pubmed-proteins 24896089 --identifier "K2" \
  -o /tmp/result.json

9. Patent Sequence Search

Two modes:

By patent number â€” fetches all protein sequences from a specific patent: bash uv run scripts/ncbi_fetch.py patent-search --patent-number US10123456 -o /tmp/patent.json

By keywords â€” searches NCBI Protein with patent[Properties] filter: bash uv run scripts/ncbi_fetch.py patent-search --keywords WRR4B Albugo --organism "Arabidopsis thaliana" -o /tmp/patent.json

[!IMPORTANT] Patent convention: In molecular biology patents, SEQ ID NO: 1 is typically the DNA sequence and SEQ ID NO: 2 is the primary protein. Higher SEQ ID NOs are variants or related sequences. Prefer Sequence 2 when selecting the primary protein of interest.

10. Organism + Length Search

Last-resort search when only organism and expected protein length are known. Uses NCBI's [SLEN] filter for exact length matching.

uv run scripts/ncbi_fetch.py organism-length \
  --organism "Arabidopsis thaliana" --length 1048 --retmax 50 \
  -o /tmp/result.json

[!NOTE] This often returns multiple candidates. Use the JSON output headers to identify the correct protein.

Workflow

Standard Sequence Retrieval Cascade

When trying to find a protein sequence, follow this priority order:

Direct accession â€” fetch-protein with GenPept/RefSeq accession
CDS translation â€” cds-translate with nucleotide/CDS accession
PubMed-linked â€” pubmed-proteins with PMID + gene name
Locus lookup â€” locus-protein with locus tag + organism
Gene + organism â€” gene-protein with gene name + organism
Patent search â€” patent-search with patent number or keywords
Organism + length â€” organism-length as last resort

Interpreting Results

All subcommands return JSON with a results array
Each result has sequence (AA string), length, and header/metadata
When multiple results are returned, select by:
- Closest match to expected length (target_length)
- Header relevance (matching gene name, "disease resistance" keywords)
- Source priority (RefSeq > GenPept > patent)

Reference

NCBI E-utilities docs: https://www.ncbi.nlm.nih.gov/books/NBK25499/
Entrez search syntax: https://www.ncbi.nlm.nih.gov/books/NBK49540/
Database list: protein, nuccore, gene, pubmed, pmc, biosample, etc.
Common accession formats:
- XP_ / NP_ â€” NCBI RefSeq protein
- AAA to AZZ + digits â€” GenPept (translated GenBank)
- MK, MN, HQ, etc. + digits â€” GenBank nucleotide
- ENSG, ENST, ENSP â€” Ensembl (use ensembl-database skill instead)
- Q, P, O + digits â€” UniProt (use uniprot-database skill instead)