name: ncbi_sequence description: NCBI E-utilities for biological sequences — fetch protein/nucleotide FASTA by accession, run BLAST, translate CDS to protein, search NCBI Protein by gene+organism. Use when the user provides an NCBI accession (NP_, XP_, NM_, NR_, etc.), asks for a sequence by gene name + species, or needs to translate a coding sequence. Don't use for ClinVar variants (use ncbi_clinvar) or gene metadata lookup (use ncbi_gene). license: Unknown metadata: skill-author: VenusFactory2.
NCBI Sequence Tools
Overview
Wraps NCBI E-utilities (efetch, esearch) for sequence-centric workflows. Honors NCBI_API_KEY env var to raise the QPS limit from 3 → 10. Set USER_EMAIL env var to identify yourself to NCBI as a good citizen.
Project Tools (VenusFactory2)
| Tool | Args | Returns | Description |
|---|---|---|---|
| download_ncbi_sequence | ncbi_id, out_dir, db (default protein) |
rich JSON envelope; FASTA file | Fetch a single sequence by accession from protein or nuccore. |
| download_ncbi_metadata | ncbi_id, out_path, db |
rich JSON envelope; metadata JSON | Fetch GenBank-style metadata for an accession. |
| download_ncbi_blast | (see existing schema) | rich JSON envelope | NCBI-hosted BLAST. Prefer download_mmseqs2_homologs_by_sequence (faster) or download_blast_homologs_by_sequence (EBI mirror) for protein-protein searches. |
| translate_ncbi_cds_to_protein | accession (nuccore, e.g. NM_000518 for HBB mRNA), out_dir, target_length (default 0 = longest), timeout |
rich JSON envelope; FASTA at <out_dir>/<accession>_protein.fasta; biological_metadata.method="fasta_cds_aa" |
Use efetch(rettype=fasta_cds_aa) to get the CDS-translated protein, pick the translation closest to target_length (or longest). |
| search_ncbi_protein_by_gene_and_organism | gene (e.g. TP53), organism ("Homo sapiens"), out_dir, target_length (default 0 = no length filter; non-zero filters to ±25 aa window), retmax (default 10), timeout |
rich JSON envelope; multi-FASTA + <stem>.json summary; biological_metadata.summary_path for the per-hit JSON |
Search NCBI Protein DB with <gene>[Gene Name] AND <organism>[Organism], fetch all hits as multi-FASTA. |
Workflow: "Get the protein sequence for gene X in species Y"
- Try
search_ncbi_protein_by_gene_and_organismfirst — if it finds 1-5 hits, pick the canonical one. - If you know the mRNA accession (e.g. from a Gene record), use
translate_ncbi_cds_to_proteinfor the canonical CDS-derived protein (no ambiguity from isoforms). - Fallback: keyword search via
download_ncbi_metadatato find an accession, thendownload_ncbi_sequence.
Workflow: "Translate this mRNA to protein"
- Direct:
translate_ncbi_cds_to_protein(accession=..., target_length=expected_aa_count). The tool uses NCBI's pre-translated CDS protein when available — no client-side translation needed, no codon-table issues.
Common Mistakes
- Confusing
proteinandnuccoreDBs: protein accessions (NP_, XP_, AAA-style) go todb=protein; mRNA/genomic (NM_, NC_, etc.) go todb=nuccore.translate_ncbi_cds_to_proteinalways usesnuccoreinternally. - Mismatched gene+organism: NCBI is strict;
TP53 AND Homo sapiensworks,tp53 AND humanmay not. - Forgetting
NCBI_API_KEY: at 3 QPS you'll hit rate limits with batch operations. Set the env var to bump to 10 QPS. - target_length too narrow:
search_ncbi_protein_by_gene_and_organismapplies ±25 aa filter; if your target_length is uncertain, pass 0 and pick from results.