name: protein-sequence-similarity-search description: > Searches for homologous protein sequences using MMseqs2 (fast, default) or BLAST (comprehensive, fallback). Trigger this whenever the user provides a protein sequence or FASTA file and asks to find homologues, sequence matches, or wants to infer protein function based on sequence similarity, but not when the user wants to infer protein function based on structural similarity.
Prerequisites
uv: Read theuvskill and follow its Setup instructions to ensureuvis installed and on PATH.User Notification: If LICENSE_NOTIFICATION.txt does not already exist in this skill directory then (1) prominently notify the user to check the terms at https://www.ebi.ac.uk/jdispatcher/sss/ncbiblast and https://colabfold.com, then (2) create the file recording the notification text and timestamp.
.envfile: Make sure the.envfile exists in your home directory. Create one if it does not exist.USER_EMAIL(optional but recommended): Recommended by the EBI for BLAST job tracking, but the skill works without it. If the variable is missing from.env, do NOT ask the user to paste it into the chat (this would leak the value into the agent's context). Instead, give the user this command — substitutingENV_FILEwith the resolved literal path to the.envfile:printf "Enter contact email: " && read email && echo "USER_EMAIL=$email" >> "ENV_FILE" && echo "Saved."The scripts load credentials automatically via
dotenv. NEVER read, print, or inspect the.envfile or its variables (e.g. nocat,grep,echo,printenv, oros.environ.geton keys). Credentials must stay out of the agent's context.
Goal
Take a user-provided amino acid sequence (or a path to a .fasta file), search
for sequence homologues using the fastest available method, generate a
Markdown-formatted table of the top hits, interpret key alignment metrics,
summarize the inferred protein functions, and save results locally for future
programmatic analysis.
Core Rules
- Strict Validation: For BLAST, only use database codes listed in the table below.
- No Hallucinations: If a script throws an error or returns no hits, inform the user clearly. Do NOT invent sequence homologues.
- Do Not Parse Output Files: Do not parse the JSON, a3m, or any other raw
output files. Rely on the generated
.mdfile for your summary. The JSON and other outputs are for subsequent tool use only. - Always State the Method: Every report must clearly state whether the search used the quick MMseqs2 (ColabFold API) or the slower EBI BLAST method.
- Notification: If this skill is used, ensure this is mentioned in the output. Explicitly state that the corresponding program (MMSEQS2 or EBI BLAST) and Sequence Databases were used.
Search Method Selection
Choose the search method based on the user's request:
If the user says "quick search" or "fast search", no specific method
requested / general homologue search, of if you are unsure: Run MMseqs2 (fast,
default) using mmseqs2_search.py
If MMseqs2 fails (exit code 2: RATELIMIT or API error) or User explicitly
requests "BLAST" or a specific BLAST database (e.g. uniprotkb_swissprot,
pdb, uniprotkb_human): Run BLAST using uniprot_blast.py
Instructions
Identify the query from the user. It can be a raw sequence string (e.g., "MKVLY...") or a path to a local file (e.g., "./data/sequence.fasta").
Determine the search method using the list above.
Path A: MMseqs2 Search (Default)
Generate File Names: Generate descriptive output file names based on the input (e.g.,
proteinA_mmseqs2.jsonandproteinA_mmseqs2.md).Execute the MMseqs2 script:
- Default:
uv run scripts/mmseqs2_search.py <SEQUENCE_OR_FILE> -o <generated-filename.md> -j <generated-filename.json>- With mgnify:
uv run scripts/mmseqs2_search.py <SEQUENCE_OR_FILE> -o <generated-filename.md> -j <generated-filename.json> --include-mgnifyThe script will query the ColabFold MMseqs2 API and poll for completion. This is typically fast (under 2 minutes).
If the script exits with code 2 (API failure, rate limit), automatically fall back to BLAST (Path B below). Inform the user: "MMseqs2 search failed, falling back to BLAST."
Read the Results: Open and read the generated
.mdfile.
Path B: BLAST Search (Explicit or Fallback)
Database Selection & Validation: Determine the most appropriate database(s) based on the user's prompt.
- Consult the Available BLAST Databases table below.
- If the user specifies a taxonomic group (e.g., "Find homologues in
microbes"), select the corresponding
Database Code(e.g.,uniprotkb_bacteria). - If the user explicitly requests curated hits, use
uniprotkb_swissprot. - If no specific database is requested, do not specify
--databases. - Validation: Ensure the database code exactly matches an entry in the table. If the user requests a database not on the list, do not proceed and provide the allowed list.
Generate File Names: (e.g.,
proteinA_ebi_blast.jsonandproteinA_ebi_blast.md).This API requires the user email address to be set in the USER_EMAIL environment variable for inclusion in request header.
Execute the BLAST script:
- Default (uniprotkb):
uv run scripts/uniprot_blast.py <SEQUENCE_OR_FILE> -o <generated-filename.md> -j <generated-filename.json>- Custom database:
uv run scripts/uniprot_blast.py <SEQUENCE_OR_FILE> -o <generated-filename.md> -j <generated-filename.json> --databases <db1,db2>The script will query the EBI BLAST API and poll the server. Note: This can take up to 15 minutes; wait patiently.
Read the Results: Open and read the generated
.mdfile.
Common Steps (Both Methods)
- Interpret the Metrics: Summarize the top 3 to 5 sequence homologues.
Assess match quality using:
- Q-Cov (Query Coverage): High percentages mean the match covers most of the query sequence.
- E-value: Lower E-values (e.g.,
1e-50) indicate extreme statistical significance. - Seq Identity: Provides evolutionary context (highly conserved vs. distant homologue).
- Perform Functional Analysis:
- If the results table includes protein descriptions, analyze them directly: report specific protein names/functions of the top homologues and summarize the variety of functions, domains, or protein families found.
- If the results contain only UniProt accession IDs without descriptions (common with MMseqs2), look up the protein names and functions for the top 3–5 hits using the uniprot-database skill or other appropriate methods before summarizing.
- Inform the user of both newly created files (
.jsonand.md) and their locations.
Available BLAST Databases
uniprotkb– UniProt Knowledgebase (The UniProt Knowledgebase includes UniProtKB/Swiss-Prot and UniProtKB/TrEMBL): The UniProt Knowledgebase (UniProtKB) is the central access point for extensive curated protein information, including function, classification, and cross-references. Search UniProtKB to retrieve "everything that is known" about a particular sequenceuniprotkb_swissprot– UniProtKB/Swiss-Prot (The manually annotated section of UniProtKB): The manually curated subsection of the UniProt Knowledgebaseuniprotkb_swissprotsv– UniProtKB/Swiss-Prot isoforms (The manually annotated isoforms of UniProtKB/Swiss-Prot): The isoform sequences for the manually curated subsection of the UniProt Knowledgebaseuniprotkb_reference_proteomes– UniProtKB Reference Proteomes: Taxonomic subset of the UniProtKB Reference Proteomesuniprotkb_trembl– UniProtKB/TrEMBL (The automatically annotated section of UniProtKB): Subsection of the UniProt Knowledgebase derived from ENA Sequence (formerly EMBL-Bank) coding sequence translations with annotation produced by an automated processuniprotkb_refprotswissprot– UniProtKB Reference Proteomes plus Swiss-Prot: UniProtKB Reference Proteomes plus Swiss-Protuniprotkb_archaea– UniProtKB Archaea: Taxonomic subset of the UniProt Knowledgebase for archaeauniprotkb_arthropoda– UniProtKB Arthropoda: Taxonomic subset of the UniProt Knowledgebase for arthropodauniprotkb_bacteria– UniProtKB Bacteria: Taxonomic subset of the UniProt Knowledgebase for bacteriauniprotkb_complete_microbial_proteomes– UniProtKB Complete Microbial Proteomes: Taxonomic subset of the UniProt Knowledgebase for complete microbial proteomesuniprotkb_eukaryota– UniProtKB Eukaryota: Taxonomic subset of the UniProt Knowledgebase for eukaryotauniprotkb_fungi– UniProtKB Fungi: Taxonomic subset of the UniProt Knowledgebase for fungiuniprotkb_human– UniProtKB Human: Taxonomic subset of the UniProt Knowledgebase for humanuniprotkb_mammals– UniProtKB Mammals: Taxonomic subset of the UniProt Knowledgebase for mammalsuniprotkb_nematoda– UniProtKB Nematoda: Taxonomic subset of the UniProt Knowledgebase for nematodauniprotkb_rodents– UniProtKB Rodents: Taxonomic subset of the UniProt Knowledgebase for rodentsuniprotkb_vertebrates– UniProtKB Vertebrates: Taxonomic subset of the UniProt Knowledgebase for vertebratesuniprotkb_viridiplantae– UniProtKB Viridiplantae: Taxonomic subset of the UniProt Knowledgebase for viridiplantaeuniprotkb_viruses– UniProtKB Viruses: Taxonomic subset of the UniProt Knowledgebase for virusesuniprotkb_enzyme– UniProtKB Enzyme: Taxonomic subset of the UniProt Knowledgebase for enzymesuniprotkb_covid19– UniProtKB COVID-19: Taxonomic subset of the UniProt Knowledgebase for COVID-19uniref100– UniProt Clusters 100% (UniRef100): The UniProt Reference Clusters (UniRef) containing sequences which are 100% identical.uniref90– UniProt Clusters 90% (UniRef90): The UniProt Reference Clusters (UniRef) containing sequences which are 90% identical.uniref50– UniProt Clusters 50% (UniRef50): The UniProt Reference Clusters (UniRef) containing sequences which are 50% identical.pdb– Protein Structure Sequences (PDBe protein structure sequences): Protein sequences from structures described in the Brookhaven Protein Data Bank (PDB)