name: alphagenome description: RCC Midway HPC only. Query the AlphaGenome API (Google DeepMind v0.5.1) for splice site probability, splice site usage, gene expression, and other genomic predictions from raw DNA sequence. Invoke when running AlphaGenome predictions, using the AlphaGenome API, or using alphagenome_utils from my_utils.
AlphaGenome Skill
AlphaGenome (Google DeepMind v0.5.1) is a genomic foundation model that predicts multi-tissue functional genomics signals, including splice site probability and splice site usage, from raw DNA sequence.
Installation
AlphaGenome v0.5.1 is installed in the py_general conda env:
pip install alphagenome
API Key
Stored in ~/.secrets as ALPHAGENOME_API_KEY, sourced via ~/.zshrc_local.
import os
api_key = os.environ["ALPHAGENOME_API_KEY"]
Supported Sequence Lengths
Must match exactly one of:
- 16,384 bp (fastest; use for local splice site analysis)
- 131,072 bp
- 524,288 bp
- 1,048,576 bp
Shorter sequences should be padded with N's to the next supported length.
Basic Usage
from alphagenome.models.dna_client import create, OutputType, Organism
client = create(api_key=os.environ["ALPHAGENOME_API_KEY"])
# Sequence must be exactly a supported length (pad with N's if shorter)
sequence = "A" * 16_384 # placeholder
output = client.predict_sequence(
sequence,
organism=Organism.HOMO_SAPIENS,
requested_outputs=[OutputType.SPLICE_SITES, OutputType.SPLICE_SITE_USAGE],
ontology_terms=None, # None = all tissues; or pass list of OntologyTerm
)
# Per-position splice site probability
splice_probs = output.splice_sites # TrackData, shape (L, n_tracks)
# Per-position splice site usage (tissue-specific)
splice_usage = output.splice_site_usage # TrackData, shape (L, n_tracks)
# Per-junction scores (if requested)
# splice_juncs = output.splice_junctions # JunctionData
Output Types
| OutputType | Description |
|---|---|
SPLICE_SITES |
Per-position splice site probability |
SPLICE_SITE_USAGE |
Per-position tissue-specific usage |
SPLICE_JUNCTIONS |
Per-junction (not per-position) scores |
GENE_EXPRESSION |
Tissue-specific expression |
CHROMATIN_ACCESSIBILITY |
ATAC-seq signal |
HISTONE_MODIFICATIONS |
H3K27ac, H3K4me3, etc. |
Tissue Specification (Ontology CURIEs)
Pass None for ontology_terms to get all tissues (~100+), or specify:
from alphagenome.data import ontology
# Common tissue CURIEs (UBERON ontology):
TISSUE_CURIES = {
"brain": "UBERON:0000955",
"liver": "UBERON:0002107",
"heart": "UBERON:0000948",
"skeletal_muscle": "UBERON:0001134",
"testis": "UBERON:0000473",
"kidney": "UBERON:0002113",
"lung": "UBERON:0002048",
"blood": "UBERON:0000178",
"ovary": "UBERON:0000992",
"thyroid": "UBERON:0002046",
}
# Request a specific tissue:
output = client.predict_sequence(
sequence,
organism=Organism.HOMO_SAPIENS,
requested_outputs=[OutputType.SPLICE_SITE_USAGE],
ontology_terms=[ontology.OntologyTerm("UBERON:0000955")], # brain
)
Accessing Track Values
vals = output.splice_site_usage.values # numpy array (L, n_tracks)
# Track metadata (tissue names, ontology terms):
meta = output.splice_site_usage.metadata
for i, m in enumerate(meta):
print(i, m.name, getattr(m, 'ontology_term', None))
my_utils Integration
Use alphagenome_utils in my_utils for a drop-in predictor:
import sys
sys.path.insert(0, "/project/yangili1/bjf79/repos_not_projects/my_utils/src")
from my_utils import alphagenome_utils, spliceai_utils
client = alphagenome_utils.create_alphagenome_client(
os.environ["ALPHAGENOME_API_KEY"]
)
# Get DataFrame(donor_prob, acceptor_prob, seq) for any sequence:
df = alphagenome_utils.predict_splice_sites_alphagenome(seq, client)
# Or as a predictor_fn for spliceO_predictions:
predictor = alphagenome_utils.make_alphagenome_predictor(client)
results = spliceai_utils.spliceO_predictions(
FASTA, "chr2:1,840,760-1,887,609", (1840760, 1887609), 20, 20,
{"donor": 1865000}, predictor_fn=predictor
)
# List available tissues:
tissue_df = alphagenome_utils.list_alphagenome_tissues(client)
Latency & Caching
- API calls are remote (Google Cloud); expect ~1–5 seconds per call
- Cache results to parquet/pickle when running large scans
- For 20N walks (~5,000 calls/exon), budget ~2–4 hours per exon
- Use
target_length=16384for the smallest footprint when sequence allows
Notes
- Sequence must contain only ACGTN (no IUPAC ambiguity codes)
- The model pads internally, but the sequence passed must match a supported length exactly
SPLICE_SITE_USAGEreturns values in [0, 1] representing fractional usageSPLICE_SITESreturns probability of being a splice site position- Track layout: typically interleaved (even=donor, odd=acceptor per tissue) — verify with metadata