name: jgi-lakehouse description: Queries JGI Lakehouse (Dremio) for genomics metadata from GOLD, IMG, Mycocosm, Phytozome. Downloads genome files from JGI filesystem using IMG taxon OIDs and links JGI taxon OIDs to read files through PMO/GOLD identifiers and JAMO. Use when working with JGI data, GOLD projects, IMG annotations, or downloading genomes.
JGI Lakehouse Skill
Quick Start
What is it? JGI's unified data warehouse (651 tables) + filesystem access to genome files.
Two data access methods:
- Lakehouse (Dremio) → Metadata, annotations, taxonomy (no sequences)
- JGI Filesystem → Actual genome files (FNA, FAA, GFF) via taxon OID
SQL Dialect: ANSI SQL (not PostgreSQL)
- Use
CAST(x AS type)not:: - Use
REGEXP_LIKE()not~ - Identifiers with dashes need double quotes:
"gold-db-2 postgresql"
-- Quick test
SELECT gold_id, project_name FROM "gold-db-2 postgresql".gold.project
WHERE is_public = 'Yes' LIMIT 5;
When to Use
- Query JGI genomics metadata (GOLD, IMG, Mycocosm, Phytozome)
- Find genomes and/or metagenomes by taxonomy, ecosystem, or phenotype.
- Download microbial genomes with IMG taxon OIDs
- Cross-reference GOLD projects with IMG annotations
Instructions
- Decide whether the task needs metadata, files, or read recovery.
- Use Lakehouse SQL for metadata/annotations and the JGI filesystem or JAMO for sequence files.
- Inspect schemas with a small
LIMIT; removeLIMITfor comprehensive results. - Record source, table, fields, filters, and access route in every result summary.
Quick Reference
| Task | Action |
|---|---|
| Test Lakehouse | Query "gold-db-2 postgresql".gold.project |
| Query IMG metadata | Use "img-db-2 postgresql".img_core_v400.* tables |
| Query NUMG proteins | Join faa and gene2pfam on both oid and gene_oid |
| Download IMG genome | Copy {taxon_oid}.tar.gz from /clusterfs/jgi/img_merfs-ro/img_web/img_web_data/download/ |
| Link reads | Start with jamo info all pmoid <img_jgi_project_id> |
Input Requirements
- JGI/GOLD/IMG identifier, taxonomy, project filter, term, or file target.
DREMIO_PATfor SQL; filesystem/JAMO access for files or reads.- Whether the result is exploratory, comprehensive, or a download.
- Optional runtime settings:
DREMIO_VERIFY_TLS(default1; setfalseonly for internal endpoints with unavailable CA validation),DREMIO_REQUEST_TIMEOUT(default60seconds),IMG_DOWNLOAD_DIR, andIMG_DATA_DIR.
Output
- SQL/result summaries with exact filtering criteria.
- Tables or exports for reusable metadata.
- Downloaded genome/protein/annotation files when requested.
- Provenance covering database, tables, filters, access route, and file paths.
Quality Gates
- Query source and access route match the task.
- Schema was inspected before writing joins against unfamiliar tables.
- Comprehensive answers do not use development
LIMITclauses. - Joins use correct keys, especially
oid+gene_oidfor NUMG. - Result summaries state all filters and whether they are exact matches, regexes, ranges, or aggregations.
Examples
See examples/04-download-img-genomes.md and examples/05-query-numg-metagenome-proteins.md.
Troubleshooting
Issue: Metadata query returns no sequence files. Solution: Use filesystem, portal downloadRequestFiles, or JAMO.
Issue: Final answer is based on 50 or 100 rows. Solution: Remove exploratory limits or use aggregation.
Issue: Dremio HTTPS fails with certificate verification errors on the internal endpoint. Solution: Prefer configuring the local CA bundle; for an explicit internal-network fallback, set DREMIO_VERIFY_TLS=false and record that opt-out in the run notes.
Data Access: Lakehouse vs Filesystem
| Need | Source | Access Method |
|---|---|---|
| Metadata (taxonomy, projects) | Lakehouse | SQL via REST API |
| Gene annotations (COG, Pfam, KO) | Lakehouse | SQL via REST API |
| Genome sequences (FNA) | JGI Filesystem | Copy from /clusterfs/jgi/img_merfs-ro/ |
| Protein sequences (FAA) | JGI Filesystem | Copy from /clusterfs/jgi/img_merfs-ro/ |
| Metagenome proteins only | Lakehouse | numg-iceberg.faa table |
Critical insight: The Lakehouse is a METADATA warehouse. Genome sequences must be accessed from the JGI filesystem.
Key Data Sources
| Source | Path | Contents |
|---|---|---|
| GOLD | "gold-db-2 postgresql".gold.* |
Projects, studies, samples, taxonomy |
| IMG | "img-db-2 postgresql".img_core_v400.* |
Taxons, genes, annotations (244 tables) |
| Portal | "portal-db-1".portal.* |
Download tracking, file paths |
| Mycocosm | "myco-db-1 mysql".<organism>.* |
Fungal genomes (2,711 schemas) |
| Phytozome | "plant-db-7 postgresql".* |
Plant genomics — see docs/phytozome.md |
| NUMG | "numg-iceberg"."numg-iceberg".* |
Metagenome proteins, Pfam hits |
Full table catalog: See docs/data-catalog.md
Phytozome (plant-db-7 / plant-db-4): Read docs/phytozome.md before writing any queries against these sources.
NUMG (Metagenome Proteins) Agent Workflow
Use NUMG when the task is metagenome protein sequence/domain analysis.
Scope rules:
numg-icebergis metagenome-focused.- Do not use NUMG for isolate genome protein retrieval; use IMG filesystem packages.
Core tables:
"numg-iceberg"."numg-iceberg".faaoid,gene_oid,faa(protein sequence)
"numg-iceberg"."numg-iceberg".gene2pfamoid,gene_oid,pfam,evalue, alignment coordinate fields
Recommended query flow:
-- 1) Confirm available NUMG tables
SHOW TABLES IN "numg-iceberg"."numg-iceberg";
-- 2) Inspect schema before writing joins/filters
DESCRIBE "numg-iceberg"."numg-iceberg".faa;
DESCRIBE "numg-iceberg"."numg-iceberg".gene2pfam;
-- 3) Domain filter (use exact lowercase pfam IDs)
SELECT oid, gene_oid, pfam, evalue
FROM "numg-iceberg"."numg-iceberg".gene2pfam
WHERE pfam IN ('pfam00001', 'pfam00004')
LIMIT 100;
-- 4) Join domains to protein sequences
SELECT
p.oid,
p.gene_oid,
p.pfam,
p.evalue,
f.faa
FROM "numg-iceberg"."numg-iceberg".gene2pfam p
JOIN "numg-iceberg"."numg-iceberg".faa f
ON p.oid = f.oid
AND p.gene_oid = f.gene_oid
WHERE p.pfam = 'pfam00001'
LIMIT 100;
Important NUMG rules:
- Join on both
oidandgene_oid(notgene_oidalone). - Keep Pfam filters exact (
pfam00001, not case-transformed). - Always start with
LIMITand expand only after verifying row shape.
See also: examples/05-query-numg-metagenome-proteins.md
Downloading Genomes with IMG Taxon OIDs
Option 1: JGI Filesystem (Fastest)
# Genome packages are at:
/clusterfs/jgi/img_merfs-ro/img_web/img_web_data/download/{taxon_oid}.tar.gz
# Example: Copy and extract
cp /clusterfs/jgi/img_merfs-ro/img_web/img_web_data/download/8136918376.tar.gz .
tar -xzf 8136918376.tar.gz
Package contents:
{taxon_oid}.fna- Genome assembly{taxon_oid}.genes.faa- Protein sequences{taxon_oid}.genes.fna- Gene nucleotide sequences{taxon_oid}.gff- GFF annotations{taxon_oid}.cog.tab.txt- COG annotations{taxon_oid}.pfam.tab.txt- Pfam annotations{taxon_oid}.ko.tab.txt- KEGG KO annotations
Linking Assemblies To Reads
Use this workflow when you need to go from an IMG metagenome assembly to the underlying JGI reads.
1. Start from the assembly taxon OID
Assemblies live under:
/clusterfs/jgi/img_merfs-ro/img_web_data_merfs/{taxon_oid}/assembled/
2. Pull the JGI/GOLD linkage fields from metadata
For a metagenome taxon OID, the most useful linkage fields are:
img_jgi_project_idsequencing_gold_idsample_gold_idstudy_gold_idgold_project_idgold_pmo_project_idgold_its_spid
In practice, img_jgi_project_id is often the strongest key for JAMO because it behaves like the PMO project identifier used by jamo info ... pmoid.
3. Prefer JAMO pmoid for JGI read lookup
Native JAMO lookup types are listed by:
apptainer run docker://doejgi/jamo-dori:latest jamo info help
For legacy JGI metagenomes, this usually works better than raw_normal spid:
apptainer run docker://doejgi/jamo-dori:latest \
jamo info all pmoid <img_jgi_project_id>
If you only want FASTQ rows, filter the output:
apptainer run docker://doejgi/jamo-dori:latest \
jamo info all pmoid <img_jgi_project_id> | rg 'fastq(\\.gz)?'
4. Direct taxon-OID lookup is still useful
This queries JAMO by the IMG taxon OID embedded in metadata:
apptainer run docker://doejgi/jamo-dori:latest \
jamo info all custom '{"metadata.gold_data.img_oid": 3300000030, "file_name": {"$regex": ".*fastq(\\\\.gz)?$"}}'
This can recover reads even when the older spid route is blank, but in recent re-audits pmoid recovered many more JGI rows.
5. spid is valid, but not sufficient
If you already have a verified sequencing project ID, this is still worth trying:
apptainer run docker://doejgi/jamo-dori:latest \
jamo info raw_normal spid <gold_its_spid>
But do not stop there. In several JGI cases:
raw_normal spidreturned nothingall pmoid <img_jgi_project_id>returned usable FASTQ records
6. Inspect and fetch the actual file
Inspect one metadata record:
apptainer run docker://doejgi/jamo-dori:latest jamo show <metadata_id>
Fetch a file by filename:
apptainer run docker://doejgi/jamo-dori:latest \
jamo fetch -s dori all filename <file_name>
That prints the staged scratch path, typically under:
/clusterfs/jgi/scratch/dsi/...
Important:
- if the file is already
RESTORED, you can use the staged path immediately - if the file is
PURGED,jamo fetchonly starts the restore; you must wait until the staged path exists and has non-zero size before using it
Simple wait pattern:
while [[ ! -s /clusterfs/jgi/scratch/dsi/.../file.fastq.gz ]]; do sleep 10; done
Practical rule
For JGI metagenome read recovery, use this priority:
jamo info all pmoid <img_jgi_project_id>jamo info all custom '{"metadata.gold_data.img_oid": ...}'jamo info raw_normal spid <gold_its_spid>
Do not assume "no reads" until all three have been checked. Do not assume a fetched file is ready until the staged path is actually restored.
Portal Downloads (Mycocosm / Phytozome)
The portal tracks downloadable files for Mycocosm and Phytozome in
"portal-db-1".portal.downloadRequestFiles. Use filePath to copy data
from the JGI filesystem (/global/dna/dm_archive/...).
Mycocosm (fungal genomes/proteins):
SELECT filePath, fileType
FROM "portal-db-1".portal.downloadRequestFiles
WHERE LOWER(filePath) LIKE '%mycocosm%'
AND (filePath LIKE '%.fasta%' OR filePath LIKE '%.fa%' OR filePath LIKE '%.faa%')
LIMIT 20;
Phytozome (plant genomes/proteins):
SELECT filePath, fileType
FROM "portal-db-1".portal.downloadRequestFiles
WHERE LOWER(filePath) LIKE '%phytozome%'
AND (filePath LIKE '%.fa%' OR filePath LIKE '%.fna%' OR filePath LIKE '%.faa%')
LIMIT 20;
Download from filesystem:
cp /global/dna/dm_archive/<path/from-filePath> .
Notes:
fileTypetypically includesAssembly,Annotation, orSequence.virtualPathcan provide a user-facing download label butfilePathis the real location.
Query Best Practices
⚠️ CRITICAL: When building queries, distinguish between exploration and comprehensive analysis:
Exploration Queries
Use LIMIT for quick validation during development:
-- For testing query structure and results
SELECT gold_id, project_name
FROM "gold-db-2 postgresql".gold.project
WHERE is_public = 'Yes'
LIMIT 10; -- ✓ OK for testing
Comprehensive Queries
Remove LIMIT and other result-limiting clauses when answering actual questions:
-- For getting actual dataset counts/results
SELECT COUNT(DISTINCT taxon_oid)
FROM "img-db-2 postgresql".img_core_v400.taxon
WHERE genome_type = 'metagenome'
AND is_public = 'Yes';
-- ✓ No LIMIT - gets true total
Common pitfalls:
- ❌
LIMIT 100on initial exploration → assumes only 100 results exist - ❌
LIMIT 50on a "find all" query → misses 99% of data - ❌ Using
FETCH FIRST N ROWS→ same issue as LIMIT
Best practice:
- Use
LIMITwith COUNT(*) or smallLIMITduring development - Once query logic is correct, remove LIMIT to get true results
- For very large result sets, use aggregation (COUNT, GROUP BY) to summarize instead
Common Queries
Find Bacterial Isolate Genomes
-- Get count of all finished bacterial isolates
SELECT COUNT(DISTINCT taxon_oid) as total_isolates
FROM "img-db-2 postgresql".img_core_v400.taxon
WHERE domain = 'Bacteria'
AND genome_type = 'isolate'
AND is_public = 'Yes'
AND seq_status = 'Finished';
-- Get sample of isolates (if you need details)
SELECT taxon_oid, taxon_display_name, phylum, genus, species
FROM "img-db-2 postgresql".img_core_v400.taxon
WHERE domain = 'Bacteria'
AND genome_type = 'isolate'
AND is_public = 'Yes'
AND seq_status = 'Finished'
LIMIT 100;
Link GOLD Project to IMG Taxon
SELECT COUNT(DISTINCT t.taxon_oid) as total_linked
FROM "img-db-2 postgresql".img_core_v400.taxon t
WHERE t.sequencing_gold_id IS NOT NULL;
Find Genomes with File Paths (Portal)
SELECT COUNT(DISTINCT taxonOid) as total_tar_gz
FROM "portal-db-1".portal.downloadRequestFiles
WHERE taxonOid IS NOT NULL
AND filePath LIKE '%.tar.gz';
Critical Pitfalls
| Wrong | Correct |
|---|---|
Using LIMIT in comprehensive queries |
Remove LIMIT when answering actual questions; use COUNT() for aggregation |
Join ncbi_assembly on project_id |
ncbi_assembly has no project_id; use bioproject or biosample |
project.ecosystem |
Join study via master_study_id |
SHOW SCHEMAS IN "source" |
Works, but some syntax errors in older Dremio |
| Get sequences from Lakehouse | Download from JGI filesystem |
sra_experiment_v2.platform |
Use library_instrument |
gene_ko_terms = 'K00025' |
Use gene_ko_terms = 'KO:K00025' |
Join NUMG on gene_oid only |
Join on both oid and gene_oid |
| Case-normalizing large function tables | Use exact normalized values (pfam00001, COG1389, etc.) |
| Isolate benchmark counts vary | Add obsolete_flag = 'No' and is_public = 'Yes' |
IMG.gene_feature fails expansion |
Fallback to "img-db-2 postgresql".img_core_v400.* tables |
show_schemas() misses sources |
Use higher limit (e.g. show_schemas(limit=2000)) |
Authentication
export DREMIO_PAT=$(cat ~/.secrets/dremio_pat)
Token setup: See docs/authentication.md
API Access
REST API Base: http://lakehouse-1.jgi.lbl.gov:9047/api/v3
# Use scripts/rest_client.py
from rest_client import query
results = query("SELECT * FROM ... LIMIT 10")
Arrow Flight (Python)
For higher-performance programmatic access, use Arrow Flight with Python.
python3 -m venv venv
. venv/bin/activate
pip install \
https://github.com/dremio-hub/arrow-flight-client-examples/releases/download/dremio-flight-python-v1.1.0/dremio_flight-1.1.0-py3-none-any.whl
Full guide: docs/arrow-flight-python.md
Documentation
IMG Reference
- docs/img_and_gold_terms.md - IMG/GOLD glossary: all terms, project types, quality flags, IDs, sequencing status, taxonomy systems
- docs/IMG_data_types.md - Guide to analysis project types (isolates, MAGs, SAGs, metagenomes) with query patterns and data type counts
- docs/IMG-tables-reference.md - Complete IMG table catalog (244 tables in img_core_v400)
- docs/explore_IMG_genomes.md - Genome metadata queries: NCBI/GTDB taxonomy, genome size, GC content, quality filters
Lakehouse Catalog & SQL
- docs/data-catalog.md - All data sources and key tables (GOLD, IMG, Portal, Mycocosm, Phytozome, NUMG)
- docs/phytozome.md - Phytozome plant genomics: proteomes, genes, families, PFAM/PANTHER/GO, expression, synteny, homologs
- docs/sql-quick-reference.md - Dremio SQL syntax
Access & Downloads
- docs/arrow-flight-python.md - Arrow Flight Python setup and test
- examples/05-query-numg-metagenome-proteins.md - NUMG workflow for protein+Pfam queries
- examples/04-download-img-genomes.md - Download with taxon OIDs
- scripts/download_img_genomes.py - Automated download script