kbase-query - SKILL.md Agent Skill

name: kbase-query description: Skills for querying the KBase/BERDL Datalake via the MCP REST API. Use this when users want to explore KBase databases, list tables, get schemas, sample data, or run SQL queries against the KBase data lake. Triggers on mentions of KBase, BERDL, or requests to query biological/microbiome data stored in KBase. category: domain

KBase Query

Query the KBase/BERDL Datalake MCP Server via REST API using Python.

Setup

The KBASE_TOKEN environment variable must be set. Tokens expire after ~1 week.

Python API Functions

Use execute_code with these helper functions to query the lakehouse:

import os
import requests
import pandas as pd

KBASE_TOKEN = os.environ.get('KBASE_TOKEN')
BASE_URL = 'https://hub.berdl.kbase.us/apis/mcp'

def get_headers():
    return {
        'Authorization': f'Bearer {KBASE_TOKEN}',
        'Content-Type': 'application/json',
        'accept': 'application/json'
    }

def list_databases():
    '''List all available databases'''
    resp = requests.post(
        f'{BASE_URL}/delta/databases/list',
        headers=get_headers(),
        json={'use_hms': True, 'filter_by_namespace': True}
    )
    resp.raise_for_status()
    return resp.json()['databases']

def list_tables(database):
    '''List tables in a database'''
    resp = requests.post(
        f'{BASE_URL}/delta/databases/tables/list',
        headers=get_headers(),
        json={'database': database, 'use_hms': True}
    )
    resp.raise_for_status()
    return resp.json()['tables']

def query(sql, limit=100):
    '''Execute SQL query and return DataFrame'''
    resp = requests.post(
        f'{BASE_URL}/delta/tables/query',
        headers=get_headers(),
        json={'query': sql, 'limit': limit}
    )
    resp.raise_for_status()
    data = resp.json()
    rows = data.get('result', [])
    return pd.DataFrame(rows)

Example Workflow

1. Explore available databases

dbs = list_databases()
print(f"Available databases: {dbs}")
# → ['enigma_coral', 'nmdc_core', 'globalusers_kepangenome_parquet_1', ...]

2. List tables in a database

tables = list_tables('nmdc_core')
print(f"Found {len(tables)} tables: {tables[:10]}")
# → ['annotation_terms_unified', 'cog_categories', 'kegg_ko_module', ...]

3. Query data with SQL

# Simple query
df = query("SELECT * FROM nmdc_core.kegg_ko_module LIMIT 10")
print(df)

# Aggregation query
df = query("""
    SELECT module_id, COUNT(*) as ko_count
    FROM nmdc_core.kegg_ko_module
    GROUP BY module_id
    ORDER BY ko_count DESC
    LIMIT 20
""")
print(df)

# Join query (when needed)
df = query("""
    SELECT a.*, b.description
    FROM nmdc_core.kegg_ko_module a
    JOIN nmdc_core.kegg_modules b ON a.module_id = b.module_id
    LIMIT 10
""")

Key Databases

Database	Description
`nmdc_core`	NMDC microbiome data (63 tables)
`globalusers_kepangenome_parquet_1`	Pangenomic data with GTDB taxonomy
`enigma_coral`	ENIGMA coral microbiome data

Key Tables in nmdc_core

Table	Description
`kegg_ko_module`	KEGG ortholog to module mappings
`cog_categories`	COG functional categories
`go_terms`	Gene Ontology terms
`ec_terms`	Enzyme Commission terms
`annotation_terms_unified`	Unified annotation data

API Reference

For complete endpoint documentation, see references/api_reference.md.

Available Endpoints

Endpoint	Method	Description
`/delta/databases/list`	POST	List databases
`/delta/databases/tables/list`	POST	List tables in database
`/delta/databases/tables/schema`	POST	Get table schema
`/delta/tables/query`	POST	Execute SQL query
`/delta/tables/sample`	POST	Sample rows (may timeout on large tables)
`/delta/tables/count`	POST	Get row count

Tips

Use query() with SQL for most reliable results
Large tables may timeout with /delta/tables/sample - use SQL with LIMIT instead
SQL queries support standard operations: SELECT, JOIN, WHERE, GROUP BY, ORDER BY
Maximum query limit is 1000 rows per request