name: berdl_start description: Get started with the BERIL Research Observatory. Use when a user is new, wants orientation, or asks what they can do. allowed-tools: Read, Bash user-invocable: true
BERIL Research Observatory - Onboarding
Welcome the user and orient them to the system, then route them to the right context based on their goal.
Phase 1: System Overview
Present this information directly (no file reads needed):
What is BERDL?
The KBase BER Data Lakehouse (BERDL) is an on-prem Delta Lakehouse (Spark SQL) hosting databases across multiple tenants. The exact database inventory and access depend on the authenticated user — discover it live in Phase 1.6 below.
Collections currently span: pangenomes (GTDB-derived species pangenomes with functional annotations and pathway predictions), genome structural data, biochemistry (ModelSEED reactions and compounds), genome-wide fitness (RB-TnSeq), environmental microbiology, multi-omics (NMDC), phage-host data, and marine microbial samples. Always discover the exact set live — do not rely on this prose for current inventory.
Repo Structure
projects/ # Science projects (each has README.md + notebooks/ + data/)
docs/ # Shared knowledge base
schemas/ # Per-collection table/column docs
pitfalls.md # SQL gotchas, data sparsity, common errors
performance.md # Query strategies for large tables
research_ideas.md # Future research directions
overview.md # Scientific context and data generation workflow
discoveries.md # Running log of insights
.claude/skills/ # Agent skills
data/ # Shared data extracts reusable across projects
Available Skills
| Skill | What it does |
|---|---|
/berdl |
Discover BERDL data with access-aware helpers and query with Spark SQL |
/berdl-query |
Run SQL queries locally with remote Spark compute (CLI tools + notebook support) |
/berdl-minio |
Transfer files between BERDL MinIO and local machine |
/berdl-discover |
Explore and document a new BERDL database |
/literature-review |
Search PubMed, bioRxiv, arXiv, Semantic Scholar, and Google Scholar for relevant biological literature |
/synthesize |
Read analysis outputs, compare against literature, and draft findings |
/submit |
Approve a project and upload it to the lakehouse |
/cts |
Run batch compute jobs on the CTS cluster |
Note: Hypothesis generation, research planning, and notebook creation are handled automatically as part of the Unified Project Workflow below. You don't need to invoke them separately.
Existing Projects
Discover projects dynamically — run ls projects/ to list them. Read the first line of each projects/*/README.md to get titles. Present the list to the user so they can see what's been done.
How Projects Work
Each project lives in projects/<name>/ with a three-file structure plus supporting directories:
projects/<name>/
├── README.md — Project overview, reproduction, authors
├── RESEARCH_PLAN.md — Hypothesis, approach, query strategy, revision history
├── REPORT.md — Findings, interpretation, supporting evidence
├── REVIEW.md — Canonical copy of the approved review (set by /submit)
├── REVIEW_N.md — Numbered reviews from /berdl-review (history; latest is canonical at submit time)
├── SUBMITTED.md — Marker file written when lakehouse upload succeeds (visible from `ls`)
├── notebooks/ — Analysis notebooks with saved outputs
├── data/ — Agent-derived data from queries and analysis
├── user_data/ — User-provided input data (gene lists, phenotypes, etc.)
├── figures/ — Key visualizations
└── requirements.txt — Python dependencies
Reproducibility is required: notebooks must be committed with outputs, figures must be saved to figures/, and README must include a ## Reproduction section. See PROJECT.md for full standards.
Phase 1.5: Environment Detection
Run the canonical environment check:
python scripts/berdl_env.py --check
This wraps scripts/detect_berdl_environment.py with auto-recovery. It:
- Detects on-cluster vs off-cluster by testing connectivity to
spark.berdl.kbase.us:443. - On-cluster: confirms
KBASE_AUTH_TOKEN, writes to.envif missing, reports ready. - Off-cluster: checks
.env,.venv-berdl, SSH tunnels (1337, 1338), pproxy (8123). Auto-starts pproxy if tunnels are up. Prints exactssh -f -N -D ...commands if tunnels are missing.
Present the output to the user. If exit code is non-zero, follow the printed next steps. Common cases:
- "Start SSH tunnels" → ask the user to run the printed command in a terminal, then re-run
python scripts/berdl_env.py --check. - "pproxy not running" → the helper auto-starts it; this should self-resolve.
- "Missing .venv-berdl" →
bash scripts/bootstrap_client.sh. - "Missing KBASE_AUTH_TOKEN" → get token from https://narrative.kbase.us/#auth2/account and add to
.env.
Route follow-up BERDL queries from the detected location:
on-cluster: use the active Spark session andspark.sql(query)directly. Do not use--berdl-proxy.off-cluster: use/berdl-queryorpython scripts/run_sql.py --berdl-proxyafter the helper reports ready.
Do not skip this step. The location reported is what every subsequent skill in this session uses to choose its execution path.
Phase 1.6: Live Inventory
After Phase 1.5 reports ready, print the live database inventory:
python scripts/berdl_inventory.py
This is plain python in both environments. It auto-detects on-cluster vs off-cluster and picks the right discovery path.
- On-cluster (JupyterHub): the JH kernel already has every import. Just run the command above.
- Off-cluster (local machine): activate
.venv-berdlfirst (source .venv-berdl/bin/activate), then run the command above. If the venv isn't bootstrapped yet, runbash scripts/bootstrap_client.sh. Ad-hoc alternative without venv:uv run --with pyspark --with "spark_connect_remote @ git+https://github.com/BERDataLakehouse/spark_connect_remote.git" --with "berdl_remote @ git+https://github.com/BERDataLakehouse/berdl_remote.git" scripts/berdl_inventory.py.
Do not use a bare uv run scripts/berdl_inventory.py (without --with) — uv run --script would create an isolated venv that excludes both the JH kernel's berdl_notebook_utils (breaks on-cluster) and the off-cluster Spark deps. The script detects misuse and exits with an actionable message, but pick the right invocation upfront.
Output contract: the script prints a compact tenant-level summary (header + one row per tenant + "other tenants" footer) to stdout, and writes the full per-database markdown report to data/berdl_inventory.md. This split exists because the Claude Code UI auto-collapses long bash output to "+N lines (ctrl+o to expand)" — keeping stdout short means the summary actually surfaces, and the file gives the user a stable artifact they can open in their editor regardless of how the chat displays bash output.
What you need to do in your reply:
- Paste the stdout summary verbatim (it's short — the AGENT comment at the end of the summary reinforces this).
- Tell the user the full per-database report is at
data/berdl_inventory.md. Do NOT try to relay that file's contents into chat unless the user explicitly asks — they can open it directly. - If the user asks for a specific tenant's databases, use
Readondata/berdl_inventory.mdand excerpt just that section, OR re-run withpython scripts/berdl_inventory.py --fulland quote the relevant block.
Same command works on-cluster and off-cluster — on-cluster it uses access-aware berdl_notebook_utils.get_db_structure() plus list_tenants() / get_tenant_detail() for tenant metadata. Off-cluster (with .venv-berdl active), the script falls back to SHOW DATABASES + SHOW TABLES via the local Spark Connect drop-in (which auto-spawns the JH server on cold start). Tenant metadata is on-cluster only; off-cluster groups by the database's underscore prefix without descriptions.
Useful flags:
--full— print the full report to stdout instead of the compact summary (also still writes the file unless--no-file).--no-file— skip writingdata/berdl_inventory.md.--output PATH— override the file destination.--sample 5— show up to 5 table names per database in the full report (default: 3).--with-members— list each tenant's read-write and read-only members in the full report.--no-emoji— plain-text output.--off-cluster— force the off-cluster path.
For deeper inspection, suggest the user run:
DESCRIBE DATABASE EXTENDED <db>— database-level description / location / properties.DESCRIBE EXTENDED <db>.<table>— table-level description, columns, partitioning, storage.berdl_notebook_utils.get_table_schema(db, table, detailed=True, return_json=False)— column-level info (name, type, nullable, description).
Do not run COUNT(*) per database in this phase — it's expensive and not needed for orientation.
Phase 1.7: Session Naming Reminder
If the session does not already have a name, remind the user:
Tip: Name this session for easy identification — especially useful for long-running or remote sessions where the connection may drop. A good convention is to match the project name and git branch (e.g.,
essential_metabolome).
This is a non-blocking reminder. Move on to Phase 2 regardless.
Phase 2: Establish Project Context
Every BERIL session works inside a project — including ad-hoc exploration. This gives every artifact (queries, user data, notes, figures) a home from minute one, makes work resumable across sessions, and avoids loose exploratory/ clutter.
Ask the user which of these applies:
- Continue an existing project — pick from
projects/*(recommended if any work is in flight) - Start a new project — name it now (e.g.,
metal_cofitness,lanthanide_genomes) - I'm just exploring — agent uses
scratch_<YYYYMMDD>(or<topic>_scratchif the user gives a keyword)
There is no "no-project" door. If the user truly wants only to learn about the system without committing to a session, point them at the Side Path at the end of this skill — that's reading, not a session.
Once a project context is established, the workflow is the same regardless of entry door:
| Phase | Status (beril.yaml) |
What happens |
|---|---|---|
| Phase 0 — Scaffold | (unset → exploration) |
Create dirs, beril.yaml, stub README; pick branch. Runs only for entry doors 2 and 3. |
| Phase A — Orientation & Exploration | exploration |
Read docs, explore data, accept user_data, develop hypotheses. |
| Phase B — Research Plan | exploration → proposed |
Write RESEARCH_PLAN.md. STOP. |
| Checkpoint — Plan Review | proposed |
Mandatory pause: approve, review, or iterate. Do not skip. |
| Phase C — Analysis | proposed → active |
Write & execute notebooks, save figures, capture pitfalls. |
| Phase D — Synthesis | active → analysis |
/synthesize → REPORT.md. |
| Phase E1 — Review | analysis → reviewed |
/berdl-review produces a numbered REVIEW_N.md with hash footer. Iterate freely. |
| Phase E2 — Approve & Submit | reviewed → complete (+ SUBMITTED.md on upload success) |
/submit verifies the latest review is current, asks for explicit approval, uploads to lakehouse, writes marker file. |
For entry door 1 (Continue existing): start at the phase matching the project's current status. For doors 2 and 3: start at Phase 0.
Unified Project Workflow
The flow below applies regardless of entry door. Doors 2 and 3 (new project, just exploring) start at Phase 0. Door 1 (continue existing) skips Phase 0 and resumes at the phase matching the project's current status in beril.yaml.
Phase 0: Project Scaffold (entry doors 2 and 3 only)
Establish the project directory and manifest before any querying, planning, or user-data handoff. This guarantees every artifact has a home from the first command.
- Resolve project name:
- Door 2 (new project): ask for a snake_case identifier (e.g.,
metal_cofitness,lanthanide_genomes). - Door 3 (just exploring): default to
scratch_<YYYYMMDD>. If the user mentioned a topic keyword (e.g., "metal cofitness"), use<topic>_scratch.
- Door 2 (new project): ask for a snake_case identifier (e.g.,
- Verify uniqueness:
projects/<id>/must not already exist. If it does, ask whether to switch to door 1 (continue existing) or pick a new name. - Author identity: run
beril user --jsonvia Bash. The JSON hasname,affiliation,orcid. Exit 0 = all three present, use them. Exit 1 = at least one missing; parse the JSON anyway, prompt only for missing fields, then suggestberil setupto persist. - Create directories:
projects/<id>/{notebooks,data,user_data,figures}. Add a.gitkeepfile in each (touch projects/<id>/notebooks/.gitkeepetc.) so git tracks the directories through the scaffold commit. Without.gitkeep, the empty dirs are dropped from the commit and the "every artifact has a home" promise breaks on checkout elsewhere. - Write
beril.yaml(template at the bottom of this file):status: exploration,created_atandlast_session_atset to current ISO timestamp,engine.nameset to the current agent (e.g.,claude), authors from beril user, allartifactsflags false initially.branchis set in step 6. - Branch (recommended): offer to create branch
projects/<id>and switch to it. Long-running projects onmaincreate merge pain. If the user declines, leave on the current branch. Record the actual branch inberil.yaml. - Write a stub
README.md(template at the bottom): title (humanized from project_id), Status block reading "Exploration — research plan not yet written", authors from beril user, Quick Links pointing atRESEARCH_PLAN.md(TBD) andREPORT.md(TBD), Reproduction placeholder. - Update
beril.yaml:artifacts.readme: true. - Commit:
feat(project): scaffold {id} (exploration phase). - Suggest naming this session to match the project: "Consider naming this session
{project_id}to match the branch — useful for long-running or remote sessions where the connection may drop."
After Phase 0, every artifact has a home. User data → projects/<id>/user_data/. Exploration queries → projects/<id>/notebooks/00_*.ipynb. References → projects/<id>/references.md. Move on to Phase A.
Phase A: Orientation & Exploration
Status: exploration. Read context, explore data, accept user-supplied input, and develop hypotheses — all inside the project directory.
Required reading before designing any queries:
PROJECT.md— dual goals (science + knowledge capture), project structure, reproducibility standards, JupyterHub workflow, Spark notebook patterns.docs/overview.md— data architecture, key tables, generation workflow, known limitations.docs/pitfalls.mdanddocs/performance.md— critical: read before any query design. These are the frozen historical archives; per-project pitfalls hit by recent projects also live inprojects/*/memories/pitfalls.md(worth a scan, especially for projects on the same database family).docs/research_ideas.md— check for related ideas; avoid duplicating work.- Use
berdl_notebook_utils.get_databases(return_json=False),get_tables(... return_json=False), andget_table_schema(... detailed=True, return_json=False)for live access-aware discovery. For database-specific gotchas, grepdocs/pitfalls.mdfor the database name (e.g.,grep -A 20 "^## kbase\.ke_pangenome$" docs/pitfalls.md); also checkprojects/*/memories/pitfalls.mdfor any project-tagged entries on the same database.
Setup check (Phase 1.5 already verified KBASE_AUTH_TOKEN and proxy):
6. gh auth status — needed for branches/PRs. Prompt gh auth login if missing.
Engagement (status stays exploration):
- Discuss the user's research interest and goals.
- If the user has input data (gene lists, phenotype tables, FASTAs, SQLite, etc.): drop it in
projects/<id>/user_data/. Never leave user-supplied data in~/or the repo root. - Run exploratory queries with
/berdl. For any query worth keeping, save it as a numbered exploration notebook (projects/<id>/notebooks/00_exploration.ipynb, then00b_*.ipynbif you need more). Even rough exploration gets a home. - Search literature with
/literature-reviewif relevant. References go toprojects/<id>/references.md. - Check related existing projects in
projects/— read their READMEs to understand prior work. - Develop 2-3 testable hypotheses with H0/H1.
When the user has a clear hypothesis and is ready to commit to a plan, transition to Phase B.
Phase B: Research Plan
Status transition: exploration → proposed.
- Write
projects/<id>/RESEARCH_PLAN.md(template at the bottom of this file): Research Question, Hypothesis (H0/H1), Literature Context, Approach, Data Sources, Query Strategy (tables, filter strategy, performance tier), Analysis Plan (numbered notebooks with goals + expected outputs), Expected Outcomes, Revision History (v1 with today's date), Authors. - Update
projects/<id>/README.mdStatus block to "Proposed — research plan written, awaiting analysis." Fill in any other sections that became clearer (Overview, Research Question). - Update
beril.yaml:status: proposed,last_session_atto now,artifacts.research_plan: true. - Commit:
feat(project): research plan for {id}.
STOP HERE. The plan is the contract for what comes next. Do NOT write or execute notebooks yet. Proceed to the Checkpoint.
Checkpoint: Plan Review (mandatory pause)
This pause is the key guard against the agent rushing from plan to compute without human or independent review. Do not skip it.
Present the plan to the user concisely:
- Title and refined research question
- Hypothesis (H0/H1)
- Query strategy summary (tables touched, filter strategy, estimated complexity)
- Expected outcomes (what supports H1, what supports H0, potential confounders)
Then ask explicitly:
"Plan ready to start analysis? (a) Approve and continue to Phase C (Analysis) (b) Run an independent review first —
bash tools/review.sh {project_id} --type planwritesPLAN_REVIEW_<n>.md. Use--reviewer codexfor a second opinion. (c) Iterate on the plan"
Do not proceed to Phase C until the user picks (a).
- If (b): invoke the plan reviewer, present
PLAN_REVIEW_<n>.mdto the user, then re-ask. Reviewer output is advisory — the user has final say. - If (c): iterate on
RESEARCH_PLAN.md. Record changes in Revision History as- **v2** ({date}): {change description}, then re-ask.
Phase C: Analysis (Notebooks)
Status transition: proposed → active.
- Update
beril.yaml:status: active,last_session_atto now. - Write numbered analysis notebooks (
01_data_exploration.ipynb,02_analysis.ipynb, ...) following the analysis plan inRESEARCH_PLAN.md. - Notebooks are the primary audit trail — do as much work as possible in notebooks so humans can inspect intermediate results.
- When parallel execution or complex pipelines are needed, write scripts in
projects/<id>/src/but call them from notebooks. - Run notebooks — execute cells, inspect outputs, iterate.
- If new information emerges that changes the approach, update
RESEARCH_PLAN.mdRevision History as- **v{n}** ({date}): {change}before continuing. - Commit frequently — after each major milestone (notebook complete, data extracted, key result reproduced).
- Re-read
docs/pitfalls.mdwhen something doesn't behave as expected.
Checkpoint: Results Review
After notebooks are executed and committed, pause and present key results before synthesis.
- Summarize the key results: main statistics, notable patterns, anything unexpected.
- Ask: "Look at the notebooks/figures before I write up findings, or proceed with
/synthesize?" - If the user wants to explore first, wait. If they want changes, iterate on the notebooks before proceeding.
Phase D: Synthesis & Writeup
Status transition: active → analysis (handled by /synthesize itself).
- Run
/synthesizeto createREPORT.md. The skill updatesberil.yamlautomatically (status: analysis,artifacts.report: true,last_session_at) and updatesREADME.mdStatus to "Analysis — report drafted, awaiting/berdl-reviewand/submit." - Commit the report and updated
beril.yaml. - Discuss the report with the user — revise if needed.
Phase E1: Review (analysis → reviewed)
Status transition handled by /berdl-review. The user can iterate freely — different models, multiple opinions — each run produces a numbered REVIEW_N.md. Each review file embeds a <!-- report_hash: sha256:... --> footer (written by tools/review.sh) so /submit can later confirm the review covers the current REPORT.md.
- Run
/berdl-review {project_id}(or with--reviewer codexfor a different model). After the first successful review, status flips fromanalysistoreviewed. - If issues are raised that need the report or notebooks updated, address them. If
REPORT.mdchanges (re-running/synthesize), the project silently demotes back toanalysisand existing reviews go stale via hash mismatch — run/berdl-reviewagain to produce a current review. - When the review looks good (or has only issues the user is comfortable approving despite), proceed to Phase E2.
Phase E2: Approve & Submit (reviewed → complete)
Status transition handled by /submit. The lifecycle treats complete as a human act: the user, as the responsible author, explicitly says "yes, this is done." Lakehouse upload follows; on success a visible SUBMITTED.md marker lands in the project so anyone can see at a glance that the project was submitted.
- Run
/submit {project_id}. The skill:- Runs the pre-submission checklist.
- Verifies the latest
REVIEW_N.md's footer matches the currentREPORT.mdhash (rejects stale reviews). - Verifies a configured ORCID (
beril user --jsonmust return one). - Asks for explicit approval (y/n).
- Writes the approval to
beril.yamland copies the approved review toREVIEW.md. - Uploads to the lakehouse.
- Writes
SUBMITTED.md(success) orSUBMISSION_FAILED.md(failure). Status flips tocompleteregardless of upload outcome — the upload is tracked separately via the marker files andberil.yaml.submissions[].
- If upload fails, just re-run
/submitafter fixing the issue (e.g., MinIO config). It recognizes the existing approval and only retries the upload. - Discuss next steps with the user once
SUBMITTED.mdis in place.
Re-opening a complete project
If the user discovers an error after submission and wants to revise: run /synthesize (or edit notebooks then /synthesize). Because status is complete, /synthesize will explicitly ask for confirmation before overwriting REPORT.md and demoting to analysis. The previous approval gets archived under previous_approvals in beril.yaml and both marker files are deleted. Then iterate /berdl-review and /submit again — the second /submit's approval prompt will warn that this replaces the existing lakehouse archive.
Throughout the Workflow
- Commit code often — don't let work accumulate uncommitted.
- Capture pitfalls as you go: when something goes wrong, follow
.claude/skills/pitfall-capture/SKILL.md— it appends toprojects/<id>/memories/pitfalls.md(per-project, append-only with corrections). Don't write to the centraldocs/pitfalls.md; it's a frozen historical archive. - Capture discoveries and performance notes in REPORT.md: when you find something worth surfacing across projects, draft it in the optional
## Discoveriessection ofREPORT.md(added by/synthesize). Project-specific tuning observations go in## Performance Notes. These flow through/berdl-reviewand get extracted intoprojects/<id>/memories/{discoveries,performance}.mdby/submitat approval — only after review and approval, so the memories layer reflects vetted content. - When debugging or designing queries, check both layers: grep
docs/pitfalls.md(historical archive — still has most of the canonical gotchas) AND scanprojects/*/memories/pitfalls.mdfor recent gotchas hit by related projects (especially the same database family). - Re-read
docs/performance.mdwhen queries are slow — it's still the canonical reference for table-size strategies and anti-patterns. Project-specific tuning hits captured at past approvals live inprojects/*/memories/performance.md. - Follow
PROJECT.mdstandards — notebooks with saved outputs, figures as standalone PNGs,requirements.txt, Reproduction section in README.
Entry Door 1: Continue an Existing Project (resumption)
Phase 0 is skipped. Resume at the phase matching the project's current status.
Run
ls projects/and list all projects.For each project, read
beril.yaml(if present) and display:status,last_session_at,branch, and whichartifactsare present/missing. Ifberil.yamlis missing (pre-manifest project), note "no manifest" — still works, just legacy.The user picks one. Read its
README.md,RESEARCH_PLAN.md, andREVIEW.md(if present). ReadREPORT.mdifartifacts.reportis true.Update
beril.yaml:last_session_atto now.Resume detection on
reviewed/complete: before suggesting a phase, validate currency. Thecompletechecks must match what/submitPhase 1a does — otherwise the user could see a project listed ascompletehere and then watch/submitreject it on the same conditions.Hash comparison rule: stored hashes in
beril.yaml.approval.*and the REVIEW footer carry asha256:prefix; computed hashes fromsha256sum(orhashlib) are raw hex. Usetools.notebook_hash.unprefixed()to strip the prefix from stored values before comparing to computed hex. Never compare the prefixed and raw forms directly.- For
complete: validate all approved content (REPORT.md, the review files, AND notebooks), not just REPORT.md.projects/{id}/REPORT.mdmust exist andsha256sum REPORT.md == unprefixed(approval.report_hash).projects/{id}/REVIEW.mdexists →sha256sum REVIEW.md == unprefixed(approval.review_hash). Missing alone is fine —/submitPhase 3a recreates it from the numbered review.projects/{id}/{approval.review}(numbered file, e.g.REVIEW_3.md) exists → its hash matchesunprefixed(approval.review_hash). Missing alone is fine. Missing together with REVIEW.md is approved-content loss.- Notebook hashes (v5): if
approval.notebook_hashesis present and non-empty, invokepython {repo_root}/tools/notebook_hash.py compute-hashes {project_path}(use absolute paths;/berdl_startmay run from anywhere). Parse the JSON output and compare each entry againstapproval.notebook_hashes. Any mismatch, missing notebook from the approval set, or new notebook in the current set → drift. (Empty / omittednotebook_hashesfield is treated as legacy and skipped.) - Any hash mismatch, OR REPORT.md missing, OR both review files missing, OR a notebook drift → offer demote-to-
analysis. Phrase the prompt to match the cause ("REPORT.md changed since approval", "REVIEW.md changed since approval", "REVIEW_N.md changed since approval", "REPORT.md is missing — restore it from version control", "Both REVIEW.md and REVIEW_N.md missing on a complete project", "notebook(s) changed since approval:- ", etc.). On Yes: move
approvaltoprevious_approvals(append) with an addedarchived_at: "<now>"field, setstatus: analysis, update README## Statusto "Analysis — report drafted, awaiting/berdl-reviewand/submit.", deleteREVIEW.mdif present, and delete both marker files. If user declines, warn but leave alone.
- For
reviewed: parse the latestREVIEW_N.mdfooter hash (extract the hex aftersha256:) and compare tosha256sum REPORT.md. If mismatch → offer demote-to-analysis(existing reviews stale). On Yes: setstatus: analysis, update README## Statusto "Analysis — report drafted, awaiting/berdl-reviewand/submit." If user declines, warn but leave alone.
- For
Resume at the phase matching the (possibly demoted) status:
exploration→ Phase A (Orientation & Exploration).proposed→ Checkpoint (re-present plan, ask the (a)/(b)/(c) question).active→ Phase C (Analysis).analysis→ Phase E1 (Review) — suggest/berdl-review. IfREPORT.mddoesn't exist yet (rare), drop back to Phase D.reviewed→ Phase E2 (Approve & Submit) — suggest/submit.complete:SUBMITTED.mdpresent → "Project complete and submitted on {SUBMITTED.md submitted_at}; archive: {archive_key}. Inspect, reopen for revisions (run/synthesize), or move on?"SUBMISSION_FAILED.mdpresent (or both markers absent) → "Project approved locally but lakehouse upload pending. Re-run/submitto retry."
- missing/legacy → infer from artifacts present (REPORT exists → Phase E1; RESEARCH_PLAN exists but no REPORT → Phase C).
Suggest the next concrete action based on the resumed phase.
Side Path: Understanding the System (no project)
If the user truly wants only to understand BERIL without committing to a session, point them at:
PROJECT.md— dual goals (science + knowledge capture), structuredocs/overview.md— scientific context and data workflowdocs/pitfalls.md— per-database non-derivable gotchasdocs/research_ideas.md— backlog and future directions
Also explain:
- Current inventory and access should be discovered live with
berdl_notebook_utils.get_databases(return_json=False). - The documentation workflow (tag discoveries, update pitfalls) — captured in
PROJECT.md. - The BERDL JupyterHub UI is browsable at https://hub.berdl.kbase.us.
- The available skills and what each does (table at the top of this skill).
This is reading, not a session. After reading, the user re-invokes /berdl_start to begin actual work — at which point Phase 2's three-door menu applies.
Companion skills used during the workflow
These are project-agnostic helpers — invoke them from inside any project at the right phase:
/berdl— query BERDL with Spark SQL. Use during Phase A and Phase C./berdl-query— local-machine variant for off-cluster work./berdl-discover— explore and document a new database that isn't yet indocs/pitfalls.md./berdl-minio— file transfer between BERDL MinIO and local./literature-review— search PubMed, bioRxiv, arXiv, Semantic Scholar, Google Scholar. Writes toprojects/<id>/references.md.
MCP setup check for /literature-review: the paper-search-mcp (openags/paper-search-mcp) is configured in .mcp.json and runs via uvx paper-search-mcp. If it fails:
- Ensure Python 3.10+ and uv are installed.
- Test:
uvx --from paper-search-mcp python -m paper_search_mcp.server. - Optionally set
SEMANTIC_SCHOLAR_API_KEYfor enhanced Semantic Scholar features. - The skill falls back to WebSearch if the MCP server is unavailable.
Key Principles (for the agent)
- Always start in a project — every session establishes a project context (new, existing, or
scratch_<date>) before doing any work. User data, queries, exploration notebooks, references — all live inprojects/<id>/. Never leave artifacts loose in the repo root,~/, orexploratory/. - Read the docs and discover live data first —
PROJECT.md,docs/overview.md,docs/pitfalls.md,docs/performance.mdbefore designing any queries. Useberdl_notebook_utils.get_databases(return_json=False),get_tables(... return_json=False), andget_table_schema(... detailed=True, return_json=False)for live inventory. - Notebooks are the audit trail — numbered sequentially (00 for exploration, 01+ for analysis), each self-contained with a clear purpose. Commit with saved outputs per
PROJECT.mdreproducibility standards. - Commit early and often — after scaffold, after plan, after each notebook, after data extraction, after analysis, after synthesis.
- Branch by default — create a
projects/{project_id}branch in Phase 0. Extended work onmaincauses merge pain and risks conflicting with other contributors. If the user explicitly prefersmain, respect that. - Never skip the plan-review checkpoint — after writing
RESEARCH_PLAN.md, STOP. Do not write or execute analysis notebooks until the user explicitly chooses (a) Approve. This is the most important rule and the easiest to violate. - Update the plan when reality changes — if analysis reveals something that changes the approach, update
RESEARCH_PLAN.mdRevision History before continuing. The plan is a contract; revisions are explicit, not silent. - Drive the process forward between checkpoints — checkpoints are explicit pause-points (Plan Review, Results Review). Outside of those, keep moving — don't stop after every individual step asking permission.
- Document as you go — pitfalls live-captured to
projects/<id>/memories/pitfalls.mdvia/pitfall-capture; discoveries and performance notes drafted in REPORT.md## Discoveries/## Performance Notessections (extracted by/submitto per-project memories at approval). The centraldocs/{pitfalls,discoveries,performance}.mdfiles are frozen historical archives — don't write to them. - Use Spark patterns from PROJECT.md —
get_spark_session(), PySpark-first,.toPandas()only for final small results.
Critical Pitfalls (always mention)
Surface these early. Database-specific gotchas live in docs/pitfalls.md per-database H2 sections (frozen historical archive; grep -A 20 "^## <database_name>$" docs/pitfalls.md). Per-project pitfalls hit by recent projects on the same database family also live in projects/*/memories/pitfalls.md and are worth a scan. Universal pitfalls:
- Species IDs contain
--— This is fine inside quoted strings in SQL. Use exact equality (WHERE id = 's__Escherichia_coli--RS_GCF_000005845.2'), not LIKE patterns. - Large tables need filters — Some tables are very large (e.g.,
gene,genome_ani,reaction_similarity). Always filter by a partitioned/indexed column. Checkdocs/pitfalls.mdper-database section for the current list. - Annotation coverage varies — Some annotation tables (e.g., AlphaEarth embeddings) cover only a fraction of genomes. Check coverage with
COUNT(DISTINCT id)before relying on them. - Match Spark import to environment — On JupyterHub notebooks:
spark = get_spark_session()(no import). On JupyterHub CLI/scripts:from berdl_notebook_utils.setup_spark_session import get_spark_session. Locally:from get_spark_session import get_spark_session(requires.venv-berdl+ proxy chain). See.claude/skills/berdl-query/references/off-cluster-mechanics.mdfor off-cluster setup. - Auth token — stored in
.envasKBASE_AUTH_TOKEN(notKB_AUTH_TOKEN). - String-typed numeric columns — Some databases store numbers as strings. Always CAST before comparisons; check
docs/pitfalls.mdper-database section. - Gene clusters are species-specific — Cannot compare cluster IDs across species. Use COG/KEGG/PFAM for cross-species comparisons.
- Avoid unnecessary
.toPandas()—.toPandas()pulls all data to the driver node and can be very slow or cause OOM errors. Use PySpark DataFrame operations for filtering, joins, and aggregations. Only convert to pandas for final small results (plotting, CSV export).
Templates
RESEARCH_PLAN.md
# Research Plan: {Title}
## Research Question
{Refined question after literature review}
## Hypothesis
- **H0**: {Null hypothesis}
- **H1**: {Alternative hypothesis}
## Literature Context
{Summary of what's known, key references, identified gaps}
## Query Strategy
### Tables Required
| Table | Purpose | Estimated Rows | Filter Strategy |
|---|---|---|---|
| {table} | {why needed} | {count} | {how to filter} |
### Key Queries
1. **{Description}**:
\```sql
{query}
\```
### Performance Plan
- **Tier**: {local bounded Spark SQL / JupyterHub Spark SQL}
- **Estimated complexity**: {simple / moderate / complex}
- **Known pitfalls**: {list from pitfalls.md}
## Analysis Plan
### Notebook 1: Data Exploration
- **Goal**: {what to verify/explore}
- **Expected output**: {CSV/figures}
### Notebook 2: Main Analysis
- **Goal**: {core analysis}
- **Expected output**: {CSV/figures}
### Notebook 3: Visualization (if needed)
- **Goal**: {figures for findings}
## Expected Outcomes
- **If H1 supported**: {interpretation}
- **If H0 not rejected**: {interpretation}
- **Potential confounders**: {list}
## Revision History
- **v1** ({date}): Initial plan
## Authors
{ORCID, affiliation}
README.md
# {Title}
## Research Question
{Refined question}
## Status
In Progress — research plan created, awaiting analysis.
## Overview
{One-paragraph summary of the hypothesis and approach}
## Quick Links
- [Research Plan](RESEARCH_PLAN.md) — hypothesis, approach, query strategy
- [Report](REPORT.md) — findings, interpretation, supporting evidence
## Reproduction
*TBD — add prerequisites and step-by-step instructions after analysis is complete.*
## Authors
{Authors}
beril.yaml (project manifest)
project_id: {project_id}
status: exploration # exploration | proposed | active | analysis | reviewed | complete
created_at: "{ISO 8601 timestamp}"
last_session_at: "{ISO 8601 timestamp}"
branch: projects/{project_id}
engine:
name: {agent name, e.g. claude}
authors:
- name: "{author name}"
affiliation: "{affiliation}"
orcid: "{ORCID}"
artifacts:
readme: false
research_plan: false
report: false
review: false
Status transitions: exploration (project scaffolded, no plan yet) → proposed (RESEARCH_PLAN.md written) → active (notebooks started) → analysis (REPORT.md written; set by /synthesize) → reviewed (a REVIEW_N.md covers the current REPORT.md; set by /berdl-review) → complete (user approved and /submit recorded the approval). Upload outcome is tracked separately via the SUBMITTED.md / SUBMISSION_FAILED.md marker files in the project directory. Downstream skills gate on status: /submit requires reviewed (or already-complete for retries); /synthesize requires active for the forward path (and demotes reviewed silently or complete with confirmation when re-run). Update artifacts flags as each file is created. Update last_session_at whenever resuming work. The full audit trail of approvals and uploads lives in beril.yaml.approval, beril.yaml.previous_approvals, and beril.yaml.submissions[].