name: download-bag
description: "ALWAYS use this skill when getting data OUT of a Deriva catalog as a BDBag — exporting a slice of rows + their FK-reachable relations + the bulk objects they reference into a portable, self-describing, checksummed archive. Covers what a BDBag is, the two export paths (server-side export service via deriva-export / DerivaExport, or client-side orchestration via deriva-download-cli / DerivaDownload), authoring the export spec (the JSON config that defines what to include), the bdbag CLI for validating and materializing bags, asset materialization and caching strategy. Standalone — works on any Deriva catalog. Triggers on: 'download a bag', 'export a bag', 'BDBag', 'export catalog data', 'pull data out', 'download dataset' (when the user means the bag-export mechanism, not the DerivaML Dataset entity), 'deriva-download-cli', 'deriva-export', 'export spec', 'snapshot the catalog', 'bag manifest', 'materialize assets', 'self-describing archive', 'portable export', 'reproducible data drop', 'data package', 'how do I get this data offline'."
user-invocable: true
disable-model-invocation: true
Downloading a BDBag from a Deriva Catalog
This skill covers getting catalog data out as a BDBag (Big Data Bag) — a self-describing, portable, checksummed archive that packages a slice of rows + their FK-reachable relations + the bulk objects they reference, suitable for sharing, archiving, or offline use.
For loading data into a catalog, see /deriva:load-data. For querying / browsing without exporting, see /deriva:query-catalog-data.
If you have the
deriva-mlplugin loaded and the data lives inside a DerivaMLDatasetentity (with a version, members, and a release lifecycle), use/deriva-ml:dataset-lifecycle(Phase 5) instead — itsdataset.download_dataset_bag(version)API generates the export spec for you and handles version pinning. This skill is for the catalog-primitive path: exporting an arbitrary slice from any Deriva catalog, with or without DerivaML installed.
What is a BDBag?
A BDBag is an extension of the BagIt packaging format with two additions that matter for Deriva:
- A fetch manifest (
fetch.txt) — lets a bag reference remote files (Hatrac assets) without bundling the bytes. The bag can be small (manifest only) or fully self-contained (with assets fetched and embedded), and convert between the two withbdbag --materialize. - Metadata files — a
manifest-{algorithm}.txtper checksum algorithm (md5, sha256) lists every file and its expected hash;bag-info.txtcarries human-readable metadata about the bag itself.
The shape on disk after extraction:
my-bag/
├── bag-info.txt # bag metadata (creator, date, ...)
├── bagit.txt # BagIt version
├── manifest-md5.txt # checksums of every file in data/
├── tagmanifest-md5.txt # checksums of the metadata files themselves
├── fetch.txt # (optional) remote files to materialize
└── data/
├── records/ # CSV per exported table
│ ├── MyProject.Subject.csv
│ └── MyProject.Image.csv
├── assets/ # (after materialize) the actual Hatrac files
│ └── Image/image-001.png
└── schema.json # catalog schema as of the export
The "self-describing" part is load-bearing. Hand a bag to a colleague six months from now and they can verify integrity (bdbag --validate), see what's inside (manifest-md5.txt), check the schema it was exported under (schema.json), and either use the assets directly (if materialized) or fetch them on demand (bdbag --resolve-fetch).
The two export paths
| Path | When to use | Lives where |
|---|---|---|
Server-side export service (DerivaExport / export annotation) |
Production exports tied to a catalog's published export profiles; you want the server to do the work; you want the result available by URL | deriva.transfer.download.DerivaExport; backed by the catalog's /deriva/export/bdbag/... endpoint |
Client-side orchestrated (deriva-download-cli / DerivaDownload) |
Custom one-off exports; you control the spec; you want it run locally; you need custom processors | deriva.transfer.download.DerivaDownload; deriva-download-cli |
Both paths take the same export spec format. What differs is who executes it — the server (export service) or the client (download class). The server path is preferred when an export is part of a published workflow (e.g., "export this view" buttons in Chaise call the server endpoint); the client path is preferred when you're authoring a one-off spec and iterating, or when no server-side export profile fits.
There is no MCP tool for bag download. The MCP server (
deriva-mcp-core) does not expose adownload_bagorexport_bagtool. The skill below uses the deriva-py Python API and thederiva-download-clidirectly. (Companion/deriva-ml:dataset-lifecycledocuments aderiva_ml_bag_infoMCP tool — that's for previewing a dataset bag before download, not for triggering the download itself.)
Path 1 — Client-side: deriva-download-cli + export spec
deriva-download-cli is the production path for client-driven bag export. You point it at a host, catalog, and an export spec; it executes the spec's queries, fetches assets, assembles a bag, and writes it to a local directory.
# Install (provides the CLI; same package as deriva-upload-cli)
uv pip install deriva
# Run the export
deriva-download-cli \
--host data.example.org \
--catalog 1 \
--config-file export-spec.json \
--output-dir ./output \
--envar DATASET_RID=2-XXXX # template substitution into the spec
What you get: a bag directory under ./output, ready to validate (bdbag --validate) or compress (bdbag --archiver zip).
Python equivalent
from deriva.core import get_credential
from deriva.transfer.download.deriva_download import DerivaDownload
import json
with open("export-spec.json") as f:
config = json.load(f)
downloader = DerivaDownload(
server={"host": "data.example.org", "catalog_id": "1", "protocol": "https"},
output_dir="./output",
config=config,
credentials=get_credential("data.example.org"),
timeout=(10, 1800), # (connect, read) in seconds
envars={"DATASET_RID": "2-XXXX"}, # same template substitution
)
result = downloader.download()
# result is a dict with output paths / URLs
Use the Python class when you need to embed export in a larger pipeline, handle errors structurally (see "Exceptions" below), or subclass to add custom query / transform / post processors.
Path 2 — Server-side: DerivaExport
from deriva.transfer.download.deriva_export import DerivaExport
exporter = DerivaExport(
host="data.example.org",
config_file="export-spec.json",
output_dir="./output",
envars={"DATASET_RID": "2-XXXX"},
export_type="bdbag",
defer_download=False, # True → return URLs, don't download
)
result = exporter.export()
This submits the spec to the catalog's /deriva/export/bdbag/ endpoint. The server runs the export and either streams the bag back (defer_download=False) or returns a URL list (defer_download=True). The server path requires the host to have an export service running and the calling user to be authorized.
Use DerivaExport when an export is meant to be a server-side operation — bookmarked, shareable, runnable by other clients — rather than a one-off you run locally.
The export spec
Both paths consume the same JSON spec. The spec has four top-level sections:
env— variables for template substitution (CLI--envar K=V, Pythonenvars=); reference them as{K}placeholders elsewhere in the spec.bag— BagIt-level options: name, checksum algorithm(s), archiver (zip / tgz / none).catalog— the export work.query_processorsis a list of operations executed in order; each one runs an ERMrest path query and either writes the rows to a CSV / JSON file underdata/records/, or treats them as asset URLs and adds them tofetch.txtfor materialization.post_processors(optional) — things to do after the bag is built (e.g., upload to S3, mint a MINID).
The four query-processor types are csv (tabular rows → CSV), json (same, JSON), fetch (asset URLs → fetch.txt, materialize on demand), and download (asset bytes inline). The query_path field uses ERMrest path-expression syntax — the same syntax query_attribute uses.
For the full spec shape, query-processor table, and the strategies for authoring a spec from scratch (cribbing from Chaise-generated specs or
deriva-ml'sDatasetBagBuilder, building incrementally, theexportannotation shortcut), seereferences/export-spec.md. Authoring a spec is the deepest documentation gap in the Deriva ecosystem and worth reading end-to-end the first time you do it; once you have a working spec you rarely need the reference again.
Materialization
A bag can be manifest-only (small; fetch.txt references remote assets) or fully materialized (asset bytes embedded under data/assets/).
# Validate a downloaded bag
bdbag --validate fast my-bag/
# Materialize fetch.txt entries (downloads the assets)
bdbag --resolve-fetch all my-bag/
# Re-validate including checksums of materialized assets
bdbag --validate full my-bag/
When to defer materialization:
- Manifest-only first, materialize selectively — you only need a few assets out of many, or you want to inspect the manifest before committing the bandwidth.
- Materialize at download time — set
materialize=truein the bag section of the spec, or pass--materializetoderiva-download-cli, so the assets land alongside the manifest in one pass. - Manifest-only forever — sharing a bag whose recipient already has the assets locally (mounted from S3, e.g.) and doesn't need them re-fetched.
Validation matters before consuming the bag. --validate fast confirms structure + bag-info checksums (cheap); --validate full rehashes every file in data/ (expensive on big bags, but catches silent corruption).
Caching
Bags are content-addressed by checksum. The deriva-py download orchestration uses a three-tier cache:
| Tier | Where | When it's checked |
|---|---|---|
| 1. Local | {cache_dir}/bags/{checksum}/ (default: ~/.deriva/cache/) |
Always checked first; same export → cached bag returned without re-running |
| 2. MINID / S3 | A persistent identifier that resolves to a bag URL | Checked if the local tier misses and the spec is configured for MINID lookup |
| 3. Generation | Run the spec against the live catalog | Fallback; the result is then cached at tier 1 (and optionally minted as a MINID for tier 2) |
The cache key is {spec_hash[:16]}_{snapshot} — both the export plan and the catalog snapshot must match for a cache hit. This is why two exports against the same dataset at the same version produce the same bag (cache hit) but exports against current (no pinned snapshot) re-generate every time.
For a one-off export, the cache is a "free" speedup. For reproducible exports — same input → same bytes — pin the catalog snapshot in the spec (or via the dataset version, if going through the dataset-lifecycle path).
Exceptions
DerivaDownload raises a small typed hierarchy. Catch the parent for "any download error"; catch a specific subclass to handle one kind specially.
from deriva.transfer.download import (
DerivaDownloadError,
DerivaDownloadConfigurationError,
DerivaDownloadAuthenticationError,
DerivaDownloadAuthorizationError,
DerivaDownloadTimeoutError,
DerivaDownloadBaggingError,
)
| Exception | Meaning | Typical fix |
|---|---|---|
DerivaDownloadConfigurationError |
The spec is malformed or references a missing envar | Re-read the spec; check {NAME} placeholders against envars |
DerivaDownloadAuthenticationError |
No / expired credentials | deriva-auth data.example.org to refresh |
DerivaDownloadAuthorizationError |
Authenticated, but not allowed to read some queried table | Check ACLs on the table; see /deriva:troubleshoot-deriva-errors |
DerivaDownloadTimeoutError |
A query (often a deep FK join) exceeded timeout |
Increase the read timeout in the second tuple element; or prune the query in the spec |
DerivaDownloadBaggingError |
BagIt-level packaging problem (disk full, write permission) | Check disk space and output dir permissions |
DerivaDownloadError |
Catch-all parent | Re-raise after logging; this is the umbrella for anything deriva.transfer.download knows about |
Performance and ergonomics
- Snapshot before exporting — for reproducible exports, capture the catalog snaptime first (
/deriva:load-data"Snapshot before any bulk mutation" — same primitive, opposite direction) and pin it in the spec. Otherwise the export drifts with the live catalog. - One spec, many runs — template substitution +
envarslets the same spec produce different bags (per-subject, per-study, per-date-range). Author the spec once; vary inputs at run time. - Validate before sharing —
bdbag --validate fullonce before handing a bag off. Cheap insurance against silent corruption during transfer. bdbag --archiver zipfor sharing — a zipped bag is one file (vs. a directory tree).bdbag --extracton the receiving side; the round trip preserves checksums.- Asset count matters more than asset size — 10,000 small assets is slower than 100 large ones, because each Hatrac fetch has overhead. Group small assets into tar-style aggregates upstream when possible (an asset-table column convention, not a bag-time concern).
Reference Tools
Bundled references
references/export-spec.md— full export-spec shape, query-processor reference, authoring strategies, and thetag:isrd.isi.edu,2016:exportannotation path. Read when authoring or modifying a spec.
CLI
deriva-download-cli --host HOST --catalog ID --config-file SPEC --output-dir DIR [--envar K=V ...]— Run an export spec locally; write the bag toDIR.bdbag <command> <bag-path>— BagIt-level operations:--validate fast|full,--resolve-fetch all|missing,--materialize,--archiver zip|tgz,--extract. From thebdbagPython package (a deriva-py dependency).
Python
deriva.transfer.download.DerivaDownload(server={...}, output_dir=..., config=..., credentials=..., timeout=..., envars={...}).download()— Client-side orchestration. Returns a dict with output paths.deriva.transfer.download.DerivaExport(host=..., config_file=..., output_dir=..., envars={...}, export_type="bdbag", defer_download=False).export()— Server-side submission. Returns paths or URLs.bdbag.bdbag_api— Programmatic BagIt operations (validate, materialize, archive) for when you need to manipulate a bag in code.
Catalog annotations
tag:isrd.isi.edu,2016:export— Per-table annotation that defines a reusable export spec; surfaced by Chaise as an "Export" button. See the ISRD export annotation spec and/deriva:customize-displayfor how to install / edit it.
Related Skills
/deriva:load-data— The inverse operation. Loading data into a catalog (row inserts, asset uploads) is what produces the rows a bag later exports. The snapshot-before-mutating discipline applies in both directions./deriva:query-catalog-data— Use to verify what rows a query path matches before baking that path into an export spec. The path-expression syntax is the same in both surfaces./deriva:customize-display— When the export should be a Chaise "Export" button rather than a standalone spec file, the path is through thetag:isrd.isi.edu,2016:exportannotation./deriva:troubleshoot-deriva-errors— For auth / permission failures on export queries; for missing-record errors when an envar resolves to a RID that no longer exists./deriva:evolve-schema— Schema changes invalidate cached bags whoseschema.jsonno longer matches the live catalog. The migration runbook covers what to do for downstream consumers (re-export, or pin to a pre-migration snaptime).
If
deriva-mlis loaded and you're exporting a DerivaMLDataset, the right surface is/deriva-ml:dataset-lifecycle(Phase 5: Use) — it wraps this skill's mechanics with version-pinning, member-driven spec generation, and a{rid}@{version}cache key. Reach for that skill instead when the source is aDatasetentity; reach for this skill when you're exporting an arbitrary catalog slice that isn't part of a Dataset.