name: tcia-query-skill description: Find, verify, cite, visualize, and route TCIA-published datasets and verified manuscripts about TCIA data across TCIA WordPress Collection and Analysis Result metadata, TCIA Publications EndNote XML, IDC/idc-index, Cancer Data Aggregator, General Commons, CTDC, PathDB, DataCite, and Aspera. Use when users ask to discover TCIA datasets or publications by cancer type, modality, body site, species, data type, access/license, DOI, program, clinical/supporting data, demographic/diagnosis enrichment, segmentation/annotation availability, browser preview/viewer URL, or download path, including public DICOM, controlled-access datasets, non-DICOM pathology, supporting files, and derived results.
TCIA Query Skill
Core Rule
Use the TCIA WordPress Collection Manager as the authority for whether a dataset is TCIA-published. A dataset is in scope only if it appears as a WordPress Collection or Analysis Result. For normal agent work, query the local SQLite snapshot as the fast discovery surface over WordPress, PathDB, and DataCite metadata. Downstream systems such as IDC, CDA, General Commons, CTDC, PathDB, Zenodo, and DataCite can enrich or route access, but they do not decide TCIA provenance.
Use TCIA's Publications page and EndNote export as the authority for peer-reviewed manuscripts written about TCIA datasets: https://www.cancerimagingarchive.net/publications/ and https://cancerimagingarchive.net/endnote/Pubs_basedon_TCIA.xml. DataCite records describe TCIA dataset DOI metadata; they are not the verified bibliography of papers that used TCIA data. For questions about papers, manuscripts, publication impact, hypotheses studied, methods, or citation lists, load references/publications.md and use scripts/tcia_publications.py.
Exception for DOI-centered questions: start with DataCite for DOI metadata, citation metadata, versions, and DOI relationships, then use WordPress to confirm TCIA publication status, visibility, access/license, and user-facing dataset pages.
Ignore WordPress records where hide_from_browse_table = "1" by default. TCIA uses this flag for pre-release staging/review and for retired/outdated datasets that should not be casually rediscovered. Include hidden records only when the user explicitly says they are a TCIA staff member and explicitly asks to include hidden, staged, retired, or internal-review datasets.
When a downstream record is derived from a TCIA DOI but is not itself listed in WordPress, describe it as an external derived or related dataset, not as a TCIA-published dataset.
Quick Workflow
- Use the local SQLite metadata snapshot for routine discovery and access/license metadata. Prefer the agent-facing views in
references/schema.md:agent_datasets,agent_current_downloads,agent_dataset_access_summary,agent_pathdb_slides, andagent_datacite_dois. Loadreferences/snapshots.mdfor refresh behavior andreferences/schema.mdfor SQL details. If the environment cannot execute scripts or query SQLite, use the web-friendly release exports described inreferences/mcp-and-web-llms.md; do not switch to live WordPress API discovery. - Choose the starting source.
- For peer-reviewed publications or manuscripts about TCIA data, use TCIA's EndNote XML export, not DataCite. Prefer
scripts/tcia_publications.py, and loadreferences/publications.md. - For DOI, citation, or version questions, start with DataCite metadata from the snapshot. Prefer
scripts/datacite_tcia_dois.pyfor TCIA DOI prefix records. - For subject-level clinical, demographic, diagnosis, treatment, or cross-commons data-availability enrichment, confirm the TCIA dataset in WordPress first, then use CDA from validated TCIA/IDC subject identifiers. Load
references/cda.md. - For file-level public NIfTI questions, confirm TCIA provenance/access through the normal snapshot first, then load
references/nifti.mdand use the optional NIfTI SQLite release only if file-grain metadata are needed. Preferagent_nifti_dataset_summary,agent_nifti_downloads,agent_nifti_files, andagent_nifti_derived_objects. - For controlled-access file, manifest,
drs_uri, modality, or series-level metadata, confirm controlled status through the normal snapshot first, then loadreferences/controlled-access.mdand use the optional controlled-access SQLite release when file-grain public metadata are needed. Preferagent_controlled_dataset_summary,agent_controlled_downloads, andagent_controlled_files. - For all other discovery and access questions, search WordPress snapshot records first for Collections and Analysis Results.
- Prefer
scripts/tcia_wordpress_search.pyfor lightweight snapshot searches. - If the snapshot is missing, run or tell the user to run
python scripts/tcia_snapshot.py ensurefrom the skill root. - If a specific dataset appears absent after refreshing, say the snapshot may not include the newest TCIA metadata yet. The GitHub Action builds release snapshots at 7:17 AM and 7:17 PM America/New_York; ask the user to try again after the next scheduled run has had time to finish, then rerun
python scripts/tcia_snapshot.py ensure. - Use
--include-hiddenonly for explicit TCIA staff requests that ask for hidden/staged/retired records.
- For peer-reviewed publications or manuscripts about TCIA data, use TCIA's EndNote XML export, not DataCite. Prefer
- Filter out
hide_from_browse_table = "1"records unless the explicit TCIA staff exception applies. - Filter candidates by the user's criteria: cancer type, body site, modality, species, data type, access/license, DOI, program, supporting data, segmentations/annotations, or download need. For modality, file format, download-route, and access/license questions, prefer download-level metadata over top-level Collection or Analysis Result
data_types; mixed datasets can have modality labels such as MR only on individual download records. For public DICOM series/file details after TCIA provenance and access are established, use IDC/idc-index rather than querying live WordPress. - Use
collection_short_titleorresult_short_titleas the cross-system key whenever possible. - For download questions, inspect
agent_current_downloadsplus WordPressdownload_type,data_type, andfile_typetogether. These are multi-select labels, not a strict one-parent tree. - Route access with the matrix below.
- Decide open versus controlled access from WordPress license metadata, not from collection/page accessibility fields. Creative Commons licenses mean open access; Creative Commons NonCommercial licenses are open access with a noncommercial-use restriction. If license text indicates NIH Controlled Data Access, TCIA Restricted, or another controlled/restricted license, alert the user that the dataset is not open access and link to the TCIA NIH Controlled Data Access Policy before giving download/API-key/Data Retriever guidance.
- If
agent_dataset_access_summary.resolved_access_level = 'mixed', split open/noncontrolled downloads from controlled downloads. Do not imply that all files in a mixed dataset are open. - For visualization questions, decide open versus controlled access first. Controlled-access data cannot be previewed in a browser before download; open-access data should be returned as clickable viewer links, not opened by the agent.
- For open-access download requests, ask whether the user wants the agent to download files directly in the current environment or create a portable TCIA Data Retriever CSV manifest when the requested data can be represented by DICOM Series Instance UIDs, direct
imageUrlvalues, ordrs_urivalues. Do not directly download controlled data; for controlled data, provide the TCIA policy link and portable manifest guidance only. - Include citations and access caveats before recommending downloads or downstream analysis.
Access Routing
| Data need | Route |
|---|---|
| Public DICOM radiology, DICOM pathology, or DICOM annotations/results | Use IDC and idc-index first. If an IDC skill is available, use it for IDC-specific querying, visualization, and downloading. Keep the TCIA WordPress short title, DOI, or Series Instance UIDs as the allowlist/provenance anchor. Use NBIA only as a fallback when the requested DICOM data cannot be found in IDC/idc-index; if fallback is needed, tell users to use the NBIA v4 API documented by https://cbiit.github.io/NBIA-TCIA/nbia-api.yaml. |
| Public non-DICOM NIfTI file-grain metadata | Use WordPress snapshot metadata first to confirm the dataset/download is visible and non-controlled, then use references/nifti.md and scripts/tcia_nifti_metadata.py. The NIfTI SQLite is optional and downloaded on demand with python scripts/tcia_nifti_metadata.py ensure; it is not installed or refreshed automatically with the base skill snapshot. Prefer the optional SQLite agent_nifti_* views for direct SQL. |
| Browser visualization before download | Controlled-access data cannot be previewed before download. For open/public DICOM in IDC, use OHIF v3 for radiology, VolView when a public S3 series folder/CRDC UUID is available, and SliM for SM slide microscopy. For open/public non-DICOM PathDB slides, use caMicroscope with the PathDB CSV camic_id. Load references/visualization.md. |
| Controlled-access face datasets | If license metadata indicates controlled/restricted access, alert that the dataset is controlled access and point to https://www.cancerimagingarchive.net/nih-controlled-data-access-policy/. For Biobank controlled-access face data, use the current WordPress dataset pages/download metadata and optional controlled-access SQLite for CTDC manifest, drs_uri, metadata, download, and viewer links; users must request access to dbGaP study phs002192. For non-Biobank face datasets, use General Commons only when WordPress or GC metadata indicate that route; scope GC queries to phs004225 and match study_acronym to the WordPress short title. Do not directly download controlled data; provide policy and TCIA Data Retriever manifest guidance for later authorized use. |
| Controlled-access NCTN trials or Biobank data | If license metadata indicates controlled/restricted access, alert that the dataset is controlled access and point to the TCIA NIH Controlled Data Access Policy. For Biobank controlled-access face data, route through CTDC using the manifests and download/view links now exposed on the relevant WordPress pages, and use the optional controlled-access SQLite when file-grain public metadata are needed; tell users they must request dbGaP access to phs002192. For NCTN trials or other controlled datasets, use WordPress for current metadata and access statements unless WordPress identifies a downstream route. |
| Subject-level clinical/demographic/diagnosis enrichment or cross-commons availability | Use CDA after WordPress provenance is established. Prefer cdapython for harmonized subject and file summaries across IDC, GDC, PDC, GC, ICDC, and related upstream identifiers. Use CDA to enrich TCIA/IDC cohorts, not to decide TCIA publication, replace official WordPress downloads, or authorize controlled data access. Load references/cda.md. |
| Non-DICOM pathology | Use PathDB. Prefer the stable cohort-builder CSV for rich slide-level metadata, and match its collection field to the WordPress short title. The PathDB API collection list may use collectionName. |
| Spreadsheets, ZIP files, supporting files, manifests, and ancillary downloads | Use WordPress download metadata. If a download is an IBM Aspera Faspex package, see references/aspera.md. |
| Peer-reviewed manuscripts about TCIA datasets | Use TCIA Publications and the EndNote XML export. Search title, abstract, keywords, journal, PMID, manuscript DOI, and linked TCIA dataset DOIs. Load references/publications.md. |
| DOI, citation, version, or derived-result relationships | Start with DataCite metadata and relationships, then use WordPress for TCIA publication/visibility, access/license, and user-facing pages. See references/datacite-relationships.md. |
Read references/routing.md for detailed routing and answer-format guidance.
Tool Setup
Do not assume optional Python packages are installed. Before writing custom code for TCIA, IDC, DICOM, or CDA operations, check whether the appropriate package is available in the same Python environment that will run the task:
python -c "import importlib.util as u; print({p: u.find_spec(p) is not None for p in ['tcia_utils','idc_index','pydicom','cdapython']})"
Ask before installing packages. If the user allows package installation, install them in the active local agent environment:
python -m pip install --upgrade tcia_utils idc-index pydicom cdapython
Prefer these package APIs over custom implementations where possible:
tcia_utils: TCIA-specific helper APIs when maintaining the snapshot or doing explicit source-system checks.idc-index: IDC metadata lookup, public DICOM download, viewer URLs, cloud-storage URLs, and Series Instance UID workflows.pydicom: local DICOM header/metadata inspection. Do not hand-parse DICOM files whenpydicomcan be installed or is already available.cdapython: CDA subject/file summaries and harmonized cross-CRDC enrichment. Prefer it over directcda-clientor handwritten CDA REST calls.
The bundled standard-library scripts support snapshot refresh/querying, Data Retriever CSV creation, legacy manifest parsing, and viewer URL construction. They do not replace idc-index for IDC workflows, pydicom for local DICOM parsing, or cdapython for CDA workflows. Live source API details belong to scripts/tcia_snapshot.py build and the maintainer guidance in references/snapshots.md, not normal end-user discovery.
For controlled-access evidence, inspect WordPress download/license metadata through agent_datasets, agent_current_downloads, and agent_dataset_access_summary. When file-grain controlled metadata are needed, use the optional controlled-access SQLite release from scripts/tcia_controlled_access_metadata.py ensure; it is derived from public WordPress manifest and metadata spreadsheet URLs and does not require a user account. Do not use collection_page_accessibility or result_page_accessibility to decide controlled access; those fields are being phased out. Creative Commons licenses mean open access. Creative Commons NonCommercial licenses are open access with a noncommercial-use restriction. When controlled_access is true or resolved_access_level is controlled or mixed, link users to https://www.cancerimagingarchive.net/nih-controlled-data-access-policy/, which contains current access-request, JSON API key, and TCIA Data Retriever configuration guidance. Do not directly download controlled data.
Use idc-index for public DICOM only after confirming that the dataset is TCIA-published through WordPress or is clearly an external derived dataset through DataCite relationships.
For open-access/public DICOM downloads, prefer IDC/idc-index over NBIA. TCIA is phasing out NBIA, so do not use NBIA as the first route for public DICOM files. Existing WordPress .tcia manifest files are still useful as legacy inputs because they usually contain a few configuration lines followed by one DICOM Series Instance UID per row; extract those Series Instance UIDs and use them with idc-index. For new portable manifests, write CSV files for TCIA Data Retriever instead of .tcia files. Use NBIA only as a fallback when IDC/idc-index cannot find the requested public DICOM series. If NBIA fallback is needed, tell users to use the NBIA v4 API documented by https://cbiit.github.io/NBIA-TCIA/nbia-api.yaml. For controlled-access DICOM, use WordPress license metadata plus the downstream route identified by current WordPress metadata: CTDC for Biobank controlled-access face data, or General Commons under phs004225 only when WordPress/GC metadata indicate that route. Do not imply IDC or NBIA public download and do not directly download controlled data.
Do not query live WordPress to find DICOM series, modality, annotation, or file details during end-user tasks. WordPress snapshot metadata identifies TCIA publication status, user-facing downloads, and access/license terms; IDC/idc-index is the preferred source for public DICOM series/file metadata after those checks.
For visualization before download, load references/visualization.md. Controlled-access data cannot be visualized in a browser before download regardless of file format. For open-access/public DICOM in IDC, use IDC/idc-index viewer support: OHIF v3 for radiology, SliM for DICOM slide microscopy (SM), and VolView only after mapping the series to its public S3 folder or crdc_series_uuid. For open/public non-DICOM PathDB slides, use caMicroscope viewer URLs built from the PathDB CSV camic_id; the viewer URL parameter is named slideId, but it expects the numeric camic_id, not the CSV slide_id or patient_id. Return these as links for the user to open in their regular browser; do not install Playwright or browser automation just to preview example data.
Before performing any data download, ask how the user wants to proceed:
- Direct agent download: use Python tooling such as
idc-indexin the current environment when the user wants the agent to download files locally. - Portable manifest: create or return a CSV manifest for TCIA Data Retriever when the data can be represented by DICOM Series Instance UIDs, direct
imageUrlvalues, ordrs_urivalues. This lets the user save the manifest, use it on another computer, or share it with a collaborator.
If the user chooses a manifest and WordPress already provides a current CSV/TSV/XLSX manifest link, prefer the official WordPress manifest. Treat .tcia as a legacy NBIA-era format: read existing .tcia files when needed, but do not write new .tcia files unless the user explicitly asks for the legacy format. If the agent has validated Series Instance UIDs, image URLs, or DRS URIs and needs to create a manifest locally, use scripts/tcia_create_data_retriever_csv.py.
TCIA Data Retriever routes spreadsheet manifests by column headers:
SeriesInstanceUIDis the preferred header for public DICOM Series Instance UID manifests;Series UIDis also recognized. Data Retriever treats this as DICOM and attempts IDC/S3 lookup first, with TCIA/NBIA v4 as backup for public data that is not found in IDC. Do not use this route for controlled-access DICOM; use General Commons guidance instead.imageUrlis the preferred header for direct public file URLs;wsiimage_urlis also recognized. In TCIA workflows, use this for public PathDB/non-DICOM pathology files.drs_uriis the preferred header for controlled-access files when official WordPress, CTDC, or General Commons manifests provide DRS URIs;File ID/file_idare also recognized and bare IDs are interpreted asdrs://nci-crdc.datacommons.io/<file-id>. Always warn that authorization and TCIA Data Retriever API-key configuration may be required.
For new CSV manifests, create a single-route file with exactly one of the preferred route headers: SeriesInstanceUID, imageUrl, or drs_uri. Do not mix route headers in one manifest. Data Retriever checks for Series UID columns first, and in direct-file spreadsheets drs_uri/File ID takes precedence over imageUrl, so mixed-route CSVs can be ambiguous. For DICOM UID manifests, prefer a simple one-column CSV headed SeriesInstanceUID.
WordPress Download Labels
WordPress download metadata uses three related multi-select fields:
download_type: broad category, such as Radiology Images, Image Annotations, Clinical Data, Pathology Images, or Other.data_type: modality, annotation, clinical, pathology, or content label, such as CT, MR, RTSTRUCT, SEG, Segmentation, Demographic, Protocol, or Whole Slide Image.file_type: file format, such as DICOM, CSV, TSV, XLSX, ZIP, NIfTI, SVS, JSON, or PDF.
Treat download_type as the broad parent category, but do not require each download to have exactly one parent. TCIA often publishes one download record for a Data Retriever manifest that contains mixed content, such as CT or MR images plus RTSTRUCT or SEG annotations. TCIA may also publish one ZIP or supporting package that legitimately combines categories, such as image annotations plus protocol/acquisition details labeled as Other.
When routing downloads, preserve all labels and explain mixed labels plainly. Do not infer the route from a single data_type; use the combined download_type, data_type, file_type, title, description, license, requirements, and WordPress dataset context. Ignore blank or orphaned download rows in normal public discovery unless they are attached to a visible Collection or Analysis Result and contain a real title, URL, or file.
For Collections, collection_downloads are the dataset's download records. For Analysis Results, result_downloads are the Analysis Result's actual downloadable files. Do not treat source collection download records as the Analysis Result files; source collections only explain what data were used to create the result. If result_downloads are missing or empty for an Analysis Result, say that the result file metadata is unavailable in the current snapshot and, if the user expects a recent change, ask them to try again after the next scheduled snapshot run has completed.
In local SQLite snapshots, current-version nested collection_downloads and result_downloads are normalized into wordpress_downloads and wordpress_download_labels. Use the boolean is_current_version column for user-facing dataset downloads, for example is_current_version IS TRUE. The global WordPress downloads endpoint is also stored there with is_current_version = FALSE; use those rows for troubleshooting, not routine discovery, because they can include historical or orphaned endpoint records.
Bundled Scripts
Run scripts from the skill root.
| Script | Purpose |
|---|---|
scripts/tcia_wordpress_search.py |
Search local SQLite TCIA WordPress Collection and Analysis Result snapshot metadata, with text or JSON output. |
scripts/tcia_snapshot.py |
Build, inspect, validate, or download the local SQLite metadata snapshot, and export web-friendly release files. |
scripts/tcia_nifti_metadata.py |
Download, validate, summarize, and query the optional release SQLite for visible non-controlled TCIA NIfTI file-grain metadata. |
scripts/tcia_controlled_access_metadata.py |
Build/download, validate, summarize, and query the optional release SQLite for public controlled-access WordPress manifests, metadata spreadsheets, drs_uri rows, and IDC-shaped radiology indexes. |
scripts/tcia_manifest_series_uids.py |
Extract DICOM Series Instance UIDs from a legacy TCIA .tcia manifest path or URL for IDC/idc-index lookup. |
scripts/tcia_create_data_retriever_csv.py |
Create a TCIA Data Retriever CSV manifest from validated DICOM Series Instance UIDs, direct imageUrl values, or drs_uri values. |
scripts/idc_viewer_urls.py |
Construct OHIF v3, SliM, or VolView URLs after TCIA provenance, license, and IDC presence are already verified. |
scripts/general_commons_studies.py |
Query General Commons GraphQL for TCIA face dataset study acronyms under phs004225 and optional node counts. |
scripts/datacite_tcia_dois.py |
List TCIA DOI metadata from DataCite using prefix 10.7937, or fetch one exact DOI. |
scripts/datacite_related.py |
Developer/explicit-current-check helper for external DataCite records that declare DOI relationships, such as Zenodo records derived from TCIA DOIs. |
scripts/tcia_publications.py |
Fetch, parse, and search TCIA's verified EndNote publication library for papers written about TCIA datasets. |
scripts/pathdb_metadata.py |
Search or summarize PathDB non-DICOM histopathology slide metadata and derived caMicroscope viewer URLs from the stable cohort-builder CSV. |
Examples:
python scripts/tcia_snapshot.py ensure
python scripts/tcia_snapshot.py info
python scripts/tcia_snapshot.py build --out cache/tcia_snapshot.sqlite --gzip-out dist/tcia_snapshot.sqlite.gz --manifest-out dist/tcia_snapshot_manifest.json --exports-dir dist
python scripts/tcia_snapshot.py validate --db cache/tcia_snapshot.sqlite
python scripts/tcia_nifti_metadata.py ensure
python scripts/tcia_nifti_metadata.py datasets --limit 20
python scripts/tcia_nifti_metadata.py derived --collection BCBM-RadioGenomics --with-sources
python scripts/tcia_controlled_access_metadata.py ensure
python scripts/tcia_controlled_access_metadata.py datasets --limit 20
python scripts/tcia_controlled_access_metadata.py files --collection CMB-MEL --limit 10
python scripts/tcia_wordpress_search.py --query breast --limit 10
python scripts/tcia_wordpress_search.py --short-title TCGA-BRCA --json
python scripts/tcia_wordpress_search.py --query retired --include-hidden
python scripts/tcia_manifest_series_uids.py ./legacy_manifest.tcia --out series_uids.txt
python scripts/tcia_create_data_retriever_csv.py --uids-file series_uids.txt --out manifest.csv
python scripts/idc_viewer_urls.py ohif-v3 --study-uid <StudyInstanceUID> --series-uid <SeriesInstanceUID>
python scripts/idc_viewer_urls.py slim --study-uid <StudyInstanceUID> --series-uid <SeriesInstanceUID>
python scripts/idc_viewer_urls.py volview --crdc-series-uuid <crdc_series_uuid>
python scripts/general_commons_studies.py --study-acronym TCGA-GBM --counts
python scripts/datacite_tcia_dois.py --query breast --limit 10
python scripts/datacite_tcia_dois.py --doi 10.7937/4qad-4280 --json
python scripts/tcia_publications.py --query radiogenomics --limit 10
python scripts/tcia_publications.py --dataset-doi 10.7937/K9/TCIA.2016.RNYFUYE9 --json
python scripts/pathdb_metadata.py --collection CPTAC-STAD --summary
General Commons
Use direct General Commons GraphQL only when WordPress or GC metadata route a controlled-access TCIA face dataset to General Commons, unless the user explicitly asks for broader GC context. All TCIA data in General Commons are under phs004225; child study_acronym values should match WordPress collection_short_title or result_short_title. Biobank controlled-access face data now route through CTDC from the current WordPress manifests/links and require dbGaP access to phs002192, not GC phs004225.
Load references/general-commons-graphql.md when querying General Commons.
Cancer Data Aggregator
Use CDA when a user asks whether TCIA/IDC subjects have additional harmonized clinical, demographic, diagnosis, treatment, genomic, proteomic, General Commons, or cross-CRDC file metadata. Load references/cda.md.
Use CDA after WordPress confirms TCIA publication and after subject identifiers are validated through TCIA/IDC metadata. CDA can enrich a cohort, summarize which subjects have IDC/GDC/PDC/GC/ICDC data, and expose upstream identifiers, but WordPress remains the TCIA publication/download authority and source systems remain the access authorities.
Controlled Access
Load references/controlled-access.md when a user asks about controlled-access datasets, face data with controlled/restricted licenses, NCTN trials, Biobank data, API-key access, or configuring TCIA Data Retriever for restricted downloads. Always point users to the TCIA NIH Controlled Data Access Policy page for current instructions: https://www.cancerimagingarchive.net/nih-controlled-data-access-policy/. For Biobank controlled-access face data, tell users to request dbGaP access to phs002192 and use the current WordPress CTDC manifests/download/view links after authorization.
Use the optional controlled-access SQLite only after the normal snapshot confirms TCIA provenance and controlled/restricted access. It is distributed as release assets controlled_access_metadata.sqlite.gz and controlled_access_metadata_manifest.json, and fetched on demand by python scripts/tcia_controlled_access_metadata.py ensure.
IDC DICOM Downloads
Load references/idc-dicom-downloads.md when a user asks to download public TCIA DICOM data, including radiology images, DICOM pathology, RTSTRUCT, SEG, SR, radiotherapy objects, or other DICOM annotation/result files. Use IDC/idc-index first. Use NBIA only as a fallback when the requested DICOM series cannot be found in IDC/idc-index, or when the user explicitly asks for NBIA after being told IDC is preferred. If NBIA fallback is needed, use the NBIA v4 API documented by https://cbiit.github.io/NBIA-TCIA/nbia-api.yaml.
Visualization
Load references/visualization.md when a user asks to preview, visualize, open, inspect, or get a browser viewer link for TCIA data. Never generate public viewer links for controlled-access datasets. For open-access/public DICOM, use IDC/idc-index and the IDC skill when available; prefer OHIF v3 for radiology, SliM for SM, and VolView only when a public S3 series URL or CRDC series UUID has been identified. Return viewer URLs as links for the user to open; do not install Playwright or browser automation just to display examples.
DataCite Relationships
Use snapshot DataCite records first for DOI metadata, citations, and versions. For explicit current external DOI relationship checks, an external Zenodo dataset may declare IsDerivedFrom a TCIA DOI. Such a record is relevant to the TCIA collection, but it remains an external derived record unless WordPress also lists it as a Collection or Analysis Result.
Load references/datacite-relationships.md when answering DOI, citation, version, or derived-result questions.
Publications About TCIA Data
Load references/publications.md when a user asks for peer-reviewed manuscripts, papers, publication lists, citation counts, research hypotheses, methods, or downstream scientific uses of TCIA data. Use TCIA's verified Publications EndNote XML export at https://cancerimagingarchive.net/endnote/Pubs_basedon_TCIA.xml as the authority. Do not substitute DataCite dataset DOI records or WordPress dataset pages for the manuscript bibliography.
Web LLMs And MCP
Load references/mcp-and-web-llms.md when a user asks how to use this skill from web-based LLMs, remote MCP connectors, custom agents that cannot install skills, or environments that cannot run Python/SQLite locally. Prefer the published release snapshot and static JSON/JSONL exports over live public APIs. A remote MCP server should expose typed read-only TCIA search tools backed by the same snapshot, not arbitrary live WordPress scraping.
Aspera Packages
Some non-DICOM data are distributed through IBM Aspera Faspex package links in WordPress. Do not try to reconstruct package URLs. Use the URL exposed by the TCIA dataset page or WordPress download metadata, then follow references/aspera.md. For non-DICOM pathology that is available both through Aspera and PathDB, explain that Aspera packages are the original submitter-provided data, while PathDB copies may be converted or reformatted so they work properly in TCIA's browser-based pathology viewer. Recommend the Aspera copy for real analyses when the user needs the exact files submitted to TCIA.
NIfTI Metadata
Load references/nifti.md when a user asks for TCIA NIfTI file counts, NIfTI filenames/package paths, NIfTI modality/acquisition metadata, or NIfTI segmentation/source-image relationships. Use the normal snapshot first for TCIA provenance, visibility, access/license, and user-facing download URLs. Use the optional NIfTI SQLite only for file-grain public non-controlled NIfTI metadata.
Do not download the optional NIfTI SQLite unless the user needs NIfTI file-grain metadata or explicitly asks to refresh it. It is distributed as release assets nifti_metadata.sqlite.gz and nifti_metadata_manifest.json, and fetched on demand by python scripts/tcia_nifti_metadata.py ensure.
PathDB Metadata
For non-DICOM histopathology, load references/pathdb.md. Use WordPress first to confirm the dataset is TCIA-published, then use PathDB metadata to answer slide-level questions, including patient counts, slide counts, image URLs, caMicroscope viewer URLs, cancer type/location, data formats, and companion radiology/genomics/proteomics flags. For public pathology Aspera package scope, Aspera-derived package file inventory, or PathDB/package reconciliation, use the optional pathology SQLite and prefer agent_pathology_dataset_summary, agent_pathology_downloads, agent_pathology_file_objects, and agent_pathology_package_files. For caMicroscope URLs, use CSV camic_id, not slide_id. If the same pathology data are also distributed as an Aspera package, distinguish the routes: PathDB is optimized for metadata and browser viewing and may use converted or reformatted files; Aspera is the original submitter-provided copy and is preferred for analyses requiring exact source files.
Answer Format
For discovery requests, prefer a compact ranked table:
| Dataset | Type | Why it matched | Access route | Access/license | DOI/citation | Notes |
|---|
Include the TCIA page link, WordPress short title, and any caveats about controlled access, license, or external derived records. For download requests, estimate size/counts when available and recommend a small test download before bulk transfer.
Guardrails
- Never present a dataset as TCIA-published unless it appears in WordPress Collections or Analysis Results.
- Ignore WordPress
hide_from_browse_table = "1"records unless the user explicitly identifies as TCIA staff and asks to include hidden/staged/retired/internal-review datasets. - Do not use live WordPress API calls for normal end-user discovery. Use the SQLite snapshot, its release exports, or a snapshot-backed MCP server. Live source APIs are for snapshot maintainers and explicit current-source troubleshooting.
- Do not use DataCite DOI records as the authority for peer-reviewed manuscripts written about TCIA data. Use TCIA's Publications EndNote XML export for manuscript bibliography questions.
- Use license metadata, not WordPress collection/page accessibility fields, to decide controlled access. Creative Commons means open access; Creative Commons NonCommercial is open with a noncommercial-use restriction; controlled/restricted license text requires the controlled-access alert and policy link.
- Do not use NBIA as the first route for open-access/public DICOM downloads. Prefer IDC/idc-index, using existing WordPress
.tciamanifests as Series Instance UID allowlists when helpful. If NBIA fallback is needed, use the NBIA v4 API documented byhttps://cbiit.github.io/NBIA-TCIA/nbia-api.yaml. For controlled-access DICOM, use WordPress license metadata plus the downstream route identified by current WordPress metadata, such as CTDC for Biobank controlled-access face data or General Commons for non-Biobank face data routed there, instead of public download routes. - Do not construct browser viewer URLs for controlled-access data. There is no pre-download browser visualization route for controlled-access TCIA data regardless of file format.
- Do not install Playwright or other browser automation just to show visualization examples. Provide viewer links for users to open in their own browser unless they explicitly ask for browser automation.
- Before downloading data, ask whether the user wants direct agent download in the active environment or a portable manifest/file list when the requested data support that workflow. For new TCIA Data Retriever manifests, write CSV/TSV/XLSX-compatible files rather than legacy
.tciafiles. - For new TCIA Data Retriever CSV manifests, use one route header only:
SeriesInstanceUIDfor public DICOM through IDC first/NBIA fallback,imageUrlfor PathDB/direct public files, ordrs_urifor controlled-access files when official WordPress, CTDC, or General Commons manifests provide DRS URIs. - For non-DICOM pathology available in both PathDB and Aspera, do not imply the files are necessarily byte-identical. PathDB may convert or reformat slides for browser-based viewing; Aspera packages should be described as the original submitter-provided data and preferred for analyses that require exact source files.
- For Biobank controlled-access face data, route through CTDC using the current WordPress manifests/download/view links and tell users to request dbGaP study
phs002192; treat this as current routing, not a future placeholder. Use the optional controlled-access SQLite for public CTDC manifest/spreadsheet metadata when file-grain details are needed. - Do not present VolView as UID-based. Map Study/Series UIDs through IDC/idc-index to a public S3 series folder or
crdc_series_uuidbefore constructing a VolView URL. - Do not broaden IDC, CDA, GC, or PathDB searches beyond WordPress short titles, TCIA DOIs, subject identifiers from validated TCIA/IDC cohorts, or explicit user-approved exploratory scope.
- Do not use CDA to claim TCIA publication, official TCIA clinical spreadsheet completeness, or controlled-data access rights. Use CDA as harmonized discovery/enrichment metadata and route users back to TCIA, IDC, GDC, PDC, GC, or other source systems for authoritative files and access controls.
- Distinguish open Creative Commons, open Creative Commons NonCommercial, mixed open/controlled, and controlled/restricted license statuses clearly.
- Do not provide medical, regulatory, or legal conclusions about data suitability. Report metadata, access terms, and citations.
- Verify current package/API behavior when the user asks for latest status, current availability, or exact download commands.