name: model-development-workflow
description: "Use when STANDING UP a new ML pipeline or onboarding to one for the first time — the cycle-zero work that has to happen before any hypothesis-driven experiment can start. Covers the end-to-end bootstrap (schema design → small representative dataset → validate features → dry run → small-data run → production run) and the three-tier pattern (dry_run → small → full) that catches bugs before full-scale compute. The skill's job ends when the pipeline produces real results; from there, hypothesis-driven iteration belongs in /deriva-ml:experiment-lifecycle. Triggers on: 'new ML project', 'set up a new pipeline', 'first model', 'onboard to existing project', 'standing up training', 'how should I get started', 'start small', 'representative dataset', 'development subset', 'what order should I do things', 'best practices for training', 'debug my training' (when the pipeline itself is new/unproven)."
user-invocable: true
Model Development Workflow (cycle zero)
This skill teaches the bootstrap workflow for a new DerivaML pipeline. It's cycle-zero work: schema design through the first production-scale training run. The core principle: start small, validate early, scale up only after everything works.
Most wasted compute comes from running full-scale training on broken configurations. This workflow catches problems at each tier before they become expensive.
When this skill stops,
experiment-lifecyclestarts. Once you have a working pipeline that's produced real (not validation) results, the hypothesis-driven iteration loop ("what's the next experiment that would teach us something we don't already know?") belongs in/deriva-ml:experiment-lifecycle. That skill assumes the pipeline exists and is sound; this skill exists to get you to that point.
Every MCP tool below takes
hostname=andcatalog_id=arguments explicitly. Substitute your catalog's hostname (e.g.,"data.example.org") and catalog ID (e.g.,"1") wherever the examples show them.
The Three-Tier Development Pattern
Every DerivaML workflow follows this progression:
| Tier | Data | Execution | Purpose | Cost |
|---|---|---|---|---|
| 1 | None | dry_run=True |
Validate config, test data loading paths | Free |
| 2 | Small subset (50–200 records) | Real execution | End-to-end pipeline validation | Minutes |
| 3 | Full dataset | Production execution | Real results | Hours/days |
Never skip tiers. Tier 1 catches config errors. Tier 2 catches data pipeline bugs. Tier 3 is only for generating real results.
Phase 1: Specify
Before designing the schema or writing any model code, capture what the model is
for. Hand off to /deriva-ml:design-experiment to author docs/design/model/<slug>.md
— Goal (the prediction task), Requirements (architecture, hyperparameters, input
features, input assets), Validation (the target metric + success threshold), and
Upstream designs (the feature-designs the model consumes). The Requirements
section is the source the model-layer configuration is derived from.
This is the Specify phase of the universal Specify → Build → Validate arc. The
three-tier development pattern (Phases 2–7) is Build; Phase 8 plus the
validate-against-the-design check is Validate. Configuration here is the model
layer (hyperparameters, architecture); the experiment layer that composes this
model with a dataset lives in /deriva-ml:experiment-lifecycle.
Reasoning about an existing model — read its design first. When the question is about a model that already exists (why is it built this way? what was it meant to predict? is this architecture still the right choice?), open its
docs/design/model/<slug>.md— the goal, requirements, and success criteria are there. Pair it withtacit-knowledge.md(what was learned in building/training it). Design = intent; journal = history.
Phase 2: Schema Design
Before any data, design the catalog structure.
Decision sequence:
- What domain tables do I need? (Subject, Image, Observation, etc.)
- What vocabularies provide consistent categorical labels?
- What features attach annotations to records?
- What asset tables store files? (images, models, masks, etc.)
Predictions as features (handoff grammar). If the model emits predictions
that should be stored as feature values (predicted labels, confidence scores),
that feature must be defined first. Hand off to /deriva-ml:create-feature to
author its feature-design and create it — the model-design's "Upstream
designs" names that feature. A model's prediction output is simultaneously an
asset (the prediction file) and feature values (on the records).
Skills to use:
/deriva:create-table(deriva-skills) — domain tables with columns and foreign keys/deriva:manage-vocabulary(deriva-skills) — controlled vocabularies for categorical datacreate-feature(this plugin) — features linking annotations to domain objectswork-with-assets(this plugin) — asset tables for file management
Start simple. You can always add columns, vocabularies, and features later. Don't over-design the schema before you have data.
After creating the schema, run rag_index_schema() so the RAG index includes your new tables.
Phase 3: Create a Development Dataset
The dataset itself is owned by
/deriva-ml:dataset-lifecycle. This phase creates a development dataset as a means to the model bootstrap; for the dataset's own design, structure, typing, and versioning decisions, that lifecycle is authoritative. The fast-path recipe inreferences/dev-dataset-recipe.mdcovers a throwaway dev subset — route through dataset-lifecycle when the dev dataset becomes something you'll reuse or cite.
Create a small, representative dataset for development. This is the dataset you'll use for tiers 1 and 2. A development dataset should have 50–200 records, include all classes (5–10 per class), cover known edge cases, and be labeled if your workflow needs labels.
Recipe: the fast-path commands for creating the dev subset (register element type → create dataset → add a representative sample → mint a "Development" type → pin to a released version) live in
references/dev-dataset-recipe.md. For the dataset's own design, structure, typing, and versioning — and once the dev dataset becomes something you'll reuse or cite — route through/deriva-ml:dataset-lifecycle, the canonical home for dataset creation.
Phase 4: Validate Features and Labels
Before training, confirm the feature schema works with your development data.
Inspection sequence:
deriva_ml_get_feature(hostname="data.example.org", catalog_id="1", target_table="Image", feature_name="<feature_name>")— confirm column structure (required vs optional)list_vocabulary_terms(hostname="data.example.org", catalog_id="1", schema="<schema>", table="<vocab_name>")— confirm valid term valuesderiva_ml_list_feature_values(hostname="data.example.org", catalog_id="1", target_table="Image", feature_name="<feature_name>", selector="newest")— check that labels exist for your dev records
If labels are missing, add them to the development dataset before training. Feature-value writes go through a bundled script template — copy skills/create-feature/scripts/populate_feature_values.py to src/scripts/populate_diagnosis.py, edit the target table and feature name, commit, then run:
uv run python src/scripts/populate_diagnosis.py \
--hostname data.example.org --catalog-id 1 \
--workflow-type Annotation \
--csv ./labels/dev_diagnosis.csv \
--target-table Image --feature-name Diagnosis
The template opens with ml.create_execution(config, workflow=workflow) as exe:, stages exe.add_features(records), and exe.commit_output_assets() post-with flushes them to the catalog. Pydantic validates each row against the feature's term vocabulary, so mismatched terms fail loudly.
Verify the full pipeline by denormalizing:
deriva_ml_denormalize_dataset(
hostname="data.example.org",
catalog_id="1",
dataset_rid="<dev_dataset>",
include_tables=["Image", "Image_Diagnosis"],
limit=20,
)
This shows you exactly what the training pipeline will see.
Phase 5: Tier 1 — Dry Run
A dry run validates configuration without creating execution records or writing to the catalog.
With the CLI
# Resolve and print the config without running
uv run deriva-ml-run +experiment=my_experiment --cfg job
# Dry run — downloads data but doesn't create execution records
uv run deriva-ml-run +experiment=my_experiment dry_run=true
With the bundled script template
skills/execution-lifecycle/scripts/basic_execution.py exposes --dry-run as a top-level argparse flag — it builds the ExecutionConfiguration and opens with ml.create_execution(config, workflow=workflow, dry_run=True) so the runner validates inputs without creating execution records or committing outputs:
uv run python src/scripts/train_resnet50.py \
--hostname data.example.org --catalog-id 1 \
--workflow-type Training \
--dry-run
For ad-hoc validation without writing a script, use deriva_ml_validate_execution_configuration (see Phase 6 pre-flight) — it's the metadata-only equivalent that doesn't pay the bag-download cost.
What dry_run validates
- ✅ Config resolves without errors
- ✅ Dataset RIDs and versions exist (the runner calls
get_entities(...)per candidate table internally) - ✅ Asset RIDs exist and are downloadable
- ✅ Data loading code runs without errors
- ✅ Model initialization works
- ❌ Does NOT write execution records to the catalog
- ❌ Does NOT upload outputs
Fix problems at this tier
Common tier 1 failures:
- Missing or wrong dataset RID/version → fix config
- Missing vocabulary terms → add terms before proceeding
- Import errors → fix code
- Config schema mismatch → fix config structure
Phase 6: Tier 2 — Small-Data Run
Run a real execution against your development dataset. This creates catalog records and tests the full pipeline end-to-end.
Pre-flight checklist
- Validate the full config with
deriva_ml_validate_execution_configuration(hostname=..., catalog_id=..., config={...})— one call confirms every dataset RID, every dataset version, every asset RID, the workflow, and surfaces cross-spec issues (duplicate RIDs, version conflicts, role conflicts). Cheap metadata-only pre-flight; doesn't pay the bag-download cost thatdry_run=Truedoes. deriva_ml_bag_info(hostname=..., catalog_id=..., dataset_rid="...", version="...")— check cache statusuv run python src/scripts/warm_cache.py --hostname ... --catalog-id ... --dataset-rid ... --version ...— pre-fetch via the bundledmanage-deriva-storagetemplate if anything readsnot_cached- Code committed and version bumped (
bump_version(bump_type="patch"))
Run with small data
# Point at your development dataset
uv run deriva-ml-run +experiment=my_experiment \
datasets.training.rid=<dev_rid> \
datasets.training.version=<dev_version> \
model_config.epochs=3
Verify outputs
After the run completes:
- Check execution status —
deriva_ml_get_execution(hostname=..., catalog_id=..., execution_rid="...") - Verify outputs were uploaded — call
deriva_ml_lookup_asset(hostname=..., catalog_id=..., asset_rid="...")for each output asset (orderiva_ml_find_workflow_executions(hostname=..., catalog_id=..., workflow_rid="...")for the broader query). - Inspect output files — download and examine predictions, metrics, model weights
- Check provenance chain —
deriva_ml_list_execution_children(hostname=..., catalog_id=..., execution_rid="...")for descendants andderiva_ml_list_execution_parents(hostname=..., catalog_id=..., execution_rid="...")for ancestors. - Compare against prior runs — fetch the same feature across this run and the last few runs in one call:
deriva_ml_list_feature_values(hostname=..., catalog_id=..., target_table=..., feature_name=..., execution_rids=["<this_rid>", "<prev_rid_1>", "<prev_rid_2>"]). Theexecution_rids=filter runs server-side, so it's one round trip instead of N. Detects regressions early. See/deriva-ml:compare-model-runsfor the full ranking pattern.
Fix problems at this tier
Common tier 2 failures:
- Data shape mismatches → fix data loading or preprocessing
- NaN/Inf in training → fix normalization or learning rate
- Output upload failures → fix asset_file_path registration
- Wrong number of classes → check vocabulary and feature values
Phase 7: Tier 3 — Production Run
Only after tiers 1 and 2 succeed, scale to the full dataset.
Create the production dataset
If you don't already have one, see the dataset-lifecycle skill for:
- Creating and populating the full dataset
- Splitting into train/val/test via a script that calls the Python API
split_dataset(ml, source_rid, exe, ...) - Stratifying by label distribution
Pre-production checklist
| Step | Tool | Purpose |
|---|---|---|
| 1 | deriva_ml_validate_execution_configuration(hostname=..., catalog_id=..., config={...}) |
Confirms all dataset RIDs + versions exist, all asset RIDs exist, workflow is valid, no cross-spec conflicts — single metadata-only call (cheaper than dry_run, which downloads bags) |
| 2 | deriva_ml_bag_info(hostname=..., catalog_id=..., dataset_rid=...) |
Check dataset sizes and cache status |
| 3 | uv run python src/scripts/warm_cache.py --hostname ... --catalog-id ... --dataset-rid ... |
Pre-fetch large datasets via the manage-deriva-storage template |
| 4 | uv run bump-version <type> (or bump_version("<type>") MCP) |
Tag the code version — see decision matrix below for <type> |
| 5 | git status |
Confirm clean working tree |
| 6 | Verify experiment description | Will be recorded in execution |
Choosing the version bump type:
| Component | When to use | Examples |
|---|---|---|
| patch | Bug fixes, small parameter tweaks | Fixed mislabeled records, tightened a loss function, adjusted a hyperparameter default |
| minor | New experiment configurations, new model architectures | Added a new model variant, added a new dataset split, new hydra-zen experiment preset |
| major | Breaking changes to the training pipeline or data format | Restructured the catalog schema, broke backwards compatibility with prior bag exports |
Commit the version bump before running. The git tag created by bump-version becomes the version recorded in the execution metadata.
Run production
uv run deriva-ml-run +experiment=my_experiment
Or for hyperparameter sweeps:
uv run deriva-ml-run +multirun=lr_sweep
After production run
Verify all child executions completed (for multiruns)
Check output assets exist and have expected sizes
Record the execution RID and a one-line characterization of what the run was for in
tacit-knowledge.md(via thecapture-tacit-knowledgeskill). The execution RID is the durable anchor — assets, features, status, inputs, and the workflow's git commit hash all hang off it. Do not enumerate asset RIDs in the notes; they go stale and the catalog already has them linked to the execution.If the execution attached features to the dataset's members (e.g., ground-truth labels, curated annotations, derived attributes that future consumers should see), record the drift and mint a release. Per ADR-0003, feature drift is not auto-detected by the dataset-mutation tools — call
dataset.mark_dev(description)from the Python API to declare a dev period (which flipscurrent_versionto a.devNlabel), thenderiva_ml_release_dataset(...)to mint a release that captures the new feature values. See thedataset-lifecycleskill, Phase 6 (Version).Do NOT bump for runs whose outputs are assets only (model weights, training logs, prediction CSVs, plots). Execution-output assets are linked to the execution, not to the dataset's members; future consumers reach them through the execution RID, so the dataset doesn't need a new version.
Phase 8: Iterate
Validate against the model-design first. Before iterating, check the run's
results against the Validation criteria in the model-design doc — did it hit
the target metric and threshold? This is the Validate phase of the arc: success
is measured against the design's stated criteria, not just "the pipeline ran."
Record the verdict in the design doc (Status → Validated) and tacit-knowledge.md.
ML development is iterative. After each production run:
- Analyze results — use
deriva_ml_denormalize_dataset(hostname=..., catalog_id=..., dataset_rid=...)or download the bag to examine predictions - Identify improvements — more data? Better labels? Different architecture?
- Go back to the appropriate tier:
- Config change only → Tier 1 (dry run)
- New feature or data pipeline change → Tier 2 (small-data run)
- Ready for next experiment → Tier 3 (production run)
Never skip back to Tier 3 after a significant change. Always validate with tiers 1–2 first.
The first hypothesis-driven experiment after cycle zero authors a
docs/design/experiment/<slug>.md via /deriva-ml:design-experiment — the
design-first phase experiment-lifecycle opens with. Cycle zero validated the
plumbing; from here every experiment starts with a design doc.
Git workflow
Cross-cutting across all phases — applies whenever you're committing code that an execution will eventually run.
- Use feature branches for all work —
git checkout -b feature/add-segmentation-model. Keepmainclean and passing. - Use pull requests, even solo — PRs create a permanent record of what changed and why. The PR description becomes part of the project's institutional memory alongside
tacit-knowledge.md. With the GitHub CLI (gh) installed, Claude can create PRs, review diffs, and merge directly from the terminal. - Commit before running — DerivaML enforces git-clean for executions (
DerivaMLDirtyWorkflowError). Use--allow-dirtyonly for debugging iterations; the resulting execution has degraded provenance. See/deriva-ml:execution-lifecyclefor the canonical commit-before-running discipline.
Extending DerivaML
If you need project-specific helpers that wrap DerivaML behavior, prefer inheritance over modifying the library:
from deriva_ml import DerivaML
class MyProjectML(DerivaML):
"""Extended DerivaML with project-specific helpers."""
def load_training_data(self, dataset_rid: str) -> pd.DataFrame:
...
This keeps the project-specific logic in your repository (versioned, reviewable) while inheriting all of DerivaML's behavior. Avoid monkey-patching DerivaML methods at runtime — those changes don't show up in git diff and break debuggability.
Quick Reference: Which Skill for What
| Task | Skill | Plugin |
|---|---|---|
| Design tables, columns, FKs | /deriva:create-table |
deriva-skills |
| Load row data into tables (CSV/JSON, asset uploads via deriva-upload-cli) | /deriva:load-data |
deriva-skills |
| Create vocabularies and terms | /deriva:manage-vocabulary |
deriva-skills |
| Create features for annotations | create-feature |
this plugin |
| Create/split/version datasets | dataset-lifecycle |
this plugin |
| Run experiments with provenance | execution-lifecycle |
this plugin |
| Upload/download/track assets | work-with-assets |
this plugin |
| Restructure data for PyTorch/TF | ml-data-engineering |
this plugin |
| Write hydra-zen configs | configure-experiment; write-hydra-config for syntax |
this plugin |
| Run notebooks with tracking | run-notebook |
this plugin |
| Document decisions | capture-tacit-knowledge |
this plugin |
| Iterate on an existing pipeline (cycle 2 onward — once cycle zero, this skill, is done) | experiment-lifecycle |
this plugin |