name: ds-baseline description: Use when a quest needs to attach, import, reproduce, repair, verify, compare, or publish a baseline and its metrics. skill_role: stage license: MIT metadata: author: ResearAI/DeepScientist version: "1.0.0"
Baseline
This skill establishes the reference system the quest will compare against. The target is one trustworthy baseline line, not an endless reproduction diary.
Interaction discipline
- Follow the shared interaction contract injected by the system prompt.
- Keep ordinary setup and debugging updates concise.
- Use richer milestone updates only when the baseline becomes trusted, caveated, blocked, waived, or route-changing.
- Hard execution rule: every terminal command in this stage must go through
bash_exec; do not use any other terminal path for setup, reproduction, monitoring, verification, Git, Python, package-manager, or file-inspection commands. - Prefer
bash_execfor setup, reproduction, monitoring, and verification commands so the baseline line stays durable and auditable.
Tool discipline
- Do not use native
shell_command/command_executionin this skill. - All shell, CLI, Python, bash, node, git, npm, uv, and environment work must go through
bash_exec(...). - For git work inside the current quest repository or worktree, prefer
artifact.git(...)before raw shell git commands. - If a generic git smoke test is needed outside the quest repo, use
bash_exec(...)in an isolated scratch repository.
Non-negotiable rules
- no fabricated metrics, logs, run status, or success claims
- do not skip baseline steps or silently simplify the route when that would change trust or comparability
- do not claim a baseline is ready before verification is complete
- do not infer missing commands, scripts, or parameters when the uncertainty could change the result
- any unavoidable guess must be written down explicitly with expected impact
- use web search for discovering papers or repos, but use
artifact.arxiv(paper_id=..., full_text=False)for actually reading a source arXiv paper when it exists - set
full_text=Trueonly when the short form is insufficient - for Python baselines, environment setup should be standardized around
uv
Stage purpose
The baseline stage should produce a usable reference point through one of four routes:
- attach an existing reusable baseline
- import a reusable baseline package
- reproduce a baseline from source
- repair a broken or stale baseline
Keep the classic control flow:
- analysis
- setup
- execution
- verification
These are control gates, not paperwork walls.
Quick workflow
- Read the source paper and source repo first, or record exactly what is missing and why.
- Choose the lightest trustworthy route: attach, import, reproduce, or repair.
- Start with the fast path whenever the current baseline object, command path, and acceptance target are already clear enough to validate cheaply.
- Before substantial baseline setup, code edits, or a real baseline run, create
PLAN.mdandCHECKLIST.md; short-form files are enough for simple fast-path work. - Keep one dominant phase visible: analysis -> setup -> execution -> verification.
- Prefer one clean implementation pass, one smoke test, and then one normal baseline run.
- Retry only when smoke, verification, or runtime evidence shows a concrete failure or incompatibility.
- Close the stage by confirming or waiving the gate, then hand off with a concise
1-2sentence summary of trust status and next anchor.
Fast-path first
Default to the lightest baseline path that can still establish a trustworthy comparison. Default to a fast path when it can establish trust with less work.
Fast path is the default when any of the following is true:
requested_baseline_reforconfirmed_baseline_refalready points to the active baseline object- the route is clearly
attachorimport - the repo entrypoint, dataset or split, and metric contract are already concrete enough to validate cheaply
- reproduction requires no meaningful code changes and the main uncertainty is only whether the command still runs
Fast path means:
- do not restart broad baseline discovery by default
- do not front-load a full codebase audit when the entrypoint is already concrete
- use a minimal
PLAN.md, a minimalCHECKLIST.md, one bounded smoke test when needed, and then one real validation or run - default to reuse-and-verify when runtime already attached a concrete baseline
Escalate from fast path to fuller audit only when:
- the paper and repo disagree materially
- the real run or eval entrypoint is unclear
- code changes are likely required
- the contract spans multiple metrics, datasets, subtasks, or splits that still need interpretation
- the same failure class reappears after one documented autonomous fix
- the quest is trying to publish a reusable global baseline rather than only clear the current gate
Use when
- no credible baseline exists yet
- the current baseline is unverified or stale
- the user already has a baseline package that should be attached or imported
- a reproduction failed earlier and now needs repair
- the quest resumed and the baseline trust state is unclear
Do not use when
- the quest already has a verified active baseline and the next move is ideation or execution
- the user explicitly waived the baseline gate and that waiver is durably recorded
Stage gate
Do not proceed to comparison-heavy downstream work unless one of the following is durably true:
- a baseline has been attached and accepted
- a baseline has been imported and accepted
- a baseline reproduction has completed and been verified
- an explicit waiver decision exists with a clear reason
Operationally:
- call
artifact.confirm_baseline(...)once the accepted baseline root and trusted comparison contract are clear - call
artifact.waive_baseline(...)when the quest must continue without a baseline - attach, import, or publish alone do not open the downstream gate
Required plan and checklist
Before substantial baseline setup, code edits, or a real baseline run, create a quest-visible PLAN.md and CHECKLIST.md.
- Use
references/baseline-plan-template.mdas the canonical structure forPLAN.md. - Use
references/baseline-checklist-template.mdas the canonical structure forCHECKLIST.md. analysis_plan.mdandREPRO_CHECKLIST.mdremain acceptable compatibility alias files when an older quest already depends on them.- For fast-path attach/import/prebound validation or a simple reproduce path with no expected code changes, short-form
PLAN.mdandCHECKLIST.mdare enough. - The plan should put the user's explicit requirements and non-negotiable constraints first.
- Then record the chosen route, source identity, command path, expected outputs, acceptance condition, safe efficiency levers, main risks, and fallback.
- If the route, commands, source package, fallback path, or trust judgment changes materially, revise
PLAN.mdbefore continuing. - Once the route is concrete, stop reshaping code and commands speculatively.
Default retry discipline:
- do not rerun the same unchanged smoke command just to reconfirm the same fact
- treat one autonomous retry for the same failure class as the normal upper bound
- if the same failure class appears again, switch explicitly into
repair, recordblocked, or route throughdecision
Required durable outputs
The baseline stage should usually leave behind:
- a baseline directory under
baselines/local/orbaselines/imported/ PLAN.mdandCHECKLIST.md- a verification note or report
- command, config, environment, and metrics pointers
- a baseline artifact
- a confirmed baseline gate via
artifact.confirm_baseline(...), or an explicit waiver viaartifact.waive_baseline(...) - an optional registry publication if the baseline is reusable beyond this quest
For simple attach/import flows or a straightforward reproduce flow, do not stall just to precreate every optional note file.
Useful optional notes:
setup.mdexecution.mdverification.mdSTRUCTURE.mdwhen the layout is non-obvious
File-by-file contract
PLAN.mdor compatibility aliasanalysis_plan.mdis the required route contract before substantial setup, code edits, or a real run; it should state the route, source identity, command path, expected outputs, acceptance condition, main risks, and fallback.CHECKLIST.mdor compatibility aliasREPRO_CHECKLIST.mdis the required living state tracker; it should show whether the baseline object, smoke decision, real run decision, and final accept / block / waive outcome are explicit.setup.mdis optional unless environment or layout choices are non-trivial; if used, record the working directory, environment route, important config paths, source revision, and notable setup deviations.execution.mdis optional unless the run is long, multi-step, or rerun-heavy; if used, record the launched commands, durable log paths, checkpoints, exit state, and any reruns or repairs.verification.mdis optional as a filename but required in substance before acceptance or blocked closeout; either this file or an equivalent report should record trusted metrics, expected-versus-observed comparison, caveats, canonical output paths, and the next anchor.STRUCTURE.mdbecomes required when the workspace layout, mounts, symlinks, or generated outputs are non-obvious or meant for reuse; it should map the important directories and say which paths are canonical.attachment.yamlis required for attached or imported baselines underbaselines/imported/; preserve source identity, selected variant when relevant, and attachment provenance there.<baseline_root>/json/metric_contract.jsonis the canonical accepted comparison contract; once the baseline is accepted, do not leave the authoritative metric surface only in chat, memory, or prose.Result/metric.mdis scratch-only; it may help during execution, but it is never the final source of truth.
Minimum stability rules:
- before the first real run, leave one durable note with the chosen route, expected command path, target outputs, and main risks
- after each smoke test or real run, record what actually happened and whether the route still looks viable
- before acceptance, leave a clear verification note and baseline gate decision
- every accepted baseline should leave one accepted baseline artifact
- every blocked baseline line should leave one blocked report and one next-step decision
- if one rolling note is enough for a simple baseline line, use it
Durable path contract
Use the real runtime paths consistently.
Quest-local paths:
- reproduced baseline root:
<quest_root>/baselines/local/<baseline_id>/ - attached or imported baseline root:
<quest_root>/baselines/imported/<baseline_id>/ - attachment record:
<quest_root>/baselines/imported/<baseline_id>/attachment.yaml - canonical baseline metric contract JSON:
<baseline_root>/json/metric_contract.json - baseline artifact record:
<quest_root>/artifacts/baselines/<artifact_id>.json - baseline reports:
<quest_root>/artifacts/reports/<artifact_id>.json - confirmed baseline reference:
quest.yaml -> confirmed_baseline_ref
Global reusable registry paths:
- baseline registry index:
~/DeepScientist/config/baselines/index.jsonl - canonical baseline entry:
~/DeepScientist/config/baselines/entries/<baseline_id>.yaml
Baseline id and variant rules
baseline_idshould be short, stable, and filesystem-safe- use letters, digits,
.,_, or- - do not use spaces,
/,\\, or.. - if one codebase contains multiple comparable baselines, prefer one
baseline_idwith structured variants instead of inventing many near-duplicate entries - when variants exist, keep
default_variant_id,baseline_variants, and per-variant metric summaries stable enough that laterexperimentandwritestages can cite them directly
Do not invent parallel durable locations when these runtime contracts already exist. Do not leave the authoritative metric contract only in chat, memory, or prose once the baseline is accepted.
If a baseline is reproduced only because an analysis campaign needs an extra comparator:
- still place it under the normal baseline roots
- treat it as a supplementary analysis baseline unless the quest explicitly promotes it into the canonical gate
- do not call
artifact.confirm_baseline(...)for that supplementary case unless the quest truly intends to replace the canonical baseline
Multi-baseline policy
One quest may legitimately need more than one baseline.
- explicitly mark which baseline is the primary downstream comparator
- distinguish primary comparison baselines from fallback or infrastructure baselines
- if several baselines are credible, record why the chosen primary baseline is the fairest paper-facing comparator
- do not leave later stages guessing which baseline is authoritative
Route order
Prefer this order:
- attach
- import
- reproduce
- repair
Prefer reuse over redundant reproduction.
Workflow
Phase 1. Analysis
Before running anything substantial, determine:
- exact task
- dataset and split contract
- metric contract
- source baseline identity
- source code path
- expected run command or evaluation path
- expected paper or repo numbers when they exist
- local resource constraints
Default analysis discipline:
- read the source paper and source repo first
- if runtime already exposes a matching
requested_baseline_reforconfirmed_baseline_ref, validate that concrete object before restarting broad discovery - identify the real run or evaluation entrypoint
- identify the dataset or split and metric contract
- identify likely environment blockers
- define the cheapest credible smoke test
Escalate to a fuller audit only when the command path is unclear, the repo is large or confusing, repair mode is active, or custom code changes look likely.
When the fuller audit is necessary, capture only what later stages truly need:
- major entry scripts, configs, and modules
- end-to-end data flow
- evaluation path and metric computation path
- obvious environment assumptions
- obvious bottlenecks or incompatibilities
If the source paper is available, record:
- the core algorithm in compact, implementation-faithful form
- the main reported numbers
- the main weaknesses or bottlenecks likely to matter for this quest
You may inspect local feasibility with shell-based checks for OS, GPU, CPU, RAM, disk, Python version, and whether uv is available.
The analysis phase should leave behind a concrete plan rather than only conversational intent.
Phase 2. Setup
Prepare the selected route:
- attach: validate the selected baseline id and variant
- import: place the imported baseline metadata under the quest and confirm the package is readable
- reproduce: prepare the baseline work directory, commands, config pointers, and environment notes
- repair: identify the precise broken point before rerunning blindly
For Python baselines, standardize environment setup around uv.
Python environment rule: use uv
- if the repo already contains
uv.lockor a solidpyproject.toml, useuv sync - otherwise create a local virtual environment with
uv venv - install dependencies with
uv pip install ... - run setup, smoke tests, and real commands through
uv run ...
Practical rules:
- prefer a quest-local or baseline-local
.venv - prefer
uv run python ...oruv run bash ...over relying on shell activation state - if a specific interpreter is required, make it explicit with
uv venv --python 3.11oruv run --python 3.11 ... - if CUDA, PyTorch, JAX, or custom wheels require a special index URL, keep that install under
uv pip - only accept a non-
uvroute when there is a concrete blocker that cannot be resolved locally
Common uv patterns:
uv syncuv venv --python 3.11uv pip install -r requirements.txtuv run python scripts/smoke_test.pyuv run python train.py --config ...
Setup should record:
- baseline id and source identity
- working directory
- config files
- command template
- expected outputs
- known deviations from paper or source
- the chosen
uvroute and Python version
Fallbacks:
- if Hugging Face access is blocked, record and try an approved local mirror such as ModelScope when that does not change the comparison meaning
- if a quest already depends on
analysis_plan.mdorREPRO_CHECKLIST.md, keep the compatibility alias explicit rather than splitting truth across two active plans
Phase 3. Execution
Run only the work required to establish the baseline credibly.
Execution rules:
- keep commands auditable
- keep logs durable
- avoid uncontrolled side experiments during baseline establishment
- checkpoint only explainable, minimal code changes
- prefer equivalence-preserving efficiency gains such as larger safe batch size, cache reuse, checkpoint resume, and parallel downloads or workers
- do not use an efficiency lever if it changes accepted baseline meaning, effective evaluation contract, or trust judgment
Long-running execution discipline:
- run one bounded smoke test before a substantial baseline reproduction
- once the smoke test passes, launch the real baseline reproduction with
bash_exec(mode='detach', ...) - monitor by forward progress instead of by short-window completion anxiety
- do not report final success until the command actually finished and the expected result files exist
- if you need to recover ids or inspect session state, use
bash_exec(mode='history')orbash_exec(mode='list') bash_exec(mode='read', id=...)returns the full saved log when it is2000 lines or fewer; for longer logs, inspect omitted middle windows withstartandtail- during monitoring, prefer
bash_exec(mode='read', id=..., tail_limit=..., order='desc'), and after the first read prefer incremental checks withafter_seq=last_seen_seq - use
silent_seconds,progress_age_seconds,signal_age_seconds, andwatchdog_overdueas the default staleness clues - if a run is clearly invalid, wedged, or superseded, stop it with
bash_exec(mode='kill', id=..., wait=true, timeout_seconds=...), document why, and relaunch cleanly - do not let more than the
30-minute visibility boundpass without a real inspection and anext expected update time - when the baseline code is under your control, prefer a throttled
tqdmprogress reporter and periodic__DS_PROGRESS__markers when feasible
Keep retries bounded:
- one smoke test is the default
- one autonomous fix-and-retry for the same failure class is the normal upper bound
- if the same failure class returns, stop looping
Phase 4. Verification
Verification is mandatory before baseline acceptance.
Verify:
- the run actually finished
- the reported metrics came from the intended dataset and split
- the metric definitions match the quest contract
- the result is comparable to the paper, source repo, or selected target
- any deviations are explicitly stated
Classify the outcome as one of:
verified_matchverified_closeverified_divergedbroken
Verification must explicitly separate:
- likely implementation mismatch
- environment mismatch
- data or split mismatch
- expected stochastic variance
- unexplained divergence
Verification should answer:
- whether the baseline is trustworthy enough for downstream comparison
- whether the result is reusable beyond this quest
- whether another repair or rerun is justified
- whether the line should stop here and hand off
A verification report should be self-contained enough that a later stage can answer:
- what was used
- how it was obtained: attach, import, reproduce, or repair
- what commands and configs were used
- what metrics are trusted
- what caveats remain
- whether the result is reusable beyond this quest
Baseline comparability contract
The baseline stage is not complete just because something ran. It is complete when later stages can compare against it fairly.
Before declaring a baseline usable, make the comparability contract explicit:
- task identity
- dataset identity and version
- split contract
- preprocessing boundary
- evaluation script or evaluation path
- required metric keys
- metric directions
- seed policy when relevant
- source commit or source package identity
- known deviations from the source reference
Unless the user explicitly specifies otherwise, treat the original paper's evaluation protocol as the canonical baseline contract.
If any of these fields are still materially unknown, do not pretend the baseline is a clean downstream reference.
For the fuller checklist and verdict meanings, read references/comparability-contract.md.
Feasibility and trust classes
Before acceptance, classify feasibility as one of:
full_reproducibledegraded_but_acceptableblocked
And classify downstream trust as one of:
verifiedpartially_verifiedoperational_but_incomparablefailed
Do not silently upgrade a degraded or merely operational result into a normal trusted baseline.
Minimum baseline artifact content
The accepted baseline artifact should include at least:
baseline_idbaseline_kindpathtaskdatasetprimary_metricmetrics_summaryenvironmentsourcesummary
If variants exist, also include:
default_variant_idbaseline_variants
Metric-contract rules:
- if the accepted baseline contract includes multiple metrics, datasets, subtasks, or splits, record all of them in
<baseline_root>/json/metric_contract.json - keep
primary_metricas the headline metric only; do not let it erase the rest of the comparison surface - when confirming a baseline, submit the canonical
metrics_summaryas a flat top-level dictionary keyed by the paper-facing metric ids - every canonical baseline metric entry should include
description, eitherderivationororigin_path, andsource_ref - if the paper reports both aggregate and per-dataset or per-task results, preserve both whenever feasible through
metrics_summaryplus structured rows rather than one cherry-picked scalar - if the source package already has a richer leaderboard table, structured result file, or
json/metric_contract.json, reuse that richer contract instead of hand-writing a thinner one that keeps only one averaged scalar Result/metric.mdis optional temporary scratch memory only; reconcile against it before callingartifact.confirm_baseline(...), but do not treat it as a required durable file
Publication and reuse
Use the registry deliberately, not as an afterthought.
If the result is reusable beyond the current quest:
- publish it through
artifact.publish_baseline(...) - ensure the payload includes identity, provenance, trusted metrics, and any variant structure
- set
publish_global: trueonly when verification is complete and reuse is justified
If the current quest should reuse an existing baseline:
- attach it through
artifact.attach_baseline(...) - preserve the selected
baseline_id - preserve the selected
variant_idwhen one is used - keep the attachment durable under
baselines/imported/
If runtime state already includes requested_baseline_ref or a matching confirmed_baseline_ref:
- default to reuse-and-verify, not rediscovery
- treat a creation-time pre-bound baseline as the active starting point unless you find a concrete incompatibility
- do not rerun broad baseline scouting or full reproduction just because the stage name is
baseline
For a clearer attach/import/reproduce/repair rubric, read references/route-selection.md.
For reusable-package expectations, read references/publishable-baseline-package.md.
Workspace and branch rules
- treat the baseline workspace as a system-managed reproduction surface, not an unrelated sandbox
- avoid creating a nested authoritative Git lifecycle inside the baseline workspace
- use the quest branch unless isolation is genuinely needed
- if baseline setup is risky or intrusive, prepare an isolated branch or worktree first and record why
- do not proliferate branches without a reason
Memory rules
Stage-start requirement:
- by default, begin every baseline pass with
memory.list_recent(scope='quest', limit=5) - then run at least one baseline-relevant
memory.search(...)before new baseline analysis, repair, or rerun work - fast-path exception: if the quest already exposes a clear
requested_baseline_reforconfirmed_baseline_refand the immediate task is only to validate or reattach that concrete baseline, you may skip broad retrieval
Write memory only for reusable lessons such as:
- paper-to-code mismatch notes
- environment incidents
- dataset quirks
- verification caveats
- attach vs import vs reproduce vs repair rationale
When calling memory.write(...), pass tags as an array like ["stage:baseline", "baseline:<baseline_id>", "type:repro-lesson"], not as one comma-joined string.
Stage-end requirement:
- if baseline work produced a durable reproduction lesson, verification caveat, environment incident, or route rationale, write at least one
memory.write(...)before leaving the stage
Artifact rules
Typical artifact sequence:
progressfor long-running setup or execution checkpointsreportfor analysis notes or verification notesdecisionfor route choice, blocked routing, or accept/reject/rerun/repair callsbaselineonly for an accepted baseline record
For stable field shapes, read references/artifact-payload-examples.md.
The baseline handoff should make these items obvious:
baseline_idbaseline_variant_idwhen relevant- route used: attach, import, reproduce, or repair
- trusted metrics
- canonical metric contract JSON path
- verification outcome
- reusable or quest-local only
- canonical output paths
- main caveats
- recommended next anchor
If this packet is not obvious from the accepted artifact plus verification note, the baseline line is not stable enough yet.
Failure and blocked handling
Do not hide failures.
If blocked, record the class explicitly:
missing_sourcemissing_codemissing_metric_contractenvironment_infeasiblecommand_unknownrun_failedverification_failed
A blocked result must state:
- what failed
- what was tried
- which paths or logs show the issue
- whether the next best move is attach, import, retry, repair, reset, or ask the user
Reasonable autonomous fixes before escalation:
- missing module or dependency
- wrong dataset path
- permission errors on scripts
- reasonable batch-size reductions for OOM
- obvious environment activation mistakes
If a fix would change confirmed scope, metrics, permissions, or resource assumptions, stop and return to analysis rather than applying it silently.
Exit criteria
Exit the baseline stage once one of the following is durably true:
- a baseline is attached and accepted
- an imported baseline is accepted
- a reproduced baseline is verified and accepted
- a broken route has been declared blocked and a next decision is recorded
Typical next anchors:
ideaexperimentin tightly scoped follow-on casesdecisionif the baseline line remains contested