name: run-research description: "Multi-session research workflow: compose logbooks, experiment issues, documentation updates, and snapshot discipline for long-running investigations."
Skill: Agent-Directed Research
Use this for long-running research where an agent iterates on benchmarks,
experiments, and hypotheses over multiple sessions. Domain skills such as
add-pallas-kernel may add constraints.
General Principle: Open Development
Long-lived work should leave a durable record: a logbook, coordinating issue updates, and enough commands/config to reproduce results. Do not publish secrets or private work the user asks to hold back.
Subskills
.agents/skills/background-research/SKILL.mdfor prior-work foraging, source ledgers, contradiction passes, and ranked hypothesis candidates..agents/skills/task-logbook/SKILL.mdfor logbooks and issue updates..agents/skills/wandb-reporting/SKILL.mdfor W&B project selection, run naming, report links, and artifact hygiene..agents/skills/task-snapshot/SKILL.mdfor commit/tag snapshots and stable artifact links..agents/skills/update-docs/SKILL.mdfor updating durable docs, runbooks, reports, or skill guidance when research changes behavior or practice.
Layer domain skills on top for task-specific constraints.
Core Artifacts
- A GitHub experiment issue. Agent-created experiment issues use the
experimentandagent-generatedlabels. - A logbook at
.agents/logbooks/<topic>.md. - A living hypothesis queue in the logbook, derived from append-only entries and updated as hypotheses are proposed, blocked, falsified, or promoted.
- A long-lived branch, for example
research/<topic>orresearch/<user>/<issue>-<topic>, with the logbook, research code, configs, small artifacts, and test harnesses needed to reproduce results. - One or more commit or tag snapshots for meaningful milestones.
- Often a "production" branch that gets PR'd and merged.
Research Logbook
Use the term logbook consistently. Follow .agents/skills/task-logbook for
formatting and issue-update rules.
Branches
Use a research branch for the logbook, one-off scripts, harnesses, and small artifacts. Extract a clean production branch only when the final code/docs shape is clear.
Standard Workflow
1. Prologue
- Create or switch to a long-lived research branch. You may already be on one, or the user may have requested a specific branch name. Otherwise, pick a descriptive name like
research/<topic>orresearch/<user>/<issue>-<topic>. - Create an experiment issue with
file-issueunless scope or visibility needs human confirmation. If the user provides one, use it. - Start the logbook and link both ways: logbook to issue URL, issue body to logbook path. See the skill.
- Pick a short experiment ID prefix for the series, for example
MOE-HC-001, and use IDs likeMOE-HC-001in logbook entries, run names, and issue comments. - Pick a set of tags to use for all experiments, to be used with W&B, etc. Typically this is the ID prefix (without the number), the issue number, and anything else reasonable. Try to keep to 2-4 shared tags. You may use more tags to distinguish runs within a project as useful.
- Record, as applicable: motivation, problem statement, context, success metrics, the initial hypothesis queue, first experiment matrix, relevant code paths, references, stop criteria, and the fixed baseline case for repeated comparison.
2. Research Loop
- Forage: gather prior work and local context.
- Propose: update the living hypothesis queue and pick the next test.
- Run: implement the smallest useful experiment and collect evidence.
- Interpret: compare against baseline, decide confidence, and update the logbook.
- Promote: move only interesting, decision-relevant claims up the issue funnel.
- Seal: snapshot durable results or extract production work.
Every cycle should leave the durable record better than it found it.
2.1 Forage: Background Research
Use background-research at the beginning, after a significant change of
direction, or whenever you hit a wall. Assume medium or low effort unless
the decision is expensive.
Use the background-research output to update the logbook's hypothesis queue:
add new candidates, revise weak ones, mark known dead ends, and promote
well-supported ideas into the next experiment matrix. Let background-research
and task-logbook decide what belongs in the issue versus the logbook.
2.2 Propose / Run / Interpret: Dev Work and Experiments
For research-branch dev work, optimize for learning speed while preserving
operational security and cost controls. Ad-hoc scripts, temporary config knobs,
and copy/paste are acceptable there. Production-facing code keeps the usual
AGENTS.md quality bar.
For each non-trivial experiment:
- Do the dev work needed for the experiment.
- Run the benchmark or experiment. Use
babysit-jobfor long-lived runs. - Append exact commands, config, key outputs, interpretation, and next decision
to the logbook. Follow
task-logbookfor issue updates. - Push dense scalar series, plots, or raw artifacts to W&B or another store when they are too large for issue comments.
3. Epilogue: Seal
Sealing should ordinarily only happen if the user requests it or the research has reached the defined goal.
- Update the issue body with the final TL;DR, conclusion, decision log, and negative-results index. Again, follow the
task-logbookskill. - Add a final issue comment covering what worked, what did not, confidence level, limitations, and ordered next steps.
- Use
update-docswhen behavior, operational practice, reusable guidance, or durable research findings changed. - Ensure the final logbook entry and snapshot links are present.
- Close the issue when the research thread is complete.
If the research produced useful production changes, extract them into a clean branch that can link to the logbook but does not include it. Follow standard Marin development practices on that branch.
Before closing the issue:
- Final TL;DR is current.
- Issue body includes a clear
Conclusion. - Next steps are listed.
- Final snapshot is linked.
- If there's a final production PR, link to it from the issue summary.
Practical Rules
- Prefer short-lived code changes unless a persistent harness is clearly useful.
- Keep benchmark harnesses configurable and minimal.
- Record exact command lines for every headline number.
- Treat failures and negative results as first-class data. Record dead ends and excessive hyperparameter sensitivity; skip routine bugs or undertuning.
See Also
.agents/skills/organize-experiments/.agents/skills/add-pallas-kernel/.agents/skills/task-logbook/.agents/skills/update-docs/