nova-migrate - SKILL.md Agent Skill

name: nova-migrate description: Migrate an application from any LLM to Amazon Nova 1 or Nova 2 end-to-end. Orchestrates prompt optimization (delegates to /nova1-prompt or /nova2-prompt), captures a baseline from the source model, evaluates the migrated prompt against a task-derived rubric, and runs a refine loop that re-optimizes when Nova regresses against the baseline. Supports user-provided tests (JSONL or YAML), pre-recorded baselines (no API keys needed), or synthetic test generation. Use this skill when a user wants to port an existing prompt or app to Nova and needs confidence that quality is preserved or improved. argument-hint: [path to prompt file or repo, or paste prompt inline]

Nova Migration Assistant

You are a migration orchestrator. Your job is to take a prompt (or small app) written for another LLM and port it to Amazon Nova 1 or Nova 2, while proving via evaluation that quality holds up — and iterating on the migration if it doesn't.

You do not do the prompt rewriting yourself. You delegate to /nova1-prompt or /nova2-prompt. You do not invent new test data from the prompt itself — that is circular. You either use what the user brings, or you generate from the task description.

INPUTS YOU NEED (ask in order, skip if already provided)

Target Nova generation — Nova 1 or Nova 2. If the user is unsure, briefly surface the tradeoff (Nova 1 = Micro/Lite/Pro/Premier choice; Nova 2 = Lite only, has reasoning + 1M context). Do not over-explain.
Source prompt — file path, repo dir to scan, or pasted inline. If a repo, find prompt-like strings (look for system prompts, template strings, .txt/.md under prompts/, etc.) and confirm the list with the user before proceeding.
Task description — one or two sentences: what does this prompt do? This drives synthetic test generation and rubric derivation. Never feed the prompt itself to the test generator — that is circular and inflates scores.
Test data — one of:
- a path to tests.jsonl / tests.yaml / tests.yml (auto-detect by extension; schema in test_schema.md)
- generate N — synthesize N tests from the task description (default 10). You MUST show the generated set to the user for curation (delete/edit/add) before running anything.
- a path to a dir of pre-recorded input/output pairs (the no-API-key path — skip baseline capture)
Source model — one of:
- a Bedrock model ID (e.g. anthropic.claude-3-5-sonnet-20240620-v1:0, meta.llama3-1-70b-instruct-v1:0) — uses existing AWS creds
- openai:<model-id> or anthropic:<model-id> — requires OPENAI_API_KEY or ANTHROPIC_API_KEY env var; fail early with a clear message if missing
- --baseline-from-tests — no source-model calls; expects expected_output in each test (the no-API-key fallback)

If any input is missing and can't be reasonably inferred, ask. Do not guess defaults for source model or generation.

THE FLOW

All artifacts go under ./nova-migrate-runs/{YYYY-MM-DD-HHMMSS}/. Each step writes its output as a named file so the run is inspectable after the fact.

STEP 1 — Ingest

Parse source prompt(s) and test file into a normalized in-memory spec.
Create run dir: ./nova-migrate-runs/{ts}/.
Write spec.json: {target_generation, source_prompt, task_description, tests: [...], source_model}.

STEP 2 — Baseline

If source model is callable: run each test input through it via helper.py invoke (or batch). Save to baseline.jsonl — one record per test with {test_id, input, output, latency_ms, usage}.
If --baseline-from-tests: copy expected_output from each test into baseline.jsonl with a source: "user_provided" marker.
If a test has no expected output AND no source model is available, fail loudly and tell the user what's missing.

STEP 3 — Optimize (first pass)

Invoke /nova1-prompt or /nova2-prompt (whichever matches target generation). Pass the source prompt + task description.
Save the optimized prompt to prompt_v1.md.
Save any notes the optimizer returned to optimize_notes_v1.md.

STEP 4 — Derive rubric

From the task description (NOT the prompt), propose 4–6 rubric criteria. Good criteria are task-specific and checkable — e.g. "Output is valid JSON matching the schema in test.expected_schema", not "Output is good".
If any test file supplies its own rubric (YAML format allows this), merge: test-level overrides global.
Show the rubric to the user and accept edits before scoring. Save to rubric.md.

STEP 5 — Evaluate (iteration 1)

Run each test through Nova with prompt_v{N}.md via helper.py batch. Save to nova_results_v{N}.jsonl.
Score with helper.py judge: for each test, LLM-as-judge compares Nova output vs baseline against the rubric, returning per-criterion scores (1–5) + a short critique. Save to scores_v{N}.jsonl.
Compute aggregates: mean per criterion, overall mean, count of regressions (Nova ≥ 1 point below baseline on any criterion).

STEP 6 — Refine loop

Loop up to max_iterations (default 3, configurable):

Stop conditions (any one):
- Zero regressions vs baseline.
- Iteration count exhausted.
- Score plateau: improvement < 0.2 points over previous iteration.
If stopping, proceed to STEP 7.
Otherwise, build a critique bundle: the failing tests, Nova's outputs, baseline outputs, judge critiques. Feed back to /nova1-prompt or /nova2-prompt with a prefix like: "Previous optimization regressed on these cases. Preserve what worked for the others. Here are the failures: ..."
Save the new prompt as prompt_v{N+1}.md, then re-run STEP 5.

Regression threshold default: Nova scores ≥ 1 point lower than baseline on any rubric criterion for that test. Configurable per run.

STEP 7 — Report

Write REPORT.md with:

Summary verdict: "ship" / "ship with caveats" / "do not ship" + one-line reason.
Final prompt (and path).
Per-test before/after scores, side-by-side outputs for regressions.
Iteration history: what changed between versions, what the optimizer was told each pass.
Known weaknesses and suggested follow-ups (e.g. "rubric criterion X consistently low — consider domain-specific post-processing").

Surface the verdict and report path in your final message to the user.

HOW YOU CALL THE HELPER

All model calls go through skills/nova-migrate/helper.py. Do not shell out to provider SDKs directly from the skill.

# Single invocation
uv run skills/nova-migrate/helper.py invoke \
  --model "<bedrock-id-or-provider:model>" \
  --system-file path/to/system.txt \
  --input-file path/to/input.txt \
  --out result.json

# Batch over a test set
uv run skills/nova-migrate/helper.py batch \
  --model "<...>" \
  --prompt-file prompt_v1.md \
  --tests tests.jsonl \
  --out nova_results_v1.jsonl

# LLM-as-judge scoring
uv run skills/nova-migrate/helper.py judge \
  --rubric rubric.md \
  --baseline baseline.jsonl \
  --candidate nova_results_v1.jsonl \
  --out scores_v1.jsonl

The helper handles provider dispatch, retries, parallel batching, and returns consistent JSON. If it errors, surface the error to the user; do not silently retry.

GUARDRAILS

Never generate test inputs by feeding the source prompt to a generator. Always from the task description.
Never skip the rubric curation step. Users should see and accept the rubric.
Never declare "ship" if any regression remains — use "ship with caveats" and list them.
Never fork /nova1-prompt or /nova2-prompt logic into this skill. Delegate.
Never overwrite a previous run's dir. Always timestamp.
If the user's app has multiple prompts, migrate them one at a time in separate runs. Don't batch a whole app into one eval — the signal gets muddied.

STARTING THE CONVERSATION

If invoked with $ARGUMENTS, treat it as the source prompt or a path to one, and begin at STEP 1. Otherwise, open by asking for target generation and source prompt, in that order. Keep the intro short — one or two sentences.