name: enhancement-evaluator description: Evaluate Gemini-enhanced recipes against quality bar, decide action, track failure modes
Skill: Enhancement Evaluator
Systematic evaluation of Gemini-enhanced recipes. Used during benchmark review of all enhanced recipes in the database.
Activation Context
- Running
recipe_manager.pycommands:next,enhanced,reenhance,done,skip - Reviewing Gemini output quality after reenhancement
- Deciding whether to apply, keep, or skip a reenhanced recipe
Tool
tools/recipe_manager.py --project hikes-482104 <command>
| Command | Purpose |
|---|---|
next |
Get next unreviewed enhanced recipe |
enhanced <id> |
Show current enhanced version |
reenhance <id> --household <id> |
Dry-run reenhancement |
reenhance <id> --household <id> --apply |
Apply saved reenhanced JSON to Firestore |
done <id> |
Mark as reviewed (good) |
skip <id> |
Skip (come back later) |
status |
Show review progress |
Default household: IXdzHJ91NZeutylohx1t
Core Principle: Enhancement Is the Goal
The purpose of Gemini enhancement is to make the meal better. This is the number one priority. A recipe that tastes better, teaches more technique, or has better flavor balance is a successful enhancement — even if it changes things from the original.
Any ingredient change — adding, reducing, or removing — is acceptable when it genuinely improves the dish. Adding acid for brightness, reducing excess salt, swapping a weaker spice for a better one, introducing umami — these are all valid enhancements. Judge every change by one criterion: does it make the meal better?
The real failure mode is silent, unjustified loss. When Gemini quietly drops an ingredient, shrinks a quantity for no reason, or reassigns an ingredient to a different role without explanation — that's a prompt failure. The difference between "I removed X because Y" (intentional improvement) and "X just disappeared" (bug) is what matters.
Quality Bar
Every enhancement MUST have:
- Technique: heat levels, blooming spices, toasting, resting times with explanation
- Sensory cues: "until golden and bubbles appear" — not "cook 5 min"
- Flavor: brightness (acid), finishing touches, umami compensation for substitutions
- Tips: teach technique, NOT obvious ("use non-stick pan")
- Correct timings: prep_time, cook_time, total_time must be accurate (no-cook recipes = null cook_time, not 5)
Decision Tree
1. Always reenhance first
Every recipe gets reenhanced with current prompts. This is a benchmark — we need to know if the prompt system produces good results, not just whether the existing enhancement is acceptable.
2. Compare: reenhanced vs current
Evaluate the reenhanced output against the current version and present a summary to the user.
3. Present summary and await decision
The user always makes the final call. Never apply, skip, or mark done without explicit user approval. Present the evaluation report (see format below) and wait for go/no-go.
Recommendations to include in the summary:
| Reenhanced is... | Recommendation |
|---|---|
| Better than current | Recommend: apply |
| Not good enough (equal, worse, or breaks something) | Recommend: fix prompt, then reenhance again |
There is no "keep current and move on" option. If the reenhancement isn't good enough, the prompt gets fixed and the recipe gets reenhanced again — repeat until the result is good enough to apply. Every recipe must end with a successful --apply.
Sub-par results must be fixed immediately
If Gemini produces a sub-par result, the underlying prompt issue must be addressed right now — after getting the user's go. Then reenhance the same recipe again to verify the fix. Do not move to the next recipe until this one produces a good result and gets applied.
4. Never modify tags
Tags are never touched — not in the JSON, not manually, not flagged as issues. Whatever Gemini produces stays.
5. Never modify reenhanced JSON
Do not manually edit the reenhanced JSON output. If Gemini's output has problems, that IS the benchmark data — it tells us the prompt needs fixing. Manually patching hides prompt failures.
Exception: timings (prep_time, cook_time, total_time) may be corrected if obviously wrong, since timing estimation is a known weak spot being actively tuned.
Evaluation Report Format
After comparing, report concisely and wait for the user's decision:
**[Recipe title]** (ID)
Reenhanced: [better / equal / worse] than current
Key differences: [1-3 bullet points]
Recommendation: [apply / keep current / skip]
[If failure mode: what went wrong]
Then wait. Do not act until the user says go.
Failure Mode Tracking
When Gemini produces bad output, log it in .copilot-tasks.md failure modes table:
| Field | What to log |
|---|---|
| Pattern | What went wrong (e.g., "ingredient dropped") |
| Count | Increment if pattern already exists |
| Severity | Low / Medium / High |
| Prompt fix? | Whether a prompt change could prevent it |
Fixing prompts
Sub-par results are never deferred. After presenting the evaluation and getting user go, fix the prompt immediately before moving to the next recipe.
Prompt files (all in config/prompts/):
| File | Contains |
|---|---|
core/base.md |
System prompt: task definition, quality bar, output schema, timing rules |
core/rules.md |
Hard constraints: forbidden patterns, quantity/ingredient preservation, equipment rules |
core/formatting.md |
Instruction structure, measurement format, tips quality |
core/tagging.md |
Tag vocabulary and rules |
locales/sv.md |
Swedish-specific: spice blend replacements, dairy subs, fat rules |
user/dietary.md |
Household dietary preferences (auto-loaded) |
How to fix:
- Identify which file the rule belongs in (constraint →
rules.md, technique →base.md, format →formatting.md) - Read the file to understand current structure
- Add the rule with a self-check pattern where possible ("Self-check: ...")
- Update the failure modes table in
.copilot-tasks.mdto mark "✅ Fixed" - Add entry to the prompt tuning log in
.copilot-tasks.md
Recipes Without Original Snapshots
Some early enhancements predate the snapshot system. The enhancer needs the original to produce good output.
Workaround:
- Find source URL from the recipe document
- Fetch original recipe from source
- Inject original snapshot via script (
tmp/inject_original.pypattern) - Then reenhance normally
Prompt Tuning Log
Track all prompt changes in .copilot-tasks.md under the benchmark section. Each entry:
- `<file>` — what changed and why (which failure mode triggered it)
This creates a record of which prompt changes fixed which problems.