name: add-eval description: Design, implement, validate, and calibrate a new eval for the convex-evals suite. Use when the user wants to add a new eval, create an eval, test a new Convex concept, or expand eval coverage.
Add a New Eval
Follow these steps whenever the user asks to create a new eval. Read .cursor/skills/add-eval/reference.md for grader helpers, test patterns, and conventions.
Switch to Plan mode immediately. Steps 0-2 (gather info, research, design) are collaborative and read-only. The user should see the research findings and approve the eval design before any files are created. Switch back to Agent mode at Step 3.
Step 0: Gather Information
Determine the following (ask the user if not provided):
- Concept to test - what Convex feature or pattern should this eval exercise? (e.g. "vector search", "pagination with joins", "cascade deletes")
- Specific focus - any edge cases, constraints, or behaviors the user wants to emphasize?
Category Selection
List the existing categories by scanning evals/ top-level directories, then propose the best-fit category. The current categories are:
| Category | Scope |
|---|---|
000-fundamentals |
Basic Convex concepts (empty functions, schema definition, crons, scheduling) |
001-data_modeling |
Schema design, indexes, relationships, unions, optional fields |
002-queries |
Reading data, joins, pagination, aggregation, filtering |
003-mutations |
Writing data, inserts, patches, deletes, cascades |
004-actions |
HTTP fetch, file storage, node runtime, HTTP action routing |
005-idioms |
File organization, internal functions, batch patterns |
006-clients |
useQuery, useMutation, usePaginatedQuery |
- If the concept clearly fits one category, propose it with a brief justification.
- If it's ambiguous (e.g. "scheduled mutations" could be fundamentals or mutations), stop and ask the user. Present the candidate categories with reasoning for each.
- If it doesn't fit any existing category, propose creating a new category and ask the user to confirm.
Determine the eval number by listing existing evals in the chosen category and picking the next sequential number.
Step 1: Research
Run these four research tracks. Use sub-agents or parallel tool calls where possible.
A. Convex Docs (source of truth)
- Fetch
https://docs.convex.dev/llms.txtto get the docs table of contents. - Identify the 1-3 most relevant doc pages for the concept being tested.
- Use WebFetch to retrieve those specific pages.
- Extract the correct API patterns, constraints, and best practices. These are ground truth for designing the eval and answer.
B. Existing Guidelines
- Read the relevant sections of
runner/models/guidelines.ts(or the generated guidelines) for the concept being tested. - Record:
- Which existing guidelines are relevant
- What behavior those guidelines would lead a model to produce
- Whether the current guidelines would already be expected to make strong models pass
- Whether any guideline seems to conflict with what the proposed eval is trying to reward
- Do not treat this as a reason to block the eval automatically. This is context for design and later calibration.
- If an existing guideline appears to directly contradict the proposed eval, STOP and discuss with the user before proceeding.
C. Existing Eval Patterns
- Read 2-3 evals in the same or a similar category.
- Study: TASK.txt style, answer structure, test approach, schema design.
- Note which grader helpers and test patterns they use. See
reference.mdfor the full catalog.
D. Overlap Check
- List ALL eval directory names under
evals/to get a high-level view of coverage. - For any eval whose name suggests overlap with the proposed concept, read its TASK.txt.
- If an existing eval tests the same or very similar concept, STOP and warn the user:
- Name the overlapping eval(s) and explain what they already cover.
- Ask whether to: differentiate the new eval (narrow its scope), adjust the existing eval, or abandon.
- Flag even partial overlap, e.g. "002-queries/012-index_and_filter already tests index usage but doesn't cover compound indexes."
Step 2: Design the Eval
You should already be in Plan mode. Present the full eval design to the user for review.
TASK.txt Draft
Write the complete TASK.txt content. Follow these principles:
- Laser-focused on the concept being tested. Control other variables (keep schema simple, minimize unrelated code).
- Explicit about schema, function names, argument types, return shapes, and which files to create.
- Don't over-specify Convex implementation details that are covered by the guidelines. If a model needs the task to spell out how to use
internalMutationor how pagination works, that's a meaningful signal, not a task problem. - Don't under-specify the problem domain. The model should not need to guess what the feature does, only how to implement it in Convex.
- Include schema as a TypeScript code block when applicable.
- Specify edge cases (empty results, error messages, missing data).
Answer Outline
Describe the files that will be created and the key implementation approach. Don't write the full code yet, just the structure and important decisions.
Test Approach
Describe how the eval will be graded:
- Pick the primary grading primitive first: behavior tests, schema inspection, function-spec comparison, HTTP testing, AI grading, AST analysis, or some combination.
- Which grader helpers to use (see
reference.mdfor the catalog and decision tree). - What behaviors to assert on.
- Whether standard unit tests are sufficient, or if you need schema inspection, HTTP testing, AI grading, or something else.
If unit tests cannot fully verify the concept (e.g. testing that a model uses an index rather than a filter, or testing code organization patterns), STOP and discuss with the user. Present the options:
- Schema/index inspection (using
getSchema,hasIndexForFields) - AI grading (
createAIGraderTest, currently disabled and requires a repo change to re-enable) - AST analysis (parse the generated TypeScript files)
- Restructure the eval so the concept can be tested via behavior
- Accept the limitation and test what we can
Let the user decide before proceeding.
Guidelines Hypothesis
Summarize the guideline context before implementation:
- Which existing guidelines are relevant to this eval
- What you would expect guideline-following models to do
- Whether failures on this eval would likely indicate a model gap, an eval/task problem, or a missing/weak guideline
- Any existing guideline that might need to be revised if calibration shows an unexpected result
If a new guideline is likely needed, design it minimal-first. Every token in the guidelines is sent with every prompt, so bloat costs real money. Start with the smallest guideline that teaches the critical pattern (usually one code example), test it, and only expand if models still fail. Avoid pinning specific dependency versions in guidelines as they age quickly. Prefer "always install the latest version" instead.
Push Back
Before presenting the design, critically evaluate it. Warn the user if:
- The eval overlaps heavily with an existing eval (should have been caught in Step 1C, but re-check).
- The task tests too many concepts at once. Each eval should focus on one thing.
- The task is so explicit that it tells the model exactly how to solve it (e.g. specifying
.withIndex()calls). We're testing knowledge, not instruction-following. - The current guidelines suggest a different approach than the eval is rewarding, or make the expected signal unclear.
- The eval seems too easy (every model will pass) or too hard (no model will pass).
Step 3: Implement the Eval
After the user approves the design, switch back to Agent mode and implement:
Create directory:
evals/<category>/<eval_slug>/Write TASK.txt with the approved content.
Create answer directory:
answer/package.json:{ "name": "convexbot", "version": "1.0.0", "dependencies": { "convex": "^1.31.2" } }answer/convex/schema.ts(if applicable)- Implementation files (e.g.
answer/convex/index.ts)
Run codegen:
cd evals/<category>/<eval_slug>/answer && bunx convex codegenWrite grader.test.ts using the approved test approach. Import paths are relative:
import { responseClient, responseAdminClient, addDocuments } from "../../../grader"; import { api } from "./answer/convex/_generated/api";Adjust the depth of
../based on the eval's nesting level.Typecheck:
bun run typecheck
Step 4: Validate the Answer
First run canonical answer validation for the new eval:
TEST_FILTER=<category>/<eval_slug> bun run validate:answers
Then run the eval for one model as a smoke test. This validates model generation against the new eval:
MODELS=anthropic/claude-sonnet-4.6 TEST_FILTER=<category>/<eval_slug> bun run local:run
Do NOT set CONVEX_EVAL_URL or CONVEX_AUTH_TOKEN, so results stay local-only.
If the smoke test fails:
- Read the output directory and
run.logto understand what happened. - Determine whether it's a test problem (fix the grader) or a model problem (expected, move on).
- If the test itself is broken, fix it and re-run before proceeding.
Step 5: Run Against Multiple Models
Start with a smaller representative set of models to calibrate difficulty. If the result is unclear, expand to a broader sweep. Launch separate background processes, one per model:
# Suggested first-pass set
MODELS=anthropic/claude-sonnet-4.6 TEST_FILTER=<category>/<eval_slug> bun run local:run &
MODELS=openai/gpt-5.4 TEST_FILTER=<category>/<eval_slug> bun run local:run &
MODELS=google/gemini-3.1-pro-preview TEST_FILTER=<category>/<eval_slug> bun run local:run &
MODELS=anthropic/claude-haiku-4.5 TEST_FILTER=<category>/<eval_slug> bun run local:run &
wait
If those results are too noisy or too uniform, expand to a broader sweep across providers and tiers. The user can override the list. Do NOT set CONVEX_EVAL_URL or CONVEX_AUTH_TOKEN.
Monitor the background processes by reading their terminal output files. Each process runs one eval so they should complete in a few minutes.
Run-to-run variance
Run each model at least twice (ideally three times) to distinguish systematic failures from flaky ones. Model output is non-deterministic, so a single pass or fail doesn't tell the whole story. Common variance sources:
- Hallucinated dependency versions that sometimes resolve and sometimes don't
- Different code styles across runs (e.g. one integration test vs many unit tests)
- Different library version choices from training data, where some versions have bugs
Step 6: Review Results and Calibrate
Collect pass/fail from all model runs and present a summary table:
Model Result
----------------------------- ------
anthropic/claude-sonnet-4.6 PASS
anthropic/claude-haiku-4.5 FAIL
openai/gpt-5.4 PASS
...
Then assess the results:
- All pass - The eval is likely too easy or the task is too explicit. Recommend tightening the task (remove implementation hints) or adding harder edge cases.
- All fail - The eval might be too hard, poorly specified, or testing something not covered by the guidelines. Investigate the failures. Common causes: ambiguous requirements, missing context, concept beyond current model capabilities.
- Mixed results (ideal) - The eval discriminates between model capabilities. Note whether the pass/fail pattern makes sense given model tiers.
- Unexpected pattern (e.g. only one provider's models fail) - Might indicate a provider-specific quirk rather than a meaningful eval. Investigate before keeping.
Then explicitly ask: is this primarily an eval/task gap, a model gap, or a guideline gap?
- Eval/task gap - The task is ambiguous, over-specified, under-specified, or the grader is not testing the right thing. Fix the eval first.
- Model gap - The task is sound, the grading is sound, and failures are what we would expect. Keep the eval.
- Guideline gap - The failures suggest there should be a guideline that helps here, or an existing guideline is weak/confusing/contradictory. Recommend following up with the
validate-guidelinesskill after the eval is settled.
Debugging failures
Model output is preserved in the temp directory printed at the start of each run (Using tempdir: ...). These directories are NOT cleaned up automatically. For each model, look at:
<tempdir>/output/<provider>/<model>/<category>/<eval>/for the generated source filesrun.login that directory for step-by-step output (install, deploy, tsc, eslint, vitest)node_modules/for the actual resolved dependency versionspackage.jsonfor what the model requested vs what was installed
This is essential for distinguishing "model wrote bad code" from "model's code is fine but the grader is too strict" from "dependency version bug".
Push back with specific recommendations if calibration looks off. Suggest concrete changes to the task, answer, or tests.
Summary Checklist
- Concept and category confirmed with user
- Convex docs consulted for the feature being tested
- Relevant existing guidelines checked, with expected implications noted
- No significant overlap with existing evals (or overlap discussed with user)
- TASK.txt reviewed and approved by user (Plan mode)
- Test approach discussed, especially if non-standard grading is needed
- Answer implemented and codegen run
- grader.test.ts written
-
bun run typecheckpasses -
bun run validate:answerspasses for the new eval - Smoke test passes for at least one model
- Calibrated on a representative set of models, expanded if needed
- Results reviewed, including eval gap vs model gap vs guideline gap
- Difficulty is appropriate