name: project-evals
description: Best practices for creating expectations and grader files to evaluate guidance quality. Use this skill any time you're writing or reviewing an expectations.md or grader.ts file.
Stage 3: Evaluating guidance for a use case (Needs evals)
This is the third of three stages in creating guidance:
- Stage 1: Identifying use cases for a feature
- Stage 2: Authoring guidance for a use case
- Stage 3: Evaluating guidance for a use case (you are here)
What the eval agent sees vs real-world agents
Real-world coding agents see only guide.md — retrieved automatically via the RAG skills system when a developer asks for help. Every other file in a use case directory is eval infrastructure.
The eval harness runs a separate coding agent in a controlled environment to test whether the guidance works. This eval agent receives the first prompt from tasks/task.md and has access to guide.md via the same RAG system. The harness then runs grader.ts against the eval agent's output.
None of the following are ever seen by real-world coding agents:
| File | Role in eval pipeline |
|---|---|
tasks/task.md |
Simulated developer prompts and base application name fed to the eval agent by the harness |
demo.html |
Reference implementation — grader runs against it to confirm tests pass on correct code |
negative-demo.html |
Anti-example — grader runs against it to confirm tests fail on incorrect code |
expectations.md |
Spec used to generate grader.ts |
grader.ts |
Playwright tests run against the eval agent's output |
How the eval files work together
tasks/task.md, expectations.md, and grader.ts form a tightly coupled pipeline:
tasks/task.md— Simulated developer prompts used only by the eval harness. It must start with a YAML frontmatter specifying thebase_app, followed by a list of prompts. Each prompt should sound like a real developer request, without naming specific APIs or best practices — the eval agent is expected to discover those by readingguide.mdvia RAG. The first prompt is the most important: it is used as the default task.expectations.md— The ground truth for what a correct implementation looks like. Each bullet becomes exactly one test ingrader.ts. Write expectations assuming the eval agent readguide.mdand implemented it faithfully; they describe the observable output, not the implementation approach.grader.ts— A Playwright test file generated 1:1 fromexpectations.md. Every bullet maps to onetest()block. If an expectation cannot be translated into a Playwright assertion (static file check or browser automation), it does not belong inexpectations.md.
Writing expectations.md
Write a natural language, bulleted list of assertions that must be true if an agent implements the guide.md correctly (e.g., "The input element is styled with a red border only AFTER a blur event").
- 1:1 with grader tests — Each bullet becomes exactly one test. Write one bullet per assertion. Do not combine multiple checks into a single bullet.
- Concrete, Testable Criteria (No API Facts) — Expectations must be verifiable browser behaviors we can check with Playwright (e.g., computed styles, DOM layout), not just factual statements about an API or code structure.
- Exercised in Demo: Ensure that every expectation written here is actively exercised in the accompanying
demo.html. Expectations that aren't covered by the demo lead to unreliable grader calibration. - Scoped to this use case — Only include expectations that apply to the specific use case being graded. Do not copy generic expectations from other guides if they describe behavior that won't appear in an implementation of this guide (e.g., don't include URL input expectations in a sign-in form grader).
- No external links — The grader generator cannot resolve them.
- Avoid over-constraining — Don't assert implementation details that don't affect correctness (e.g., don't require a direct child relationship if a descendant also works).
Grading Note
- Graders (
grader.ts) live within their respective guide folders. These are Playwright test files. - AVOID using static assertions (like regex or
str.includes()onfs.readFileSync) to test CSS or HTML syntax whenever possible. These are extremely brittle and will fail if the agent uses a different class name, semantic element, or formatting. - Instead, PREFER using Playwright's browser APIs to test computed styles and actual DOM layout. Use
element.evaluate((el) => window.getComputedStyle(el).propertyName)to robustly verify that the browser is rendering the feature correctly, regardless of how the agent authored the code. - A human may manually edit the
.tsfile if the generator struggles to get it perfectly tailored.
Once a guide has its guide.md, demo.html, and expectations.md completely written, it is ready for the evaluation pipeline.
Generating the Eval Graders
To generate the eval graders, use the gd dev tool.
Run the following command:
node ./bin/gd.ts dev <path-to-guide-directory>
This command will automatically:
- Generate a
negative-demo.htmlbased on the guidance. - Generate a
grader.tsPlaywright test that asserts yourexpectations.mdagainst bothdemo.html(should pass) andnegative-demo.html(should fail). - Test and calibrate the grader by running the test suite.
- Eval Performance Thresholds: A guide is not considered ready if evaluation pass rates are low. A 0% unguided pass rate is a critical blocker, indicating the guide may lack sufficient scaffolding for the model to discover the solution.
Writing tasks/task.md
tasks/task.md contains realistic developer prompts used to run AI agents end-to-end against the guide's grader, prefixed by a YAML frontmatter specifying the base application.
Format:
---
base_app: daily-grind
---
- make my images load faster on the page
- Optimize the priority of my LCP image 'hero.jpg' and deprioritize the gallery images below the fold.
Critical: The first prompt is the most important. It is used as the default task for the harness, and it must be specific enough to produce a grader-testable result.
Rules:
- DO write prompts as a developer talking to an AI coding assistant — casual, lowercase, sometimes vague.
- DO phrase prompts as action requests or directives (e.g. "add X", "can you build Y", "implement Z").
- DO NOT phrase prompts as advisory questions (e.g. "how can I?", "what's the best way to?", "can you explain?"). The agent must implement, not just explain.
- DO vary specificity: include at least one vague/intent-based prompt and one specific/technical ask.
- DO assume the developer is working on an existing app (the base app). Reference its real assets and endpoints if needed (e.g.,
hero.jpg,/api/analytics). - DO NOT mention the guide, the feature name, or hint that guidance exists.
- DO NOT name the base app (e.g., "daily-grind") — a real developer wouldn't refer to it that way.
- DO NOT tell the agent which web API or CSS property to use unless a real developer would naturally do so. The point is to test whether the agent discovers the right solution via the guide.
[!IMPORTANT] Functional Locators vs. Technical Solutions It is completely acceptable (and sometimes necessary) to mandate specific DOM IDs or CSS classes (e.g.,
"add a .fan-card class") if the grader requires them to locate elements. What is strictly banned is mandating the underlying implementation technology (e.g., commanding the model to"use sibling-index()"or"use the Temporal API").
Quantity: 1–4 prompts is typical. A single highly specific prompt is fine for technical use cases. Multiple prompts are useful for use cases with multiple valid entry points (e.g., "accordion", "tabs", "drawer" all exercising the same feature).
Test your prompts: Before finalizing, ask yourself: would an agent reading this prompt understand what they need to build? Vague phrases like "I should be able to search" may not convey browser-native "Find in page" behavior to a model. If the prompt is ambiguous, rewrite it to make the intent explicit.
Consistency: If writing multiple prompts, consider starting them with the same verb or structure (e.g., all starting with "Create a...") to make the list scannable and consistent.
Troubleshooting
If gd dev fails to calibrate the grader:
- Read the command output to see which assertions failed.
- If the grader logic generated by the pipeline is wrong, you may need to tweak the language in
expectations.mdso the generated grader is more accurate, or simply rungd devagain (it attempts to fix itself using failure context).