name: dev-render-htmlcss-feature description: > Manual-invocation only. Five-phase feature loop (audit → ground → fixture → implement → verify) for driving a single CSS feature to Chromium parity in the grida htmlcss renderer.
grida-htmlcss — feature loop
What this is. A heavy, manually-invoked loop for driving a single
CSS feature forward in the grida htmlcss renderer. Do not auto-trigger;
load only when the user explicitly runs it. The loop is a conductor
over /research, /fixtures, and /render-reftest — those auto-trigger
on their own for narrower work.
Sibling skill. For features in the SVG path
(crates/grida/src/htmlcss/svg/, resvg-test-suite corpus,
multi-oracle scoring against expected.png + baked Chrome PNG), use
dev-render-htmlcss-svg-feature
instead. Same five-phase shape, different tooling.
Lifecycle. Expect this skill to grow as new divergence patterns surface. It will likely go stale in parts once htmlcss hits Chromium-parity on L0/L1; treat the phase structure as durable and the property-specific callouts as advisory.
The five phases
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ 1. AUDIT │→ │2. GROUND │→ │3. FIXTURE│→ │ 4. IMPL │→ │5. VERIFY │
└──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘
│ │
└───── ← ─── loop ← ─── score < floor ← ─── diff ← ───────┘
Each phase has a question it answers, a deliverable, and an exit criterion. Don't skip forward; don't linger past the exit criterion. The loop closes at verify — if the score is below the gate, return to phase 3 or 4 with a specific hypothesis, not a vibe.
1. Audit — "what's the actual state of this feature?"
Question. Where is the feature on the grida side today? What renders wrong, what doesn't render at all, what renders coincidentally-correctly but by the wrong path?
Actions.
- Scan
crates/grida/src/htmlcss/for the property name in stylo enum mapping, paint emit, layout feed. A property can be parsed-but-dropped, emitted-but-wrong, or unhandled — each has a different fix shape. - Enumerate existing fixtures that touch the feature
(
fixtures/test-html/L0/). Run them underL0.coverageand record current similarity per fixture. This is the before-number. - Check
docs/wg/feat-2d/htmlcss.mdand any related design notes for a prior decision or deliberate gap. - List sibling properties likely to break the same way (e.g.
border-radius%-values impliedborder-image-slice%-values).
Deliverable. A short audit note inside the task prompt or the PR draft:
- Current support level: not-parsed / parsed-but-dropped / partial / Chromium-parity-except-X.
- Fixtures touching it: list with current similarity scores.
- Priority bucket: easy-and-important / easy-low-value / hard-important / hard-low-value. Pick from the top-left by default; only go hard-important when called out.
Exit when. You can state the feature's current renderer state in one paragraph with file references. If you can't, you don't know enough yet — read more, don't guess.
2. Ground — "how do real engines solve this?"
Question. What's the canonical implementation strategy for this feature in a mature engine? We are not inventing; we are adapting.
Actions. Invoke /research. Three engines are the usual
references:
- Servo + stylo — Rust, most readable. Especially useful for parsing, cascade, inheritance, computed-value rules.
- Chromium / Blink — C++. Authoritative for layout and paint divergence calls. The renderer we diff against.
- WebKit — C++. Third voice; useful when Blink has controversial behavior (Safari-only bugs / features).
For a new property, read the spec first (CSS Backgrounds, CSS Display, CSS Values 4, etc.). Then look up:
- How stylo represents the property's computed value.
- How Blink paints or lays out against that representation.
- What WPT section exercises it (for free fixtures later).
Deliverable. A research note — either inline in the PR
description or under docs/wg/feat-2d/ if substantial — with:
- The spec section(s) that govern behavior.
- The 3–6 line summary of how stylo/Blink structure the solution.
- The explicit deviation, if any, and why.
Exit when. You can defend the implementation shape by pointing at prior art, not just "it compiles and the fixture passes." If the only justification is the fixture, you've over-fit.
3. Fixture — "what's the smallest test that proves it?"
Question. What HTML/CSS input demonstrates the feature unambiguously, and what does the ideal rendered output look like?
Actions. Invoke /fixtures for authoring rules; /render-reftest
for the suite manifest. In short:
- One concept per file.
paint-<property>-<variant>.htmlnaming. - Probe-friendly palette (≤3 colors, round coordinates) when the feature is pixel-precision rather than paint-rich.
- Paint vs. layout decision. Paint fixtures fix body size to
the preset (via
min-height); layout fixtures let content size itself and carry an explicitviewportin the suite entry. Seefixtures/test-html/README.md. - Inject
hide-text.cssviaextra_csswhen text is incidental (labels for humans, not the subject under test). This is the single biggest lever against noise. - WPT fixtures are fair game — prefer pulling an established WPT test into the suite over authoring one from scratch when the section is mature.
Deliverable.
- One or more fixtures under
fixtures/test-html/L0/. - Entries in
fixtures/test-html/suites/L0.coverage.json. Only put inL0.exact.jsonafter verify phase confirms 100.00%. - For layout fixtures: the measured
viewport.heightfrom the grida natural cull.
Exit when. The fixture runs through both producers and produces PNGs of identical dimensions. Dimension mismatch → stop; the suite config is wrong and the score will be zero.
4. Implement — "what code change realizes the behavior?"
Question. What is the minimum set of edits in
crates/grida/src/htmlcss/ to make the fixture match?
Actions.
- Touch the smallest surface that can possibly work. Avoid "refactor + feature" in one commit; the reftest cannot tell you which change caused which delta.
- Trace the pipeline end-to-end for the property: parse → compute → layout feed → paint. A feature can fail at any stage; diagnose before editing.
- Add unit tests where behavior is data-assertable (computed
value, resolved length, layout position). Data tests are free
and catch regressions the reftest can't (e.g. "this resolves
to
12pxin both Chromium and us, for the right reason"). - When in doubt, mirror the Blink / stylo structure. Deviations cost reviewer attention; prior-art parity is free.
Deliverable.
- Code change scoped to the feature.
- Any new data tests for the computed-value surface.
- A one-line entry in the PR description for each user-facing behavior change, written in spec terms, not implementation terms.
Exit when. cargo check -p grida is clean, existing tests pass,
and the fixture renders through grida_wpt render --suite without
error. Similarity score is measured in phase 5 — do not gate on
it here.
5. Verify — "does it actually match Chromium?"
Question. Is the rendered output Chromium-parity at the fixture's tolerance gate?
Actions. This is /render-reftest's core loop. For each fixture
in the change:
- Render expecteds (Playwright Chromium) into
target/refbrowser/<suite>/expected. - Render actuals (
cargo run -p grida_wpt -- render --suite … --out-dir target/refbrowser/<suite>/actual). - Diff with
@grida/reftest, threshold 0 (the strict default). - Read similarity against the suite's
gate.floor.
Don't trust the score naively — see "Reading the score" in the render-reftest skill. A 96% score on a sparse fixture can mask a completely broken subject. Eyeball the diff PNG every time. A single round of verification without visual inspection is not verification.
Close the loop:
- Score ≥
gate.floor? Promote the fixture toL0.exact.jsonif it reached 100.00%; otherwise leave in coverage and document the residual delta in the PR description. - Score < floor? Return to phase 3 (fixture too noisy / wrong subject) or phase 4 (renderer bug) with a specific hypothesis. Do not lower the gate to fit the result; the gate exists so regressions are loud.
Deliverable. The PR description, written honestly:
- Before/after similarity numbers for every affected fixture.
- Diff PNGs attached or linked for any score < 1.0.
- The specific divergence surface (rounding, AA, layout math,
etc.) if below 100.00%. "Renderer choice differs from Blink at
" beats "close enough."
Exit when. The PR description can be read by someone who has never seen the code and they know exactly what's now supported, what's still broken, and what the score proves.
Handoffs and artifacts
The phases are designed so an agent can stop, a second agent can pick up, and no context is lost. The durable artifacts:
| Phase | Artifact | Location |
|---|---|---|
| Audit | Current-state note, priority bucket | PR description / task prompt |
| Ground | Research note (spec + engine cross-ref) | PR description or docs/wg/feat-2d/ |
| Fixture | .html fixture(s), suite entries, viewport measurement |
fixtures/test-html/L0/, fixtures/test-html/suites/ |
| Implement | Code change, data tests, behavior summary | crates/grida/src/htmlcss/ |
| Verify | Before/after scores, diff PNG review, divergence surface | PR description |
If a phase's artifact is missing, the phase isn't done — even if the code "works."
Gate policy — the part that makes automation safe
The only reason this loop can be automated is that phase 5 has a numeric, unambiguous, byte-exact pass condition. Everything upstream is advisory; verify is the truth.
L0.exact.json:gate.floor = 1.0,threshold = 0,aa = off. Any regression is a real renderer change we made differently from Blink. No tolerance inflation — ever.L0.coverage.json: informational scores, no gate. Landing a fixture here is "we know about this case and intend to fix it." Promoting to exact is "we now match Blink."
Automation rules downstream of this skill (CI gating, auto-merge,
etc.) must assert on the report.json emitted by @grida/reftest
and not on free-text agent assertions. The agent's job is to
drive the loop; the report is the contract.
What "destructive" means here
A change is destructive if it:
- Lowers
gate.floorinL0.exact.json. - Removes an entry from
L0.exact.jsonwithout a correspondingcoverageentry (or documented reason). - Increases
--thresholdor enables--aato absorb real divergence. - Suppresses a fixture to dodge a failing score.
None of these are acceptable without explicit human approval. The loop fails loudly instead.
Anti-patterns
| Anti-pattern | Why it fails | Instead |
|---|---|---|
| Skipping audit, starting with "fix this bug" | The bug is a symptom; the broken pipeline stage may be a different property. | Trace parse→compute→layout→paint first. Name the stage. |
| Skipping ground, implementing from intuition | CSS is full of non-obvious spec requirements. "Looks right" to a human ≠ spec-correct. | Read the spec. Cross-ref one real engine. |
| Combining refactor + feature in one PR | Reftest deltas can't be attributed. | Land the refactor alone first (score must not drop). |
| Raising threshold to "just pass" | Hides real bugs. Turns the harness into a rubber stamp. | Fix the divergence. If out of scope, document + leave in coverage. |
| Using text-heavy fixtures to test non-text feat | Font shaping noise dominates the score; you're measuring the wrong thing. | Inject hide-text.css. Or use probe-friendly fixtures. |
Promoting to exact at 99.xx% |
The exact suite is a byte-exact contract. Near-passes belong in coverage with a delta note. | Wait for 100.00%. Or fix the residual. |
| Claiming "verified" without reading the diff | A similarity score is a coarse index; the diff image is the truth. | Eyeball every sub-100 diff. Record the specific divergence. |
| Inventing new fixtures when WPT covers it | Duplicates work; WPT has reviewed spec-intent pass criteria. | Import the WPT fixture; cite it in the suite entry. |
The template — paste this to kick off a cycle
Fill in the brackets. The agent you hand it to should produce all five artifacts before declaring done. Expect to run the loop in passes (audit+ground+fixture → implement → verify), with a checkpoint at each pass that future-you or a reviewer can read without the conversation.
Drive the htmlcss feature loop for: <property or behavior>.
Follow the dev-render-htmlcss-feature skill (.agents/skills/dev-render-htmlcss-feature/SKILL.md).
Scope:
- Feature: <e.g. `border-radius` percentage values>
- Hypothesis: <e.g. grida parses but drops %-values in the paint stage>
- Expected: <e.g. promote paint-border-radius.html to L0.exact>
Produce, in order:
1. Audit note: current support level, file references, before-scores.
2. Ground note: spec section(s), stylo/Blink strategy summary.
3. Fixture(s): `.html` + suite entries. Paint or layout? Declare it.
4. Implementation: minimal diff. Data tests where assertable.
5. Verify report: before/after similarity per fixture, diff PNG
review for any sub-1.0 score, promoted fixtures listed.
Gate: L0.exact must stay at floor 1.0, threshold 0, aa off. Do not
relax the gate. If the feature doesn't reach 100.00%, leave it in
coverage with a specific divergence-surface note.
Use /research for phase 2, /fixtures for phase 3, /render-reftest for
phases 3 and 5.