default-browser-devtools - SKILL.md Agent Skill

name: default-browser-devtools description: 'Cross-engine automation with self-adaptive learning. Validates UI, captures evidence, learns from failures. WebKit/Blink testing, deterministic workflows, agentic-eval integration.' license: MIT

Default Browser Engine DevTools Agent (Self-Adaptive Testing)

Overview

This skill provides cross-engine testing with continuous learning for self-contained Svelte apps. It combines deterministic test workflows with an agentic-eval framework that learns from test execution results, accumulating knowledge over time to improve test quality and reduce false positives.

Key capabilities:

Engine vs Browser: Targets real compatibility boundaries: Blink vs WebKit
Self-adaptive: Learns patterns, optimizes selectors, calibrates thresholds
Evidence-based: Captures screenshots, console logs, network failures, accessibility snapshots
Evaluation-driven: Scores tests on completeness, precision, recall, efficiency
Knowledge persistence: Accumulates learning across CI runs via knowledge base

Why this approach:

Testing rendering engines (Blink/WebKit) covers 99% of browser compatibility issues
Agentic-eval integration enables continuous quality improvement without manual tuning
Knowledge base reduces false positives and noise over time
Multi-dimensional scoring provides actionable feedback

Architecture

Self-Adaptive Learning Loop

Test Execution → Evidence Capture → Evaluation → Pattern Detection → Knowledge Update → Next Run
                                        ↓
                                  Optimization
                                  Suggestions

Components:

Runner (runner.js): Executes tests via Playwright, captures artifacts
Evaluator (evaluator.js): Scores test quality, detects patterns, suggests optimizations
Knowledge Base (test-knowledge.json): Persists learned patterns, optimized selectors, thresholds

Learning mechanisms:

Pattern occurrence tracking (ResizeObserver loops, hydration errors, network timeouts, infinite loops, Svelte reactivity warnings)
Selector confidence calibration based on interaction success rates
Failure clustering to identify common issue signatures
Threshold adjustment based on historical baseline scores
Infinite loop detection via repeated console message analysis
Reactive dependency cycle detection for Svelte/React frameworks

Enhanced Pattern Detection (v1.1)

The evaluator has been enhanced to detect and provide actionable feedback for additional patterns learned from real-world debugging:

Critical Patterns:

Infinite Loops: Detects repeated console messages in rapid succession (>20 occurrences, <100ms intervals)
- Example: [SLICE FETCH EFFECT] Triggered repeating infinitely
- Suggests checking for missing guards in reactive effects, circular dependencies
Reactive Dependency Cycles: Identifies circular reactive chains
- Example: State A updates → triggers effect → updates State B → triggers effect → updates State A
- Provides actionable items for breaking the cycle with guards or untrack()

Code Quality Patterns:

Svelte 5 Reactivity Warnings: Captures state_referenced_locally and similar compiler warnings
- Suggests using getter/setter pairs for proper reactivity
- Links to Svelte documentation for best practices
Hydration Mismatches: Critical for SSR/Svelte apps
- Indicates SSR/CSR content differs
- High severity as it can cause UI inconsistencies

Performance Patterns:

ResizeObserver Loops: Low severity, common false positive
Network Timeouts: Suggests retry logic or timeout adjustments
Timing Issues: Flags potential race conditions in async code

Engine Targets

Platform target	Practical engine family to test	Playwright engine
iOS / macOS	WebKit (Safari engine; iOS requires WebKit)	`webkit`
Windows / ChromeOS	Blink (Chromium/Edge family)	`chromium`
Linux	varies by distro/toolkit; often WebKitGTK; sometimes Chromium	`webkit` (default), optional `chromium`

Linux note: there is no single "default engine" across distros. This skill treats Linux as WebKit-first (WebKitGTK-like environments), but supports Chromium if that matches your target.

Prerequisites

Node.js 18+

Automatic Setup (New!)

The browser-dev-tools runner now automatically handles all setup on first run:

✅ Auto-detects missing dependencies and installs them
✅ Auto-installs Playwright browsers if needed
✅ Provides helpful error messages if server isn't running
✅ Optionally auto-starts dev server with --start-server flag

No manual setup required! Just run any command and the tool will ensure everything is ready.

Manual Setup (Optional)

If you prefer to install dependencies manually:

npm install
npx playwright install

Quick Start

Option 1: Automatic (Recommended)

# Tool automatically installs dependencies on first run
node .github/skills/default-browser-devtools/runner.js smoke --url http://localhost:5173

Option 2: With Auto-Start Server

# Tool starts dev server for you
node .github/skills/default-browser-devtools/runner.js smoke \
  --url http://localhost:5173 \
  --start-server

Option 3: Skip Auto-Install

# Skip dependency checks (if you've already installed manually)
node .github/skills/default-browser-devtools/runner.js smoke \
  --url http://localhost:5173 \
  --skip-install

Runner CLI Reference

Commands

`smoke`

Load + conservative click sweep + artifacts + evaluation

Purpose: Verify basic load, hydration, and interaction correctness across engines.

Artifacts:

baseline.png, post_interaction.png
snapshot_baseline.json, snapshot_post.json
console.json, network_failures.json
summary.json, evaluation.json

Example:

node .github/skills/default-browser-devtools/runner.js smoke \
  --url http://localhost:5173 \
  --engines webkit,chromium \
  --learn true

`triage`

Console + network failures + screenshot + evaluation

Purpose: Fastest signal for "what broke?" debugging.

Artifacts:

baseline.png
console.json, network_failures.json
summary.json, evaluation.json

Example:

node .github/skills/default-browser-devtools/runner.js triage \
  --url http://localhost:5173 \
  --engine webkit

`golden`

Deterministic feature sequence + evaluation

Purpose: "Don't lose features" regression detection.

Edit steps array in runner.js to match your app's invariants:

const steps = [
  { type: "clickText", value: "Surface" },
  { type: "waitText", value: "Surface Toolbox" },
  { type: "clickSelector", value: "#export-btn" },
  { type: "fillSelector", selector: "#filename", value: "test.csv" },
  { type: "press", value: "Enter" },
];

Artifacts:

golden_end.png
snapshot_golden_end.json
console.json, network_failures.json
summary.json, evaluation.json

Example:

node .github/skills/default-browser-devtools/runner.js golden \
  --url http://localhost:5173 \
  --engines webkit,chromium

`perf`

Performance trace + analysis

Purpose: Capture trace.zip for performance profiling.

Artifacts:

trace.zip (open in Playwright trace viewer or Chrome DevTools)

Example:

node .github/skills/default-browser-devtools/runner.js perf \
  --url http://localhost:5173 \
  --engine chromium

Note: Performance mode skips evaluation phase (not applicable to perf traces).

`eval`

Evaluate existing test artifacts (no test execution)

Purpose: Re-evaluate past test runs with updated knowledge base or evaluator logic.

Example:

node .github/skills/default-browser-devtools/runner.js eval \
  --artifacts ./artifacts/default-browser-devtools/20260215_103000 \
  --learn true

CLI Arguments

Test Configuration

Argument	Type	Default	Description
`--url`	string	`http://localhost:5173`	Target URL to test
`--engine`	string	-	Single engine: `webkit` or `chromium`
`--engines`	string	`webkit,chromium`	Comma-separated list of engines
`--headless`	boolean	`true`	Run browser in headless mode
`--readySelector`	string	-	Selector to wait for before proceeding
`--readyText`	string	-	Text to wait for before proceeding
`--timeoutMs`	number	`30000`	Timeout for ready checks (ms)
`--viewportWidth`	number	`1280`	Viewport width (px)
`--viewportHeight`	number	`720`	Viewport height (px)

Setup & Server Options (New!)

Argument	Type	Default	Description
`--skip-install`	boolean	`false`	Skip automatic npm and Playwright installation
`--start-server`	boolean	`false`	Automatically start dev server if not running
`--server-command`	string	`npm run dev`	Command to start server
`--server-wait-ms`	number	`10000`	Time to wait for server startup (ms)

Evaluation & Learning

Argument	Type	Default	Description
`--learn`	boolean	`true`	Enable evaluation & learning phase
`--knowledge-base`	string	`./test-knowledge.json`	Path to knowledge base JSON
`--eval-only`	boolean	`false`	Only evaluate, don't run tests
`--eval-consensus`	number	`1`	Run evaluation N times for consensus scoring
`--artifacts`	string	-	Path to artifacts directory (for eval-only mode)

Smoke Test Configuration

Argument	Type	Default	Description
`--sweepSelector`	string	`button, a, [role='button'], summary`	Selector for click sweep
`--maxClicks`	number	`30`	Max interactions in click sweep
`--postClickWaitMs`	number	`120`	Wait time after each click (ms)

Performance & Debugging

Argument	Type	Default	Description
`--perfWindowMs`	number	`1500`	Performance capture window (ms)
`--eval`	string	-	JavaScript expression to evaluate on page

Output Structure

artifacts/default-browser-devtools/<timestamp>/
├── webkit/
│   ├── baseline.png
│   ├── post_interaction.png
│   ├── snapshot_baseline.json
│   ├── snapshot_post.json
│   ├── console.json
│   ├── network_failures.json
│   ├── summary.json
│   └── evaluation.json          ← Agentic-eval output
├── chromium/
│   └── ... (same structure)
├── combined_summary.json
└── combined_evaluation.json      ← When using eval command

Evaluation & Learning System

Evaluation Rubric

The evaluator scores each test run across four dimensions:

Dimension	Weight	Description
Completeness	30%	All expected artifacts captured?
Precision	25%	Low false positive rate?
Recall	25%	Critical failures detected?
Efficiency	20%	Good cost vs value ratio?

Overall Score = Σ(dimension_score × weight)

Scoring Dimensions

Completeness (30%)

Measures whether all expected evidence was captured.

Checks:

Baseline screenshot present? (+15%)
Console logs captured? (+15%)
Network failures logged? (+15%)
Summary file written? (+10%)
Command-specific artifacts? (+15-25%)
- Smoke: post-interaction screenshot, post-snapshot
- Golden: golden-end screenshot, golden-end snapshot
- Perf: trace.zip
Interaction data (smoke)? (+10%)
Ready signal captured? (+10%)

Score interpretation:

0.90-1.0: Excellent, all evidence captured
0.70-0.89: Good, minor gaps
<0.70: Incomplete, missing critical artifacts

Precision (25%)

Measures false positive rate (noise vs signal).

Factors:

ResizeObserver loop count (common false positive, -2% per occurrence)
Excessive warnings (>20 warnings, -10%)
Dev asset failures (localhost/127.0.0.1, -5% per failure)

Score interpretation:

0.90-1.0: Excellent, very low noise
0.70-0.89: Acceptable, some false positives
<0.70: High noise, needs filtering

Recall (25%)

Measures whether critical failures are detected.

Checks:

Page errors detected? (if test passed despite errors, -30%)
Console errors caught? (if test passed despite errors, -30%)
Hydration issues detected? (critical for Svelte, -20% if missed)

Score interpretation:

0.90-1.0: Excellent, all critical issues caught
0.70-0.89: Good, minor blind spots
<0.70: Poor recall, missing failures

Efficiency (20%)

Measures cost vs value (are we wasting effort?).

Factors:

Interaction count (smoke test):
- <5 interactions: -20% (too few, low coverage)
- 10-30 interactions: optimal
- 50 interactions: -30% (diminishing returns)
Error rate:
- 10 errors: -20% (noisy test, wasted effort)

Score interpretation:

0.90-1.0: Excellent efficiency
0.60-0.89: Acceptable
<0.60: Inefficient, needs optimization

Pattern Detection

The evaluator recognizes common failure patterns:

Pattern	Severity	Description	Evaluator Uses
`resizeObserverLoop`	Low	ResizeObserver loop detected	Precision adjustment, filtering suggestion
`hydrationError`	High	Svelte hydration mismatch	Critical issue flag, code quality optimization
`networkTimeout`	Medium	Network request timeout	Reliability optimization, timeout tuning
`timingIssue`	Medium	Timing/async/race condition	Code quality flag, timing optimization

Optimization Suggestions

Based on detected patterns and scores, the evaluator suggests improvements:

Example optimizations:

{
  "optimizations": [
    {
      "category": "filtering",
      "priority": "low",
      "suggestion": "Add ResizeObserver loop filtering to reduce noise",
      "rationale": "5 occurrences detected"
    },
    {
      "category": "code-quality",
      "priority": "high",
      "suggestion": "Fix Svelte hydration mismatch (SSR/CSR content differs)",
      "rationale": "Hydration errors can cause UI inconsistencies"
    },
    {
      "category": "efficiency",
      "priority": "medium",
      "suggestion": "Reduce maxClicks or refine sweepSelector to focus on critical interactions",
      "rationale": "42 interactions is high; diminishing returns"
    }
  ]
}

Knowledge Base Schema

Structure (test-knowledge.json):

{
  "version": "1.0",
  "learned_patterns": [
    {
      "type": "resizeObserverLoop",
      "severity": "low",
      "description": "ResizeObserver loop detected (common false positive)",
      "occurrences": 12,
      "first_seen": "2026-02-15T10:30:00.000Z",
      "last_seen": "2026-02-15T14:20:00.000Z"
    }
  ],
  "optimized_selectors": {
    "interactive_elements": "button, a, [role='button'], summary, [onclick]",
    "confidence": 0.87
  },
  "threshold_calibration": {
    "smoke_test": {
      "baseline_score": 0.85,
      "confidence_threshold": 0.80,
      "adjusted_from": [
        { "date": "2026-02-10", "old": 0.80, "new": 0.85, "reason": "reduced false positives" }
      ]
    }
  },
  "failure_clusters": [
    {
      "pattern_signature": "hydrationError+networkTimeout",
      "patterns": ["hydrationError", "networkTimeout"],
      "occurrences": 3,
      "first_seen": "2026-02-14T09:00:00.000Z",
      "last_seen": "2026-02-15T12:00:00.000Z"
    }
  ]
}

Fields:

learned_patterns: Tracked pattern occurrences with frequency and timestamps
optimized_selectors: Selector strings refined based on interaction success rates
- confidence: 0-1 score; used when ≥0.75
threshold_calibration: Baseline scores and confidence thresholds for each workflow
failure_clusters: Common failure combinations (helps identify systemic issues)

Evaluation Output Schema

File (evaluation.json):

{
  "test_run_id": "webkit",
  "overall_score": 0.87,
  "dimensions": {
    "completeness": 0.95,
    "precision": 0.82,
    "recall": 0.90,
    "efficiency": 0.78
  },
  "confidence": 0.89,
  "feedback": [
    {
      "dimension": "precision",
      "severity": "low",
      "message": "High false positive rate detected",
      "action": "Consider filtering known benign issues (e.g., ResizeObserver loops)"
    }
  ],
  "optimizations": [
    {
      "category": "filtering",
      "priority": "low",
      "suggestion": "Add ResizeObserver loop filtering to reduce noise",
      "rationale": "5 occurrences detected"
    }
  ],
  "patterns_detected": 2,
  "patterns_learned": 1,
  "knowledge_base_updated": true,
  "consensus": {
    "runs": 3,
    "average_score": 0.86,
    "average_confidence": 0.88,
    "variance": 0.0002
  }
}

Confidence score:

High variance in dimension scores → lower confidence
Missing data (e.g., null artifacts) → reduced confidence
Typical range: 0.75-0.95

Consensus mode (--eval-consensus N):

Runs evaluation N times
Computes average scores and variance
Low variance → stable evaluation
High variance → consider adversarial review or refinement

Workflows

Workflow 1 — Smoke Test with Learning (Simplified!)

Goal: Verify load + hydration + basic interactions, learn from results.

New simplified usage (auto-installs everything on first run):

# That's it! No setup required - tool handles everything
node .github/skills/default-browser-devtools/runner.js smoke \
  --url http://localhost:5173 \
  --engines webkit,chromium

With auto-start server:

node .github/skills/default-browser-devtools/runner.js smoke \
  --url http://localhost:5173 \
  --engines webkit,chromium \
  --start-server

Process:

✨ Auto-check dependencies (installs if missing)
✨ Auto-install Playwright browsers (if needed)
✨ Check server availability (or start with --start-server)
Launch webkit and chromium browsers
Navigate to URL, wait for networkidle
Capture baseline screenshot + accessibility snapshot
Perform conservative click sweep (up to 30 interactions)
Capture post-interaction screenshot + snapshot
Log console messages + network failures
Evaluate test quality (completeness, precision, recall, efficiency)
Detect patterns (ResizeObserver, hydration, timing issues)
Suggest optimizations
Update knowledge base with learned patterns

Output:

{
  "ok": true,
  "outRoot": "./artifacts/default-browser-devtools/20260215_143000",
  "engines": ["webkit", "chromium"],
  "results": [...]
}

--- Evaluation & Learning Phase ---

Evaluating webkit test run...
  Score: 0.87
  Confidence: 0.89
  Patterns detected: 2
  Patterns learned: 1
  Optimizations suggested: 1
    - [low] Add ResizeObserver loop filtering to reduce noise

Evaluating chromium test run...
  Score: 0.92
  Confidence: 0.91
  Patterns detected: 1
  Patterns learned: 0

✓ Knowledge base updated

Workflow 2 — Triage (Fast Debugging)

Goal: Fastest path to "what broke?"

node .github/skills/default-browser-devtools/runner.js triage \
  --url http://localhost:5173 \
  --engine webkit \
  --learn true

Process:

Launch browser
Navigate to URL
Capture console logs + network failures + screenshot
Evaluate for critical issues
Update knowledge base

Use case: CI failure, need quick signal.

Workflow 3 — Golden Path (Regression Prevention)

Goal: Ensure core features work across engines.

Setup: Edit steps array in runner.js:

const steps = [
  { type: "clickText", value: "Bushing" },
  { type: "waitText", value: "Bushing Toolbox" },
  { type: "fillSelector", selector: "#bore-diameter", value: "0.5" },
  { type: "clickText", value: "Compute" },
  { type: "waitText", value: "Stress Analysis" },
];

Run:

node .github/skills/default-browser-devtools/runner.js golden \
  --url http://localhost:5173 \
  --engines webkit,chromium \
  --learn true

Use case: Nightly regression suite, feature acceptance testing.

Workflow 4 — Eval-Only Mode

Goal: Re-evaluate past test runs without re-running tests.

node .github/skills/default-browser-devtools/runner.js eval \
  --artifacts ./artifacts/default-browser-devtools/20260215_103000 \
  --learn true

Use case:

Updated evaluator logic
Knowledge base refinement
Historical trend analysis

Workflow 5 — Consensus Evaluation

Goal: Increase evaluation reliability via ensemble scoring.

node .github/skills/default-browser-devtools/runner.js smoke \
  --url http://localhost:5173 \
  --engine webkit \
  --eval-consensus 3

Output:

Evaluating webkit test run...
  Score: 0.86 (consensus of 3)
  Confidence: 0.88

When to use:

High variance in scores
Critical release validation
Evaluator calibration

Workflow 6 — Backward Compatible (Learning Disabled)

Goal: Run tests without evaluation phase.

node .github/skills/default-browser-devtools/runner.js smoke \
  --url http://localhost:5173 \
  --engines webkit,chromium \
  --learn false

Use case:

CI environments without persistent storage
Quick local checks
Debugging evaluator issues

Maturity Levels

This skill implements a Level 4 agentic-eval architecture:

Level	Capability	Status
Level 1	Basic reflection	✅ Implemented
Level 2	Evaluator separation	✅ Implemented (separate evaluator.js)
Level 3	Adversarial & ensemble	✅ Implemented (consensus mode)
Level 4	Benchmark-driven	✅ Implemented (knowledge base baselines)
Level 5	Confidence-calibrated & cost-aware	🔄 Partial (confidence scores, no cost routing yet)

Roadmap to Level 5:

Add cost-aware routing (skip evaluation for trivial tests)
Confidence-based early stopping
Evaluator model selection (small vs large)

Cross-Platform Portability Checklist

When testing Svelte apps for cross-platform compatibility:

Case-sensitive assets/imports (Linux/CI)
Path separator assumptions (Windows vs Unix)
Keyboard modifiers (Cmd vs Ctrl)
Font/layout dependencies (WebKit text rendering differs)
ResizeObserver / measurement loops (engine-specific)
Scroll container behavior (iOS WebKit scroll anchoring)
File input parsing + error UI (WebKit file dialog differences)
CSS containment (Blink vs WebKit implementation gaps)
Intersection observer thresholds (rounding differences)

Troubleshooting

Issue: First time setup or missing dependencies

New in v2.0: The runner now automatically handles setup! No manual intervention needed.

What happens automatically:

✅ Checks for node_modules and runs npm install if missing
✅ Checks for Playwright browsers and installs them if missing
✅ Provides clear progress messages during installation
✅ Verifies server is running before tests start

If you want to skip automatic installation:

node runner.js smoke --url http://localhost:5173 --skip-install

If server isn't running:

# Option 1: Auto-start server
node runner.js smoke --url http://localhost:5173 --start-server

# Option 2: Custom server command
node runner.js smoke --url http://localhost:5173 \
  --start-server \
  --server-command "npm run dev" \
  --server-wait-ms 15000

Manual setup (if preferred):

npm install
npx playwright install
npm run dev &
node runner.js smoke --url http://localhost:5173

Issue: "Cannot find package 'playwright'" error

This should no longer happen! The runner now automatically installs Playwright before importing it.

If you still see this error:

Make sure you're using the updated runner (check git log)
Try running with verbose output: node runner.js smoke --url http://localhost:5173 2>&1 | tee output.log
Check if node_modules directory exists
Try manual install: npm install

Issue: Server not responding

Symptom:

✗ Server is not running at http://localhost:5173
   Options:
   1. Start server manually: npm run dev
   2. Use --start-server flag to auto-start
   3. Specify different URL with --url

Fix:

# Let the tool start it for you
node runner.js smoke --url http://localhost:5173 --start-server

# Or start manually in another terminal
npm run dev
# Then in original terminal:
node runner.js smoke --url http://localhost:5173

Issue: Evaluation score is low despite test passing

Symptom:

Score: 0.62
Confidence: 0.75
Feedback:
  - [medium] Test efficiency could be improved

Diagnosis:

Check dimensions in evaluation.json
Low efficiency? → Too many interactions or high error rate
Low precision? → High false positive rate
Low recall? → Critical errors not detected

Fix:

Review optimizations array for specific suggestions
Adjust CLI arguments (e.g., --maxClicks 20 to reduce interaction count)
Update knowledge base to filter known false positives

Issue: Knowledge base not updating

Symptom:

Patterns learned: 0
Knowledge base updated: false

Diagnosis:

Check file permissions on test-knowledge.json
Verify --learn true is set
Check console for error messages

Fix:

# Verify file exists and is writable
ls -l .github/skills/default-browser-devtools/test-knowledge.json
chmod 644 .github/skills/default-browser-devtools/test-knowledge.json

# Run with explicit path
node runner.js smoke --url http://localhost:5173 \
  --knowledge-base /absolute/path/to/test-knowledge.json

Issue: High variance in consensus evaluation

Symptom:

Score: 0.75 (consensus of 3)
Variance: 0.08

Diagnosis:

Variance >0.05 indicates unstable evaluation
Possible causes:
- Non-deterministic test behavior
- Timing-dependent failures
- Evaluator logic ambiguity

Fix:

Add --readySelector or --readyText for more stable ready detection
Increase --postClickWaitMs for slower interactions
Review console logs for timing issues
Consider adversarial review (manual inspection)

Issue: Selector confidence degrading

Symptom:

{
  "optimized_selectors": {
    "confidence": 0.52
  }
}

Diagnosis:

Many failed interactions
Selector is too broad or includes non-interactive elements

Fix:

Review summary.json → interactions array for failed items

Refine --sweepSelector:

--sweepSelector "button:not([disabled]), a[href], [role='button'][tabindex]"

Confidence will automatically increase as success rate improves

Issue: Eval-only mode fails with missing artifacts

Symptom:

Error: No summary.json found at <path>

Diagnosis:

Artifacts directory structure doesn't match expected format
Test run failed before writing artifacts

Fix:

Verify directory structure:

artifacts/
  default-browser-devtools/
    <timestamp>/
      webkit/
        summary.json
      chromium/
        summary.json

Point to timestamp directory, not engine directory:

--artifacts ./artifacts/default-browser-devtools/20260215_103000

FAQ

How does learning persist across CI runs?

Knowledge base file must be committed:

# After local testing with learning
git add .github/skills/default-browser-dev-tools/test-knowledge.json
git commit -m "Update test knowledge base"
git push

In CI, mount knowledge base as artifact or use persistent volume:

# GitHub Actions example
- name: Run smoke test
  run: |
    node .github/skills/default-browser-devtools/runner.js smoke \
      --url http://localhost:5173 \
      --learn true

- name: Upload knowledge base
  uses: actions/upload-artifact@v3
  with:
    name: test-knowledge
    path: .github/skills/default-browser-dev-tools/test-knowledge.json

Alternative: Use external storage (S3, database) with custom --knowledge-base path.

When should I use consensus mode?

Use --eval-consensus N when:

Evaluating critical release candidates
High variance in historical scores
Calibrating evaluator after logic changes
Detecting evaluator bias or drift

Typical N values:

N=1: Default, fast
N=3: Standard consensus, good balance
N=5: High reliability, slower

Cost: N× evaluation time (but no additional test execution).

Can I customize the evaluation rubric?

Yes. Edit evaluator.js → WEIGHTS constant:

const WEIGHTS = {
  completeness: 0.30,  // Adjust as needed
  precision: 0.25,
  recall: 0.25,
  efficiency: 0.20,
};

Or add custom dimensions:

const customScore = this.scoreCustomDimension(summary);
const overallScore =
  completeness * 0.25 +
  precision * 0.20 +
  recall * 0.20 +
  efficiency * 0.15 +
  customScore * 0.20;

Recommendation: Keep weights documented in knowledge base:

{
  "rubric_config": {
    "completeness": 0.30,
    "precision": 0.25,
    "recall": 0.25,
    "efficiency": 0.20
  }
}

How do I disable learning for specific test runs?

Temporary disable:

--learn false

Permanent disable (CI):

export DISABLE_TEST_LEARNING=true

Then in runner.js:

if (process.env.DISABLE_TEST_LEARNING === "true") {
  args.learn = "false";
}

What's the difference between smoke, triage, and golden?

Command	Purpose	Speed	Coverage	When to use
smoke	Load + interaction sweep	Moderate	High	PR checks, nightly regression
triage	Console + network only	Fast	Low	CI failure debugging
golden	Deterministic feature path	Moderate	Focused	Feature acceptance, critical path validation
perf	Performance profiling	Slow	N/A	Performance investigation

Can I run evaluation without tests?

Yes. Use eval command or --eval-only true:

# Re-evaluate past run
node runner.js eval --artifacts ./artifacts/default-browser-devtools/20260215_103000

# Or with existing command
node runner.js smoke --url http://localhost:5173 --eval-only true --artifacts <path>

Use case:

Updated evaluator logic
Knowledge base refinement
Historical trend analysis
Debugging evaluation scores

How does the evaluator handle flaky tests?

Pattern detection:

Timing issues → Medium severity pattern
Network timeouts → Suggests retry logic
Hydration errors → High severity, indicates code issue

Confidence adjustment:

High variance across runs → Lower confidence
Inconsistent patterns → Suggests flakiness

Recommendation:

Use --eval-consensus 3 to detect flakiness
Review evaluation.json → consensus.variance
High variance? Add stabilization (timeouts, ready checks)

What happens if knowledge base is corrupted?

Auto-recovery:

Evaluator detects parse errors
Falls back to default knowledge base
Logs warning: Failed to parse knowledge base: <error>, using defaults

Manual fix:

# Backup corrupted file
mv test-knowledge.json test-knowledge.json.bak

# Regenerate from template
cp test-knowledge.json.template test-knowledge.json

# Or let evaluator create default
rm test-knowledge.json
node runner.js smoke --url http://localhost:5173 --learn true

Issue: Infinite loop detected in console

Symptom:

{
  "type": "infiniteLoop",
  "count": 247,
  "severity": "critical",
  "description": "Infinite loop detected: '[SLICE FETCH EFFECT] Triggered...' repeated 247 times",
  "avgIntervalMs": 15
}

Diagnosis:

Message repeating >20 times with <100ms intervals = infinite loop
Typically caused by reactive dependency cycles or missing guards in effects

Fix:

Check for missing guards in effects:

$effect(() => {
  if (!someCondition) return;  // Add guard
  // ... rest of effect
});

Verify state synchronization:
- Svelte 5: Use getter/setter pairs instead of capturing initial values
- React: Check useEffect dependencies

Look for circular dependencies:

// BAD: Circular update
$effect(() => {
  stateA = stateB;  // Updates stateA
});
$effect(() => {
  stateB = stateA;  // Updates stateB → triggers first effect → loop
});

// GOOD: Add guard or use derived
let stateB = $derived(stateA);  // One-way dependency

Add early return based on previous value:

let lastValue = '';
$effect(() => {
  if (value === lastValue) return;  // Skip if unchanged
  lastValue = value;
  // ... process value
});

Real-world example: Inspector toolbox fix

Problem: loadState.isMergedView captured initial value, stayed false
Solution: Used getters/setters for proper reactivity
Result: Guard check worked correctly, no infinite loop

Issue: Svelte reactivity warnings in dev console

Symptom:

{
  "type": "svelteReactivityWarning",
  "count": 17,
  "severity": "medium",
  "description": "Svelte 5 reactivity warning detected (state_referenced_locally)",
  "examples": [
    "This reference only captures the initial value of `isLoading`...",
    "This reference only captures the initial value of `isMergedView`..."
  ]
}

Diagnosis:

Using shorthand initialization captures initial values, not reactive references
Example: let obj = $state({ isLoading, headers }) ← BAD

Fix: Convert to getter/setter pairs:

// BAD: Captures initial values
let state = $state({
  isLoading,
  headers,
  isMergedView
});

// GOOD: Uses reactive getters/setters
let state = $state({
  get isLoading() { return isLoading; },
  set isLoading(v) { isLoading = v; },
  get headers() { return headers; },
  set headers(v) { headers = v; },
  get isMergedView() { return isMergedView; },
  set isMergedView(v) { isMergedView = v; }
});

Action items provided by evaluator:

Convert $state objects to use getter/setter pairs
Use $derived for computed values
Avoid capturing initial values in closures
Review https://svelte.dev/docs/svelte/$state

Issue: Reactive loop causing performance problems

Symptom:

{
  "type": "reactiveLoop",
  "count": 3,
  "severity": "critical",
  "description": "Reactive dependency loop or circular dependency detected"
}

Diagnosis:

Console shows messages about circular dependencies, infinite effects, or reactive loops
App becomes unresponsive due to constant re-rendering

Fix strategies provided by evaluator:

Identify the reactive chain:
- Use browser DevTools Performance tab
- Look for stack traces in console errors
- Add debug logging to track update sequence

Add conditional guards:

$effect(() => {
  if (updating) return;  // Guard against re-entry
  updating = true;
  // ... perform updates
  updating = false;
});

Use untrack() for non-reactive reads:

import { untrack } from 'svelte';

$effect(() => {
  const currentValue = untrack(() => someState);  // Read without tracking
  // ... use currentValue
});

Extract shared state:

// Instead of bidirectional updates between A ↔ B
// Use shared state: A → C ← B
let sharedState = $state({});
let derivedA = $derived(computeA(sharedState));
let derivedB = $derived(computeB(sharedState));

Integration with Agentic-Eval Framework

This skill implements core patterns from the Agentic-Eval Framework (.github/skills/ agentic-eval/SKILL.md):

Pattern: Evaluator Separation (Level 2)

Implementation:

runner.js = Generator (produces test results)
evaluator.js = Evaluator (scores + critiques)
Separate concerns ensure evaluator can evolve independently

Pattern: Multi-Judge Consensus (Level 3)

Implementation:

--eval-consensus 3

Logic:

const evaluations = [];
for (let i = 0; i < consensus; i++) {
  evaluations.push(evaluator.evaluate(engineOut));
}
const avgScore = evaluations.reduce((sum, e) => sum + e.overall_score, 0) / evaluations.length;

Variance interpretation:

Low variance (<0.01) → stable evaluation
High variance (>0.05) → unreliable, needs refinement

Pattern: Rubric-Based Evaluation

Implementation:

Four dimensions: completeness, precision, recall, efficiency
Weighted scoring: overall_score = Σ(dim × weight)
Documented rubric in evaluation output

Pattern: Knowledge Persistence (Level 4)

Implementation:

learned_patterns: Track occurrences over time
optimized_selectors: Refine based on success rates
threshold_calibration: Adjust baselines from historical data
failure_clusters: Identify systemic issues

Pattern: Confidence-Based Routing (Level 5)

Implementation:

calculateConfidence(): Score variance + data completeness
Confidence output in evaluation JSON
Future: Skip evaluation if confidence >0.95 and score >0.90

Pattern: Cost-Aware Learning

Current:

Learning can be disabled: --learn false
Evaluation is fast (no LLM calls, pure logic)

Future:

Skip evaluation for trivial tests (<50 lines of console output)
Cache repeated evaluations (hash of artifacts)

Advanced Examples

Example 1: CI Integration with Learning

GitHub Actions workflow:

name: Cross-Engine Tests

on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      
      - name: Setup Node
        uses: actions/setup-node@v3
        with:
          node-version: 18
      
      - name: Install dependencies
        run: npm ci
      
      - name: Install Playwright
        run: npx playwright install --with-deps
      
      - name: Start dev server
        run: npm run dev &
        env:
          PORT: 5173
      
      - name: Wait for server
        run: npx wait-on http://localhost:5173
      
      - name: Run smoke tests
        run: |
          node .github/skills/default-browser-devtools/runner.js smoke \
            --url http://localhost:5173 \
            --engines webkit,chromium \
            --learn true \
            --knowledge-base .github/skills/default-browser-dev-tools/test-knowledge.json
      
      - name: Upload artifacts
        if: always()
        uses: actions/upload-artifact@v3
        with:
          name: test-results
          path: artifacts/
      
      - name: Commit knowledge base updates
        if: github.ref == 'refs/heads/main'
        run: |
          git config user.name "github-actions"
          git config user.email "github-actions@github.com"
          git add .github/skills/default-browser-dev-tools/test-knowledge.json
          git diff --quiet || git commit -m "Update test knowledge base [skip ci]"
          git push

Example 2: Local Development Loop

#!/bin/bash
# dev-test-loop.sh

# Start dev server in background
npm run dev &
DEV_PID=$!

# Wait for server
sleep 3

# Run smoke test with learning
node .github/skills/default-browser-devtools/runner.js smoke \
  --url http://localhost:5173 \
  --engines webkit,chromium \
  --learn true \
  --headless false

# Kill dev server
kill $DEV_PID

# Show evaluation summary
cat artifacts/default-browser-devtools/*/webkit/evaluation.json | jq '.overall_score, .feedback'

Example 3: Historical Trend Analysis

# Evaluate multiple past runs
for dir in artifacts/default-browser-devtools/*/; do
  echo "Evaluating $dir"
  node .github/skills/default-browser-devtools/runner.js eval \
    --artifacts "$dir" \
    --learn false
done | jq -s '.[] | {timestamp: .test_run_id, score: .overall_score}'

Output:

[
  {"timestamp": "20260210_100000", "score": 0.78},
  {"timestamp": "20260211_100000", "score": 0.82},
  {"timestamp": "20260212_100000", "score": 0.85},
  {"timestamp": "20260215_100000", "score": 0.87}
]

Analysis: Score improving over time → learning is effective.

Example 4: Selective Learning (Per-Branch)

# main branch: aggressive learning
if [ "$BRANCH" = "main" ]; then
  LEARN=true
  CONSENSUS=3
else
  # feature branches: basic learning
  LEARN=true
  CONSENSUS=1
fi

node runner.js smoke --url http://localhost:5173 \
  --learn $LEARN \
  --eval-consensus $CONSENSUS

References

Agentic-Eval Framework: .github/skills/ agentic-eval/SKILL.md
Playwright Docs: https://playwright.dev/docs/intro
WebKit vs Blink: https://en.wikipedia.org/wiki/Comparison_of_browser_engines

License

MIT