name: default-browser-devtools description: 'Cross-engine automation with self-adaptive learning. Validates UI, captures evidence, learns from failures. WebKit/Blink testing, deterministic workflows, agentic-eval integration.' license: MIT
Default Browser Engine DevTools Agent (Self-Adaptive Testing)
Overview
This skill provides cross-engine testing with continuous learning for self-contained Svelte apps. It combines deterministic test workflows with an agentic-eval framework that learns from test execution results, accumulating knowledge over time to improve test quality and reduce false positives.
Key capabilities:
- Engine vs Browser: Targets real compatibility boundaries: Blink vs WebKit
- Self-adaptive: Learns patterns, optimizes selectors, calibrates thresholds
- Evidence-based: Captures screenshots, console logs, network failures, accessibility snapshots
- Evaluation-driven: Scores tests on completeness, precision, recall, efficiency
- Knowledge persistence: Accumulates learning across CI runs via knowledge base
Why this approach:
- Testing rendering engines (Blink/WebKit) covers 99% of browser compatibility issues
- Agentic-eval integration enables continuous quality improvement without manual tuning
- Knowledge base reduces false positives and noise over time
- Multi-dimensional scoring provides actionable feedback
Architecture
Self-Adaptive Learning Loop
Test Execution → Evidence Capture → Evaluation → Pattern Detection → Knowledge Update → Next Run
↓
Optimization
Suggestions
Components:
- Runner (runner.js): Executes tests via Playwright, captures artifacts
- Evaluator (evaluator.js): Scores test quality, detects patterns, suggests optimizations
- Knowledge Base (test-knowledge.json): Persists learned patterns, optimized selectors, thresholds
Learning mechanisms:
- Pattern occurrence tracking (ResizeObserver loops, hydration errors, network timeouts, infinite loops, Svelte reactivity warnings)
- Selector confidence calibration based on interaction success rates
- Failure clustering to identify common issue signatures
- Threshold adjustment based on historical baseline scores
- Infinite loop detection via repeated console message analysis
- Reactive dependency cycle detection for Svelte/React frameworks
Enhanced Pattern Detection (v1.1)
The evaluator has been enhanced to detect and provide actionable feedback for additional patterns learned from real-world debugging:
Critical Patterns:
Infinite Loops: Detects repeated console messages in rapid succession (>20 occurrences, <100ms intervals)
- Example:
[SLICE FETCH EFFECT] Triggeredrepeating infinitely - Suggests checking for missing guards in reactive effects, circular dependencies
- Example:
Reactive Dependency Cycles: Identifies circular reactive chains
- Example: State A updates → triggers effect → updates State B → triggers effect → updates State A
- Provides actionable items for breaking the cycle with guards or untrack()
Code Quality Patterns:
Svelte 5 Reactivity Warnings: Captures
state_referenced_locallyand similar compiler warnings- Suggests using getter/setter pairs for proper reactivity
- Links to Svelte documentation for best practices
Hydration Mismatches: Critical for SSR/Svelte apps
- Indicates SSR/CSR content differs
- High severity as it can cause UI inconsistencies
Performance Patterns:
- ResizeObserver Loops: Low severity, common false positive
- Network Timeouts: Suggests retry logic or timeout adjustments
- Timing Issues: Flags potential race conditions in async code
Engine Targets
| Platform target | Practical engine family to test | Playwright engine |
|---|---|---|
| iOS / macOS | WebKit (Safari engine; iOS requires WebKit) | webkit |
| Windows / ChromeOS | Blink (Chromium/Edge family) | chromium |
| Linux | varies by distro/toolkit; often WebKitGTK; sometimes Chromium | webkit (default), optional chromium |
Linux note: there is no single "default engine" across distros. This skill treats Linux as WebKit-first (WebKitGTK-like environments), but supports Chromium if that matches your target.
Prerequisites
- Node.js 18+
Automatic Setup (New!)
The browser-dev-tools runner now automatically handles all setup on first run:
✅ Auto-detects missing dependencies and installs them
✅ Auto-installs Playwright browsers if needed
✅ Provides helpful error messages if server isn't running
✅ Optionally auto-starts dev server with --start-server flag
No manual setup required! Just run any command and the tool will ensure everything is ready.
Manual Setup (Optional)
If you prefer to install dependencies manually:
npm install
npx playwright install
Quick Start
Option 1: Automatic (Recommended)
# Tool automatically installs dependencies on first run
node .github/skills/default-browser-devtools/runner.js smoke --url http://localhost:5173
Option 2: With Auto-Start Server
# Tool starts dev server for you
node .github/skills/default-browser-devtools/runner.js smoke \
--url http://localhost:5173 \
--start-server
Option 3: Skip Auto-Install
# Skip dependency checks (if you've already installed manually)
node .github/skills/default-browser-devtools/runner.js smoke \
--url http://localhost:5173 \
--skip-install
Runner CLI Reference
Commands
smoke
Load + conservative click sweep + artifacts + evaluation
Purpose: Verify basic load, hydration, and interaction correctness across engines.
Artifacts:
baseline.png,post_interaction.pngsnapshot_baseline.json,snapshot_post.jsonconsole.json,network_failures.jsonsummary.json,evaluation.json
Example:
node .github/skills/default-browser-devtools/runner.js smoke \
--url http://localhost:5173 \
--engines webkit,chromium \
--learn true
triage
Console + network failures + screenshot + evaluation
Purpose: Fastest signal for "what broke?" debugging.
Artifacts:
baseline.pngconsole.json,network_failures.jsonsummary.json,evaluation.json
Example:
node .github/skills/default-browser-devtools/runner.js triage \
--url http://localhost:5173 \
--engine webkit
golden
Deterministic feature sequence + evaluation
Purpose: "Don't lose features" regression detection.
Edit steps array in runner.js to match your app's invariants:
const steps = [
{ type: "clickText", value: "Surface" },
{ type: "waitText", value: "Surface Toolbox" },
{ type: "clickSelector", value: "#export-btn" },
{ type: "fillSelector", selector: "#filename", value: "test.csv" },
{ type: "press", value: "Enter" },
];
Artifacts:
golden_end.pngsnapshot_golden_end.jsonconsole.json,network_failures.jsonsummary.json,evaluation.json
Example:
node .github/skills/default-browser-devtools/runner.js golden \
--url http://localhost:5173 \
--engines webkit,chromium
perf
Performance trace + analysis
Purpose: Capture trace.zip for performance profiling.
Artifacts:
trace.zip(open in Playwright trace viewer or Chrome DevTools)
Example:
node .github/skills/default-browser-devtools/runner.js perf \
--url http://localhost:5173 \
--engine chromium
Note: Performance mode skips evaluation phase (not applicable to perf traces).
eval
Evaluate existing test artifacts (no test execution)
Purpose: Re-evaluate past test runs with updated knowledge base or evaluator logic.
Example:
node .github/skills/default-browser-devtools/runner.js eval \
--artifacts ./artifacts/default-browser-devtools/20260215_103000 \
--learn true
CLI Arguments
Test Configuration
| Argument | Type | Default | Description |
|---|---|---|---|
--url |
string | http://localhost:5173 |
Target URL to test |
--engine |
string | - | Single engine: webkit or chromium |
--engines |
string | webkit,chromium |
Comma-separated list of engines |
--headless |
boolean | true |
Run browser in headless mode |
--readySelector |
string | - | Selector to wait for before proceeding |
--readyText |
string | - | Text to wait for before proceeding |
--timeoutMs |
number | 30000 |
Timeout for ready checks (ms) |
--viewportWidth |
number | 1280 |
Viewport width (px) |
--viewportHeight |
number | 720 |
Viewport height (px) |
Setup & Server Options (New!)
| Argument | Type | Default | Description |
|---|---|---|---|
--skip-install |
boolean | false |
Skip automatic npm and Playwright installation |
--start-server |
boolean | false |
Automatically start dev server if not running |
--server-command |
string | npm run dev |
Command to start server |
--server-wait-ms |
number | 10000 |
Time to wait for server startup (ms) |
Evaluation & Learning
| Argument | Type | Default | Description |
|---|---|---|---|
--learn |
boolean | true |
Enable evaluation & learning phase |
--knowledge-base |
string | ./test-knowledge.json |
Path to knowledge base JSON |
--eval-only |
boolean | false |
Only evaluate, don't run tests |
--eval-consensus |
number | 1 |
Run evaluation N times for consensus scoring |
--artifacts |
string | - | Path to artifacts directory (for eval-only mode) |
Smoke Test Configuration
| Argument | Type | Default | Description |
|---|---|---|---|
--sweepSelector |
string | button, a, [role='button'], summary |
Selector for click sweep |
--maxClicks |
number | 30 |
Max interactions in click sweep |
--postClickWaitMs |
number | 120 |
Wait time after each click (ms) |
Performance & Debugging
| Argument | Type | Default | Description |
|---|---|---|---|
--perfWindowMs |
number | 1500 |
Performance capture window (ms) |
--eval |
string | - | JavaScript expression to evaluate on page |
Output Structure
artifacts/default-browser-devtools/<timestamp>/
├── webkit/
│ ├── baseline.png
│ ├── post_interaction.png
│ ├── snapshot_baseline.json
│ ├── snapshot_post.json
│ ├── console.json
│ ├── network_failures.json
│ ├── summary.json
│ └── evaluation.json ← Agentic-eval output
├── chromium/
│ └── ... (same structure)
├── combined_summary.json
└── combined_evaluation.json ← When using eval command
Evaluation & Learning System
Evaluation Rubric
The evaluator scores each test run across four dimensions:
| Dimension | Weight | Description |
|---|---|---|
| Completeness | 30% | All expected artifacts captured? |
| Precision | 25% | Low false positive rate? |
| Recall | 25% | Critical failures detected? |
| Efficiency | 20% | Good cost vs value ratio? |
Overall Score = Σ(dimension_score × weight)
Scoring Dimensions
Completeness (30%)
Measures whether all expected evidence was captured.
Checks:
- Baseline screenshot present? (+15%)
- Console logs captured? (+15%)
- Network failures logged? (+15%)
- Summary file written? (+10%)
- Command-specific artifacts? (+15-25%)
- Smoke: post-interaction screenshot, post-snapshot
- Golden: golden-end screenshot, golden-end snapshot
- Perf: trace.zip
- Interaction data (smoke)? (+10%)
- Ready signal captured? (+10%)
Score interpretation:
0.90-1.0: Excellent, all evidence captured0.70-0.89: Good, minor gaps<0.70: Incomplete, missing critical artifacts
Precision (25%)
Measures false positive rate (noise vs signal).
Factors:
- ResizeObserver loop count (common false positive, -2% per occurrence)
- Excessive warnings (>20 warnings, -10%)
- Dev asset failures (localhost/127.0.0.1, -5% per failure)
Score interpretation:
0.90-1.0: Excellent, very low noise0.70-0.89: Acceptable, some false positives<0.70: High noise, needs filtering
Recall (25%)
Measures whether critical failures are detected.
Checks:
- Page errors detected? (if test passed despite errors, -30%)
- Console errors caught? (if test passed despite errors, -30%)
- Hydration issues detected? (critical for Svelte, -20% if missed)
Score interpretation:
0.90-1.0: Excellent, all critical issues caught0.70-0.89: Good, minor blind spots<0.70: Poor recall, missing failures
Efficiency (20%)
Measures cost vs value (are we wasting effort?).
Factors:
- Interaction count (smoke test):
- <5 interactions: -20% (too few, low coverage)
- 10-30 interactions: optimal
50 interactions: -30% (diminishing returns)
- Error rate:
10 errors: -20% (noisy test, wasted effort)
Score interpretation:
0.90-1.0: Excellent efficiency0.60-0.89: Acceptable<0.60: Inefficient, needs optimization
Pattern Detection
The evaluator recognizes common failure patterns:
| Pattern | Severity | Description | Evaluator Uses |
|---|---|---|---|
resizeObserverLoop |
Low | ResizeObserver loop detected | Precision adjustment, filtering suggestion |
hydrationError |
High | Svelte hydration mismatch | Critical issue flag, code quality optimization |
networkTimeout |
Medium | Network request timeout | Reliability optimization, timeout tuning |
timingIssue |
Medium | Timing/async/race condition | Code quality flag, timing optimization |
Optimization Suggestions
Based on detected patterns and scores, the evaluator suggests improvements:
Example optimizations:
{
"optimizations": [
{
"category": "filtering",
"priority": "low",
"suggestion": "Add ResizeObserver loop filtering to reduce noise",
"rationale": "5 occurrences detected"
},
{
"category": "code-quality",
"priority": "high",
"suggestion": "Fix Svelte hydration mismatch (SSR/CSR content differs)",
"rationale": "Hydration errors can cause UI inconsistencies"
},
{
"category": "efficiency",
"priority": "medium",
"suggestion": "Reduce maxClicks or refine sweepSelector to focus on critical interactions",
"rationale": "42 interactions is high; diminishing returns"
}
]
}
Knowledge Base Schema
Structure (test-knowledge.json):
{
"version": "1.0",
"learned_patterns": [
{
"type": "resizeObserverLoop",
"severity": "low",
"description": "ResizeObserver loop detected (common false positive)",
"occurrences": 12,
"first_seen": "2026-02-15T10:30:00.000Z",
"last_seen": "2026-02-15T14:20:00.000Z"
}
],
"optimized_selectors": {
"interactive_elements": "button, a, [role='button'], summary, [onclick]",
"confidence": 0.87
},
"threshold_calibration": {
"smoke_test": {
"baseline_score": 0.85,
"confidence_threshold": 0.80,
"adjusted_from": [
{ "date": "2026-02-10", "old": 0.80, "new": 0.85, "reason": "reduced false positives" }
]
}
},
"failure_clusters": [
{
"pattern_signature": "hydrationError+networkTimeout",
"patterns": ["hydrationError", "networkTimeout"],
"occurrences": 3,
"first_seen": "2026-02-14T09:00:00.000Z",
"last_seen": "2026-02-15T12:00:00.000Z"
}
]
}
Fields:
learned_patterns: Tracked pattern occurrences with frequency and timestampsoptimized_selectors: Selector strings refined based on interaction success ratesconfidence: 0-1 score; used when ≥0.75
threshold_calibration: Baseline scores and confidence thresholds for each workflowfailure_clusters: Common failure combinations (helps identify systemic issues)
Evaluation Output Schema
File (evaluation.json):
{
"test_run_id": "webkit",
"overall_score": 0.87,
"dimensions": {
"completeness": 0.95,
"precision": 0.82,
"recall": 0.90,
"efficiency": 0.78
},
"confidence": 0.89,
"feedback": [
{
"dimension": "precision",
"severity": "low",
"message": "High false positive rate detected",
"action": "Consider filtering known benign issues (e.g., ResizeObserver loops)"
}
],
"optimizations": [
{
"category": "filtering",
"priority": "low",
"suggestion": "Add ResizeObserver loop filtering to reduce noise",
"rationale": "5 occurrences detected"
}
],
"patterns_detected": 2,
"patterns_learned": 1,
"knowledge_base_updated": true,
"consensus": {
"runs": 3,
"average_score": 0.86,
"average_confidence": 0.88,
"variance": 0.0002
}
}
Confidence score:
- High variance in dimension scores → lower confidence
- Missing data (e.g., null artifacts) → reduced confidence
- Typical range: 0.75-0.95
Consensus mode (--eval-consensus N):
- Runs evaluation N times
- Computes average scores and variance
- Low variance → stable evaluation
- High variance → consider adversarial review or refinement
Workflows
Workflow 1 — Smoke Test with Learning (Simplified!)
Goal: Verify load + hydration + basic interactions, learn from results.
New simplified usage (auto-installs everything on first run):
# That's it! No setup required - tool handles everything
node .github/skills/default-browser-devtools/runner.js smoke \
--url http://localhost:5173 \
--engines webkit,chromium
With auto-start server:
node .github/skills/default-browser-devtools/runner.js smoke \
--url http://localhost:5173 \
--engines webkit,chromium \
--start-server
Process:
- ✨ Auto-check dependencies (installs if missing)
- ✨ Auto-install Playwright browsers (if needed)
- ✨ Check server availability (or start with --start-server)
- Launch webkit and chromium browsers
- Navigate to URL, wait for networkidle
- Capture baseline screenshot + accessibility snapshot
- Perform conservative click sweep (up to 30 interactions)
- Capture post-interaction screenshot + snapshot
- Log console messages + network failures
- Evaluate test quality (completeness, precision, recall, efficiency)
- Detect patterns (ResizeObserver, hydration, timing issues)
- Suggest optimizations
- Update knowledge base with learned patterns
Output:
{
"ok": true,
"outRoot": "./artifacts/default-browser-devtools/20260215_143000",
"engines": ["webkit", "chromium"],
"results": [...]
}
--- Evaluation & Learning Phase ---
Evaluating webkit test run...
Score: 0.87
Confidence: 0.89
Patterns detected: 2
Patterns learned: 1
Optimizations suggested: 1
- [low] Add ResizeObserver loop filtering to reduce noise
Evaluating chromium test run...
Score: 0.92
Confidence: 0.91
Patterns detected: 1
Patterns learned: 0
✓ Knowledge base updated
Workflow 2 — Triage (Fast Debugging)
Goal: Fastest path to "what broke?"
node .github/skills/default-browser-devtools/runner.js triage \
--url http://localhost:5173 \
--engine webkit \
--learn true
Process:
- Launch browser
- Navigate to URL
- Capture console logs + network failures + screenshot
- Evaluate for critical issues
- Update knowledge base
Use case: CI failure, need quick signal.
Workflow 3 — Golden Path (Regression Prevention)
Goal: Ensure core features work across engines.
Setup:
Edit steps array in runner.js:
const steps = [
{ type: "clickText", value: "Bushing" },
{ type: "waitText", value: "Bushing Toolbox" },
{ type: "fillSelector", selector: "#bore-diameter", value: "0.5" },
{ type: "clickText", value: "Compute" },
{ type: "waitText", value: "Stress Analysis" },
];
Run:
node .github/skills/default-browser-devtools/runner.js golden \
--url http://localhost:5173 \
--engines webkit,chromium \
--learn true
Use case: Nightly regression suite, feature acceptance testing.
Workflow 4 — Eval-Only Mode
Goal: Re-evaluate past test runs without re-running tests.
node .github/skills/default-browser-devtools/runner.js eval \
--artifacts ./artifacts/default-browser-devtools/20260215_103000 \
--learn true
Use case:
- Updated evaluator logic
- Knowledge base refinement
- Historical trend analysis
Workflow 5 — Consensus Evaluation
Goal: Increase evaluation reliability via ensemble scoring.
node .github/skills/default-browser-devtools/runner.js smoke \
--url http://localhost:5173 \
--engine webkit \
--eval-consensus 3
Output:
Evaluating webkit test run...
Score: 0.86 (consensus of 3)
Confidence: 0.88
When to use:
- High variance in scores
- Critical release validation
- Evaluator calibration
Workflow 6 — Backward Compatible (Learning Disabled)
Goal: Run tests without evaluation phase.
node .github/skills/default-browser-devtools/runner.js smoke \
--url http://localhost:5173 \
--engines webkit,chromium \
--learn false
Use case:
- CI environments without persistent storage
- Quick local checks
- Debugging evaluator issues
Maturity Levels
This skill implements a Level 4 agentic-eval architecture:
| Level | Capability | Status |
|---|---|---|
| Level 1 | Basic reflection | ✅ Implemented |
| Level 2 | Evaluator separation | ✅ Implemented (separate evaluator.js) |
| Level 3 | Adversarial & ensemble | ✅ Implemented (consensus mode) |
| Level 4 | Benchmark-driven | ✅ Implemented (knowledge base baselines) |
| Level 5 | Confidence-calibrated & cost-aware | 🔄 Partial (confidence scores, no cost routing yet) |
Roadmap to Level 5:
- Add cost-aware routing (skip evaluation for trivial tests)
- Confidence-based early stopping
- Evaluator model selection (small vs large)
Cross-Platform Portability Checklist
When testing Svelte apps for cross-platform compatibility:
- Case-sensitive assets/imports (Linux/CI)
- Path separator assumptions (Windows vs Unix)
- Keyboard modifiers (Cmd vs Ctrl)
- Font/layout dependencies (WebKit text rendering differs)
- ResizeObserver / measurement loops (engine-specific)
- Scroll container behavior (iOS WebKit scroll anchoring)
- File input parsing + error UI (WebKit file dialog differences)
- CSS containment (Blink vs WebKit implementation gaps)
- Intersection observer thresholds (rounding differences)
Troubleshooting
Issue: First time setup or missing dependencies
New in v2.0: The runner now automatically handles setup! No manual intervention needed.
What happens automatically:
- ✅ Checks for
node_modulesand runsnpm installif missing - ✅ Checks for Playwright browsers and installs them if missing
- ✅ Provides clear progress messages during installation
- ✅ Verifies server is running before tests start
If you want to skip automatic installation:
node runner.js smoke --url http://localhost:5173 --skip-install
If server isn't running:
# Option 1: Auto-start server
node runner.js smoke --url http://localhost:5173 --start-server
# Option 2: Custom server command
node runner.js smoke --url http://localhost:5173 \
--start-server \
--server-command "npm run dev" \
--server-wait-ms 15000
Manual setup (if preferred):
npm install
npx playwright install
npm run dev &
node runner.js smoke --url http://localhost:5173
Issue: "Cannot find package 'playwright'" error
This should no longer happen! The runner now automatically installs Playwright before importing it.
If you still see this error:
- Make sure you're using the updated runner (check git log)
- Try running with verbose output:
node runner.js smoke --url http://localhost:5173 2>&1 | tee output.log - Check if
node_modulesdirectory exists - Try manual install:
npm install
Issue: Server not responding
Symptom:
✗ Server is not running at http://localhost:5173
Options:
1. Start server manually: npm run dev
2. Use --start-server flag to auto-start
3. Specify different URL with --url
Fix:
# Let the tool start it for you
node runner.js smoke --url http://localhost:5173 --start-server
# Or start manually in another terminal
npm run dev
# Then in original terminal:
node runner.js smoke --url http://localhost:5173
Issue: Evaluation score is low despite test passing
Symptom:
Score: 0.62
Confidence: 0.75
Feedback:
- [medium] Test efficiency could be improved
Diagnosis:
- Check
dimensionsinevaluation.json - Low efficiency? → Too many interactions or high error rate
- Low precision? → High false positive rate
- Low recall? → Critical errors not detected
Fix:
- Review
optimizationsarray for specific suggestions - Adjust CLI arguments (e.g.,
--maxClicks 20to reduce interaction count) - Update knowledge base to filter known false positives
Issue: Knowledge base not updating
Symptom:
Patterns learned: 0
Knowledge base updated: false
Diagnosis:
- Check file permissions on
test-knowledge.json - Verify
--learn trueis set - Check console for error messages
Fix:
# Verify file exists and is writable
ls -l .github/skills/default-browser-devtools/test-knowledge.json
chmod 644 .github/skills/default-browser-devtools/test-knowledge.json
# Run with explicit path
node runner.js smoke --url http://localhost:5173 \
--knowledge-base /absolute/path/to/test-knowledge.json
Issue: High variance in consensus evaluation
Symptom:
Score: 0.75 (consensus of 3)
Variance: 0.08
Diagnosis:
- Variance >0.05 indicates unstable evaluation
- Possible causes:
- Non-deterministic test behavior
- Timing-dependent failures
- Evaluator logic ambiguity
Fix:
- Add
--readySelectoror--readyTextfor more stable ready detection - Increase
--postClickWaitMsfor slower interactions - Review console logs for timing issues
- Consider adversarial review (manual inspection)
Issue: Selector confidence degrading
Symptom:
{
"optimized_selectors": {
"confidence": 0.52
}
}
Diagnosis:
- Many failed interactions
- Selector is too broad or includes non-interactive elements
Fix:
- Review
summary.json→interactionsarray for failed items - Refine
--sweepSelector:--sweepSelector "button:not([disabled]), a[href], [role='button'][tabindex]" - Confidence will automatically increase as success rate improves
Issue: Eval-only mode fails with missing artifacts
Symptom:
Error: No summary.json found at <path>
Diagnosis:
- Artifacts directory structure doesn't match expected format
- Test run failed before writing artifacts
Fix:
- Verify directory structure:
artifacts/ default-browser-devtools/ <timestamp>/ webkit/ summary.json chromium/ summary.json - Point to timestamp directory, not engine directory:
--artifacts ./artifacts/default-browser-devtools/20260215_103000
FAQ
How does learning persist across CI runs?
Knowledge base file must be committed:
# After local testing with learning
git add .github/skills/default-browser-dev-tools/test-knowledge.json
git commit -m "Update test knowledge base"
git push
In CI, mount knowledge base as artifact or use persistent volume:
# GitHub Actions example
- name: Run smoke test
run: |
node .github/skills/default-browser-devtools/runner.js smoke \
--url http://localhost:5173 \
--learn true
- name: Upload knowledge base
uses: actions/upload-artifact@v3
with:
name: test-knowledge
path: .github/skills/default-browser-dev-tools/test-knowledge.json
Alternative: Use external storage (S3, database) with custom --knowledge-base path.
When should I use consensus mode?
Use --eval-consensus N when:
- Evaluating critical release candidates
- High variance in historical scores
- Calibrating evaluator after logic changes
- Detecting evaluator bias or drift
Typical N values:
N=1: Default, fastN=3: Standard consensus, good balanceN=5: High reliability, slower
Cost: N× evaluation time (but no additional test execution).
Can I customize the evaluation rubric?
Yes. Edit evaluator.js → WEIGHTS constant:
const WEIGHTS = {
completeness: 0.30, // Adjust as needed
precision: 0.25,
recall: 0.25,
efficiency: 0.20,
};
Or add custom dimensions:
const customScore = this.scoreCustomDimension(summary);
const overallScore =
completeness * 0.25 +
precision * 0.20 +
recall * 0.20 +
efficiency * 0.15 +
customScore * 0.20;
Recommendation: Keep weights documented in knowledge base:
{
"rubric_config": {
"completeness": 0.30,
"precision": 0.25,
"recall": 0.25,
"efficiency": 0.20
}
}
How do I disable learning for specific test runs?
Temporary disable:
--learn false
Permanent disable (CI):
export DISABLE_TEST_LEARNING=true
Then in runner.js:
if (process.env.DISABLE_TEST_LEARNING === "true") {
args.learn = "false";
}
What's the difference between smoke, triage, and golden?
| Command | Purpose | Speed | Coverage | When to use |
|---|---|---|---|---|
| smoke | Load + interaction sweep | Moderate | High | PR checks, nightly regression |
| triage | Console + network only | Fast | Low | CI failure debugging |
| golden | Deterministic feature path | Moderate | Focused | Feature acceptance, critical path validation |
| perf | Performance profiling | Slow | N/A | Performance investigation |
Can I run evaluation without tests?
Yes. Use eval command or --eval-only true:
# Re-evaluate past run
node runner.js eval --artifacts ./artifacts/default-browser-devtools/20260215_103000
# Or with existing command
node runner.js smoke --url http://localhost:5173 --eval-only true --artifacts <path>
Use case:
- Updated evaluator logic
- Knowledge base refinement
- Historical trend analysis
- Debugging evaluation scores
How does the evaluator handle flaky tests?
Pattern detection:
- Timing issues → Medium severity pattern
- Network timeouts → Suggests retry logic
- Hydration errors → High severity, indicates code issue
Confidence adjustment:
- High variance across runs → Lower confidence
- Inconsistent patterns → Suggests flakiness
Recommendation:
- Use
--eval-consensus 3to detect flakiness - Review
evaluation.json→consensus.variance - High variance? Add stabilization (timeouts, ready checks)
What happens if knowledge base is corrupted?
Auto-recovery:
- Evaluator detects parse errors
- Falls back to default knowledge base
- Logs warning:
Failed to parse knowledge base: <error>, using defaults
Manual fix:
# Backup corrupted file
mv test-knowledge.json test-knowledge.json.bak
# Regenerate from template
cp test-knowledge.json.template test-knowledge.json
# Or let evaluator create default
rm test-knowledge.json
node runner.js smoke --url http://localhost:5173 --learn true
Issue: Infinite loop detected in console
Symptom:
{
"type": "infiniteLoop",
"count": 247,
"severity": "critical",
"description": "Infinite loop detected: '[SLICE FETCH EFFECT] Triggered...' repeated 247 times",
"avgIntervalMs": 15
}
Diagnosis:
- Message repeating >20 times with <100ms intervals = infinite loop
- Typically caused by reactive dependency cycles or missing guards in effects
Fix:
Check for missing guards in effects:
$effect(() => { if (!someCondition) return; // Add guard // ... rest of effect });Verify state synchronization:
- Svelte 5: Use getter/setter pairs instead of capturing initial values
- React: Check useEffect dependencies
Look for circular dependencies:
// BAD: Circular update $effect(() => { stateA = stateB; // Updates stateA }); $effect(() => { stateB = stateA; // Updates stateB → triggers first effect → loop }); // GOOD: Add guard or use derived let stateB = $derived(stateA); // One-way dependencyAdd early return based on previous value:
let lastValue = ''; $effect(() => { if (value === lastValue) return; // Skip if unchanged lastValue = value; // ... process value });
Real-world example: Inspector toolbox fix
- Problem:
loadState.isMergedViewcaptured initial value, stayed false - Solution: Used getters/setters for proper reactivity
- Result: Guard check worked correctly, no infinite loop
Issue: Svelte reactivity warnings in dev console
Symptom:
{
"type": "svelteReactivityWarning",
"count": 17,
"severity": "medium",
"description": "Svelte 5 reactivity warning detected (state_referenced_locally)",
"examples": [
"This reference only captures the initial value of `isLoading`...",
"This reference only captures the initial value of `isMergedView`..."
]
}
Diagnosis:
- Using shorthand initialization captures initial values, not reactive references
- Example:
let obj = $state({ isLoading, headers })← BAD
Fix: Convert to getter/setter pairs:
// BAD: Captures initial values
let state = $state({
isLoading,
headers,
isMergedView
});
// GOOD: Uses reactive getters/setters
let state = $state({
get isLoading() { return isLoading; },
set isLoading(v) { isLoading = v; },
get headers() { return headers; },
set headers(v) { headers = v; },
get isMergedView() { return isMergedView; },
set isMergedView(v) { isMergedView = v; }
});
Action items provided by evaluator:
- Convert $state objects to use getter/setter pairs
- Use $derived for computed values
- Avoid capturing initial values in closures
- Review https://svelte.dev/docs/svelte/$state
Issue: Reactive loop causing performance problems
Symptom:
{
"type": "reactiveLoop",
"count": 3,
"severity": "critical",
"description": "Reactive dependency loop or circular dependency detected"
}
Diagnosis:
- Console shows messages about circular dependencies, infinite effects, or reactive loops
- App becomes unresponsive due to constant re-rendering
Fix strategies provided by evaluator:
Identify the reactive chain:
- Use browser DevTools Performance tab
- Look for stack traces in console errors
- Add debug logging to track update sequence
Add conditional guards:
$effect(() => { if (updating) return; // Guard against re-entry updating = true; // ... perform updates updating = false; });Use untrack() for non-reactive reads:
import { untrack } from 'svelte'; $effect(() => { const currentValue = untrack(() => someState); // Read without tracking // ... use currentValue });Extract shared state:
// Instead of bidirectional updates between A ↔ B // Use shared state: A → C ← B let sharedState = $state({}); let derivedA = $derived(computeA(sharedState)); let derivedB = $derived(computeB(sharedState));
Integration with Agentic-Eval Framework
This skill implements core patterns from the Agentic-Eval Framework (.github/skills/ agentic-eval/SKILL.md):
Pattern: Evaluator Separation (Level 2)
Implementation:
runner.js= Generator (produces test results)evaluator.js= Evaluator (scores + critiques)- Separate concerns ensure evaluator can evolve independently
Pattern: Multi-Judge Consensus (Level 3)
Implementation:
--eval-consensus 3
Logic:
const evaluations = [];
for (let i = 0; i < consensus; i++) {
evaluations.push(evaluator.evaluate(engineOut));
}
const avgScore = evaluations.reduce((sum, e) => sum + e.overall_score, 0) / evaluations.length;
Variance interpretation:
- Low variance (<0.01) → stable evaluation
- High variance (>0.05) → unreliable, needs refinement
Pattern: Rubric-Based Evaluation
Implementation:
- Four dimensions: completeness, precision, recall, efficiency
- Weighted scoring:
overall_score = Σ(dim × weight) - Documented rubric in evaluation output
Pattern: Knowledge Persistence (Level 4)
Implementation:
learned_patterns: Track occurrences over timeoptimized_selectors: Refine based on success ratesthreshold_calibration: Adjust baselines from historical datafailure_clusters: Identify systemic issues
Pattern: Confidence-Based Routing (Level 5)
Implementation:
calculateConfidence(): Score variance + data completeness- Confidence output in evaluation JSON
- Future: Skip evaluation if confidence >0.95 and score >0.90
Pattern: Cost-Aware Learning
Current:
- Learning can be disabled:
--learn false - Evaluation is fast (no LLM calls, pure logic)
Future:
- Skip evaluation for trivial tests (<50 lines of console output)
- Cache repeated evaluations (hash of artifacts)
Advanced Examples
Example 1: CI Integration with Learning
GitHub Actions workflow:
name: Cross-Engine Tests
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Setup Node
uses: actions/setup-node@v3
with:
node-version: 18
- name: Install dependencies
run: npm ci
- name: Install Playwright
run: npx playwright install --with-deps
- name: Start dev server
run: npm run dev &
env:
PORT: 5173
- name: Wait for server
run: npx wait-on http://localhost:5173
- name: Run smoke tests
run: |
node .github/skills/default-browser-devtools/runner.js smoke \
--url http://localhost:5173 \
--engines webkit,chromium \
--learn true \
--knowledge-base .github/skills/default-browser-dev-tools/test-knowledge.json
- name: Upload artifacts
if: always()
uses: actions/upload-artifact@v3
with:
name: test-results
path: artifacts/
- name: Commit knowledge base updates
if: github.ref == 'refs/heads/main'
run: |
git config user.name "github-actions"
git config user.email "github-actions@github.com"
git add .github/skills/default-browser-dev-tools/test-knowledge.json
git diff --quiet || git commit -m "Update test knowledge base [skip ci]"
git push
Example 2: Local Development Loop
#!/bin/bash
# dev-test-loop.sh
# Start dev server in background
npm run dev &
DEV_PID=$!
# Wait for server
sleep 3
# Run smoke test with learning
node .github/skills/default-browser-devtools/runner.js smoke \
--url http://localhost:5173 \
--engines webkit,chromium \
--learn true \
--headless false
# Kill dev server
kill $DEV_PID
# Show evaluation summary
cat artifacts/default-browser-devtools/*/webkit/evaluation.json | jq '.overall_score, .feedback'
Example 3: Historical Trend Analysis
# Evaluate multiple past runs
for dir in artifacts/default-browser-devtools/*/; do
echo "Evaluating $dir"
node .github/skills/default-browser-devtools/runner.js eval \
--artifacts "$dir" \
--learn false
done | jq -s '.[] | {timestamp: .test_run_id, score: .overall_score}'
Output:
[
{"timestamp": "20260210_100000", "score": 0.78},
{"timestamp": "20260211_100000", "score": 0.82},
{"timestamp": "20260212_100000", "score": 0.85},
{"timestamp": "20260215_100000", "score": 0.87}
]
Analysis: Score improving over time → learning is effective.
Example 4: Selective Learning (Per-Branch)
# main branch: aggressive learning
if [ "$BRANCH" = "main" ]; then
LEARN=true
CONSENSUS=3
else
# feature branches: basic learning
LEARN=true
CONSENSUS=1
fi
node runner.js smoke --url http://localhost:5173 \
--learn $LEARN \
--eval-consensus $CONSENSUS
References
- Agentic-Eval Framework:
.github/skills/ agentic-eval/SKILL.md - Playwright Docs: https://playwright.dev/docs/intro
- WebKit vs Blink: https://en.wikipedia.org/wiki/Comparison_of_browser_engines
License
MIT