diagnose-extraction - SKILL.md Agent Skill

name: diagnose-extraction description: Deep-dive diagnostics on a low-quality or failed extraction. Analyzes field traces, content provenance, fallback usage, and suggests mapping improvements. argument-hint: [scraper-name-or-url]

Deep Extraction Diagnostics

Perform a thorough analysis of an extraction result to understand why quality is low or fields are missing. Goes beyond /extract by analyzing each field's extraction strategy, suggesting fixes, and identifying structural issues.

Inputs

$ARGUMENTS can be:

A scraper name (runs against existing fixture)
A URL (fetches and analyzes)
A file path to HTML

Workflow

Step 1: Run extraction with full diagnostics

Write and execute an inline script to get the complete diagnostic output:

cd astro-app && npx tsx -e "
import { readFileSync } from 'fs';
import { extractFromHtml } from './src/lib/extractor/html-extractor.js';
const html = readFileSync('<fixture_path>', 'utf-8');
const result = extractFromHtml({
  html,
  sourceUrl: '<url>',
  scraperMappingName: '<name>',
});
const d = result.diagnostics;
console.log(JSON.stringify({
  grade: d?.qualityGrade,
  label: d?.qualityLabel,
  extractionRate: d?.extractionRate,
  weightedRate: d?.weightedExtractionRate,
  totalFields: d?.totalFields,
  populated: d?.populatedFields,
  extractable: d?.extractableFields,
  populatedExtractable: d?.populatedExtractableFields,
  criticalMissing: d?.criticalFieldsMissing,
  emptyFields: d?.emptyFields,
  contentAnalysis: d?.contentAnalysis,
  fieldTraces: d?.fieldTraces,
  splitSchema: result.splitSchema,
}, null, 2));
"

Step 2: Analyze content provenance

Check the contentAnalysis section:

appearsBlocked: true — The page was likely bot-blocked (captcha/verify page). The user needs to provide HTML from a real browser session.
appearsJsOnly: true — The page is a JS-only shell. The user needs to capture the rendered HTML (browser "Save As" after rendering).
jsonLdCount > 0 — JSON-LD structured data is available. Consider adding jsonLdPath strategies.
scriptJsonVarsFound — Known script variables detected (PAGE_MODEL, NEXT_DATA, etc). Consider adding scriptJsonPath strategies.

Step 3: Analyze field traces

For each empty or problematic field:

Read the field trace — what strategy was attempted?
Read the mapping — is the CSS selector still valid?
Search the HTML fixture — where does the data actually live?
Check for fallbacks — does the field have fallback strategies?
Check the field importance — is it critical (title, price), important (coords, address), or optional?

Step 4: Analyze the HTML structure

Look at the fixture HTML for:

JSON-LD blocks (<script type="application/ld+json">) — often contain title, price, address, coordinates
Open Graph meta tags (og:title, og:image, og:description) — good fallback sources
Script variables (__NEXT_DATA__, PAGE_MODEL, __INITIAL_STATE__, dataLayer) — rich structured data
Microdata attributes (itemprop, itemtype) — semantic HTML markers
Twitter card meta tags (twitter:title, twitter:image) — another fallback source

Step 5: Generate recommendations

Based on the analysis, provide specific recommendations:

Selector updates — new CSS selectors for fields with broken selectors
Fallback chains — add fallbacks arrays using alternative strategies
Strategy switches — switch from fragile cssLocator to robust scriptJsonPath/jsonLdPath
New fields — data available in HTML that isn't being extracted
Mapping structural issues — fields in wrong sections, missing cssCountId, etc.

Step 6: Offer to apply fixes

Present the specific JSON changes needed and offer to:

Edit the mapping file
Update manifest expected values if needed
Run validation tests
Commit the changes

Key analysis patterns

Content Signal	Recommendation
JSON-LD present, not used	Add `jsonLdPath` strategies (most robust)
`__NEXT_DATA__` present	Add `scriptJsonPath` with `scriptJsonVar: "__NEXT_DATA__"`
`PAGE_MODEL` present	Add `scriptJsonPath` with `scriptJsonVar: "PAGE_MODEL"`
Multiple CSS matches	Add `cssCountId: "0"` to pick first element
CSS selector fails	Check if classes changed, try ID-based or microdata selectors
Critical fields missing	Priority fix — grade capped at C until resolved
Fallback used	Primary strategy is broken, should be updated

MCP tools

When the property-scraper MCP server is running, these tools can assist with diagnosis:

get_scraper_mapping — inspect the full mapping definition (selectors, regex, fallbacks)
list_supported_portals — check portal metadata and expected extraction rates
extract_property — re-run extraction with full diagnostics on modified HTML