name: diagnose-extraction description: Deep-dive diagnostics on a low-quality or failed extraction. Analyzes field traces, content provenance, fallback usage, and suggests mapping improvements. argument-hint: [scraper-name-or-url]
Deep Extraction Diagnostics
Perform a thorough analysis of an extraction result to understand why quality is low or fields are missing. Goes beyond /extract by analyzing each field's extraction strategy, suggesting fixes, and identifying structural issues.
Inputs
$ARGUMENTS can be:
- A scraper name (runs against existing fixture)
- A URL (fetches and analyzes)
- A file path to HTML
Workflow
Step 1: Run extraction with full diagnostics
Write and execute an inline script to get the complete diagnostic output:
cd astro-app && npx tsx -e "
import { readFileSync } from 'fs';
import { extractFromHtml } from './src/lib/extractor/html-extractor.js';
const html = readFileSync('<fixture_path>', 'utf-8');
const result = extractFromHtml({
html,
sourceUrl: '<url>',
scraperMappingName: '<name>',
});
const d = result.diagnostics;
console.log(JSON.stringify({
grade: d?.qualityGrade,
label: d?.qualityLabel,
extractionRate: d?.extractionRate,
weightedRate: d?.weightedExtractionRate,
totalFields: d?.totalFields,
populated: d?.populatedFields,
extractable: d?.extractableFields,
populatedExtractable: d?.populatedExtractableFields,
criticalMissing: d?.criticalFieldsMissing,
emptyFields: d?.emptyFields,
contentAnalysis: d?.contentAnalysis,
fieldTraces: d?.fieldTraces,
splitSchema: result.splitSchema,
}, null, 2));
"
Step 2: Analyze content provenance
Check the contentAnalysis section:
appearsBlocked: true— The page was likely bot-blocked (captcha/verify page). The user needs to provide HTML from a real browser session.appearsJsOnly: true— The page is a JS-only shell. The user needs to capture the rendered HTML (browser "Save As" after rendering).jsonLdCount > 0— JSON-LD structured data is available. Consider addingjsonLdPathstrategies.scriptJsonVarsFound— Known script variables detected (PAGE_MODEL, NEXT_DATA, etc). Consider addingscriptJsonPathstrategies.
Step 3: Analyze field traces
For each empty or problematic field:
- Read the field trace — what strategy was attempted?
- Read the mapping — is the CSS selector still valid?
- Search the HTML fixture — where does the data actually live?
- Check for fallbacks — does the field have fallback strategies?
- Check the field importance — is it critical (title, price), important (coords, address), or optional?
Step 4: Analyze the HTML structure
Look at the fixture HTML for:
- JSON-LD blocks (
<script type="application/ld+json">) — often contain title, price, address, coordinates - Open Graph meta tags (
og:title,og:image,og:description) — good fallback sources - Script variables (
__NEXT_DATA__,PAGE_MODEL,__INITIAL_STATE__,dataLayer) — rich structured data - Microdata attributes (
itemprop,itemtype) — semantic HTML markers - Twitter card meta tags (
twitter:title,twitter:image) — another fallback source
Step 5: Generate recommendations
Based on the analysis, provide specific recommendations:
- Selector updates — new CSS selectors for fields with broken selectors
- Fallback chains — add
fallbacksarrays using alternative strategies - Strategy switches — switch from fragile cssLocator to robust scriptJsonPath/jsonLdPath
- New fields — data available in HTML that isn't being extracted
- Mapping structural issues — fields in wrong sections, missing cssCountId, etc.
Step 6: Offer to apply fixes
Present the specific JSON changes needed and offer to:
- Edit the mapping file
- Update manifest expected values if needed
- Run validation tests
- Commit the changes
Key analysis patterns
| Content Signal | Recommendation |
|---|---|
| JSON-LD present, not used | Add jsonLdPath strategies (most robust) |
__NEXT_DATA__ present |
Add scriptJsonPath with scriptJsonVar: "__NEXT_DATA__" |
PAGE_MODEL present |
Add scriptJsonPath with scriptJsonVar: "PAGE_MODEL" |
| Multiple CSS matches | Add cssCountId: "0" to pick first element |
| CSS selector fails | Check if classes changed, try ID-based or microdata selectors |
| Critical fields missing | Priority fix — grade capped at C until resolved |
| Fallback used | Primary strategy is broken, should be updated |
MCP tools
When the property-scraper MCP server is running, these tools can assist with diagnosis:
get_scraper_mapping— inspect the full mapping definition (selectors, regex, fallbacks)list_supported_portals— check portal metadata and expected extraction ratesextract_property— re-run extraction with full diagnostics on modified HTML