name: scrape-quality description: Verify and improve the quality of scraped dance studio data before publication. Use when validating scraper output, catching normalization/schema issues, measuring field completeness, and triaging freshness or deduplication defects.
Scrape Quality
Goal
Ensure scraped studio records are accurate, complete, and trustworthy so customers can confidently compare studios.
Workflow
- Start with a recent scrape artifact (JSON/JSONL) and define audit scope:
- Full dataset or targeted source/city.
- Validate schema and required fields:
- Confirm each record has required keys and valid types.
- Fail fast on malformed URLs, impossible coordinates, or invalid enums.
- Measure completeness for customer-critical fields:
- Studio name, location, dance styles, level signal, schedule signal, pricing signal, contact method.
- Report completion rates and flag fields below threshold.
- Check normalization quality:
- Style and level vocabularies are standardized.
- Address/city/state formatting is consistent.
- Check deduplication quality:
- Detect near-duplicates by name + address + URL patterns.
- Confirm obvious duplicates are merged or marked.
- Check freshness and source reliability:
- Compare
updated_atrecency and stale-page rates by source. - Flag sources with repeated fetch/parse failures.
- Compare
- Spot-check accuracy against source pages:
- Manually verify a small sample of high-traffic or recently changed studios.
- Prioritize remediation:
- P0: data corruption or misleading core fields.
- P1: high-impact completeness/normalization gaps.
- P2: long-tail consistency improvements.
Output format
Return findings in this structure:
- Dataset and scope audited
- Quality metrics summary (schema validity, completeness rates, duplicate rate, freshness)
- Issues found (P0/P1/P2)
- Recommended fix (smallest reversible change)
- Verification step (tests/checks to confirm fix)
Guardrails
- Prefer deterministic checks and reproducible metrics over subjective judgments.
- Keep quality gates explicit and versioned where possible.
- Do not overwrite production data during audits; write findings to a separate report.