scrape-quality

star 0

Verify and improve the quality of scraped dance studio data before publication. Use when validating scraper output, catching normalization/schema issues, measuring field completeness, and triaging freshness or deduplication defects.

mzgleason By mzgleason schedule Updated 2/25/2026

name: scrape-quality description: Verify and improve the quality of scraped dance studio data before publication. Use when validating scraper output, catching normalization/schema issues, measuring field completeness, and triaging freshness or deduplication defects.

Scrape Quality

Goal

Ensure scraped studio records are accurate, complete, and trustworthy so customers can confidently compare studios.

Workflow

  1. Start with a recent scrape artifact (JSON/JSONL) and define audit scope:
    • Full dataset or targeted source/city.
  2. Validate schema and required fields:
    • Confirm each record has required keys and valid types.
    • Fail fast on malformed URLs, impossible coordinates, or invalid enums.
  3. Measure completeness for customer-critical fields:
    • Studio name, location, dance styles, level signal, schedule signal, pricing signal, contact method.
    • Report completion rates and flag fields below threshold.
  4. Check normalization quality:
    • Style and level vocabularies are standardized.
    • Address/city/state formatting is consistent.
  5. Check deduplication quality:
    • Detect near-duplicates by name + address + URL patterns.
    • Confirm obvious duplicates are merged or marked.
  6. Check freshness and source reliability:
    • Compare updated_at recency and stale-page rates by source.
    • Flag sources with repeated fetch/parse failures.
  7. Spot-check accuracy against source pages:
    • Manually verify a small sample of high-traffic or recently changed studios.
  8. Prioritize remediation:
    • P0: data corruption or misleading core fields.
    • P1: high-impact completeness/normalization gaps.
    • P2: long-tail consistency improvements.

Output format

Return findings in this structure:

  • Dataset and scope audited
  • Quality metrics summary (schema validity, completeness rates, duplicate rate, freshness)
  • Issues found (P0/P1/P2)
  • Recommended fix (smallest reversible change)
  • Verification step (tests/checks to confirm fix)

Guardrails

  • Prefer deterministic checks and reproducible metrics over subjective judgments.
  • Keep quality gates explicit and versioned where possible.
  • Do not overwrite production data during audits; write findings to a separate report.
Install via CLI
npx skills add https://github.com/mzgleason/openfloor_dance_studio_directory --skill scrape-quality
Repository Details
star Stars 0
call_split Forks 0
navigation Branch main
article Path SKILL.md
More from Creator