name: oem-crawl description: Website crawling with cheap-check and full-render pipeline. Monitors OEM pages for changes using hash comparison, then triggers full browser rendering via CDP when changes are detected.
OEM Crawl
Monitors OEM websites for changes using a two-stage pipeline.
Triggers
- Cron schedule — each trigger targets specific page types via
crawl_typefilter:- Every 2h:
homepagepages (banners) - Every 4h:
offerspages (offers + offer-page banners) - Every 12h:
vehicle,category,build_pricepages (variants/models) - Daily 6am:
newspages - Daily 7am:
sitemappages
- Every 2h:
- Manual trigger via
POST /api/v1/oem-agent/admin/crawl/:oemId(crawls all page types) - Force trigger via
POST /api/v1/oem-agent/admin/force-crawl/:oemId(resets scheduler, bypasses backoff) - Slack command ("check Ford's offers")
Prerequisites
SUPABASE_URLandSUPABASE_SERVICE_ROLE_KEYforsource_pages,import_runs, andchange_eventstablesWORKER_URLfor CDP proxy accessCDP_SECRETfor browser rendering authentication
Pipeline
Stage 0: Gatsby Page-Data (Skip Browser)
For Gatsby-based OEMs (e.g. LDV AU), structured data is available at page-data.json endpoints. These return pre-rendered JSON with full vehicle data (specs, variants, colors, pricing) — no browser rendering needed. Detection: check if /{route}/page-data.json returns valid JSON.
Stage 1: Cheap Check
- Fetch HTML for each active source page
- Normalise HTML (strip scripts, styles, data attributes, etc.)
- Hash normalised content
- Compare against stored hash
- If unchanged: update
last_checked_at, incrementconsecutive_no_change - If changed: queue for full render
Stage 2: Full Render (if changed)
- Connect to headless browser — Lightpanda (primary, via
LIGHTPANDA_URL) or Cloudflare Browser Rendering (fallback, via CDP proxy) - Navigate page, wait for JS render
- Capture rendered DOM
- Hand off to
oem-extractskill for data extraction - Optionally capture network requests for API discovery
Auto-sync on upsert: Every product upsert automatically runs syncVariantColors() (for all OEMs) and buildSpecsJson() to keep variant_colors and specs_json current without separate enrichment passes.
Input
{
"oem_id": "ford-au",
"page_type": "offers",
"trigger": "cron",
"cron": "0 5 * * *"
}
Output
{
"pages_checked": 15,
"changes_detected": 3
}