name: oem-extract description: Product, offer, and banner extraction from rendered HTML. Uses deterministic extraction first (JSON-LD, OpenGraph, CSS selectors) with LLM fallback for complex cases.
OEM Extract
Extracts structured data from rendered OEM web pages.
Extraction Priority (deterministic first, LLM fallback)
- Gatsby page-data.json — Static JSON endpoints at
/{route}/page-data.json(LDV AU: Gatsby 5.14.6 + i-Motor CMS). Returns full vehicle data (specs, variants, colors with price deltas, pricing). No browser rendering needed. - JSON-LD structured data
- OpenGraph meta tags
- CSS selector extraction (OEM-specific via
lib/extractors/) - LLM normalisation fallback (Groq) - only if <80% field coverage
Prerequisites
SUPABASE_URLandSUPABASE_SERVICE_ROLE_KEYfor data storageGROQ_API_KEYfor LLM fallback extraction
Triggered By
oem-crawlskill after full render detects a change
Page Types
vehicle- Model overview pages with basic specs and pricingbuild_price- Build-and-price/configurator pages with:- Variant selection (trims, drivetrains)
- Color options with pricing
- Detailed disclaimers
- Example:
kia.com/au/shopping-tools/build-and-price.trim.k4-hatch.html
offers- Special offers and promotionshomepage- Hero banners and featured content
Data Types
Products
Vehicle listings with specifications, pricing, and features.
Core Fields:
title,subtitle- Vehicle name and seriesbody_type- SUV, Sedan, Hatchback, etc.fuel_type- Petrol, Diesel, Hybrid, Electricavailability- For Sale, Coming Soon, Sold Outprice_amount,price_currency,price_type,price_qualifier
Technical Specifications (specs_json JSONB — auto-built on every product upsert):
engine- type, displacement_cc, cylinders, power_kw, torque_nmtransmission- type, gears, drivedimensions- length_mm, width_mm, height_mm, wheelbase_mm, kerb_weight_kgperformance- fuel_combined_l100km (ICE) or range_km, battery_kwh (EV)towing- braked_kg, unbraked_kgcapacity- doors, seats, boot_litres, fuel_tank_litressafety- ancap_stars, airbagswheels- size, type
All 8 categories at 100% coverage across all 19 OEMs. GMSV AU fully populated (6 vehicle_models, 7 products, 55 colors, full specs).
Vehicle Specifications (scalar columns — auto-populated from meta_json on every upsert):
engine_size,cylinders,transmission,gears,drive,doors,seats
OEM Marketing Features (key_features array):
- Features displayed on OEM vehicle pages (e.g., "Apple CarPlay", "Blind Spot Monitor", "Adaptive Cruise Control")
- These are promotional/marketing features, NOT vehicle specifications
Variants (variants array):
- Each variant includes: name, price_amount, price_type, drivetrain, engine
colors- Available color options for the variant:name- Color name (e.g., "Aurora Black Pearl")code- OEM color code (e.g., "WK")hex- Hex color value for UI displayswatch_url- URL to color swatch imageprice_delta- Additional cost (0 for standard, e.g., 695 for metallic)is_standard- Whether included in base price
disclaimer_text- Variant-specific disclaimer
Product Disclaimer (disclaimer_text):
- Legal disclaimers attached to the product pricing
- Extracted from build-and-price pages
Metadata (meta_json):
- VIN, registration, stock number
- Odometer, build year, exterior colour
- Location (suburb, state, postcode, lat/lng)
- Source system references (network_id, specification_code)
Offers
Promotional offers including factory bonuses, run-out sales, value-add deals, and financing specials.
Core Fields:
title- Offer headline (e.g., "Musso Factory Bonus - Save $2,000")description- Offer details and conditionsoffer_type- Classification:factory_bonus,run_out,value_add,finance,leaseprice_amount- Private/RRP starting price for the applicable modelabn_price_amount- ABN holder price (when different from private price)saving_amount- Dollar amount saved (e.g., 2000, 5010)price_currency- AlwaysAUDprice_type-rrp,driveaway, etc.
Validity:
validity_start- Offer start date (ISO 8601)validity_end- Offer end date (ISO 8601)validity_raw- Raw validity text from source (e.g., "Limited time, while stocks last")
Display:
hero_image_r2_key- URL to offer hero image (OEM CDN or CMS media)cta_text- Call-to-action button textcta_url- Call-to-action link URLdisclaimer_text- Legal disclaimer text
Linking:
model_id- FK tovehicle_models(when offer applies to a single model)applicable_models- Comma-separated model slugs (when offer applies to multiple models)external_key- Unique identifier for deduplication (e.g.,offer-musso-factory-bonus)source_url- URL where the offer was found
Banners
Homepage and offers page hero banners across all 19 OEMs (176 banners, 100% with desktop images).
Core Fields:
page_url- Which page the banner appears on (homepage, offers page)position- Slide/carousel order (0-indexed)headline- Primary heading textsub_headline- Secondary heading textcta_text- Call-to-action button labelcta_url- Call-to-action link URLimage_url_desktop- Desktop hero image URLimage_url_mobile- Mobile hero image URLimage_r2_key- R2 proxy key (after rewrite)image_sha256- Content hash for change detectionvideo_url_desktop- Desktop video URL (mp4 or Brightcove)video_url_mobile- Mobile video URLdisclaimer_text- Legal disclaimer text
Extraction Methods:
- Server-rendered HTML (cheerio): Ford, GWM, KGM, Isuzu, Mazda, VW, Kia, Nissan — extract from slick/swiper carousels, background-image CSS, img tags
- Browser-rendered (Chrome MCP): Hyundai, Toyota, Suzuki — client-side JS carousels with lazy-loaded images
- Video banners: Ford (Brightcove Playback API — account 4082198814001, policy key needed), Suzuki (direct mp4 URLs)
- Offers page banners: GWM (
/special-offers/), Suzuki (/latest-offers/)
Output
- Extracted data mapped to canonical schemas (
product.v1,offer.v1,banner.v1) - Content hash for change detection
- Upserted to Supabase tables via orchestrator methods:
products—upsertProduct(): match by oem_id + title, track vialast_seen_at. Auto-populatesspecs_json(viabuildSpecsJson()) andvariant_colors(viasyncVariantColors()) on every upsert for all OEMs. Individual spec columns (engine_size,cylinders,transmission,gears,drive,doors,seats) are also populated frommeta_jsonon every upsert.offers—upsertOffer(): match by oem_id + title, detect changes (price, validity, disclaimer), updatelast_seen_aton every crawl pass, create change events + Slack alerts for new/changed offersbanners—upsertBanner(): match by oem_id + page_url + position, detect headline changesvariant_colors— Colour options per product (auto-synced for all OEMs viasyncVariantColors())variant_pricing— Per-state driveaway pricing (NSW/VIC/QLD/WA/SA/TAS/ACT/NT)accessories— Accessory catalog per OEM (viaaccessory_modelsjoin to vehicle_models)pdf_embeddings— Vectorized brochure/guidelines chunks for semantic search
- Change events fired to
change_eventstable if data differs from current record last_seen_atupdated on every crawl pass for products, offers, and banners — even if content unchanged. This timestamp indicates when the crawl last verified the entity still exists on the OEM website.
Input
{
"oem_id": "ford-au",
"url": "https://ford.com/suvs/explorer",
"page_type": "vehicle",
"rendered_html": "<html>...</html>",
"screenshot_r2_key": "screenshots/ford/explorer-2024.png"
}
Output
{
"products": 1,
"offers": 0,
"banners": 0,
"changes": 1
}