jd-extraction-orchestrator - SKILL.md Agent Skill

name: jd-extraction-orchestrator description: Extract job listings and full JDs from any career site (Workday, Greenhouse, Lever, custom). Use when user provides a career URL, says "extract jobs", "extract job listings", "extract JD", "scrape career site", or any Workday URL (wd3.myworkdayjobs.com). Handles site detection, pagination, full JD extraction, self-check, retry, validation, and JSON/CSV/Excel output. Supports 20+ pre-configured Workday company templates.

JD Extraction Orchestrator

URL → detect site type → extract all jobs with full JDs → validate → output files.

Startup

ALWAYS load lessons learned first:

Read: .claude/skills/jd-extraction-orchestrator/references/lessons-learned.md

Site Detection

URL Pattern	Type	Method
`.wd3.myworkdayjobs.com`, `.wd5.myworkdayjobs.com`	Workday	`eval` with DOM selector
`jobs.nokia.com`, `jobs.rbc.com`, `jobs.td.com`, etc.	Workday	`eval` with DOM selector
`boards.greenhouse.io`, `jobs.lever.co`	ATS	snapshot + get text
Other career sites	Generic	snapshot + get text

Pre-configured Workday company templates (CIBC, RBC, TD, BMO, Nokia, Bell, etc.):

Read: .claude/skills/jd-extraction-orchestrator/references/site-configs.md

Model Selection

Job Count	Model
< 30	Sonnet OK
30-70	Opus recommended
70+	Opus required

Workflow

Phase 1: Setup

Open URL, snapshot, estimate job count
Report: "Found ~N jobs. Time: ~X min. Proceed?"

Phase 2: Extract Job Links

agent-browser --session extract --headed open "{url}"
agent-browser --session extract snapshot -i

Loop all pages: snapshot → collect job titles/locations/URLs → click pagination next → wait 2s → repeat. Deduplicate by jobId.

Phase 3: Extract Full JDs

Workday sites (CRITICAL - use eval):

agent-browser --session extract open "{job_url}"
# Wait 2s for dynamic content
agent-browser --session extract eval "document.querySelector('[data-automation-id=\"jobPostingDescription\"]')?.innerText || ''"

Non-Workday sites (use snapshot + get text):

agent-browser --session extract open "{job_url}"
agent-browser --session extract snapshot -i
agent-browser --session extract get text @{jd_element_ref}

For >3 jobs on generic sites, use Task tool subagents for parallel extraction.

Save progress every 10 jobs. Report every 50 jobs.

Phase 4: Self-Check & Retry

Count failed or short JDs (< 500 chars). Retry up to 3 times with 2s delay.

Phase 5: Validate & Report

========================================
    {Company} Extraction Report
========================================
Count: Expected N, Extracted M (X%)
Quality: Complete JDs N, Short N, Failed N
JD Length: Min X, Max X, Avg X chars
Output Files: .json, .csv, _report.txt
Overall: PASS/FAIL
========================================

If < 95% success → retry failed → re-validate.

Phase 6: Output

{company}_jobs_with_jd_final.json - Complete JSON array
{company}_jobs_final.csv - Excel-friendly CSV (BOM-encoded)
{company}_extraction_report.txt - Validation report

Phase 7: Update lessons-learned.md (only if new actionable insights)

Workday Selectors (Standard)

jobTitle:        [data-automation-id="jobTitle"]
locations:       [data-automation-id="locations"]
postedOn:        [data-automation-id="postedOn"]
paginationNext:  [data-uxi-widget-type="paginationNext"]
jobDescription:  [data-automation-id="jobPostingDescription"]
timeType:        [data-automation-id="timeType"]

Extracted Fields

jobId, title, location, postedDate, url, jobDescription, timeType, status, extractedAt

Generic Site Navigation Patterns

Pattern: Job Listing → Detail → Back

snapshot → find job links → click each → get text for JD → back to results

Pattern: Pagination

Click "next" / "page N" or use URL offset (?offset=0, 20, 40...)

Pattern: Category Filters ("继续")

When user says "继续": extract new category → merge into SAME file → add Category column.

Excel Output (Generic sites)

headers = ['Job ID', 'Job Title', 'Location', 'Posted', 'Job URL', 'Full JD']
# Full JD with wrap_text=True, column width=120

Critical Rules

ALWAYS

Use agent-browser eval with data-automation-id="jobPostingDescription" for Workday
Add 2s delay after page load
Delete progress files before fresh extraction
Use --session flag with agent-browser
ONE combined output file (don't create multiple)

NEVER

Don't use get text body + regex for Workday - breaks after ~20 jobs
Don't use Sonnet for 70+ job Workday extractions

Error Handling

Browser fails: agent-browser --version, reinstall if needed
Site blocks: Increase all waits by 2x
Timeout: Save progress, report partial, offer resume
Unknown structure: Take screenshots, try generic extraction

References

site-configs.md - 20+ company templates
lessons-learned.md - Failure patterns and solutions