salesforce-developer-site-scraper - SKILL.md Agent Skill

name: salesforce-developer-site-scraper description: 'Scrape Salesforce Developer documentation into clean Markdown using headless Chromium, Readability, and the docs content API fallback. Use when content is async or blocked by OneTrust cookie banners.' license: Forward Proprietary compatibility: VS Code 1.x+, Node.js 18+

Salesforce Developer Site Scraper

Use this skill to capture Salesforce Developer documentation pages as clean Markdown even when content loads asynchronously or is blocked by OneTrust cookie banners.

When to Use This Skill

A Salesforce Developer doc page renders key content after async requests.
A OneTrust cookie banner hides content until consent is accepted.
You need a readable Markdown snapshot for Apex, LWC, or platform docs.
NOT for: high-volume crawling or scraping behind access restrictions.

Prerequisites

Node.js 18+
npm dependencies for the script (see below)

How to Use

1) Install script dependencies (if not done already)

npm install playwright @mozilla/readability jsdom turndown

2) Run the script

node skills/salesforce-developer-site-scraper/scripts/scrape-to-markdown.js \
  --url "https://developer.salesforce.com/docs/atlas.en-us.apexcode.meta/apexcode/apex_intro.htm" \
  --out "./artifacts/online-research/apex_intro.md" \
  --consent-selector "#onetrust-accept-btn-handler" \
  --wait 2000

Script Options

Option	Required	Description
`--url`	Yes	Target URL to fetch and extract.
`--out`	Yes	Output Markdown file path.
`--consent-selector`	No	CSS or text selector for a cookie banner accept button.
`--wait`	No	Milliseconds to wait after navigation or consent click.
`--content-selector`	No	Extract only this element instead of Readability parsing.
`--remove-selectors`	No	Comma-separated selectors to remove before extraction.
`--cookie`	No	Consent/session cookie, e.g. `name=value;domain=example.com;path=/`.
`--storage-state`	No	Playwright storage state JSON file to reuse consent/session.
`--timeout`	No	Navigation timeout in ms (default 45000).
`--no-default-removals`	No	Disable default cookie/consent element removals.

Compliance Notes

Respect robots.txt and site terms before scraping.
Use consent cookies or storage state only when you have permission.
Avoid collecting personal data unless you have a legal basis.

Examples

Example: Reuse a consent state

node skills/salesforce-developer-site-scraper/scripts/scrape-to-markdown.js \
  --url "https://developer.salesforce.com/docs/atlas.en-us.apexcode.meta/apexcode/apex_intro.htm" \
  --out "./artifacts/online-research/apex_intro.md" \
  --storage-state "./artifacts/online-research/consent-state.json"

Example: Extract a specific content container

node skills/salesforce-developer-site-scraper/scripts/scrape-to-markdown.js \
  --url "https://developer.salesforce.com/docs/atlas.en-us.apexcode.meta/apexcode/apex_intro.htm" \
  --out "./artifacts/online-research/apex_intro.md" \
  --content-selector "main article"

Troubleshooting

Issue: Output is empty or too short

Solution: Add --wait or provide a --content-selector for the primary content node.

Issue: Cookie banner blocks content

Solution: Provide --consent-selector (OneTrust) or reuse a --storage-state with consent already saved.

References

Playwright Docs: https://playwright.dev/docs/intro
Playwright Cookies: https://playwright.dev/docs/api/class-browsercontext#browser-context-add-cookies
Playwright Storage State: https://playwright.dev/docs/auth#reuse-authentication-state
Readability (Mozilla): https://github.com/mozilla/readability
DOMParser (MDN): https://developer.mozilla.org/en-US/docs/Web/API/DOMParser
Robots Exclusion Protocol (RFC 9309): https://www.rfc-editor.org/rfc/rfc9309
GDPR: https://eur-lex.europa.eu/eli/reg/2016/679/oj
ePrivacy Directive: https://eur-lex.europa.eu/eli/dir/2002/58/oj