name: salesforce-developer-site-scraper description: 'Scrape Salesforce Developer documentation into clean Markdown using headless Chromium, Readability, and the docs content API fallback. Use when content is async or blocked by OneTrust cookie banners.' license: Forward Proprietary compatibility: VS Code 1.x+, Node.js 18+
Salesforce Developer Site Scraper
Use this skill to capture Salesforce Developer documentation pages as clean Markdown even when content loads asynchronously or is blocked by OneTrust cookie banners.
When to Use This Skill
- A Salesforce Developer doc page renders key content after async requests.
- A OneTrust cookie banner hides content until consent is accepted.
- You need a readable Markdown snapshot for Apex, LWC, or platform docs.
- NOT for: high-volume crawling or scraping behind access restrictions.
Prerequisites
- Node.js 18+
- npm dependencies for the script (see below)
How to Use
1) Install script dependencies (if not done already)
npm install playwright @mozilla/readability jsdom turndown
2) Run the script
node skills/salesforce-developer-site-scraper/scripts/scrape-to-markdown.js \
--url "https://developer.salesforce.com/docs/atlas.en-us.apexcode.meta/apexcode/apex_intro.htm" \
--out "./artifacts/online-research/apex_intro.md" \
--consent-selector "#onetrust-accept-btn-handler" \
--wait 2000
Script Options
| Option | Required | Description |
|---|---|---|
--url |
Yes | Target URL to fetch and extract. |
--out |
Yes | Output Markdown file path. |
--consent-selector |
No | CSS or text selector for a cookie banner accept button. |
--wait |
No | Milliseconds to wait after navigation or consent click. |
--content-selector |
No | Extract only this element instead of Readability parsing. |
--remove-selectors |
No | Comma-separated selectors to remove before extraction. |
--cookie |
No | Consent/session cookie, e.g. name=value;domain=example.com;path=/. |
--storage-state |
No | Playwright storage state JSON file to reuse consent/session. |
--timeout |
No | Navigation timeout in ms (default 45000). |
--no-default-removals |
No | Disable default cookie/consent element removals. |
Compliance Notes
- Respect robots.txt and site terms before scraping.
- Use consent cookies or storage state only when you have permission.
- Avoid collecting personal data unless you have a legal basis.
Examples
Example: Reuse a consent state
node skills/salesforce-developer-site-scraper/scripts/scrape-to-markdown.js \
--url "https://developer.salesforce.com/docs/atlas.en-us.apexcode.meta/apexcode/apex_intro.htm" \
--out "./artifacts/online-research/apex_intro.md" \
--storage-state "./artifacts/online-research/consent-state.json"
Example: Extract a specific content container
node skills/salesforce-developer-site-scraper/scripts/scrape-to-markdown.js \
--url "https://developer.salesforce.com/docs/atlas.en-us.apexcode.meta/apexcode/apex_intro.htm" \
--out "./artifacts/online-research/apex_intro.md" \
--content-selector "main article"
Troubleshooting
Issue: Output is empty or too short
Solution: Add --wait or provide a --content-selector for the primary content node.
Issue: Cookie banner blocks content
Solution: Provide --consent-selector (OneTrust) or reuse a --storage-state with consent already saved.
References
- Playwright Docs: https://playwright.dev/docs/intro
- Playwright Cookies: https://playwright.dev/docs/api/class-browsercontext#browser-context-add-cookies
- Playwright Storage State: https://playwright.dev/docs/auth#reuse-authentication-state
- Readability (Mozilla): https://github.com/mozilla/readability
- DOMParser (MDN): https://developer.mozilla.org/en-US/docs/Web/API/DOMParser
- Robots Exclusion Protocol (RFC 9309): https://www.rfc-editor.org/rfc/rfc9309
- GDPR: https://eur-lex.europa.eu/eli/reg/2016/679/oj
- ePrivacy Directive: https://eur-lex.europa.eu/eli/dir/2002/58/oj