name: d-research description: >- Browser-first deep research and lawful public-data collection for AI agents. Triggers: web research, source discovery, scraping public data, literature reviews, market or technical research, due diligence, policy/standards analysis, creative/cultural research, research intake, evidence ledgers, execution gates, blocker reports. Read-only; never bypasses logins, paywalls, captchas, or rate limits.
D Research
Mission
Use this skill to maximize reachable evidence under available tools, permissions, and open-web constraints.
Default browser automation tool: Playwright.
Use this skill for:
- deep web research
- public web data collection
- source discovery
- academic and literature review
- market, competitor, product, and technical research
- due diligence, public investigation, claim verification, and red-flag review
- policy, standards, RFC, governance, and compliance analysis
- creative, cultural, media, trend, reception, and archive research
- collecting evidence for reports, essays, theses, and projects
- researching dynamic websites that require browser interaction
- reporting blocked sources so the user can retrieve data manually
Do not use this skill to bypass access controls, login walls, paywalls, captchas, rate limits, or explicit access restrictions.
Core model
Run every research task as a layered investigation:
- classify the research shape, depth, safety posture, output artifact, and route
- define the question
- decompose the topic
- map likely sources
- generate query fanout
- discover candidate URLs
- probe sources with browser-first access
- extract accessible data
- expand through links, sitemaps, files, and public APIs
- verify claims with an evidence ledger
- search for contradictions
- report unreachable or blocked data with manual instructions
- synthesize the final answer
For non-trivial work, run the portable quality gates in
references/execution-gates.md before final synthesis. The gates are
domain-neutral: use them to prevent thin, single-basin, under-verified, or
overclaimed answers without forcing every task into a person/social workflow.
Tool priority
Use the best available tool stack in this order:
- user-provided context, URLs, files, and constraints
- web search (programmatic via
scripts/web_search.mjsor browser-based) - URL fetch or HTTP read, when available
- Playwright browser automation
- public files: PDF, CSV, JSON, XML, XLSX, DOCX, TXT
- public API or public network responses exposed by the page
- local files or repo search, when the user provides a workspace
- configured browser adapter, if not Playwright
- web-search-only fallback via
scripts/web_search.mjs(DDG → SearXNG → Brave → Google CSE), if no browser or fetch tool exists
If a tool is unavailable, continue with the next-best method and record the limitation.
If a relevant public tier-1 source is blocked by an anti-bot layer, JavaScript challenge, captcha, 403, 429, or repeated browser/fetch failure, run the bounded fallback chain in references/anti-bot-fallback.md before writing the final blocker report. The chain is API/static form -> public web archive -> cache/snippet if available -> fetch-only/no-JS retrieval -> blocker report. Never use it to bypass access controls.
For the full decision rules on choosing between adapters (e.g. when to demote Playwright to fetch-only, when web search alone is acceptable), see references/tool-adapter-policy.md.
Data access layers
Access data in order of preference:
- Web layer — public pages, dynamic content, file downloads (existing browser/fetch workflow)
- File layer — CSV, JSON, XML, PDF, XLSX, DOCX from public URLs
- API layer — REST, GraphQL, SPARQL endpoints with proper authentication. See
references/api-access-workflow.md - Wikidata layer — structured entity lookups, disambiguation, and SPARQL queries. See
adapters/wikidata.md - Database layer — read-only SQL/NoSQL access when user provides credentials. See
adapters/database-readonly.md - Academic database layer — OpenAlex, CrossRef, PubMed, Semantic Scholar, arXiv. See
references/academic-databases.md. Fast path: if the input is already a DOI, PMID, arXiv ID, or ISBN, resolve it viascripts/citation_resolver.pyfirst (seeadapters/citation-resolver.md) before searching. - Specialized domain layer — financial APIs, patent databases, government portals. See
references/specialized-domains.md
For each layer, follow the safety boundary: read-only, respect rate limits, log all access.
Safety boundary
Allowed:
- open public pages
- render dynamic pages
- click normal user-visible navigation
- use site search boxes and filters
- paginate through public results
- expand tabs, accordions, and lazy-loaded sections
- download public files
- inspect public network requests initiated by a page
- extract visible text, tables, links, metadata, and files
- produce blocker reports when access fails
Not allowed:
- bypass login or authentication
- bypass paywalls or subscription checks
- solve or evade captchas
- evade rate limits or anti-bot systems
- use stealth plugins by default
- use stolen cookies, leaked tokens, or credentials not explicitly provided by the user
- access private, personal, or sensitive data without authorization
- ignore robots or explicit site restrictions when acting as a crawler
When blocked, do not force access. Produce a blocker report.
The full safety and access policy (legal/ethical framing, what counts as a public source, escalation steps) is in references/safety-and-access-policy.md. Read it before doing anything that touches authenticated or rate-limited surfaces.
Workflow decision tree
Step 0: Research intake. Before choosing any branch or opening sources,
classify the request with references/research-intake.md. Assign one or more
shape labels (atomic fact, URL, person/public role, academic review, systematic
review, dataset/extraction, API/database, technical/market, due diligence,
policy/standards, creative/cultural, high-stakes, multilingual/local,
long-horizon, etc.), set research depth (fast, standard, or
completeness-first), set the safety posture, choose the expected output
artifact, and list the references/gates that apply. Use multi-label routing
when tasks overlap. If the classification changes safety,
legality, scope, or deliverable and cannot be resolved conservatively, ask the
user before proceeding; otherwise state the assumption and continue.
Before picking a branch: if the task is long-horizon (more than 5 sub-questions, more than 50 sources, multi-context-window runtime, or audit-grade output), apply the research plan protocol from references/research-plan-protocol.md as an outer loop around whichever branch fits the topic. The agent creates one workspace directory with scripts/research_plan.py init --slug <topic-slug>, writes research-plan.json (from templates/research-plan.json), renders PLAN.md, gets approval with scripts/research_plan.py approve, dispatches parallel-safe tasks (optionally to sub-agents if config allows), gates the synthesize step, and only then composes the final report. See examples/long-horizon-research-plan.md. The branches below describe the content of the work; the protocol describes the flow control that keeps the work surviving across context resets.
Before final synthesis: for any non-trivial branch, apply
references/execution-gates.md unless a narrower fast path explicitly says to
skip it. If subagents exist, use the gate roles as independent reviewers; if
they do not, perform the same checklists manually. Do not present a result as
complete until source mapping, recall/coverage, evidence verification, blockers,
and confidence have been handled or explicitly marked out of scope.
If the user asks to verify or look up one specific atomic fact
Use references/fact-verification.md. Applies when the question targets one named entity, one named attribute, has a deterministic primary source (API, registry, canonical text), and a one-sentence-or-quote answer. Skip decompose, source map, query fanout, and crawl. Hit the primary source once, quote the value verbatim, file one ledger row with a one-shot independent re-check, and report. If anything looks off — non-2xx status, contradicting mirrors, the user follows up with "why" — escalate to the broad research workflow below. Never reach for references/frontier-search.md from this branch; atomic facts either fetch cleanly or fail loudly.
If the user asks to capture or analyze a public social-media post
Use references/social-media-archival.md. Capture public posts from 12 supported platforms (Reddit, Hacker News, Mastodon, Bluesky, Lemmy, X, Facebook, Instagram, TikTok, YouTube, Threads, LinkedIn) plus a generic fallback. The script scripts/social_snapshot.py handles snapshot capture, hash-based verification, and evidence-ledger row generation. Read the privacy boundary section first — it refuses minors, private individuals, harassment/stalking/doxxing framings, and login-bypass attempts before making any HTTP call. Tier A platforms (Reddit, HN, Mastodon, Bluesky, Lemmy) use direct public API fetch with high verifiability; Tier B platforms (X, Facebook, Instagram, TikTok, YouTube, Threads, LinkedIn) use archive-only via scripts/wayback.py with low verifiability.
If the user asks for public-role information about a specific named person
Use references/person-aggregation.md. Applies when the user wants scattered public-role information about one named person (maintainer, author, speaker, journalist, public figure) and there is a canonical anchor (GitHub profile, ORCID, package author field, faculty page, verified byline). The value is in cross-source aggregation and homonym disambiguation, not in any one source. Apply the privacy boundary in that file before doing anything else — it is a hard stop, not abstract guidance; home address, family, private accounts, personal contact, photos, medical/financial/legal/orientation/whereabouts, pseudonym-to-real-name re-identification, and explicitly-private items are out of scope regardless of whether they appear on the open web. Refuse on minors, private individuals, and harassment/stalking/doxxing framings. Saturate at 25 ledger rows or three sources adding no new verified claims, and never escalate to references/frontier-search.md to chase one more piece of personal info.
If the user has a large corpus or many ledger rows and asks a semantic question
Use references/semantic-retrieval.md when a corpus is large enough that keyword search is brittle (roughly >30 documents or many evidence-ledger rows) and the task asks for conceptually related material, near-duplicates, or "find claims like X". Build or query an index with scripts/embed_corpus.py; prefer local backends for private data, and require explicit remote opt-in for Cohere.
If the user asks for a broad research answer
Use the full deep research workflow. Produce a source-backed synthesis with evidence, confidence, caveats, and next steps.
If the user asks for due diligence, public investigation, risk review, or red flags
Use references/research-intake.md with due_diligence_or_investigation.
Default to completeness-first unless the user explicitly asks for a quick scan.
Build a source map, keep a search log, maintain an evidence ledger for verified
claims and red flags, run a contradiction pass, and apply execution gates before
synthesis. Separate verified facts, red flags, unresolved risks, benign
unknowns, confidence, and recommended manual checks. Do not gather private
personal data or phrase allegations beyond what the evidence supports.
If the user asks for policy, standards, RFC, governance, or compliance analysis
Use references/research-intake.md with policy_or_standards_analysis.
Prioritize canonical text, version/status, effective dates, errata, issuing-body
guidance, and exact clause evidence. Distinguish normative from informative
language, draft from final or superseded text, and obligations from permissions
or implementation notes. Add references/specialized-domains.md only when the
question is legal/government/financial or jurisdiction-specific.
If the user asks for creative, cultural, media, trend, reception, or archive research
Use references/research-intake.md with creative_or_cultural_research.
Anchor on primary works, official releases, creator/publisher/studio/label
records, archives, criticism, cultural scholarship, trade press, and public
reception metrics when available. Treat fan/community/social sources as
reception evidence, not as verified factual authority about private people.
If the user asks to collect a dataset
Use the crawl and extraction workflow. Produce structured data, a data dictionary, extraction method, coverage notes, and blocked-source report.
If the user asks for academic research, literature review, thesis, or project work
Use the academic workflow. Define research questions, search strings, inclusion and exclusion criteria, screening log, evidence table, synthesis, and citations.
If the user asks for a systematic review, scoping review, rapid review, or PRISMA-grade output
Use references/systematic-review-protocol.md (PRISMA 2020). Pick the right review type with references/synthesis-patterns.md. Populate templates/prisma-flow.json for the flow diagram and templates/screening-log.csv for screening decisions. For citation chasing / snowball sampling, use scripts/citation_graph.py expand --seed seeds.csv --direction both to traverse forward and backward citations from included papers. See examples/systematic-review-prisma.md for an end-to-end walkthrough.
If the user gives a specific URL
Probe the URL first with the browser. Classify access status, extract available data, discover linked files/endpoints/pages, and report blockers.
If the URL appears relevant but is blocked by Cloudflare, bot challenge, captcha, 403, 429, or a JavaScript challenge, follow references/anti-bot-fallback.md once before declaring it blocked. Record failed fallback attempts as low-confidence process rows in the evidence ledger, then produce references/blocker-report.md if no lawful public fallback works.
If only web search exists
Run search-based research. Prefer official and primary sources. Mark sources that were found but not directly opened.
If the user asks to collect data from an API
Use references/api-access-workflow.md. Discover endpoints, authenticate if user provides keys, paginate, handle rate limits, export structured data.
If the user asks for large-scale collection (100+ pages/records)
Use references/large-scale-collection.md. Enable checkpointing, adaptive rate limiting, batch processing.
If the user asks for financial, patent, legal, or government data
Use references/specialized-domains.md. Route to appropriate free APIs and data portals.
If the user asks for a literature review with citations
Combine the academic workflow with references/academic-databases.md and references/citation-management.md. Export citations in BibTeX or RIS with scripts/citation_export.py, then render APA / MLA / IEEE / Chicago / Vancouver / Harvard / Nature with scripts/citation_render.py (pandoc + CSL).
If the user wants data cleaned or analyzed
Use references/data-processing-pipeline.md after extraction. Run cleaning, validation, and analysis stages.
If the user asks to extract structured data from web pages (tables, JSON-LD, embedded JSON, sitemaps, RSS, OAI-PMH)
Use references/data-extraction-toolbox.md for recipe-style playbooks. Use scripts/extract_tables.py for HTML <table> → CSV, scripts/api_fetch.mjs for REST/GraphQL, and templates/data-package.json to publish the result as a Frictionless Data Package.
If the user needs tamper-evident research output, a signed evidence ledger, or a reproducibility audit
Sign the ledger with scripts/evidence_ledger.py sign --file evidence-ledger.csv --key-env D_RESEARCH_LEDGER_KEY; the verifier is evidence_ledger.py verify. Then walk through references/reproducibility-checklist.md before declaring done.
If the task is long-horizon, multi-source, or risks blowing context
Use the research plan protocol in references/research-plan-protocol.md. The agent MUST start by creating a single workspace directory with scripts/research_plan.py init --slug <topic-slug>, writing research-plan.json (from templates/research-plan.json), validating it with scripts/research_plan.py check, refreshing execution annotations with scripts/research_plan.py configure-execution, rendering PLAN.md, running gate --gate plan_ready, and getting approval with scripts/research_plan.py approve before dispatch. init reads research.config.json when present; by default it creates a fresh run folder in the current working directory, or under researchPlan.workspace.baseDir if configured. If that configured folder is inaccessible, it falls back to the current directory and warns. Unattended runs fail by default unless the agent explicitly records --allow-unattended. After approval, dispatch parallel-safe tasks via scripts/research_plan.py parallelizable only when researchPlan.subagents.slots[] contains configured slots (agent, contextLength, and maxParallel are not null). If the user changes task assignment, slot, or thread count during review, use scripts/research_plan.py set-execution --id <task> --agent <main|subagent> [--slot <slot>] [--parallel-threads <n>], then render and approve again. If no sub-agent is configured, run tasks with the main agent and split the plan according to the main agent's context length. Context overflow is a hard failure: each task must fit its execution.context_budget; if not, split the task, write partial findings to files immediately, re-run configure-execution, render, and re-approve. Mark task status as work progresses, gate the synthesize step with scripts/research_plan.py gate --gate synthesize_ready, and only then compose the final report. Always include the final workspace path in the user-facing answer. See examples/long-horizon-research-plan.md for the end-to-end walkthrough. This is the right default for any task with >5 sub-questions, >50 sources, or estimated runtime that does not fit in one context window.
If the user wants visualizations or charts
Use references/data-visualization.md. Generate matplotlib/plotly charts as part of the report.
If the user wants to monitor changes over time
Use references/monitoring-change-detection.md. Take baseline snapshots, detect changes, report diffs.
If the user needs research across multiple languages
Use references/multilingual-research.md. Translate queries per language, search local-language sources, extract in original language, and cross-validate findings across languages.
If Vietnamese sources, Vietnam-local institutions, Vietnamese news, or
Vietnamese public/community sources are materially relevant, use
references/vietnamese-source-discovery.md as a companion. It adds
diacritic/no-diacritic alias handling, local source basins, and date/identity
discipline without making Vietnamese discovery a global default.
If recall is thin or the evidence lives in community, vernacular, or jargon-heavy registers
Use references/register-and-jargon-expansion.md. Applies when a clinical,
legal, standards, or academic query under-recalls because the people who hold
the evidence use lay terms, community jargon, or emergent slang. Walk the
register ladder in both directions: formal -> vernacular to open recall, and
vernacular -> formal to anchor every community term back to a primary source.
Harvest vocabulary from fresh results only (never from model memory), keep only
terms that recur across two or more independent community sources, and treat the
harvested vocabulary as a discovery layer — never as evidence. Every claim still
passes references/source-quality-rubric.md and the contradiction pass. This is
an additive layer on top of references/multilingual-research.md, not a
replacement for native-language search.
If the first pass leaves evidence gaps, obscure / long-tail facts, or contested claims
Escalate to references/frontier-search.md. Build a small best-first frontier over candidate queries, URLs, files, APIs, citations, repos, aliases, and archives; score each node against the unresolved sub-question; expand the highest-priority node; and stop on evidence saturation rather than node count. Maintain templates/frontier-ledger.csv and templates/coverage-map.json alongside templates/evidence-ledger.csv. Never use this as a way to bypass access controls — blocked nodes still go to references/blocker-report.md.
Standard deep research workflow
0. Classify and route
Use references/research-intake.md before source access. Produce or internally
maintain a short intake card covering the user goal, primary object, shape
labels, research depth, safety posture, freshness requirement,
geography/language scope, authority model/source basins, source expectations,
output artifact, required references, required ledgers/templates, execution
gates, red-flag or contradiction focus, ambiguities, and route.
Hard-stop safety/privacy/access checks happen here. Do not continue to broad research if the intake indicates a refusal, access-control bypass attempt, private-person profiling request, or high-stakes advice request that must be reframed as evidence synthesis.
1. Restate the task
Capture:
- research goal
- entities, products, technologies, organizations, places, or people
- timeframe and freshness requirement
- geography and language constraints
- desired output format
- acceptable source types
- forbidden source types
- whether the task is research, dataset collection, academic review, or mixed
When the request is broad, proceed with reasonable assumptions and state them.
2. Decompose the topic
Use references/topic-decomposition.md.
Create:
- root question
- sub-questions
- facets
- entities
- synonyms and aliases
- likely source classes
- unknowns
- research risks
- stopping criteria
3. Build a source map
Use references/source-discovery.md.
Look for:
- official sites and docs
- source repositories
- issue trackers and discussions
- changelogs and releases
- public datasets
- public APIs
- standards, RFCs, and specifications
- academic papers
- government filings
- PDFs and reports
- tables, dashboards, and data portals
- news, blogs, forums, and community sources
- archives and caches, when allowed
4. Generate query fanout
Use references/query-patterns.md.
For every important sub-question, generate:
- broad query
- exact phrase query
- official source query
- primary source query
- filetype query
- site-specific query
- dataset/API query
- recent query
- contradiction query
- alternate-language query when useful
- register/jargon variant query when the evidence basin uses lay, community, or vernacular terms (see
references/register-and-jargon-expansion.md)
Do not conclude "not found" until broad, exact, official, primary, filetype, and contradiction searches have been attempted or are clearly irrelevant.
5. Probe sources with browser-first access
Use Playwright by default. See adapters/playwright.md and references/browser-first-crawl.md.
For each promising URL:
- open the page
- wait for page stability
- record final URL, title, status if available, and access state
- extract visible text, headings, links, files, tables, metadata, dates, and page controls
- capture screenshots when evidence or blockers need visual proof
- classify the page as accessible, partial, dynamic, blocked, login-required, paywalled, captcha, rate-limited, robots-restricted, broken, or manual-needed
6. Extract data
Use the least invasive reliable method:
- public downloadable files
- public API or structured endpoint linked/exposed by the page
- static HTML tables and semantic markup
- visible page text
- browser-rendered content after normal interaction
- screenshots only when text extraction is unreliable
Always record the extraction method.
For the detailed playbooks per content type (HTML tables, JSON-LD, PDFs, embedded JSON in <script> tags, datalayer objects, GraphQL responses, etc.), see references/extraction-methods.md.
7. Crawl and expand
For accessible sources:
- use sitemaps and robots discovery when acting as a crawler
- follow relevant internal links
- follow citation and reference links
- follow pagination and filters that expose public data
- deduplicate URLs
- limit crawl depth and page count
- stop when evidence saturation is reached
Default crawl limits unless overridden:
- max depth: 2
- max pages per domain: 30
- max total pages: 100
- delay between page loads: 1000 ms
- respect robots: true
- follow external links: false unless needed for source discovery
8. Maintain evidence ledger
Use references/evidence-ledger.md.
For tamper-evidence, sign the ledger with scripts/evidence_ledger.py sign (HMAC-SHA256 over the canonicalised CSV bytes). The verifier (scripts/evidence_ledger.py verify) detects any edits made after signing.
Every important claim must have:
- claim
- source
- source type
- date
- access method
- extracted evidence
- contradiction status
- confidence
Separate facts, inferences, speculation, and unknowns.
9. Run contradiction pass
Before final output:
- search for contrary evidence
- compare source dates and versions
- identify stale or deprecated pages
- check whether secondary sources cite primary sources
- downgrade confidence when evidence is weak or conflicting
- state unresolved contradictions clearly
Score every source on the rubric in references/source-quality-rubric.md (primary vs. secondary, authority, recency, methodology, independence). Use the rubric scores to set the confidence column in the evidence ledger and to break ties between contradicting sources.
For non-trivial tasks, also run the execution gates in
references/execution-gates.md: source map gate, coverage/recall gate,
identity/date/inference gate, evidence verification gate, and synthesis
readiness gate. If a gate fails, continue research, downgrade confidence, or
mark the output as partial instead of overclaiming completeness.
10. Report blockers
Use references/blocker-report.md.
If a likely useful source cannot be extracted, report:
- URL
- why it matters
- access status
- what was attempted
- what blocked access
- visible evidence of blocker
- likely manual path
- exact data the user should export, copy, screenshot, or download
- alternative sources found
11. Synthesize
Before composing the final answer, scan the evidence ledger and (if maintained) templates/coverage-map.json for unresolved gaps. If a key sub-question still has missing entries or only low-confidence non-primary sources, escalate one pass with references/frontier-search.md instead of synthesising over thin evidence.
If the task is non-trivial, do not synthesize until
references/execution-gates.md has been satisfied or any unmet gate is clearly
reported as a limitation.
Use references/final-report-template.md.
Default final answer:
- direct answer
- key findings
- evidence summary
- data collected
- sources reached
- sources blocked
- contradictions and caveats
- confidence
- next research steps
For academic outputs, use the academic report format in references/academic-research-protocol.md.
Optional bundled scripts
The scripts/ directory contains helper scripts for agents running in a local Node environment.
Use them when Playwright is installed and the task benefits from repeatable extraction:
scripts/playwright_probe.mjs: classify a page, detect blockers, list links/files/tables, optionally screenshotscripts/playwright_extract.mjs: extract visible text, tables, links, metadata, and files into JSON or Markdownscripts/playwright_crawl.mjs: bounded same-domain crawl with basic robots awareness and page manifestsscripts/evidence_ledger.py: initialize, validate, and HMAC-sign / verify CSV evidence ledgersscripts/api_fetch.mjs: paginated API fetch with rate limiting, retry, and multiple output formatsscripts/data_clean.py: data cleaning, deduplication, validation, statistics, and mergingscripts/citation_export.py: BibTeX/RIS citation export and DOI enrichment via CrossRefscripts/citation_render.py: render BibTeX into APA / MLA / IEEE / Chicago / Vancouver / Harvard / Nature / Science / ACM / AMA styles via pandoc + CSLscripts/extract_tables.py: extract HTML<table>elements into CSV (handlescolspan/rowspan, stdlib only)scripts/score_source.py: apply thereferences/source-quality-rubric.mdrubric to an evidence ledger and emit per-row scores + bandsscripts/research_plan.py: init / configure-execution / set-execution / render / approve / revoke / check / status / parallelizable / mark / block / add-task / gate — drives the long-horizon context-safe protocol inreferences/research-plan-protocol.mdscripts/wikidata.py: search / entity / disambiguate / sparql / self-test — Wikidata entity lookup, disambiguation, and SPARQL queries (seeadapters/wikidata.md)scripts/social_snapshot.py: snapshot / verify / to-ledger / self-test — public social-media post capture with two-tier architecture, content hashing, and evidence-ledger integration (seereferences/social-media-archival.md)scripts/pdf_extract.py: text / meta / tables / to-ledger / self-test — PDF text, metadata, and table extraction via pdftotext / pdfinfo / pdfplumber with soft-fail when binaries are missing (seereferences/pdf-extraction.md)scripts/wayback.py: lookup / nearest / save / diff [--summarize --top-n N] / self-test — Wayback Machine snapshot lookup, archival, and diff summarization (seereferences/wayback-archive.mdandreferences/monitoring-change-detection.md)scripts/citation_resolver.py: doi / pmid / arxiv / isbn / oa / to-ledger / to-bibtex / batch / self-test — academic identifier resolution via free public APIs (CrossRef, Datacite, NCBI, arXiv, Open Library, Unpaywall); seeadapters/citation-resolver.mdscripts/report_render.py: init / render / to-pdf / to-docx / to-html / list-styles / lint / self-test — final report generator from research workspace (plan + ledger + screening log); seereferences/report-generation.mdscripts/ocr.py: text / pdf / to-ledger / langs / self-test — OCR via tesseract (optional system binary, soft-fail if missing); seereferences/ocr.mdscripts/translate.py: text / detect / instances / self-test — translation adapter with stdlib trigram language detection and LibreTranslate/DeepL/Google/Argos backends; seeadapters/translation.mdscripts/embed_corpus.py: index / query / query-ledger / dedupe / self-test — semantic retrieval over text corpora using cosine similarity with stub/sentence-transformers/cohere/llama-cli backends; seereferences/semantic-retrieval.mdscripts/citation_graph.py: cited-by / references / expand / to-frontier / coauthors / self-test — citation graph traversal via OpenAlex for snowball sampling and network analysis; seereferences/citation-graph.mdscripts/multi_extract.py: text / meta / tables / structured / mbox-search / to-ledger / self-test — unified extraction from DOCX, EPUB, XLSX, mbox, and HTML structured data; seereferences/multi-format-extraction.mdscripts/dedup_near.py: fingerprint / scan / ledger / self-test — near-duplicate detection via SimHash + Hamming distance; seereferences/deduplication.mdscripts/http_cache.py: get-key / stats / purge / self-test — shared HTTP cache (opt-in viaD_RESEARCH_HTTP_CACHE_PATH); seereferences/http-cache.mdscripts/lib/http_cache.mjs: Node ESM helper used byapi_fetch.mjsfor the same shared cache layoutscripts/bench_harness_check.py: check / check-all / orphans / self-test — bench/fixture/harness consistency check. NOT an agent benchmark — only catches bench data regressionsscripts/web_search.mjs: multi-engine web search with fallback chain (DuckDuckGo → SearXNG → Brave → Google CSE); seeadapters/web-search-only.mdscripts/check_internal_refs.py: validate backticked in-repo path references (CI guard)
The scripts are optional. If dependencies are unavailable, follow the workflow manually using the agent's browser or web tools.
Configuration
If a project has research.config.json, obey it. Otherwise use research.config.example.json defaults.
Important config fields:
- browser.default
- browser.timeoutMs
- crawl.maxDepth
- crawl.maxPagesPerDomain
- crawl.maxTotalPages
- crawl.delayMs
- crawl.respectRobots
- research.intake.enabled
- research.intake.emitClassificationCard
- research.intake.multiLabel
- research.intake.askOnSafetyOrOutputAmbiguity
- research.intake.defaultToConservativeBranch
- research.requireEvidenceLedger
- research.requireContradictionPass
- research.executionGates.enabled
- research.executionGates.lowRecallGuard
- research.executionGates.noSingleBasinStop
- research.executionGates.finalVerificationGate
- research.executionGates.subagentsOptional
- research.executionGates.minIndependentBasinsForCompleteness
- researchPlan.context.mainContextLength
- researchPlan.context.taskBudgetRatio
- researchPlan.context.writeFindingsImmediately
- researchPlan.subagents.slots[].id
- researchPlan.subagents.slots[].agent
- researchPlan.subagents.slots[].contextLength
- researchPlan.subagents.slots[].maxParallel
- researchPlan.workspace.baseDir
- researchPlan.workspace.nameTemplate
- researchPlan.workspace.fallbackToCwdOnError
- researchPlan.finalResponse.reportWorkspacePath
- access.allowLoginWithUserPermission
- access.allowPaywalledSources
- access.allowCaptchaSolving
- access.allowStealthEvasion
- api.defaultDelayMs
- api.maxRetries
- api.respectRateLimitHeaders
- database.queryTimeoutMs
- database.maxResultRows
- database.readOnly
- citation.defaultFormat
- citation.enrichFromCrossRef
- monitoring.enabled
- monitoring.defaultIntervalMinutes
- processing.autoClean
- processing.detectOutliers
- largeScale.checkpointEveryN
- largeScale.adaptiveRateLimit
Default access policy is conservative and read-only.
For long-horizon research, researchPlan.finalResponse.reportWorkspacePath defaults to true; always tell the user the workspace path where the plan, ledger, notes, sections, report, and checklist were written.
Output standards
Never present scraped data as complete unless coverage was verified.
Always include:
- what was searched
- what was accessed
- what was extracted
- what was blocked
- what remains unknown
- confidence level
Use concise outputs for simple tasks. Use full evidence-ledger reports for high-stakes, academic, or dataset-building tasks.
Further reading
The research methodology and source taxonomy behind this skill are written up in references/research-bibliography.md. Read it when adapting the skill to a new domain or when explaining the methodology to a stakeholder.
For contributors extending the skill (new references, adapters, examples, scripts, or templates), see CONTRIBUTING.md.