name: docs-indexer description: Use when Codex needs to crawl, inspect, or summarize a documentation website, GitHub docs tree, or local docs folder and produce a relevance-ranked page index, doc-map, source guide, or shortlist of important documentation pages. Trigger for documentation discovery, docs indexing, source-map creation, llms.txt alternatives, docs skill scaffolding, or choosing the most relevant pages from standalone docs sites or repository documentation folders.
Docs Indexer
Create a compact, evidence-backed index of the most important docs pages from a website, GitHub docs tree, or local repository folder.
Workflow
Clarify the indexing target:
- Use website mode for standalone docs sites.
- Use GitHub tree mode for public
github.com/<owner>/<repo>/tree/<ref>/<path>docs folders. - Use local mode for a checked-out repo folder or local docs directory.
Read
references/source-strategy.mdwhen source boundaries, sitemap alternatives, GitHub tree handling, or crawl scope are unclear.Read
references/index-output.mdwhen the user needs a durable artifact such asdoc-map.md, a source guide for another skill, or a ranked research shortlist.Run the helper with a bounded crawl first:
uv run --script config/codex/skills/docs-indexer/scripts/build_docs_index.py <source> --max-pages 60 --top 25Add
--focus "<terms>"when the index should prioritize a topic, task, feature area, or planned skill.Inspect the generated top pages and crawl notes. If the crawl was too shallow, rerun with a tighter
--scope-prefixbefore raising--max-pages.Fetch and read the highest-ranked pages before writing precise guidance. Treat the generated index as a routing artifact, not a replacement for source reading.
Common Commands
Website docs:
uv run --script config/codex/skills/docs-indexer/scripts/build_docs_index.py \
https://docs.redpanda.com/streaming/current/home/ \
--scope-prefix /streaming/current/ \
--max-pages 80 \
--top 30
GitHub docs folder:
uv run --script config/codex/skills/docs-indexer/scripts/build_docs_index.py \
https://github.com/openai/codex/tree/main/sdk/python/docs \
--focus "sdk api quickstart examples" \
--top 20
Local repo docs:
uv run --script config/codex/skills/docs-indexer/scripts/build_docs_index.py \
./docs \
--output /tmp/doc-map.md
Structured output:
uv run --script config/codex/skills/docs-indexer/scripts/build_docs_index.py \
https://debezium.io/documentation/reference/stable/index.html \
--format json \
--output /tmp/debezium-doc-index.json
Ranking Rules
- Prefer pages that are close to the seed, heavily linked by nearby docs, and named as overview, introduction, getting started, quickstart, concepts, architecture, configuration, API/reference, operations, migration, best practices, or troubleshooting.
- Use
--focusfor task-specific relevance. A focused crawl for "Kafka transactions" should rank transaction pages over generic overview pages. - Penalize changelogs, release notes, blog posts, archives, legal pages, and search pages unless the focus terms explicitly need them.
- Keep reasons short: name the strongest title/path match, focus match, inbound links, or seed distance.
Quality Rules
- Keep crawls bounded. Start with
--max-pages 40-80; avoid unbounded recursive scraping. - Stay inside the docs scope. Prefer
--scope-prefixfor websites with broad nav or marketing links. - Respect source freshness. Re-run the helper when docs may have changed, then fetch the final pages before citing or encoding claims.
- Do not treat the helper as an authority for exact behavior, API shape, or version semantics. It ranks pages; Codex still needs to read the pages it uses.
- For private repos, use a local checkout instead of adding tokens or credentials to the script invocation.
Resources
scripts/build_docs_index.py: bounded crawler and relevance-ranked Markdown/JSON index generator.references/source-strategy.md: source selection, crawl boundaries, website/GitHub/local handling, and failure modes.references/index-output.md: recommended index artifact shape, ranking interpretation, and validation checklist.