docs-indexer - SKILL.md Agent Skill

name: docs-indexer description: Use when Codex needs to crawl, inspect, or summarize a documentation website, GitHub docs tree, or local docs folder and produce a relevance-ranked page index, doc-map, source guide, or shortlist of important documentation pages. Trigger for documentation discovery, docs indexing, source-map creation, llms.txt alternatives, docs skill scaffolding, or choosing the most relevant pages from standalone docs sites or repository documentation folders.

Docs Indexer

Create a compact, evidence-backed index of the most important docs pages from a website, GitHub docs tree, or local repository folder.

Workflow

Clarify the indexing target:
- Use website mode for standalone docs sites.
- Use GitHub tree mode for public github.com/<owner>/<repo>/tree/<ref>/<path> docs folders.
- Use local mode for a checked-out repo folder or local docs directory.
Read references/source-strategy.md when source boundaries, sitemap alternatives, GitHub tree handling, or crawl scope are unclear.
Read references/index-output.md when the user needs a durable artifact such as doc-map.md, a source guide for another skill, or a ranked research shortlist.

Run the helper with a bounded crawl first:

uv run --script config/codex/skills/docs-indexer/scripts/build_docs_index.py <source> --max-pages 60 --top 25

Add --focus "<terms>" when the index should prioritize a topic, task, feature area, or planned skill.
Inspect the generated top pages and crawl notes. If the crawl was too shallow, rerun with a tighter --scope-prefix before raising --max-pages.
Fetch and read the highest-ranked pages before writing precise guidance. Treat the generated index as a routing artifact, not a replacement for source reading.

Common Commands

Website docs:

uv run --script config/codex/skills/docs-indexer/scripts/build_docs_index.py \
  https://docs.redpanda.com/streaming/current/home/ \
  --scope-prefix /streaming/current/ \
  --max-pages 80 \
  --top 30

GitHub docs folder:

uv run --script config/codex/skills/docs-indexer/scripts/build_docs_index.py \
  https://github.com/openai/codex/tree/main/sdk/python/docs \
  --focus "sdk api quickstart examples" \
  --top 20

Local repo docs:

uv run --script config/codex/skills/docs-indexer/scripts/build_docs_index.py \
  ./docs \
  --output /tmp/doc-map.md

Structured output:

uv run --script config/codex/skills/docs-indexer/scripts/build_docs_index.py \
  https://debezium.io/documentation/reference/stable/index.html \
  --format json \
  --output /tmp/debezium-doc-index.json

Ranking Rules

Prefer pages that are close to the seed, heavily linked by nearby docs, and named as overview, introduction, getting started, quickstart, concepts, architecture, configuration, API/reference, operations, migration, best practices, or troubleshooting.
Use --focus for task-specific relevance. A focused crawl for "Kafka transactions" should rank transaction pages over generic overview pages.
Penalize changelogs, release notes, blog posts, archives, legal pages, and search pages unless the focus terms explicitly need them.
Keep reasons short: name the strongest title/path match, focus match, inbound links, or seed distance.

Quality Rules

Keep crawls bounded. Start with --max-pages 40-80; avoid unbounded recursive scraping.
Stay inside the docs scope. Prefer --scope-prefix for websites with broad nav or marketing links.
Respect source freshness. Re-run the helper when docs may have changed, then fetch the final pages before citing or encoding claims.
Do not treat the helper as an authority for exact behavior, API shape, or version semantics. It ranks pages; Codex still needs to read the pages it uses.
For private repos, use a local checkout instead of adding tokens or credentials to the script invocation.

Resources

scripts/build_docs_index.py: bounded crawler and relevance-ranked Markdown/JSON index generator.
references/source-strategy.md: source selection, crawl boundaries, website/GitHub/local handling, and failure modes.
references/index-output.md: recommended index artifact shape, ranking interpretation, and validation checklist.