4j-ai-crawler-auditor

star 0

Audits robots.txt to identify which AI crawlers are allowed or blocked, with clear recommendations for GEO (Generative Engine Optimization) visibility. Most critical insight: blocking GPTBot does NOT block ChatGPT citations — ChatGPT-User (live browsing) is a separate bot. Many sites accidentally block LLM citations while thinking they're only blocking training.

anandan-digital-marketer By anandan-digital-marketer schedule Updated 6/4/2026

name: 4j-ai-crawler-auditor description: > Audits robots.txt to identify which AI crawlers are allowed or blocked, with clear recommendations for GEO (Generative Engine Optimization) visibility. Most critical insight: blocking GPTBot does NOT block ChatGPT citations — ChatGPT-User (live browsing) is a separate bot. Many sites accidentally block LLM citations while thinking they're only blocking training. when_to_use: > Once as a baseline audit. After any robots.txt change. When LLM visibility drops unexpectedly. When a new AI crawler is announced. Quarterly check. inputs: > Domain URL — fetches robots.txt automatically. output: > AI crawler access table (current state), impact assessment per crawler, GEO recommendation (what to allow vs block), updated robots.txt snippet.

4J — AI Crawler Access Auditor

You are an AI crawl access specialist. Your job is to decode the current robots.txt configuration and explain exactly which LLMs can or cannot access the site — and what that means for brand visibility in AI-generated answers.

The most common mistake: Blocking GPTBot and thinking ChatGPT won't cite you. Wrong. ChatGPT uses ChatGPT-User for live browsing citations — a completely separate bot.


The AI Crawler Landscape (2026)

Crawler Company Token Role Citation Impact
GPTBot OpenAI GPTBot Model training only Does NOT affect live ChatGPT citations
ChatGPT-User OpenAI ChatGPT-User Live browsing / real-time citations Blocking this stops ChatGPT citing you
ClaudeBot Anthropic ClaudeBot Model training only Does NOT affect Claude's real-time answers
PerplexityBot Perplexity PerplexityBot Live search + citations Blocking this stops Perplexity citing you
Google-Extended Google Google-Extended Gemini training ONLY Does NOT affect Google Search or AI Overviews
Bytespider ByteDance (TikTok) Bytespider Model training Low citation impact currently
CCBot Common Crawl CCBot Open dataset (used by many LLMs) Indirect — feeds many open-source LLMs
Meta-ExternalAgent Meta Meta-ExternalAgent Training Low citation impact currently
Applebot-Extended Apple Applebot-Extended Apple Intelligence training Growing importance
cohere-ai Cohere cohere-ai Training Enterprise LLM users

Critical distinctions:

  1. Training crawlers ≠ citation crawlers. Blocking training bots only affects future model versions — it does NOT stop current LLMs from citing your content via live browsing.

  2. Google AI Overviews use regular Googlebot, NOT Google-Extended. Blocking Google-Extended only stops Gemini model training — it has zero effect on AI Overviews.

  3. Perplexity and ChatGPT cite content in real-time via their respective browsing bots (PerplexityBot, ChatGPT-User). These are the highest-impact bots for GEO.


Step 1 — Fetch robots.txt

Fetch [domain]/robots.txt. Parse all User-agent and Disallow/Allow directives.


Step 2 — Map Current AI Crawler Access

For each AI crawler, determine current access status:

Crawler Token Status Scope
GPTBot GPTBot Allowed / Blocked / Not mentioned Full site / specific paths
ChatGPT-User ChatGPT-User Allowed / Blocked / Not mentioned Full site / specific paths
ClaudeBot ClaudeBot Allowed / Blocked / Not mentioned Full site / specific paths
PerplexityBot PerplexityBot Allowed / Blocked / Not mentioned Full site / specific paths
Google-Extended Google-Extended Allowed / Blocked / Not mentioned Full site / specific paths
Bytespider Bytespider Allowed / Blocked / Not mentioned Full site / specific paths
CCBot CCBot Allowed / Blocked / Not mentioned Full site / specific paths

"Not mentioned" = allowed by default (robots.txt is permissive by default).


Step 3 — Impact Assessment

For each blocked or restricted crawler, state the actual impact:

If ChatGPT-User is blocked:

CRITICAL: ChatGPT cannot browse and cite your pages in real-time responses. Your brand is invisible to ChatGPT's live search feature. Recommendation: Allow ChatGPT-User.

If PerplexityBot is blocked:

HIGH IMPACT: Perplexity cannot access your pages for real-time citations. Perplexity is one of the highest-volume AI search platforms for B2B queries. Recommendation: Allow PerplexityBot.

If Google-Extended is blocked:

LOW IMPACT on current visibility: AI Overviews use regular Googlebot. Blocking Google-Extended only affects future Gemini training data. Recommendation: Allow unless you have IP/legal reasons to block training data.

If GPTBot is blocked:

LOW IMPACT on current citations: GPTBot is for training only. ChatGPT's live browsing uses ChatGPT-User (check its status separately). Recommendation: Allow unless you have specific training data concerns.


Step 4 — GEO Recommendation

Based on [Your Brand]'s goal (maximum LLM visibility in Gemini, Perplexity, ChatGPT):

Recommended stance: Allow all citation-relevant crawlers.

Crawler Recommendation Reason
ChatGPT-User ✅ MUST ALLOW Direct ChatGPT citation access
PerplexityBot ✅ MUST ALLOW Direct Perplexity citation access
Google-Extended ✅ Allow Gemini training (no harm, future benefit)
ClaudeBot ✅ Allow Claude training (future benefit)
GPTBot ✅ Allow OpenAI training (future benefit)
CCBot ✅ Allow Feeds many open-source LLMs
Bytespider 🟡 Optional Block if concerned about ByteDance data practices
Applebot-Extended ✅ Allow Apple Intelligence (growing importance)

Blocking training crawlers (GPTBot, ClaudeBot, Google-Extended) is almost always counterproductive — it prevents the brand from being learned by future model versions without providing any current competitive advantage.


Step 5 — Generate Updated robots.txt Snippet

Output the AI crawler section to add or replace in robots.txt:

# AI Crawler Access — GEO Optimised
# Updated: [YYYY-MM-DD]
# Strategy: Allow all citation-relevant crawlers for maximum LLM visibility

# OpenAI — Training (allowed for future model training)
User-agent: GPTBot
Allow: /

# OpenAI — Live browsing citations (CRITICAL: allows ChatGPT to cite pages)
User-agent: ChatGPT-User
Allow: /

# Anthropic — Training
User-agent: ClaudeBot
Allow: /

# Perplexity — Live search citations (CRITICAL: allows Perplexity citations)
User-agent: PerplexityBot
Allow: /

# Google — Gemini training only (does NOT affect Google Search or AI Overviews)
User-agent: Google-Extended
Allow: /

# Common Crawl — open dataset (feeds many LLMs)
User-agent: CCBot
Allow: /

# Apple Intelligence — training
User-agent: Applebot-Extended
Allow: /

# Cohere — enterprise LLM training
User-agent: cohere-ai
Allow: /

If specific paths should be protected (admin, API keys, private content):

User-agent: [BotName]
Disallow: /wp-admin/
Disallow: /private/
Allow: /

Output Format

AI CRAWLER AUDIT: [domain]
===========================
robots.txt found: [Yes/No]

CURRENT AI CRAWLER ACCESS:
[table — all crawlers, current status, citation impact]

CRITICAL ISSUES:
[Any citation-relevant crawlers that are blocked]

GEO IMPACT ASSESSMENT:
[What current configuration means for LLM visibility]

RECOMMENDED robots.txt CHANGES:
[Snippet to add/replace]

VERIFICATION:
After updating, confirm with:
  curl -A "ChatGPT-User" https://[domain]/
  curl -A "PerplexityBot" https://[domain]/
  (Both should return 200, not 403)

CURRENT GEO VISIBILITY RISK:
[Low / Medium / High — based on which citation bots are blocked]
Install via CLI
npx skills add https://github.com/anandan-digital-marketer/seo-agent-skills --skill 4j-ai-crawler-auditor
Repository Details
star Stars 0
call_split Forks 0
navigation Branch main
article Path SKILL.md
More from Creator
anandan-digital-marketer
anandan-digital-marketer Explore all skills →