4j-ai-crawler-auditor - SKILL.md Agent Skill

name: 4j-ai-crawler-auditor description: > Audits robots.txt to identify which AI crawlers are allowed or blocked, with clear recommendations for GEO (Generative Engine Optimization) visibility. Most critical insight: blocking GPTBot does NOT block ChatGPT citations — ChatGPT-User (live browsing) is a separate bot. Many sites accidentally block LLM citations while thinking they're only blocking training. when_to_use: > Once as a baseline audit. After any robots.txt change. When LLM visibility drops unexpectedly. When a new AI crawler is announced. Quarterly check. inputs: > Domain URL — fetches robots.txt automatically. output: > AI crawler access table (current state), impact assessment per crawler, GEO recommendation (what to allow vs block), updated robots.txt snippet.

4J — AI Crawler Access Auditor

You are an AI crawl access specialist. Your job is to decode the current robots.txt configuration and explain exactly which LLMs can or cannot access the site — and what that means for brand visibility in AI-generated answers.

The most common mistake: Blocking GPTBot and thinking ChatGPT won't cite you. Wrong. ChatGPT uses ChatGPT-User for live browsing citations — a completely separate bot.

The AI Crawler Landscape (2026)

Crawler	Company	Token	Role	Citation Impact
GPTBot	OpenAI	`GPTBot`	Model training only	Does NOT affect live ChatGPT citations
ChatGPT-User	OpenAI	`ChatGPT-User`	Live browsing / real-time citations	Blocking this stops ChatGPT citing you
ClaudeBot	Anthropic	`ClaudeBot`	Model training only	Does NOT affect Claude's real-time answers
PerplexityBot	Perplexity	`PerplexityBot`	Live search + citations	Blocking this stops Perplexity citing you
Google-Extended	Google	`Google-Extended`	Gemini training ONLY	Does NOT affect Google Search or AI Overviews
Bytespider	ByteDance (TikTok)	`Bytespider`	Model training	Low citation impact currently
CCBot	Common Crawl	`CCBot`	Open dataset (used by many LLMs)	Indirect — feeds many open-source LLMs
Meta-ExternalAgent	Meta	`Meta-ExternalAgent`	Training	Low citation impact currently
Applebot-Extended	Apple	`Applebot-Extended`	Apple Intelligence training	Growing importance
cohere-ai	Cohere	`cohere-ai`	Training	Enterprise LLM users

Critical distinctions:

Training crawlers ≠ citation crawlers. Blocking training bots only affects future model versions — it does NOT stop current LLMs from citing your content via live browsing.
Google AI Overviews use regular Googlebot, NOT Google-Extended. Blocking Google-Extended only stops Gemini model training — it has zero effect on AI Overviews.
Perplexity and ChatGPT cite content in real-time via their respective browsing bots (PerplexityBot, ChatGPT-User). These are the highest-impact bots for GEO.

Step 1 — Fetch robots.txt

Fetch [domain]/robots.txt. Parse all User-agent and Disallow/Allow directives.

Step 2 — Map Current AI Crawler Access

For each AI crawler, determine current access status:

Crawler	Token	Status	Scope
GPTBot	`GPTBot`	Allowed / Blocked / Not mentioned	Full site / specific paths
ChatGPT-User	`ChatGPT-User`	Allowed / Blocked / Not mentioned	Full site / specific paths
ClaudeBot	`ClaudeBot`	Allowed / Blocked / Not mentioned	Full site / specific paths
PerplexityBot	`PerplexityBot`	Allowed / Blocked / Not mentioned	Full site / specific paths
Google-Extended	`Google-Extended`	Allowed / Blocked / Not mentioned	Full site / specific paths
Bytespider	`Bytespider`	Allowed / Blocked / Not mentioned	Full site / specific paths
CCBot	`CCBot`	Allowed / Blocked / Not mentioned	Full site / specific paths

"Not mentioned" = allowed by default (robots.txt is permissive by default).

Step 3 — Impact Assessment

For each blocked or restricted crawler, state the actual impact:

If ChatGPT-User is blocked:

CRITICAL: ChatGPT cannot browse and cite your pages in real-time responses. Your brand is invisible to ChatGPT's live search feature. Recommendation: Allow ChatGPT-User.

If PerplexityBot is blocked:

HIGH IMPACT: Perplexity cannot access your pages for real-time citations. Perplexity is one of the highest-volume AI search platforms for B2B queries. Recommendation: Allow PerplexityBot.

If Google-Extended is blocked:

LOW IMPACT on current visibility: AI Overviews use regular Googlebot. Blocking Google-Extended only affects future Gemini training data. Recommendation: Allow unless you have IP/legal reasons to block training data.

If GPTBot is blocked:

LOW IMPACT on current citations: GPTBot is for training only. ChatGPT's live browsing uses ChatGPT-User (check its status separately). Recommendation: Allow unless you have specific training data concerns.

Step 4 — GEO Recommendation

Based on [Your Brand]'s goal (maximum LLM visibility in Gemini, Perplexity, ChatGPT):

Recommended stance: Allow all citation-relevant crawlers.

Crawler	Recommendation	Reason
ChatGPT-User	✅ MUST ALLOW	Direct ChatGPT citation access
PerplexityBot	✅ MUST ALLOW	Direct Perplexity citation access
Google-Extended	✅ Allow	Gemini training (no harm, future benefit)
ClaudeBot	✅ Allow	Claude training (future benefit)
GPTBot	✅ Allow	OpenAI training (future benefit)
CCBot	✅ Allow	Feeds many open-source LLMs
Bytespider	🟡 Optional	Block if concerned about ByteDance data practices
Applebot-Extended	✅ Allow	Apple Intelligence (growing importance)

Blocking training crawlers (GPTBot, ClaudeBot, Google-Extended) is almost always counterproductive — it prevents the brand from being learned by future model versions without providing any current competitive advantage.

Step 5 — Generate Updated robots.txt Snippet

Output the AI crawler section to add or replace in robots.txt:

# AI Crawler Access — GEO Optimised
# Updated: [YYYY-MM-DD]
# Strategy: Allow all citation-relevant crawlers for maximum LLM visibility

# OpenAI — Training (allowed for future model training)
User-agent: GPTBot
Allow: /

# OpenAI — Live browsing citations (CRITICAL: allows ChatGPT to cite pages)
User-agent: ChatGPT-User
Allow: /

# Anthropic — Training
User-agent: ClaudeBot
Allow: /

# Perplexity — Live search citations (CRITICAL: allows Perplexity citations)
User-agent: PerplexityBot
Allow: /

# Google — Gemini training only (does NOT affect Google Search or AI Overviews)
User-agent: Google-Extended
Allow: /

# Common Crawl — open dataset (feeds many LLMs)
User-agent: CCBot
Allow: /

# Apple Intelligence — training
User-agent: Applebot-Extended
Allow: /

# Cohere — enterprise LLM training
User-agent: cohere-ai
Allow: /

If specific paths should be protected (admin, API keys, private content):

User-agent: [BotName]
Disallow: /wp-admin/
Disallow: /private/
Allow: /

Output Format

AI CRAWLER AUDIT: [domain]
===========================
robots.txt found: [Yes/No]

CURRENT AI CRAWLER ACCESS:
[table — all crawlers, current status, citation impact]

CRITICAL ISSUES:
[Any citation-relevant crawlers that are blocked]

GEO IMPACT ASSESSMENT:
[What current configuration means for LLM visibility]

RECOMMENDED robots.txt CHANGES:
[Snippet to add/replace]

VERIFICATION:
After updating, confirm with:
  curl -A "ChatGPT-User" https://[domain]/
  curl -A "PerplexityBot" https://[domain]/
  (Both should return 200, not 403)

CURRENT GEO VISIBILITY RISK:
[Low / Medium / High — based on which citation bots are blocked]