Allow AI crawlers in robots.txt

Q: Does blocking GPTBot remove my content from ChatGPT?

It stops future crawls from indexing your site, but content already in training data stays there. There's no retroactive removal — OpenAI doesn't offer one. Blocking now means future model versions won't see your new content.

What this check does

Fetches /robots.txt and parses every User-agent block for the major AI training and answer-engine crawlers:

GPTBot (OpenAI — ChatGPT training and browsing)
ChatGPT-User (OpenAI — when ChatGPT browses on a user’s behalf)
ClaudeBot (Anthropic — Claude training)
Claude-Web / anthropic-ai (legacy Anthropic crawler names)
PerplexityBot (Perplexity answer engine)
Google-Extended (Google’s Gemini training opt-in — separate from Googlebot)
CCBot (Common Crawl, feeds many AI datasets)
cohere-ai (Cohere)
Bytespider (ByteDance / TikTok AI)
Meta-ExternalAgent (Meta AI training)
applebot-extended (Apple Intelligence training opt-in — separate from Applebot)

The check fails when one or more of these are explicitly disallowed (Disallow: /) and your audit profile is “AI discovery: allow.”

Why it matters

Answer engines and chatbots are becoming a meaningful traffic source — Perplexity, ChatGPT, Claude, and Google AI Overviews all surface citations to source pages, and clicks from those citations now rival some social-platform referrals.

The trade-off is real and not unambiguous:

Allowing AI crawlers means your content gets quoted in answers and cited with a link. Discovery improves; brand recognition improves; some clicks come through.
Blocking AI crawlers prevents your content from being used as training data (for the ones that respect robots.txt — not all do). You preserve the “scarcity” of your content, but you also opt out of being cited in answers people would otherwise see.

Sites that monetize via ads or have unique, hard-to-replace content (news publishers, paid research) often block. Sites that monetize via product sales or lead gen usually allow — being cited as the authoritative answer is free brand-awareness marketing.

There’s no universal right answer. This check fires when your config doesn’t match the audit profile you selected. The fix is either to allow them (if you want answer-engine traffic) or to acknowledge the deliberate block.

How to fix it

To allow all major AI crawlers, put this at the top of robots.txt:

# Allow AI crawlers explicitly
User-agent: GPTBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: CCBot
Allow: /

User-agent: anthropic-ai
Allow: /

User-agent: cohere-ai
Allow: /

User-agent: applebot-extended
Allow: /

User-agent: Meta-ExternalAgent
Allow: /

# Sitemap last
Sitemap: https://yourdomain.com/sitemap.xml

You don’t strictly need to list them — if no rule matches a user-agent, the crawler is allowed by default. But listing them explicitly is a public signal that you welcome them, and it makes your intent unambiguous when a new crawler shows up and you have to decide.

To block them all (opt out):

User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

Selective allow. Some sites block training crawlers but allow live-fetch agents:

# Block training-data scrapers
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: CCBot
Disallow: /

# Allow on-demand fetches (citations land back as links)
User-agent: ChatGPT-User
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Claude-Web
Allow: /

Server-level blocks. robots.txt is a politeness convention — only crawlers that respect it obey it. For a hard block, add nginx user-agent matching:

if ($http_user_agent ~* (GPTBot|ClaudeBot|CCBot|anthropic-ai)) {
  return 403;
}

Pair with agents.txt (a newer convention MetricSpot also checks — see the agents.txt doc) for a structured per-bot policy that’s machine-readable beyond robots.txt’s wildcards.

Audit yourself — curl https://yourdomain.com/robots.txt and confirm the blocks/allows match your intent. Then check the Allow AI crawlers finding in your next audit.

Frequently asked questions

Does blocking GPTBot remove my content from ChatGPT?

It stops future crawls from indexing your site, but content already in training data stays there. There’s no retroactive removal — OpenAI doesn’t offer one. Blocking now means future model versions won’t see your new content.

What about AI tools that ignore robots.txt?

A growing list of scrapers ignore robots.txt entirely or spoof user agents. For those, robots.txt is useless and you need server-level filtering (nginx user-agent rules, Cloudflare bot management, IP blocks). The robots.txt approach handles the well-behaved 80%.

Should I allow Google-Extended specifically?

Google-Extended is Google’s training-data crawler for Gemini, separate from Googlebot (which still indexes you for normal search). Blocking Google-Extended doesn’t affect your search rankings; allowing it lets your content show up in Gemini answers. Most sites with AI-discovery intent allow it.