technical
robots.txt file
MetricSpot checks for /robots.txt at the root of your domain. It's the first file every crawler fetches — its absence isn't fatal, but it's a missed signal.
What this check does
GETs https://yourdomain.com/robots.txt and confirms it returns 200 with a parseable robots file. A missing file (404) or non-200 status fails the check.
Why it matters
robots.txt is the first URL every crawler — Googlebot, GPTBot, ClaudeBot, PerplexityBot, archive.org — fetches before scanning your site. It’s your one chance to:
- Direct crawlers to your sitemap with a
Sitemap:line, dramatically improving discovery for pages not linked from the homepage. - Block crawl traps: infinite calendars, faceted search filters, internal-search result pages.
- Allow or disallow AI crawlers selectively (separate check).
Without a robots.txt, you’re saying “crawl whatever you find, in whatever order” — and crawlers waste budget on pages you don’t care about.
How to fix it
Create /public/robots.txt (or wherever your server serves static files from) with at minimum:
User-agent: *
Allow: /
Sitemap: https://yourdomain.com/sitemap.xml
That’s the “open site” baseline. To block specific paths:
User-agent: *
Disallow: /admin/
Disallow: /search?
Disallow: /cart/
Sitemap: https://yourdomain.com/sitemap.xml
Common patterns:
- WordPress: WordPress auto-generates a virtual robots.txt unless
/public/robots.txtexists. Yoast / Rank Math let you edit it in the admin. - Next.js: create
app/robots.tsexporting aMetadataRoute.Robotsobject. - Astro: drop a static
public/robots.txtfile.
After publishing, test with Google Search Console → robots.txt Tester.
Frequently asked questions
Can I block crawlers I don’t want?
Yes, with User-agent: GPTBot followed by Disallow: /. But this only works for crawlers that respect robots.txt — and a growing list of AI scrapers ignore it. For hard blocks, use server-level user-agent rules.
Does Disallow: prevent indexing?
No, Disallow: blocks crawling, not indexing. A page with Disallow: can still appear in search results (with no description) if other sites link to it. To prevent indexing, use a noindex meta tag or X-Robots-Tag: noindex header instead.
What if I want to allow everything?
The simplest valid file is:
User-agent: *
Allow: /
You can omit the file entirely and Google will treat it as “all crawling allowed,” but you also lose the sitemap reference and the explicit signal.
Sources
Last updated 2026-05-11