technical

robots.txt file

MetricSpot checks for /robots.txt at the root of your domain. It's the first file every crawler fetches — its absence isn't fatal, but it's a missed signal.

What this check does

GETs https://yourdomain.com/robots.txt and confirms it returns 200 with a parseable robots file. A missing file (404) or non-200 status fails the check.

Why it matters

robots.txt is the first URL every crawler — Googlebot, GPTBot, ClaudeBot, PerplexityBot, archive.org — fetches before scanning your site. It’s your one chance to:

  • Direct crawlers to your sitemap with a Sitemap: line, dramatically improving discovery for pages not linked from the homepage.
  • Block crawl traps: infinite calendars, faceted search filters, internal-search result pages.
  • Allow or disallow AI crawlers selectively (separate check).

Without a robots.txt, you’re saying “crawl whatever you find, in whatever order” — and crawlers waste budget on pages you don’t care about.

How to fix it

Create /public/robots.txt (or wherever your server serves static files from) with at minimum:

User-agent: *
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

That’s the “open site” baseline. To block specific paths:

User-agent: *
Disallow: /admin/
Disallow: /search?
Disallow: /cart/

Sitemap: https://yourdomain.com/sitemap.xml

Common patterns:

  • WordPress: WordPress auto-generates a virtual robots.txt unless /public/robots.txt exists. Yoast / Rank Math let you edit it in the admin.
  • Next.js: create app/robots.ts exporting a MetadataRoute.Robots object.
  • Astro: drop a static public/robots.txt file.

After publishing, test with Google Search Console → robots.txt Tester.

Frequently asked questions

Can I block crawlers I don’t want?

Yes, with User-agent: GPTBot followed by Disallow: /. But this only works for crawlers that respect robots.txt — and a growing list of AI scrapers ignore it. For hard blocks, use server-level user-agent rules.

Does Disallow: prevent indexing?

No, Disallow: blocks crawling, not indexing. A page with Disallow: can still appear in search results (with no description) if other sites link to it. To prevent indexing, use a noindex meta tag or X-Robots-Tag: noindex header instead.

What if I want to allow everything?

The simplest valid file is:

User-agent: *
Allow: /

You can omit the file entirely and Google will treat it as “all crawling allowed,” but you also lose the sitemap reference and the explicit signal.

Sources

Last updated 2026-05-11