technical

Sitemap in robots.txt

MetricSpot checks robots.txt for a Sitemap: line. It's how Google, Bing, and most AI crawlers auto-discover your sitemap without you submitting it manually.

What this check does

Fetches /robots.txt and looks for one or more Sitemap: directives. Verifies the URL is absolute (relative URLs aren’t allowed per the spec) and reachable.

Why it matters

Most crawlers discover your sitemap one of three ways:

  1. You submit it in Google Search Console / Bing Webmaster Tools.
  2. You declare it in robots.txt via Sitemap:.
  3. The crawler guesses /sitemap.xml as a last resort.

Method 2 is the one that scales to crawlers you don’t have an account with — Googlebot, Bingbot, GPTBot, ClaudeBot, PerplexityBot, Applebot, Yandex, Baidu, and dozens of smaller indexers all read robots.txt before they crawl anything else. Declaring your sitemap there is one line of configuration that broadcasts the location to everyone at once.

If you skip it, a CMS-generated sitemap at a non-standard URL (/sitemap_index.xml, /wp-sitemap.xml, /sitemap-0.xml) may never be discovered by the smaller crawlers that don’t try alternative paths.

How to fix it

Append a Sitemap: line to /robots.txt. Use the absolute, canonical URL — same scheme (https), same hostname (with or without www, matching your canonical), no trailing redirects.

User-agent: *
Allow: /

Sitemap: https://example.com/sitemap.xml

Multiple sitemaps are allowed. List each one, or list a sitemap index that references the others:

Sitemap: https://example.com/sitemap-pages.xml
Sitemap: https://example.com/sitemap-posts.xml
Sitemap: https://example.com/sitemap-products.xml

Rules:

  • Absolute URL only. Sitemap: /sitemap.xml is invalid per the spec. Some crawlers tolerate it, others ignore the directive entirely.
  • Match the canonical scheme/host. If your site is https://www.example.com, don’t declare https://example.com/sitemap.xml. Google treats the sitemap as belonging to the host it’s declared on.
  • The Sitemap: directive is global, not scoped to a User-agent: block. Put it on its own line, at the top or bottom of the file — placement doesn’t matter.

nginx (serving a static robots.txt):

location = /robots.txt {
  alias /var/www/example.com/robots.txt;
}

Next.js (App Router) — dynamic robots.txt:

// app/robots.ts
import type { MetadataRoute } from "next";

export default function robots(): MetadataRoute.Robots {
  return {
    rules: [{ userAgent: "*", allow: "/" }],
    sitemap: "https://example.com/sitemap.xml",
  };
}

Astro: put a literal public/robots.txt in your project, or generate one in src/pages/robots.txt.ts:

// src/pages/robots.txt.ts
import type { APIRoute } from "astro";

export const GET: APIRoute = ({ site }) => {
  const body = `User-agent: *
Allow: /

Sitemap: ${new URL("sitemap-index.xml", site).href}
`;
  return new Response(body, { headers: { "Content-Type": "text/plain" } });
};

If you use @astrojs/sitemap, it emits sitemap-index.xml — point the directive at the index, not at individual sitemaps.

WordPress: Yoast SEO and Rank Math both add the Sitemap: line to the virtual robots.txt automatically. If you have a real /robots.txt file on disk, the plugin can’t override it — either delete the file or append the Sitemap: line manually.

Cloudflare Workers / Pages: robots.txt is just static text — drop it in your public/ directory. If you generate it via a Worker, set content-type: text/plain so crawlers parse it correctly.

Combine with Robots.txt file, XML sitemap, and Allow AI crawlers — the three together set up the entire crawl-discovery surface.

Frequently asked questions

Do I still need to submit the sitemap in Search Console if it’s in robots.txt?

Yes, for Google. Search Console submission gives you per-sitemap indexing stats, error reports, and “discovered, not indexed” diagnostics that auto-discovery doesn’t. Use both — robots.txt for the rest of the web, Search Console for visibility into Google specifically.

Can I have more than one Sitemap: line?

Yes. The spec allows multiple sitemap declarations. Either list them all, or list a sitemap index file that references the others (cleaner, easier to maintain).

What if my sitemap URL changes?

Update the Sitemap: line and Googlebot will pick it up the next time it reads robots.txt (usually within 24 hours). If you have an old sitemap submitted in Search Console, remove it manually — robots.txt only tells crawlers where the new one is, it doesn’t expire the old one.

Sources

Last updated 2026-05-11