technical
Sitemap in robots.txt
MetricSpot checks robots.txt for a Sitemap: line. It's how Google, Bing, and most AI crawlers auto-discover your sitemap without you submitting it manually.
What this check does
Fetches /robots.txt and looks for one or more Sitemap: directives. Verifies the URL is absolute (relative URLs aren’t allowed per the spec) and reachable.
Why it matters
Most crawlers discover your sitemap one of three ways:
- You submit it in Google Search Console / Bing Webmaster Tools.
- You declare it in
robots.txtviaSitemap:. - The crawler guesses
/sitemap.xmlas a last resort.
Method 2 is the one that scales to crawlers you don’t have an account with — Googlebot, Bingbot, GPTBot, ClaudeBot, PerplexityBot, Applebot, Yandex, Baidu, and dozens of smaller indexers all read robots.txt before they crawl anything else. Declaring your sitemap there is one line of configuration that broadcasts the location to everyone at once.
If you skip it, a CMS-generated sitemap at a non-standard URL (/sitemap_index.xml, /wp-sitemap.xml, /sitemap-0.xml) may never be discovered by the smaller crawlers that don’t try alternative paths.
How to fix it
Append a Sitemap: line to /robots.txt. Use the absolute, canonical URL — same scheme (https), same hostname (with or without www, matching your canonical), no trailing redirects.
User-agent: *
Allow: /
Sitemap: https://example.com/sitemap.xml
Multiple sitemaps are allowed. List each one, or list a sitemap index that references the others:
Sitemap: https://example.com/sitemap-pages.xml
Sitemap: https://example.com/sitemap-posts.xml
Sitemap: https://example.com/sitemap-products.xml
Rules:
- Absolute URL only.
Sitemap: /sitemap.xmlis invalid per the spec. Some crawlers tolerate it, others ignore the directive entirely. - Match the canonical scheme/host. If your site is
https://www.example.com, don’t declarehttps://example.com/sitemap.xml. Google treats the sitemap as belonging to the host it’s declared on. - The
Sitemap:directive is global, not scoped to aUser-agent:block. Put it on its own line, at the top or bottom of the file — placement doesn’t matter.
nginx (serving a static robots.txt):
location = /robots.txt {
alias /var/www/example.com/robots.txt;
}
Next.js (App Router) — dynamic robots.txt:
// app/robots.ts
import type { MetadataRoute } from "next";
export default function robots(): MetadataRoute.Robots {
return {
rules: [{ userAgent: "*", allow: "/" }],
sitemap: "https://example.com/sitemap.xml",
};
}
Astro: put a literal public/robots.txt in your project, or generate one in src/pages/robots.txt.ts:
// src/pages/robots.txt.ts
import type { APIRoute } from "astro";
export const GET: APIRoute = ({ site }) => {
const body = `User-agent: *
Allow: /
Sitemap: ${new URL("sitemap-index.xml", site).href}
`;
return new Response(body, { headers: { "Content-Type": "text/plain" } });
};
If you use @astrojs/sitemap, it emits sitemap-index.xml — point the directive at the index, not at individual sitemaps.
WordPress: Yoast SEO and Rank Math both add the Sitemap: line to the virtual robots.txt automatically. If you have a real /robots.txt file on disk, the plugin can’t override it — either delete the file or append the Sitemap: line manually.
Cloudflare Workers / Pages: robots.txt is just static text — drop it in your public/ directory. If you generate it via a Worker, set content-type: text/plain so crawlers parse it correctly.
Combine with Robots.txt file, XML sitemap, and Allow AI crawlers — the three together set up the entire crawl-discovery surface.
Frequently asked questions
Do I still need to submit the sitemap in Search Console if it’s in robots.txt?
Yes, for Google. Search Console submission gives you per-sitemap indexing stats, error reports, and “discovered, not indexed” diagnostics that auto-discovery doesn’t. Use both — robots.txt for the rest of the web, Search Console for visibility into Google specifically.
Can I have more than one Sitemap: line?
Yes. The spec allows multiple sitemap declarations. Either list them all, or list a sitemap index file that references the others (cleaner, easier to maintain).
What if my sitemap URL changes?
Update the Sitemap: line and Googlebot will pick it up the next time it reads robots.txt (usually within 24 hours). If you have an old sitemap submitted in Search Console, remove it manually — robots.txt only tells crawlers where the new one is, it doesn’t expire the old one.
Sources
Last updated 2026-05-11