Parse and validate robots.txt for any domain. See every user-agent group, allow/disallow paths, sitemap directives, and crawl-delay. Detects common syntax mistakes per RFC 9309.
robots.txt is the single most consequential text file on your domain. A misplaced Disallow: / under User-agent: * tells Googlebot to stop crawling your entire site — pages drop out of the index within days, organic traffic collapses, and recovery takes weeks even after the fix. We see this happen several times a month in our scans, usually after a staging environment gets deployed to production with its "block everything" robots.txt still in place.
robots.txt
Disallow: /
User-agent: *
The flip side matters too: a robots.txt that allows everything but forgets the Sitemap: directive forces crawlers to discover your URLs the slow way, through internal links. Adding one line at the top of robots.txt can speed up indexation of new pages from weeks to hours.
Sitemap:
Crawlers fetch https://example.com/robots.txt before any other URL on a domain. The file must live at the root path; /blog/robots.txt or /en/robots.txt are ignored. Each subdomain has its own robots.txt — blog.example.com/robots.txt is separate from example.com/robots.txt.
https://example.com/robots.txt
/blog/robots.txt
/en/robots.txt
blog.example.com/robots.txt
example.com/robots.txt
Each User-agent: line starts a new group of rules. Consecutive User-agent: lines merge into a single group — this is the spec rule most parsers get wrong. User-agent: * is the catch-all, applied when a crawler does not find its own name. Specific names like User-agent: Googlebot override the wildcard for that bot.
User-agent:
User-agent: Googlebot
Paths are interpreted as URL prefixes. Disallow: /admin blocks /admin, /admin/, and /admin/users. Disallow: / blocks the entire site. An empty Disallow: with no value explicitly allows everything. Allow: creates an exception inside a broader Disallow — the longest matching path wins.
Disallow: /admin
/admin
/admin/
/admin/users
Disallow:
Allow:
* matches any sequence of characters; $ anchors the end of the URL. Disallow: /*? blocks every URL with a query string. Disallow: /*.pdf$ blocks PDFs but allows /page.pdf.html. Googlebot and Bingbot support wildcards; some legacy bots ignore them.
*
$
Disallow: /*?
Disallow: /*.pdf$
/page.pdf.html
Sitemap: https://example.com/sitemap.xml can appear anywhere in the file and is independent of user-agent groups. You can list multiple sitemaps — one per language, one for images, one for news. Always use the full absolute URL with scheme.
Sitemap: https://example.com/sitemap.xml
The most misunderstood directive. Google explicitly ignores Crawl-delay — you control Googlebot crawl rate from Search Console under Settings, Crawl rate. Bing, Yahoo, and Yandex respect it (values are in seconds between requests).
Disallow: /static/
X-Robots-Tag: noindex
Disallow
noindex
The minimum useful robots.txt for a content or SaaS site: allow everything, declare your sitemap, block the genuinely sensitive paths.
User-agent: * Disallow: /admin/ Disallow: /api/ Disallow: /*?utm_ Sitemap: https://example.com/sitemap.xml Sitemap: https://example.com/sitemap-blog.xml
The wildcard group applies to every crawler; /admin/ and /api/ are not indexable anyway so we save Google crawl budget by skipping them; /*?utm_ blocks every URL with a UTM parameter from being indexed as a duplicate of the canonical; two sitemaps cover the site and the blog independently.
/api/
/*?utm_
A plain-text file at the root of a domain (e.g. example.com/robots.txt) that tells web crawlers which paths they may or may not request. It is part of the Robots Exclusion Protocol, standardized as RFC 9309 in 2022.
No. It is purely advisory — well-behaved crawlers respect it, but malicious bots and scrapers ignore it. Never use robots.txt to hide sensitive content; use authentication, robots meta noindex, or X-Robots-Tag instead.
Google explicitly does not support Crawl-delay. You control crawl rate via Google Search Console > Settings > Crawl rate. Bing, Yahoo, and Yandex do respect Crawl-delay.
Yes — putting "Sitemap: https://example.com/sitemap.xml" in robots.txt lets crawlers discover your sitemap without manual submission. You can list multiple Sitemap lines for multilingual or media-specific sitemaps.
Try it on popular domains: github.com, google.com, cloudflare.com, openai.com.