Robots.txt Generator

Controlling the Crawlers That Index the Web

The Gatekeeper of Your Website

The robots.txt file is the very first document a search engine crawler requests when visiting your domain. Before Googlebot touches a single HTML page, it fetches https://yourdomain.com/robots.txt to understand what it is and is not allowed to crawl. Google alone runs over 130 trillion known pagesthrough its index. With that scale, the crawl budget allocated to your site is finite—a typical small site might get 50–200 pages crawled per day, while a large e-commerce site might get 50,000+. A poorly configured robots.txt that blocks important pages or wastes budget on irrelevant ones directly impacts how much of your content appears in search results.

Anatomy of the Syntax

User-agent: *
Disallow: /admin/
Allow: /admin/public/
Sitemap: https://example.com/sitemap.xml

The syntax follows the Robots Exclusion Protocol, originally proposed by Martijn Koster in 1994 and now an IETF standard (RFC 9309, published 2022). Each block starts with User-agent: specifying which crawler the rules apply to (* means all). Google supports pattern matching: Disallow: /*.pdf$ blocks all PDF files, and Disallow: /search?*sort= blocks sorted search result pages. The Allow:directive creates exceptions within broader blocks—Google processes the most specific matching path, so Allow: /admin/public/ overrides Disallow: /admin/.

Crawl Budget Optimization

Crawl budget is the number of pages Googlebot will crawl on your site within a given timeframe. For a 10,000-page e-commerce site with faceted navigation generating 500,000 URL variations (color, size, price filters), failing to block those filter pages means Google spends 98% of its crawl budget on duplicate, low-value URLs while your new product pages wait weeks to be discovered. Strategic robots.txt rules like Disallow: /*?sort= and Disallow: /*?filter=can reclaim that budget. After implementing proper crawl budget optimization, sites commonly see new pages indexed within 24–48 hours instead of 2–3 weeks.

Common Mistakes That Kill SEO

The most catastrophic robots.txt mistake is a single line: Disallow: / under User-agent: *. This blocks all crawlers from your entire site—and it happens more often than you think, especially when a staging site's robots.txt gets deployed to production. Another common error is blocking CSS and JS files. Google needs these to render your pages; blocking them means Googlebot sees a broken page and may rank you lower. Finally, never use robots.txt to hide sensitive data—anyone can read your robots.txt, and listing Disallow: /secret-api/actually advertises the path's existence.

Frequently Asked Questions

What is the difference between Disallow and noindex?

Disallowin robots.txt prevents crawling—the bot never visits the page. But if external sites link to that URL, Google can still index it as a “URL-only” result with no snippet. The noindexmeta tag or X-Robots-Tag header prevents indexing—the bot visits the page, sees the directive, and removes it from the index. Critically, you cannot combine both: if you block crawling with robots.txt, Google will never see the noindex tag. To fully suppress a URL, allow crawling but add noindex.

How do wildcard patterns work in robots.txt?

Google and Bing support two wildcard characters: * matches any sequence of characters, and $ anchors the match to the end of the URL. For example, Disallow: /*.pdf$ blocks all URLs ending in .pdf but not /pdf-guide/. The pattern Disallow: /catalog/*/reviewsblocks review pages across all product categories. These patterns are not regex—they only support* and $, not character classes or quantifiers.

How long does Google take to respect robots.txt changes?

Google caches your robots.txt and re-fetches it roughly every 24 hours, though this interval can vary. After updating your file, you can force a re-fetch via Google Search Console's robots.txt Tester. If Google cannot fetch your robots.txt (HTTP 5xx error) for more than 30 days, it treats the file as permissive and crawls everything. A 4xx error (file not found) is also treated as fully open—meaning if you accidentally delete your robots.txt, all previously blocked paths become crawlable immediately.