robots.txt is a small, public text file at the root of your site that tells well-behaved crawlers what to crawl and what to skip. Used well, it keeps bots focused on valuable pages, reduces crawl noise from filters and internal search, and helps new or updated content get discovered faster. Used poorly, a single line can stall discovery or make large parts of your site fade from search over time.
This practical guide is written for beginners and busy marketers. You will learn what robots.txt is and is not, how the core rules work in plain English, which common paths to block, what never to block, and how to validate everything in Google Search Console. We include one safe starter template you can copy, simple patterns for messy parameters, and a short checklist for moving safely from staging to production. Throughout, we emphasise caution, clarity, and easy wins you can apply immediately.
Table of contents
- Robots.txt: what it is and where to find it
- Robots.txt warning: read this before you edit
- Robots.txt vs indexing: what it does and doesn’t
- Do you need a robots.txt file?
- Robots.txt template: one safe example to copy
- Robots.txt rules in plain English (User-agent, Disallow, Allow, Sitemap, Crawl-delay)
- Robots.txt for facets and parameters: simple patterns
- What to block in robots.txt vs what not to
- Robots.txt on multiple domains and subdomains
- Google Search Console: robots.txt checks and where to look
- Post-launch checks: coverage, rendering, “blocked by robots.txt”
- Staging to production: robots.txt checklist
- Robots.txt troubleshooting: does this rule really block?
- Robots.txt mistakes to avoid
- Quick deployment checklist
- Conclusion
Robots.txt: what it is and where to find it
The robots.txt file is always placed at the root of a website and is easy to find. In almost every case you can simply add /robots.txt
to the end of your domain to check if it exists:
https://www.example.com/robots.txt
Quick check
- Open the URL in your browser. If the file exists, it loads as plain text.
- No file? Then your site probably does not use one, which is also allowed.
- Blocked (403) or missing (404)? Then the server or CMS needs a fix before crawlers can read it.
Important to know
- Each host and subdomain needs its own robots.txt file (for example:
example.com
,www.example.com
,blog.example.com
). - The file must always be at
/robots.txt
, never inside a folder like/seo/robots.txt
. - If you redirect from http to https or from non-www to www, make sure the robots.txt request also redirects correctly.
In practice, you can almost always reach the file via /robots.txt
(for example: https://www.yourdomain.com/robots.txt
).
If your site runs on WordPress with Elementor Hosting (or a similar managed environment), the process can differ slightly. For a step-by-step guide, see: How to Edit the Robots.txt File in Elementor Hosting.
Robots.txt warning: read this before you edit
One incorrect rule can hide large parts of your site from crawlers. A broad Disallow: /
blocks all crawling for that user-agent. Blocking CSS or JS can break rendering and harm visibility. Never copy staging rules to production without review. Always validate changes on a test URL first.
Robots.txt vs indexing: what it does and doesn’t
Robots.txt only manages crawling, not indexing. This is one of the biggest misunderstandings in SEO. A robots.txt rule can stop Google from fetching a page, but it cannot guarantee that the page will stay out of the index.
- Crawling: If you block a page in robots.txt, Google will not read its content. However, if that page is linked elsewhere, the URL can still show up in search, usually with just the URL and no snippet.
- Indexing: To fully keep a page out of search results, you must allow it to be crawled and then use a
noindex
tag (in HTML) or anX-Robots-Tag
header. That way Google sees the instruction and removes the page from the index. - Noindex in robots.txt does not work: Google ignores any
Noindex:
directives inside a robots.txt file. - Avoid conflicts: If you both block a page in robots.txt and add a
noindex
tag, Google cannot see the tag. The block wins, but the page may still be indexed as a bare URL. If you wantnoindex
to work, do not block the page in robots.txt.
Practical tip: For sensitive but crawlable pages like /thank-you
or /checkout
, do not disallow them in robots.txt. Instead, add:
<meta name="robots" content="noindex,follow">
Noindex,follow is the most common and recommended setup. It tells search engines not to index the page, while still allowing link signals to flow through. This is useful for pages such as thank-you pages, checkout steps, or other thin utility pages where you want to preserve internal linking value but keep the page itself out of search results.
Large e-commerce stores sometimes choose noindex,nofollow for specific sections. With thousands of products, layered navigation, and endless filter combinations, these sites can generate millions of low-value or duplicate URLs. By blocking both indexing and link following, they prevent crawl budget from being wasted on filter pages that add no unique value. This helps Google focus on high-priority product and category pages, keeps the site architecture cleaner, and avoids passing link equity into an infinite web of filter URLs.
Noindex options compared (quick reference)
- noindex,follow
- Indexing: page not indexed
- Links: links on the page are followed; link equity can flow
- Use for: most thank-you pages, checkout steps, thin utility pages
- Why: keeps the page out of search while preserving internal linking signals
- noindex,nofollow
- Indexing: page not indexed
- Links: links on the page are not followed
- Use for: large webshops with massive filter paths or duplicate structures
- Why: prevents crawl budget being spent on near-duplicate filter paths; keeps architecture focused
Do you need a robots.txt file?
Technically, you do not need a robots.txt file at all. If no file is present, search engines will attempt to crawl everything they can discover through links and sitemaps. For very small websites, this can be perfectly fine, because the site is simple, crawl paths are limited, and there is little risk of wasting crawl budget.
However, most websites benefit from having a well-structured robots.txt file. It allows you to reduce unnecessary crawling of low-value pages (such as internal search results, duplicate filter URLs, or temporary folders) and to ensure that crawlers can always access essential assets like CSS and JavaScript. By keeping your file minimal and focused on clear priorities, you can improve crawl efficiency, protect sensitive sections of your site, and make it easier for search engines to discover the content that truly matters. The best practice is to start with a simple template and only add extra rules when you have a proven case, such as parameter explosions in e-commerce filters or staging environments that must stay out of the index.
Robots.txt template: one safe example to copy
A robots.txt file should be clear, minimal, and tailored to your site’s needs. Below is a conservative starting point you can adapt. Do not assume it’s perfect for your setup — always validate it before going live.
# Default rules for all compliant crawlers
User-agent: *
# Block internal search results and temporary folders
Disallow: /search
Disallow: /internal/
Disallow: /tmp/
# Allow essential assets for proper rendering
Allow: /wp-content/uploads/
Allow: /*.css$
Allow: /*.js$
# Reference your sitemap(s)
Sitemap: https://www.example.com/sitemap.xml
# Add additional sitemap lines if needed (e.g., for products, languages, images)
# Sitemap: https://www.example.com/sitemap_products.xml
Why this works: This template blocks obvious low-value paths while ensuring that CSS, JS, and media files remain crawlable. Most importantly, it explicitly lists your sitemap. Adding the sitemap here helps search engines discover and crawl your most important pages more efficiently, especially on large websites with many URLs. Without it, crawlers may take longer to find new or updated content.
Warnings
- Never copy staging rules to production: A broad
Disallow: /
will block everything. - Do not block assets: Avoid disallowing CSS, JS, or images that Google needs for rendering.
- Check redirects: If you force HTTPS or www/non-www, make sure
/robots.txt
also resolves correctly. - Keep it minimal: Add new rules only when you have confirmed problems such as parameter explosions or crawl loops.
- Always include a sitemap: This ensures search engines can quickly discover the canonical structure of your site.
Robots.txt rules in plain English
User-agent
Declares which crawler a group of rules applies to. Use *
for all bots, or a specific agent like Googlebot
. You can have multiple groups, each with their own rules.
User-agent: Googlebot
Disallow: /internal/
User-agent: *
Disallow: /tmp/
Disallow
Defines the path prefix that a bot must not crawl. Matching starts immediately after the domain, is case-sensitive, and supports *
as a wildcard and $
to mark the end of a string.
Disallow: /private/
Disallow: /*?*session_id=
Disallow: /*/sort/*$
Allow
Lets you reopen a subpath under a broader Disallow
. Google and most modern crawlers support it. Always test how non-Google crawlers behave if they matter to your site.
Disallow: /wp-content/
Allow: /wp-content/uploads/
Sitemap
Lists the absolute URL(s) of your XML sitemap. While optional, including a sitemap in robots.txt is highly recommended. It gives crawlers a direct pointer to your most important URLs and helps speed up discovery.
Sitemap: https://www.example.com/sitemap.xml
Sitemap: https://blog.example.com/sitemap.xml
Sitemap: https://www.example.com/sitemap-products.xml
Do you need multiple sitemaps?
- Yes, if: your site has multiple languages, distinct sections (e.g. blog vs. shop), or a very large catalog that is split across multiple files. Listing them all in robots.txt ensures search engines don’t miss critical areas.
- No, if: you already use a single sitemap index file (
sitemap_index.xml
) that links to all sub-sitemaps. In that case, reference only the index in robots.txt – crawlers will automatically find the rest.
Best practice: Always include at least one sitemap or a sitemap index in robots.txt. For large or multilingual sites, referencing either the index or all major sitemaps minimizes the risk of coverage gaps.
Crawl-delay
Controls how often bots should request pages. Google ignores Crawl-delay
, but some other engines may support it. For Google, manage crawl rate in Search Console settings or by optimizing server performance and caching.
Robots.txt for facets and parameters: simple patterns
Faceted navigation (filters) and URL parameters can quickly create thousands of duplicate or near-duplicate URLs. This is one of the biggest risks for large e-commerce stores: wasted crawl budget and diluted ranking signals. Instead of blocking everything, target only the parameters that add no unique search value.
Sort and order parameters
Disallow: /*?*sort=
Disallow: /*?*order=
Why: Sorting options (e.g., price ascending, newest first) don’t change the underlying products, only their order. Blocking these avoids hundreds of duplicate category URLs.
Filter facets (optional)
Disallow: /*?*color=
Disallow: /*?*size=
Why: Facets like color or size often generate thin or duplicate pages. Unless you’ve created unique, index-worthy landing pages (e.g., “red dresses”), block them to keep crawlers focused on the main categories.
Tracking parameters
Disallow: /*?*utm_
Disallow: /*?*gclid=
Disallow: /*?*fbclid=
Why: Marketing and analytics tags (from Google Ads, Facebook, email campaigns, etc.) never change the content. They only generate endless duplicate URLs. Blocking them prevents crawl waste.
Pagination
# Usually allow – don't block ?page=
# Disallow: /*?*page= not recommended
Why: Pagination lets crawlers discover deeper products within a category. Blocking ?page=
may hide products from Google. Only block if you are 100% sure products are accessible via other crawlable links (e.g., load-more or infinite scroll with proper markup).
Best practice: Keep robots.txt rules as specific as possible and combine them with other tools:
- Use
rel="canonical"
to consolidate duplicate variations. - Manage parameter handling in Google Search Console (e.g., let Google ignore certain parameters).
- Test key URLs in URL Inspection to confirm that valuable pages remain crawlable.
What to block in robots.txt vs what not to
Common candidates to block
- Internal search (
/search
,?q=
): Creates endless URL variations with little unique content. Blocking prevents crawl budget waste. - Temporary or test folders (
/tmp/
,/internal/
): Contain staging files or experiments not meant for indexing. - Sessioned or printer-friendly duplicates (
?session_id=
,?print=
): These produce duplicate versions of the same page, often with thin or identical content. - Cart and checkout steps: Not useful for search engines, and usually already handled with
noindex
tags. Blocking them keeps crawlers out of transactional flows.
Do not block
- Essential static assets (CSS, JS, images): Google needs these files to properly render your site. Blocking them can harm Core Web Vitals and structured data detection.
- Primary category, product, and content pages: These are your most valuable SEO pages. Blocking them will remove them from discovery and ranking.
- Canonical pages: If a page is the canonical version of duplicates, crawlers must be able to access it. Blocking it in robots.txt undermines deduplication signals.
Key takeaway: Robots.txt should reduce crawl waste, not limit visibility of your best content. Block technical or duplicate paths, but always keep essential assets and canonical pages accessible.
Robots.txt on multiple domains and subdomains
Every host and subdomain needs its own robots.txt file, served from its root. A robots.txt on the main domain does not apply to subdomains or CDNs.
- Per host:
https://example.com/robots.txt
,https://www.example.com/robots.txt
, andhttps://blog.example.com/robots.txt
are all separate files. - Language subdomains: International sites on
en.example.com
orde.example.com
require their own robots.txt and sitemap declarations. - CDNs and asset domains: Ensure these do not block essential files (CSS, JS, images) needed for rendering. A blocked CDN can break how Google interprets your site.
Google Search Console: robots.txt checks and where to look
Google Search Console (GSC) is your main tool for testing and validating robots.txt rules in practice.
- URL Inspection > Test Live URL: Confirms whether a page is blocked by robots.txt and whether Google can render all required resources.
- Indexing > Pages: Check status groups such as “Blocked by robots.txt” to see which URLs are affected. Drill down into examples to spot patterns.
- Sitemaps: Submit your sitemap index to help crawlers discover valid URLs quickly. Errors here often reveal mismatches between your sitemap and robots.txt.
Note: Google’s legacy robots.txt Tester was retired. Today you should rely on live testing in GSC, combined with server log analysis and controlled rollouts.
Post-launch checks: coverage, rendering, “blocked by robots.txt”
- Inspect a few key templates with URL Inspection to confirm content and structured data are visible.
- Verify that CSS and JS assets are crawlable and that Google renders the site as intended.
- Check the “Blocked by robots.txt” group in GSC and confirm only the paths you intended appear there.
- Monitor crawl stats for anomalies such as spikes from infinite parameter paths or crawl gaps on important sections.
Staging to production: robots.txt checklist
Moving from staging to production is where many mistakes happen. Use this checklist to stay safe:
- On staging: Restrict access with IP allowlisting or password protection. If you use
Disallow: /
, keep it on staging only. - Before go-live: Remove staging blocks, update all
Sitemap:
URLs to point to the live domain, and confirm/robots.txt
serves a valid file with200 OK
. - After go-live: Test representative URLs with GSC URL Inspection and monitor coverage trends closely for the first few weeks.
Robots.txt troubleshooting: does this rule really block?
- Check delivery:
curl -I https://www.example.com/robots.txt
should return200 OK
andContent-Type: text/plain
. - Confirm match: Remember matching is case-sensitive; only
*
and$
are supported wildcards. - Test live: Use GSC URL Inspection to verify whether robots.txt is the reason for disallow.
- Log analysis: Server logs reveal whether bots are still attempting to access blocked paths and whether disallows are respected in practice.
Robots.txt mistakes to avoid
- Accidentally using
Disallow: /
on production and blocking the entire site. - Blocking CSS, JS, or image folders that Google needs to render and evaluate your site.
- Expecting
Noindex:
in robots.txt to work (it is ignored). - Using overly broad rules that block valid canonical pages.
- Relying on
Crawl-delay
for Google (it is ignored). - Placing absolute URLs in
Disallow
rules (only paths are valid). - Forgetting that each host and subdomain requires its own robots.txt file.
Quick deployment checklist
- Back up your current robots.txt before making any changes.
- Deploy updates to staging or a test environment first.
- Validate a sample of URLs with GSC URL Inspection to confirm behavior.
- Re-submit your sitemaps and review indexing reports over the following days.
Conclusion
A robots.txt file controls crawling, not indexing. Keep your rules lean: block only what truly wastes crawl budget, avoid blocking essential assets, and use noindex
where removal from search is the goal. Start with a safe template, validate in Google Search Console, and monitor coverage after each update.
For managed WordPress environments such as Elementor Hosting, follow platform-specific steps and confirm that your file is actually served at /robots.txt
. For a step-by-step guide, see: How to Edit the Robots.txt File in Elementor Hosting.
Next steps: Deepen your technical SEO knowledge with these related guides: