Robots.txt Best Practices for SEO

Application of the robots.txt file typically falls sharply into two opposing schools of thought: either entries here are taken for granted as rote, but necessary, directives for search engines to follow (and required by CMSs); or there is an omnipresent fear (rightly placed) over placing any entry into the file lest it block search engine access to something critical on the site.

robots.txt talking robot

image credit: robotstxt.org

What's missing in this paradoxical way of thinking is a middle way that uses the robots.txt for the good of the SEO campaign.

Many robots.txt best practices are well established, and yet we continue to see incorrect information spread in prominent places, such as this recent article on SEW. There are several points stated in the piece that are either fundamentally wrong, or that we strongly disagree with.

How To Use The Robots.txt File For SEO

There are several best practices that should first be covered:

  • As a general rule, the robots.txt file should never be used to handle duplicate content. There are better ways.
  • Disallow statements within the robots.txt file are hard directives, not hints, and should be thought of as such. Directives here are akin to using a sledgehammer.
  • No equity will be passed through URLs blocked by robots.txt. Keep this in mind when dealing with duplicate content (see above).
  • Using robots.txt to disallow URLs will not prevent them from being displayed in Google's search engine (see below for details).
  • When Googlebot is specified as a user agent, all preceding rules are ignored and the subsequent rules are followed. For example, this Disallow directive applies to all user agents:

User-Agent: *

Disallow: /

  • However, this example of directives applies differently to all user agents, and Googlebot, respectively:

User-Agent: *

Disallow: /

User-Agent: Googlebot

Disallow: /cgi-bin/

  • Use care when disallowing content. Use of the following syntax will block the directory /folder-of-stuff/ and everything located within it (including subsequent folders and assets):

Disallow: /folder-of-stuff/

  • Limited use of regular expression is supported. This means that you can use wildcards to block all content with a specific extension, for example, such as the following directive which will block Powerpoints:

Disallow: *.ppt$

robots.txt is a sledgehammer

robots.txt

  • Always remember that robots.txt is a sledgehammer and is not subtle. There are often other tools at your disposal that can do a better job of influencing how search engines crawl, such as the parameter handling tools within Google and Bing Webmaster Tools, the meta robots tag, and the x-robots-tag response header.

Setting A Few Facts Straight Let's correct a few statements the previously cited SEW article stumbled on.

Wrong:

"Stop the search engines from indexing certain directories of your site that might include duplicate content. For example, some websites have "print versions" of web pages and articles that allow visitors to print them easily. You should only allow the search engines to index one version of your content."

Using robots.txt for duplicate content is almost always bad advice. Rel canonical is your best friend here, and there are other methods. The example given is especially important: publishers with print versions should always use rel canonical to pass equity properly, as these often get shared and linked to by savvy users.

Wrong:

"Don't use comments in your robots.txt file."

You should absolutely use comments in your robots.txt file, there is no reason not to. In fact, comments here can be quite useful, much like commenting source code. Do it!

# The use of robots or other automated means to access the Adobe site

# without the express permission of Adobe is strictly prohibited.

# Details about Googlebot available at: http://www.google.com/bot.html

# The Google search engine can see everything

User-agent: gsa-crawler-www

Disallow: /events/executivecouncil/

Disallow: /devnet-archive/

Disallow: /limited/

Disallow: /special/

# Adobe's SEO team rocks

Wrong:

"There's no "/allow" command in the robots.txt file, so there's no need to add it to the robots.txt file."

There is a well documented Allow directive for robots.txt. This can be quite useful, for example if you want to disallow URLs based on a matched pattern, but allow a subset of those URLs. The example given by Google is:

User-agent: *

Allow: /*?$

Disallow: /*?

... where any URL that ends with a ? is crawled (Allow), and any URL with a ? somewhere in the path or parameters is not (Disallow). To be fair, this is an advanced case where something like Webmaster Tools may work better, but having this type of constraint is helpful when you need it. Allow is most definitely 'allowed' here.

Robots.txt and Suppressed Organic Results

Blocked content can still appear in search results, leading to a poor user experience in some cases. When Googlebot is blocked from a particular URL, it has no way of accessing the content. When a link appears to that content, the URL often is displayed in the index without snippet or title information. It becomes a so-called "suppressed listing" in organic search.

URLs blocked with robots.txt in Google's index

One important note: while robots.txt will create these undesirable suppressed listings, use of meta robots noindex will keep URLs from appearing in the index entirely, even when links appear to the URLs (astute readers will note this is because meta noindex URLs are crawled). However, using either method (meta noindex or robots.txt disallow) creates a wall that prevents the passing of link equity and anchor text. It is effectively a PageRank dead end.

Common Gotchas with Robots.txt

  • As described above, if the user-agent Googlebot is specified, it overrules all other directives in the file.
  • Limited use of regular expression is supported. That means that wildcards (*), end of line ($), anything before (^) and some others will work.
  • Ensure CSS files are not blocked in robots.txt. For similar reasons, javascript assets that assist in rendering rich content should also be omitted from disallow statements, as these can cause problems with snippet previews.
  • It may sound obvious, but exclude content carefully. This directive will block the folder "stuff" and everything beneath it (note trailing slash):

Disallow: /folder/stuff/

  • Verify your syntax with a regular expression testing tool. Sadly, Google will remove the robots.txt tool from within Webmaster Tools. This is a bit of a loss as it was a quick and handy way to double-check syntax before pushing robots.txt changes live.
  • Remember that adding disallow statements to a robots.txt file does not remove content. It simply blocks access to spiders. Oftentimes when there is content that you want removed, it's best to use a meta noindex and wait for the next crawl.

Resources

Join the Discussion