From the very beginning, Google – and other search engines that followed, grew, and lived or died – needed some way to find pages to index. At the start, Google used manual feeding with a staff of employees. With the size and scale of the Internet, however, this was almost immediately unsustainable. Instead, they developed search crawlers, web spiders, software scanners; the entities we call robots today.
In order to limit the behavior of these robots – to tell them what to do and what not to do on a given site – the robots exclusion protocol was created. Originally created in 1994, it has been updated since, most recently to include the new attribute rel=nofollow.
How do you use the robots.txt file appropriately?
When a search engine crawler visits your website, the very first thing it does is strips the URL and looks for a robots.txt file. So if a robot found a link to your site, www.example.com/blog-post-category/blog-title/ on another site and followed it, its first action is to strip everything out of that URL, leaving it with www.example.com. Once it has that bare URL, it adds robots.txt to the end to look for a file; www.example.com/robots.txt. This is the only valid location for a robots.txt file. If you place it anywhere else on your site, the search engine will not find it.
Consider these three scenarios:
1. You have no robots.txt file at all.
2. You have a robots.txt file but it’s blank.
3. You have a robots.txt file with two lines: a wildcard under User-agent and nothing under disallow.
These three scenarios all work exactly the same. When a search robot comes to your site, it will look for the robots.txt file. If it finds nothing, finds it empty or finds it with nothing disallowed, the robot then is free to parse your entire site. Nothing will be hidden or disallowed from indexing. If this is fine for you – or if you’re using on-page noindex directives – you’re free to not use a robots.txt file. It’s good practice to have at least a basic robots.txt file, however, so you may want to include one with nothing disallowed.
As an additional note: robots.txt is case sensitive. Robots.TXT is not a valid filename for your directives.
The first line in your robots.txt file is typically “User-agent: *”. What this line does is specify the bot you’re controlling. The * is a wildcard that means any following directive applies to every robot that visits your site. Almost every site will just use one user-agent line with a wildcard rather than specify directives for individual bots. This is because there are hundreds of bots you can specify, and trying to direct their behavior individually is a quick way to bloat your file and waste your time.
You can specify certain bots if you don’t want your site to show up in specific searches, though the cases where you may want to do this are rare.
Any line in your robots.txt file that follows the user-agent line typically begins with Disallow:. Anything that follows the : is the path that you’re telling the search engines not to index. For example:
• Disallow: /etc will tell the search engines to ignore anything in the /etc folder.
• Disallow: /photos will tell the search engines to ignore anything in the /photos folder.
• Disallow: / will tell the search engines to ignore everything on your site.
Most basic robots.txt files will tell the search engine to ignore some directories that are unnecessary to the display or content of your site, but must be there for the back-end systems to work. Folders such as /cgi-bin/ and /tmp/ fall into this category.
If, for example, each user on your site has their own subfolder, you may want to disallow those folders by default. An entry for this may be Disallow: /~username/. This tells the search spiders to ignore anything in that user folder. You could also disallow: /confidential/ to hide any confidential documents you don’t want indexed online.
There’s one huge flaw with this plan; your robots.txt file is publicly accessible. It has to be, for the web bots to find it and use it. That means anyone can visit your site and see the robots.txt file in plaintext. If you disallow confidential document folders or user profiles, those URL strings are visible in your txt file, which allows users to follow them and view your documents.
Never use disallow commands as the sole means of protecting your files. At the very least, you should also put these folders behind a password, so the average unauthorized user can’t access them.
If there’s a disallow, there must be an allow, right? Well, yes and no. There is an allow command, but only Google and a few other bots honor it. Most bots don’t care; they treat the absence of a disallow as permission to index.
When might you want to use the allow command? Say you have a documents folder, /docs/. It’s full of documents you don’t want the Internet at large to see, but there’s one document in that folder you want shared and indexed, /sharedoc.txt. The proper syntax for allowing Google to see that file would be:
• User-agent: *
• Disallow: /docs/
• Allow: /docs/sharedoc.txt
This really only works for Google and should not be used throughout your site. In general, you will want to use allow sparingly or not at all.
Search engines will follow links and they will record that they have followed those links. The disallow command only tells the search engines not to continue and index the content of the pages in a given directory. Disallow: /sharedocs/ would tell the search engine to ignore the contents of that folder, but it will still note that the folder exists. That page can still accumulate PageRank and inbound links, but it cannot pass them on.
If you want a page to be functionally invisible to the search engines, you will want to use the noindex meta command on the page instead. This keeps the url out of the search results, as well as the content of the page.
Additionally, malicious web crawlers will ignore your robots.txt directives. This means it is not a security tool, it is simply a tool for controlling what Google and other legitimate crawlers see.
• Have a robots.txt. Not having one gives you no control.
• Use a wildcard for the bot directives. There’s no sense in specifying different behaviors.
• Never disallow your whole site. Disallow: / keeps your site out of the rankings and destroys any progress you may have towards ranking.
• Disallow junk directories. Anything including system files or files you don’t want indexed should be disallowed.
• For individual pages, skip robots.txt and use meta noindex instead.
Creating a robots.txt file is simple by hand, or you can generate one using an online tool.