When a search engine crawler comes to a site, it looks for a special file on that site. That file is called robots.txt and it tells the search engine spider, which web pages of that site should be indexed and which Web pages should be ignored. The robots.txt file is a simple text file (.txt) and normally placed in the root directory of a website or blog. For example: http://www.yourblog.com/robots.txt
Robots.txt is important from Search Engine Optimization point of view. By specifying where search engines should look for content in directories or files you can increase the search engine ranking of your site, and is recommended by Google and all the search engines.
How to create a robots.txt file?
Don’t have a robots.txt file? You can create one easily. Mind it, the robots.txt file is a simple text file.The content of a robots.txt file are called “records”. One can create it in a simple text editor like notepad. Off-course, you can use some online robots.txt generators, if you don’t want to write your own!
For example, if you want to give all search engine spiders the equal rights, use the following lines in your robots.txt file:
User-agent: *
Disallow:
A robots.txt record contains the information for search engine(s). Each record consists of two fields: the user agent line and one or more Disallow lines. In the above example, by putting (*) in user agent line, we are allowing all search engine spiders to index our site. A blank disallow field means we are allowing search engines to index all pages!
Another example:
User-agent: googlebot
Disallow: /cgi-bin/
This robots.txt file would allow the “googlebot”, which is the search engine spider of Google, to retrieve every page from your site except for files from the “cgi-bin” directory.
Important:
- Don’t change the order of commands in your records. Never use ‘Disallow’ before ‘User-agent’.
- Don’t use any white space at the beginning of a line.
- Don’t use any ‘allow’ command. It doesn’t exists.
- Don’t use all file names under a directory. Only name the directory.
A typical Robots.txt file for a WordPress blog may contain the following information:
User-agent: *
Disallow: /cgi-bin
Disallow: /wp-admin
Disallow: /wp-includes
Disallow: /wp-content/plugins
Disallow: /wp-content/cache
Disallow: /wp-content/themes
Disallow: /trackback
Disallow: /feed
Disallow: /comments
Disallow: /category/*/*
Disallow: */trackback
Disallow: */feed
Disallow: */comments
Disallow: /*?*
Disallow: /*?
Allow: /wp-content/uploads# Google Image
User-agent: Googlebot-Image
Disallow:
Allow: /*# Google AdSense
User-agent: Mediapartners-Google*
Disallow:
Allow: /*# digg mirror
User-agent: duggmirror
Disallow: /Sitemap: http://www.yourblog.com/sitemap.xml
Presently, I am allowing all search engines to visit EarnBlogger. But, it is not necessary that you want to do the same! For more information, visit the robots.txt site. You can also search the Internet. I’m sure that there are tons of useful information waiting for you! Feel free to give your comments and suggestions.
You might also like:

That’s a nice article. It will help new bloggers. Keep blogging. Thanks.