Home | Free Quote | Portfolio | Plans | Articles | Contact
Hosting | Free Templates | Search Engines | Free Logos

Discover "Robots.txt" and Why Every Website Should Have One

Do you have pages on your website you do not want to be displayed in search results?  You know, those pages that are never intended to be seen by people unless they have completed some function on your site.

For example, you have a contact form on your website.  When someone completes it and clicks the "Submit" button, the form is submitted and the viewer is taken to a "thank you" page.  That "thank you" page really doesn't help you in the search engine arena.

In fact, your website would be better served if a more important page (like your home page) appeared when people search keywords relevant to your website.  Usually this is the case, but often times a useless page from your site can appear higher than your home or products/services page.  Sure, it brings visitors to your site.  But, you would see more visitors if a more beneficial page were displayed in the search results.

And, it's not just the thank you pages that are indexed by Google and other search engines.  Other pages and files on your site can be picked up as well.  Files like "_private", "stats", and "images" are consistently monitored and indexed by search engine robots, web crawlers and spiders. 

So what, you say?  Well, understand that top search engines do not typically list every page of a website when producing results for a particular keyword or phrase.  Usually, they return pages that have more incoming links than the other pages of the website (i.e. your home page), but this may not always be the case.  Studies have shown that most search engines index about 16% of each website.

If your website allows search engines to systematically choose the pages and files to be indexed, the aforementioned useless pages may be competing with your home (.index.htm or default.htm) and other more important pages.

Got Your Robots.txt File?

Fortunately, there is a way to ensure there is no competition for your home and other relevant pages when it comes to search engines displaying your website's search results.  The answer: robots text file.

A robot what?  Yeah, a robots text file.  Websites are basically made up of files and pages.  One file is your home page, as mentioned earlier (index.htm or default.htm).  Another may be ".aboutus.htm" or ".contactus.htm"  Look at your browser's address bar when you visit one of your website's pages on the world wide web.  Look to the far right side of the address until you see the .xyz.htm (where xyz is your page name).

These files are stored on the server where your site is hosted.  For example, on my website there are files and subdirectories such as www.webdesignerlive.com/images/ and www.webdesignerlive.com/_private/ and www.webdesignerlive.com/stats/.  And,  when someone signs up for one of my website plans, they will complete an online Agreement.  Upon clicking on the "Submit Agreement" button, they are automatically sent to a "thank you" page explaining the agreement was accurately completed and sent.  One example is www.webdesignerlive.com/thanks-agreement-silverwebsitedesigner.htm.

Clearly, I don't need that page or those other files indexed by search engines.   So, I added a special page to the root directory of my website called ".robots.txt".  This is a file that rests behind the scenes of my website, and isn't viewable from any links on the site.  In that file I identify the pages or files I do not want indexed.  Therefore, those files are skipped by search engine robots and spiders when they are looking through my website.

What are Robots, Web Crawlers and Spiders?

Robots, Web Crawlers and Spiders are essentially one in the same.  They are programs that automatically peruse, or "crawl" the world wide web indexing the files and returning the information to the search engine's database.

There are thousands of robots operating at any given moment.  Probably the most important robot is generated by Google.  Google's robot is called "Googlebot".  Other crawler-based search engines include Teoma and AlltheWeb and each have specific robots searching web pages throughout the world.

Wikipedia describes robots as:

This process is called web crawling or spidering. Many sites, in particular search engines, use spidering as a means of providing up-to-date data. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches. Crawlers can also be used for automating maintenance tasks on a website, such as checking links or validating HTML code. Also, crawlers can be used to gather specific types of information from Web pages, such as harvesting e-mail addresses (usually for spam).

The beauty of robots is that they identify all the links on a site and index them for later crawling.  Google uses at least a portion of this concept to rank websites and pages.  Part of their algorithm takes into consideration the number and relevancy of sites being linked.  If a particular website has 25 links from similar websites pointing to it, then Google may rank that site higher than one with only 5 links coming in.

To summarize, every website should have a .robots.txt file.  In it are instructions for web crawlers, bots, spiders, etc.  These instructions tell the robots what to index and what not to index.  Ultimately, you web site's key pages are listed in the search results while insignificant pages are left alone.  Results for your site within the search engines are optimized when it comes to robots.txt!

Setting up a .Robots.txt File on Your Website

If you are having WebDesignerLive.com develop your website, you have nothing to worry about.  We add a carefully-constructed .robots.txt file to every website web design.  Some of our website plans include a free Google sitemap, which take advantage of your .robots.txt file.

If you are designing your own website, here are the instructions for creating a .robots.txt file, and how to configure it.

The first thing you need to do is create a new blank page on your website, and save it as ".robots.txt".  Make sure you save it as a text (.txt) file and not a hypertext (.htm or .html) file.

Next, open the .robots.txt file you just created and add the following two lines while working in the design mode:

User-agent: *
Disallow:

The user agent, "*" represents all robots.  In other words, the first line, "User-agent: *" tells all search engine bots they are invited to index your website.

The "disallow" line, which in the example above, has no instructions following it, tells all search engine bots they can index every URL (page, file, etc.) of your website.  If you want to keep them from indexing your "contact" page, the code would look something like this:

User-agent: *
Disallow: contact.htm

If you wanted to disallow other pages, or files, it would look like this (an example from this website):

User-agent: *
Disallow: /stats/
Disallow: /_private/
Disallow: agreement-basicwebsitedesigner.htm
Disallow: agreement-bronzewebsitedesigner.htm
Disallow: agreement-silverwebsitedesigner.htm
Disallow: agreement-goldwebsitedesigner.htm
Disallow: agreement-platinumwebsitedesigner.htm
Disallow: agreement-starterwebsitedesigner.htm
Disallow: thanks.htm
Disallow: thanks-agreement-basicwebsitedesigner.htm
Disallow: thanks-agreement-bronzewebsitedesigner.htm
Disallow: thanks-agreement-silverwebsitedesigner.htm
Disallow: thanks-agreement-goldwebsitedesigner.htm
Disallow: thanks-agreement-platinumwebsitedesigner.htm
Disallow: thanks-agreement-starterwebsitedesigner.htm
Disallow: thanks-quote.htm

If you did not want a particular robot to index your site:

User-agent: Googlebot
Disallow: /

Where the "/" indicates your entire website.

If you did not want a particular robot to index certain parts of your site:

User-agent: Googlebot
Disallow: /images/
Disallow: /private/
Disallow: /stats/

For more information about the .robots.txt file and the different rules, please visit the SEO Consultants website.