Navigation

MAIN
Home
Articles
Report Your Scams
Dictionary

My Affiliate Place Blog

Sitemap
Contact
Privacy Policy

 Subscribe in a reader


TECH SECTION
COMPUTERS &
ELECTRONICS

Dell Weekly Deals
Dell Server-Electronic Deals
Dell Deals-Consumer


TECH INFORMATION
Buying A Laptop
Buying A Desktop
Protect Your Data
Tech Corner



LEARN AFFILIATE
MARKETING

Affiliate Marketing Info
Landing Page Basics
Net Etiquette


AFFILIATE PROGRAMS & PROMOTE
Find Affiliate Programs
Promote Your Business


ADSENSE
Adsense TidBits
Adsense Basics


ADWORDS
Adwords Basics


SEO
SEO Basics
Meta Tag Tips
Search Methods


WEBSITE BASICS
Website Types
Build|Design Websites 
Domain|Subdomain Facts


WEBSITE PROTECTION
htaccess
Robots.txt




Protecting Your Website with a
Robots.txt file






Do You Need A Robots.txt file 


Do you need a robots.txt file? Is it important? Yes, to both questions. In fact, this file is one important element to the health of your website. I give you an overview of the purpose of the robots.txt file and examples of how to build this simple file.


Robots.txt file purpose

The main purpose of a robots.txt file is to give spiders/robots instructions on what they can crawl or cannot crawl. In fact, some robots will not venture to your site without a robots.txt file being present within the main directory of your website.

1. Exclude pages from being indexed that are still under construction
2. Exclude directories or pages that you do not want indexed
3. Exclude search engines that you may not want your website to appear in. (more on that later)
4. To tell search engines (bots) where they can and cannot go within your website
However, not all spiders are polite, and some are just downright rude. For the rude bots you need a stronger method of dissuading them from taking what you claim is off limits. This is where the htaccess file comes into play (link to the article on the htaccess file is below).

But for the bots that do mind their manners, they look first to the robots.txt file on your server before entering. 



Building a Robots.txt file

Building a robots.txt file is relatively simple. There are certain commands that you use to either tell a robot to please visit your site, you are welcome; or to tell a robot/spider to please go away, you're no longer welcome.


*If you want all robots to index your pages you can put the following command.

User-agent: *

Disallow:

*If you want to disallow a robot from your site you would use the following code:

User-agent: specificbadbot

Disallow: /

(The "/" is needed, because that means "all directories")



This illustrates what you need to put in your robots.txt file is you want to disallow the following files.

User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /private/


If you have one bot that is trying to bomb your stats you can place the following in your robots.txt file


User-agent: SpammerBot # replace the 'SpammerBot' with the actual user-agent of the bot
Disallow: /images/


*If you don't want a complete directory to be indexed you would put the following code. The "nogoindex" is the directory you don't want indexed.


User-agent: Googlebot

Disallow: /nogoindex/

The forward slash at the beginning and at the end,  tells the search engine not 
to include any of the directories.


*If you don't want Google to index a page you would put the following code.
(The "nogoindex" being the directory and the "donotenterpage" being the page you don't want Google to Index.)

User-agent: Googlebot

Disallow: /nogoindex/donotenterpage.html


After you have created your robots.txt file, be sure and validate your coding. What can happen if it is invalidated? If you happen to forget a line of code, and the Googlebot hits your file, you may find your site has been de-indexed.  Thus, to save-guard yourself against such an incident from happening there are several places in which you can have your robots.txt file validated:

Free Validation tools:

Clockwatchers

Search Engine Promotions


Note: There are also a lot of robots.txt generator tools if you don't feel comfortable creating your own at the very beginning. Just do a search in Google on Robots.txt generator tools.


Google Rules with HTML

I've read other sites that say, don't use the META TAGS to indicate to the search engines that you do not want the page indexed. But when I checked Google Webmaster-they do say to use the META TAGS within the head section for doing some of the following:

To prevent robots from indexing your page you would:

<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">


To allow robots to index the page on your site, but
instruct them not to follow outgoing links:


<META NAME="ROBOTS" CONTENT="NOFOLLOW">

To Index the page on your site but not index the images on that page:


<META NAME="ROBOTS" CONTENT="NOIMAGEINDEX">


Rules-And Search Engines


In a perfect world, all rules would be followed. But alas, not all spiders follow the rules of the robots.txt file. How do you know that a spider is not following your rules?

Check your stats. Are you finding:
  • Reoccurring IP address hitting your site (the hits are sometimes unrealistically out of proportion to the rest of the hits.)
  • A spider hitting your site and pulling data from your site without your permission?
  • A robot scraping your content
  • A spider consuming your bandwidth.

If you ignore it and the robots are causing a major ruckus on your server, you may get an email from your web hosting provider that happens to say -"Did you know that your content is being stolen?" If you get this message, the bad bots have taken more than they should have already. It's time to swallow hard, and create a robots.txt file or find someone to create it for you.

If you find a visiting robot is not following your robot.txt rules, it is time to utilize the htaccess file to prevent them from even entering your site at all. 


 Relevant Articles

Spamming, Your Stats, Website, Blog and Social Media















Copyright © 2005-2012 All Rights Reserved My Affiliate Place | Website Protection-Robots.txt