|
Navigation
MAIN
Home
Articles
Report Your Scams
Dictionary
My Affiliate Place Blog
Sitemap
Contact
Privacy Policy
Subscribe in a reader
TECH SECTION
COMPUTERS &
ELECTRONICS
Dell Weekly
Deals
Dell Server-Electronic Deals
Dell Deals-Consumer
TECH INFORMATION
Buying A
Laptop
Buying A Desktop
Protect Your Data
Tech
Corner
LEARN AFFILIATE
MARKETING
Affiliate Marketing Info
Landing Page Basics
Net Etiquette
AFFILIATE PROGRAMS & PROMOTE
Find Affiliate Programs
Promote Your Business
ADSENSE
Adsense TidBits
Adsense Basics
ADWORDS
Adwords Basics
SEO
SEO Basics
Meta Tag Tips
Search Methods
WEBSITE BASICS
Website Types
Build|Design
Websites
Domain|Subdomain
Facts
WEBSITE PROTECTION
htaccess
Robots.txt
| |
Protecting Your Website with a
Robots.txt file
Do You Need A Robots.txt file
Do you need a robots.txt file? Is it
important? Yes, to both questions. In fact, this file is one
important element to the health of your website. I give you an
overview of the purpose of the robots.txt file and examples of how
to build this simple file.
Robots.txt file purpose
The main purpose of a robots.txt file is to give spiders/robots instructions on what they
can crawl or cannot crawl. In fact, some robots will not venture to your site without a robots.txt
file being present within the main directory of your website.
1. Exclude pages from being indexed that are still under construction
2. Exclude directories or pages that you do not want indexed
3. Exclude search engines that you may not want your website to
appear in. (more on that later)
4. To tell search engines (bots) where they can and cannot go within your website
However, not all spiders are polite, and some are just downright
rude. For the rude bots you need a stronger method of dissuading
them from taking what you claim is off limits. This is where the
htaccess file comes into play (link to the article on the htaccess
file is below).
But for the bots that do mind their manners, they look first to the
robots.txt file on your server before entering.
Building a Robots.txt
file
Building a robots.txt file is relatively simple. There are certain
commands that you use to either tell a robot to please visit
your site, you are welcome; or to tell a robot/spider to please
go away, you're no longer welcome.
*If you want all robots to index your pages
you can put the following command.
User-agent: *
Disallow:
*If you want to disallow a robot from your site you would use the following code:
User-agent: specificbadbot
Disallow: /
(The "/" is needed, because that means "all directories")
This illustrates what you need to put in your robots.txt file is you want to disallow the following files.
User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /private/
If you have one bot that is trying to bomb your stats you can place the following in your robots.txt file
User-agent: SpammerBot # replace the 'SpammerBot' with the actual user-agent of the bot
Disallow: /images/
*If you don't want a complete directory to be indexed you would put the following code.
The "nogoindex" is the directory you don't want indexed.
User-agent: Googlebot
Disallow: /nogoindex/
The forward slash at the beginning and at the end, tells the search engine not
to include any of the directories.
*If you don't want Google to index a page you would put the following code.
(The "nogoindex" being the directory and the "donotenterpage"
being the page you don't want Google to Index.)
User-agent: Googlebot
Disallow: /nogoindex/donotenterpage.html
After you have created your robots.txt file, be sure and validate your coding. What can
happen if it is invalidated? If you happen to forget a line of code, and the Googlebot hits your file, you may find
your site has been de-indexed. Thus, to save-guard yourself
against such an incident from happening there are several places in which you can have your robots.txt file validated:
Free Validation tools:
Clockwatchers
Search Engine
Promotions
Note: There are also a lot of robots.txt generator tools if you
don't feel comfortable creating your own at the very beginning. Just
do a search in Google on Robots.txt generator tools.
Google Rules with HTML
I've read other sites that say, don't use the META TAGS to indicate to the search engines that you do not want
the page indexed. But when I checked Google Webmaster-they do say to use the META TAGS within the head section for doing some of the following:
To prevent robots from indexing your page you would:
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
To allow robots to index the page on your site, but
instruct them not to follow outgoing links:
<META NAME="ROBOTS" CONTENT="NOFOLLOW">
To Index the page on your site but not index the images on that page:
<META NAME="ROBOTS" CONTENT="NOIMAGEINDEX">
Rules-And Search Engines
In a perfect world, all rules would be followed. But alas, not all
spiders follow the rules of the robots.txt file. How do you know that a
spider is not following your rules?
Check your stats. Are you finding:
- Reoccurring IP address hitting your site (the hits are sometimes unrealistically out of proportion to the rest of the hits.)
- A spider hitting your site and pulling data from your site without your permission?
- A robot scraping your content
- A spider consuming your bandwidth.
If you ignore it and the robots are causing a major ruckus on your server, you may get an email
from your web hosting provider that happens to say -"Did you
know that your content is being stolen?" If you get this
message, the bad bots have taken more than they should have already.
It's time to swallow hard, and create a robots.txt file or find
someone to create it for you.
If you find a visiting robot is not following your robot.txt rules, it is time to utilize the
htaccess file to prevent them from even entering your site at all.
Relevant
Articles
Spamming, Your Stats, Website, Blog and Social Media
|