|
Protecting Your Website-Behind
The Scenes
Do you need to protect your website? Yes. If you're new to the Internet, or just coming onboard with your website do not
neglect protecting your website.
Once you have
been online for a while, robots will begin spidering your site. You
say, "Gee, that's a good thing." It can be. But it also
can be bad thing, if their intent is to steal your content, harvest
your emails, or harvest your images. So how do you protect your online
investment. Through the .htaccess file and the robots.txt file.
.htaccess
In most cases you will have to create your own
.htaccess file, but yes, some web hosting services are beginning to offer services to help you create an
.htaccess file.
The .htaccess file (hypertext access) allows you to customize your configuration
to specify security restriction for your directory or directories,
password protect areas on your site, deny or allow IP addresses, and
deny or allow search engines, and customize your error responses
(such as 404 errors or rewriting your urls).
How do you create an .htaccess file?
The .htaccess file is created as a text file (htaccess.txt). After creating the file, you would FTP
the file to your server, than rename the htaccess.txt to .htaccess
Custom error pages
Creating error pages is not difficult if you know the type of error you are wanting to use.
For example: 404 error message is "no page found" error. To create a custom error page for 404 you
will need to do two things:
First, you will need a custom file. For example: Let say you call the error document
nofile.html. You would put in your .htaccess file the following:
ErrorDocument 404/nofile.html
(ErrorDocument errornumber/file.html
The nofile.html would be your custom file for no file found. You would then upload the nofile.html onto your service along with your .htacess file.
To deny access of
an obnoxous bot you can use the following method: #block bad bots
SetEnvIfNoCase User-Agent "^CherryPicker" bad_bot
<Limit GET POST>
Order Allow,Deny
Allow from all
Deny from env=bad_bot
</Limit>
(the ^ code
indicates the command of "anything that begins with")
Other Helpful
Redirects with htaccess The
best way to create a redirect is through the htaccess file. If you
are contemplating use a META redirect (that occurs within the
<HEAD> of your web page) please reconsider. Why? Many search
engines have troubles with this one and spammers use it in bad ways.
Since the robots does not know a good guy from a bad guy you may
find you inadvertently banned by the search engines. With that
being said, this is how you can do a redirect. Redirect
of a single page: Redirect 301 /oldpage.html
http://www.example.com/newpage.html
Redirect of a
whole site: Redirect 301 /
http://www.example.com/
Robots.txt file
The main purpose of a robots.txt file is to give spiders/robots instructions on what they
can crawl or cannot crawl. In fact, some robots will not venture to your site without a robots.txt file.
The robots.txt file can help you to:
1. exclude pages from being indexed that are still under construction
2. exclude directories or pages that you do not want indexed
3. exclude search engines that you may not want your website to
appear in. (more on that later)
What is a robots.txt file
The robots.txt file is a simple text file. And yes, the robots.txt file is important.
With a robots.txt file, it will allow your site to be indexed by the spiders much faster and more accurately.
Some
Robots.txt commands *If you want all robots to index your pages
you can put the following command.
User-agent: *
Disallow:
*If you want to disallow a robot from your site you would use the following code:
User-agent: specificbadbot
Disallow: /
(The "/" is needed, because that means "all directories")
*If you don't want
a complete directory to be indexed you would put the following code.
The "nogoindex" is the directory you don't want indexed.
User-agent: Googlebot
Disallow: /nogoindex/
The forward slash at the beginning and at the end, tells the search engine not
to include any of the directories.
*If you don't
want Google to index a page you would put the following code. The
"nogoindex" being the directory and the "donotenterpage"
being the page you don't want Google to Index. User-agent: Googlebot
Disallow: /nogoindex/donotenterpage.html
Rules-And Search Engines
In a perfect world, all rules would be followed. But alas, not all
robots follow the rules of the robots.txt file. How do you know that a robot is not following your rules. Look at your stats and how often they are hitting your site and what
they are pulling from your site. Or worse yet, you get an email
from your web hosting server saying-"your content is being
stolen."
If you find a robot is not following the rules, consuming your bandwidth,
stealing your content, and not following your robots.txt instruction
then it is time to utilize the .htaccess file to prevent them from even entering your site at all.
After you have created your robots.txt file, be sure and validate your coding. What can
happen if it is unvalidated? If your happen to forget a line of code, and say the Googlebot hits your file, you may find in a few days
your site has been de-indexed. Several places in which you can have your robots.txt file validated is at:
Free Validation tools:
Clockwatchers
Search Engine
Promotions
Note: There are also a lot of robots.txt generator tools if you
don't feel comfortable creating your own at the very beginning. Just
do a search in Google on Robots.txt generator tools.
Google Rules with HTML
I've read other sites that say, don't use the META TAGS to indicate to the search engines that you do not want the page indexed. But when I checked Google Webmaster-they do say to use the META TAGS within the head section for doing some of the following:
To prevent robots from indexing your page you would:
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
To allow robots to index the page on your site, but
instruct them not to follow outgoing links:
<META NAME="ROBOTS" CONTENT="NOFOLLOW">
To Index the page on your site but not index the images on that page:
<META NAME="ROBOTS" CONTENT="NOIMAGEINDEX">
|