|
|
Prevent duplicate content with robots.txt
and the robots meta tag
Duplicate content is one of the problems
that we regularly come across as part of the search engine optimization
services we offer. If the search engines determine your site contains
similar content, this may result in penalties and even exclusion from
the search engines. Fortunately it’s a problem that is easily rectified.
Your primary weapon of choice against duplicate content can be found
within “The
Robot Exclusion Protocol” which has now been adopted by all the
major search engines.
There are two ways to control how the search engine spiders index your site.
1. The Robot Exclusion File or “robots.txt” and
2. The Robots Meta Tag
The Robots Exclusion File (Robots.txt)
This is a simple text file that can be created in Notepad. Once created you
must upload the file into the root directory of your website e.g. www.yourwebsite.com/robots.txt. Before a search engine spider
indexes your website they look for this file which tells them exactly
how to index your site’s content.
The use of the robots.txt file is most suited to static html sites or
for excluding certain files in dynamic sites. If the majority of your
site is dynamically created then consider using the Robots < Meta >Tag.
Creating your robots.txt file
Example 1 Scenario
If you wanted to make the .txt file applicable to all search engine spiders
and make the entire site available for indexing. The robots.txt file
would look like this:
Explanation
The use of the asterisk with the “User-agent” means this robots.txt
file applies to all search engine spiders. By leaving the “Disallow” blank
all parts of the site are suitable for indexing.
Example 2 Scenario
If you wanted to make the .txt file applicable to all search engine spiders
and to stop the spiders from indexing the faq, cgi-bin the images directories
and a specific page called faqs.html contained within the root directory,
the robots.txt file would look like this:
User-agent: *
Disallow: /faq/
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /faqs.html
Explanation
The use of the asterisk with the “User-agent” means this robots.txt
file applies to all search engine spiders. Preventing access to the directories
is achieved by naming them, and the specific page is referenced directly.
The named files & directories will now not be indexed by any search
engine spiders.
Example 3 Scenario
If you wanted to make the .txt file applicable to the Google spider, googlebot
and stop it from indexing the faq, cgi-bin, images directories and a specific
html page called faqs.html contained within the root directory, the robots.txt
file would look like this:
User-agent: googlebot
Disallow: /faq/
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /faqs.html
Explanation
By naming the particular search spider in the “User-agent” you
prevent it from indexing the content you specify. Preventing access to
the directories is achieved by simply naming them, and the specific page
is referenced directly. The named files & directories will not be indexed
by Google.
That’s all there is to it!
As mentioned earlier the robots.txt file can be difficult to implement
in the case of dynamic sites and in this case it’s probably necessary
to use a combination of the robots.txt and the robots <meta> tag.
The robots meta tag
This alternative way of telling the search engines what to do with site
content appears in the <head> section of a web page. A simple example
would be as follows;
<meta name=”robots” content=”noindex,
nofollow”>
In this example we are telling all search engines not to index the page
or to follow any of the links contained within the page.
In this second example I don’t want Google to cache the page, because
the site contains time sensitive information. This can be achieved simply
by adding the “noarchive” directive.
<meta name=”robots” content=”noindex,
nofollow, noarchive”>
What could be simpler!
Although there are other ways of preventing duplicate content from
appearing in the Search Engines this is the simplest to implement and
all websites should operate either a robots.txt file and or a Robot
meta tag
combination.
About the Author: Andrew Allfrey
is Director of Search Engine Marketing company www.e-prominence.co.uk
From Sydney to Perth Adrenalyn have provided
successful search engine optimization for SME companies. Our SEO solutions
have created outstanding brand promotion, web site traffic and sales
You'll not only receive top rankings but with our brand
development partner, you can have the best online branding possible.
Call us NOW! P:
+61 (0)2 9016 3850 E: info@adrenalyn.com.au
|
|