Search Engine Optimization
Search Engine Marketing

Wednesday, September 17, 2008

Using robots.txt or REP

We never used Robots.txt on our web design company websites as we thought its for keeping Search engines away, and we wanted to welcome search engines.

But you can also use it tell Search Engines to crawl your website, like this.
User-agent: *
Disallow:
or
User-agent: *
Allow: /

But we were noticing many requests for Robots.txt by MSN and Googlebot in our server logs, hence we decided what the heck might as well create a robot file.

Here we made our first mistake we made "robot.txt" instead of "robots.txt". Though this may seem like a No brainer, you would be surprised by the number of people who make this mistake.

Our main concern was that we had certain critical pages / scripts / folders which if we mention on our robots.txt file would be like putting up a board attack here. But then after a lot of discussions we came to this conclusion that we don't need to show these pages/scripts/folders in robots.txt file unless there was a link to it from somewhere it wouldn't be crawled anyways.

Now we have some experience in writing these Robots.txt files, which we would like to share here:

1. All Commands need to be Title Case for e.g. User-agent, Disallow, Allow and so on

2. Further it can be used effectively to curb the problem of duplicate content on your blog. For e.g. if your using blogger you will won't the permalink of blog post to cached as opposed to the Archive page or the labels page, so you write your robots.txt file in the following manner.

User-agent: * {this is for all Bots or you specify a bots name here}
Disallow: /Blog/2008_ {this will block all your archive pages for 2008 which are named something like "2008_09_01.html" while not blocking your permalinks like "/Blog/2008/09/04/blogpost.html"}
Disallow: /Blog/label/

if your using wordpress then you can block "wp-login.php" and your "wp-admin" folder using "Disallow: /wp-"

3. If your site uses session ids in the URL you may have a major problem with duplicate content, wherein the same page gets cached with different URLs thus causing duplicate content. A REP solution to this would be using "*", suppose your session ids start with "sess_id=" so you have urls like "...page1.htm?sess_id=nndndchh3nG" and "...page1.htm?sess_id=mnvmjenfcchh3nG", now both these urls are the same i.e. "page1.htm".

This is how you can block urls with session ids:
Disallow: /*sess_id

4. If you have a XML sitemap file in your site, you can reference that in your Robots.txt file for search engine to flow.
Sitemap: http://www.mysite.com/sitemap.xml

these are just some rules we use, depending on your site or blogs structure these very rules can be tweaked effectively to filter out duplicate content and let the juice flow most effectively.

If you still feel confused about using robots.txt, feel free to contact one of our SEO experts.

Labels: , ,


StumbleUpon Share on Facebook reditt

0 Comments:

Post a Comment

Go Back to SEO Blog Home