Bad Robots.txt Causes Massive Blocked Urls?

By Updated   BloggingSEOSite News

Robots.txt is blocking a huge number of urls on our site! We delved further into causes of falling traffic since Panda 25 and especially after Reconsideration request revealed no manual penalty. As the traffic continuously falls for the past few weeks by over 50% since March 15, we looked closely at Webmaster tools setting and found that robots.txt was blocking excessive urls and this block was continuously increasing every week such that very few pages were still left unblocked.

panda penalty

Bad Robots

While we were aware robots.txt was set to block some urls, but how many of these indexed urls are being blocked by robots.txt? The answer is provided by this key report in Google Webmaster tools at Index Status > Advanced View. This gives a great view of all the indexed urls and robots.txt blocked urls. This is what our chart looks like … scary!

urls blocked by robots.txt

As you  can see over the past few weeks the robots.txt has continued to block a huge number of urls, and more urls are getting blocked every week. In fact at this rate all urls will be blocked out in the next few weeks. I am not sure when we last tweaked this file, maybe  a few months back. Maybe there was some  noindex tag somewhere, though we do not noindex categories or tags for many months now. This explains the falling traffic over the past few weeks as the number of urls in the Google index are dropping.

Why did the Robots.txt suddenly start blocking so aggressively? Is it a Googlebot problem?

Old Robots.txt

Over the years we have been tweaking the robots.txt in an attempt to reduce indexing of duplicate content and large number of URL parameters – to fix possible Panda issues. This is what our robots.txt looked like for the past few months, as suggested by some popular websites.

User-agent: Mediapartners-Google
Disallow:

User-agent: *
Disallow: /cgi-bin/
Disallow: /wp-admin/
Disallow: /wp-content/
Disallow: /wp-includes/
Disallow: /author/
Disallow: /search/
Disallow: /*/trackback
Disallow: /?q*
Disallow: /*?
Disallow: /*.css$
Disallow: /*/?replytocom

This is partly modified to  not show some private folders/sitemap. We allow Google Image search full access to the WP uploads. We blocked WordPress folders access; disallowed author, search, trackback pages;  and pages with url parameters like q for search, replytocom for reply comments pages, css and other ? parameter pages. We also blocked replytocom URL parameters in GWT.

Can you identify what is so wrong in this robots.txt file? Your feedback will be valuable.

New Robots.txt

This is our robots now changed a few days back, as we really do not know what is blocking the urls. We also removed all replytocom URL parameter blocks, as Google warns this could also deindex urls. We still block access to the wp-admin, wp-includes and the wp-content plugins folder. Though probably I do not need to do block them as well because robots must be smart enough to not index them anyway across millions of WordPress blogs.

User-agent: *
Disallow: /cgi-bin/
Disallow: /wp-admin/
Disallow: /wp-content/plugins/
Disallow: /wp-includes/ 

Now we are waiting and watching for the next weekly update to show how the Index status changes. The Blocked urls page has already started showing a decline in the number of blocked urls! Great.

Big Lessons learnt – I just wished I had reviewed the Advanced Index report before. Still confused about the relationship between Panda 25, March 15 and robots blocks. Its also reminds webmasters that robots.txt is a powerful tool and be sure what you add there. Also take care before you decide to add noindex tags indiscriminately. What are your thoughts on this?

Update: wondering if it is a Googlebot access / server connectivity issue which people are reporting.

Update: Our Knownhost VPS hosting support investigated the issue and suggested maybe our server  firewall CSF  was temporarily blocking googlebot. They have added rules in csf.rignore to ensure that the firewall does not block googlebot, bing or yahoo. They seemed pretty sure this might be the issue and think this should fix the robots blocking issue. Earlier we had reported CSF can block site traffic, and I think all really need to see what the server firewall is blocking or allowing.

Update: Here is the charts after 2 months. Note that the gap has widened and the number of indexed pages and blocked pages has reduced considerably. Note that the blocked pages are still 2000+ even after 2 months!

Recovered robots.txt


23 comments on “Bad Robots.txt Causes Massive Blocked Urls?

  1. Pavan Somu says:

    Today I too got a mail from Google team about the same issue. I just rebuild my sitemap and checked the status in webmaster tools. Now everything is fine.

  2. Sourish says:

    fed up with site connection time out errors i moved my hosting from knownhost to other web-host

  3. umair says:

    This one may be the actual reason

    Disallow: /*/trackback

    /*/ it means all kind of urls will be blocked which have /trackback at the end.

    because trackback redirects bot to the previous url which is your post. and your post is blocked by *

    Just a guess. But this could be a reason I think.

    • P. Chandra says:

      That was a good pick. Worth thinking more about this. That code was meant to block and not index trackbacks only which are just redirect to the main url. Should be more careful before getting these codes from *popular websites, as they too could be wrong.

  4. Ganoderma says:

    I knew it!, For a moment I entered the paranoia, since I read several times that google and other search engines, make your robots.txt file only as a TIP, most do not follow it to the letter. Moreover, I put as blocked hundreds of directories, indexed however normal rule I put it, but I do not consider important things.

    Cheers =)

  5. karan says:

    Nice post. Robots.txt can be harmful if settings done wrong with it!

  6. Sam H. says:

    Thanks for the tip, I checked and I am running the old style Robots.txt so I redid it. Will track the results to see if it helps or not.

  7. Hung Webster says:

    You can add this anywhere in the robots.txt file because the directive is independent of the user-agent line. All you have to do is specify the location of your Sitemap in the sitemap-location.xml part of the URL. If you have multiple Sitemaps you can also specify the location of your Sitemap index file. Learn more about sitemaps in our blog on XML Sitemaps .

  8. Eric | Negocios Multinivel says:

    I’m new on these things, what I did was copy the robots.txt file of a blog that I follow, but recently I have received messages of blocking url by google. Now I just make the same changes that you and I’ll let you know if works.

    Thanks

    • P. Chandra says:

      We should use robots.txt which is suitable for our sites. Other websites may have reasons for various configurations.

  9. Patrick H. Clark says:

    An important, but sometimes overlooked element of onsite optimisation is the robots.txt file. This file alone, usually weighing not more than a few bytes, can be responsible for making or breaking your site’s relationship with the search engines.

  10. Jaimin says:

    Robot.txt, very powerful if any one use it carefully otherwise it is only tool which decrease so much traffic without any update :( :)

  11. Aditya says:

    thanks for info about Robot.txt. i would like to learn more about Robot.txt!!

  12. Alex says:

    Hi Chandra

    I dont’ use a robots.txt file as there are situations when some variables from the web server come in conflict with the commands from the robots.txt

    If I want to block a specific page I use a meta line like this:
    meta name=”robots” content=”noindex, nofollow”

  13. techpopular says:

    Hi Chandra,

    Thanks for sharing info, in my robot.txt the sitemap is appearing as Sitemap: http://techpopular.com/sitemap.xml.gz. is it ?
    why that .gz extension has come, any issues with that? does it require to change? please guide me

    • P. Chandra says:

      The sitemap plugins usually generate both sitemap files – sitemap.xml and the zipped smaller size sitemap.xml.gz

  14. Richard says:

    Seems like there was a panda update recently. Did you see any change in your stats? My blog was hit back in Nov. of 2012 and the latest update ( 7/ 9/ 2013) hit me again.
    Just wondering if yours change.

    Thanks,

  15. ninjaseo says:

    Robots.txt is a sensitive file and should not be touched unless you have experience.

    A tip for people using WordPress, if you include this line:
    Disallow: /page/
    You can avoid problems with duplicate pages due to the wordpress pagination system on blogs.

    But take care when modifying this file and always test with Webmaster Tools!

  16. Mark Saw says:

    nice post how to check which url is blocked by robots.txt if we have not block manually in robots.txt please advise me how to check.

Leave a Reply

Your email address will not be published. Required fields are marked *




css.php