Bad Robots.txt Causes Massive Blocked Urls?

Robots.txt is blocking a huge number of urls on our site! We delved further into causes of falling traffic since Panda 25 and especially after Reconsideration request revealed no manual penalty. As the traffic continuously falls for the past few weeks by over 50% since March 15, we looked closely at Webmaster tools setting and found that robots.txt was blocking excessive urls and this block was continuously increasing every week such that very few pages were still left unblocked.

Bad Robots

While we were aware robots.txt was set to block some urls, but how many of these indexed urls are being blocked by robots.txt? The answer is provided by this key report in Google Webmaster tools at Index Status > Advanced View. This gives a great view of all the indexed urls and robots.txt blocked urls. This is what our chart looks like … scary!

As you can see over the past few weeks the robots.txt has continued to block a huge number of urls, and more urls are getting blocked every week. In fact at this rate all urls will be blocked out in the next few weeks. I am not sure when we last tweaked this file, maybe a few months back. Maybe there was some noindex tag somewhere, though we do not noindex categories or tags for many months now. This explains the falling traffic over the past few weeks as the number of urls in the Google index are dropping.

Why did the Robots.txt suddenly start blocking so aggressively? Is it a Googlebot problem?

Old Robots.txt

Over the years we have been tweaking the robots.txt in an attempt to reduce indexing of duplicate content and large number of URL parameters – to fix possible Panda issues. This is what our robots.txt looked like for the past few months, as suggested by some popular websites.

User-agent: Mediapartners-Google
Disallow:

User-agent: *
Disallow: /cgi-bin/
Disallow: /wp-admin/
Disallow: /wp-content/
Disallow: /wp-includes/
Disallow: /author/
Disallow: /search/
Disallow: /*/trackback
Disallow: /?q*
Disallow: /*?
Disallow: /*.css$
Disallow: /*/?replytocom

This is partly modified to not show some private folders/sitemap. We allow Google Image search full access to the WP uploads. We blocked WordPress folders access; disallowed author, search, trackback pages; and pages with url parameters like q for search, replytocom for reply comments pages, css and other ? parameter pages. We also blocked replytocom URL parameters in GWT.

Can you identify what is so wrong in this robots.txt file? Your feedback will be valuable.

New Robots.txt

This is our robots now changed a few days back, as we really do not know what is blocking the urls. We also removed all replytocom URL parameter blocks, as Google warns this could also deindex urls. We still block access to the wp-admin, wp-includes and the wp-content plugins folder. Though probably I do not need to do block them as well because robots must be smart enough to not index them anyway across millions of WordPress blogs.

User-agent: *
Disallow: /cgi-bin/
Disallow: /wp-admin/
Disallow: /wp-content/plugins/
Disallow: /wp-includes/

Now we are waiting and watching for the next weekly update to show how the Index status changes. The Blocked urls page has already started showing a decline in the number of blocked urls! Great.

Big Lessons learnt – I just wished I had reviewed the Advanced Index report before. Still confused about the relationship between Panda 25, March 15 and robots blocks. Its also reminds webmasters that robots.txt is a powerful tool and be sure what you add there. Also take care before you decide to add noindex tags indiscriminately. What are your thoughts on this?

Update: wondering if it is a Googlebot access / server connectivity issue which people are reporting.

Update: Our Knownhost VPS hosting support investigated the issue and suggested maybe our server firewall CSF was temporarily blocking googlebot. They have added rules in csf.rignore to ensure that the firewall does not block googlebot, bing or yahoo. They seemed pretty sure this might be the issue and think this should fix the robots blocking issue. Earlier we had reported CSF can block site traffic, and I think all really need to see what the server firewall is blocking or allowing.

Update: Here is the charts after 2 months. Note that the gap has widened and the number of indexed pages and blocked pages has reduced considerably. Note that the blocked pages are still 2000+ even after 2 months!

Bad Robots

Old Robots.txt

New Robots.txt

You may also like