Our Dreamhost server crashed and our site has been offline for a large part for the last 2 days! This again gives us an opportunity to share what happened and some tips we learnt from this experience.
Internal server errors were on for last few weeks on our server, and we learnt more about memory intensive plugins and the need to upgrade from PHP4 to PHP5. Though mostly the server errors were fixed, it was short lived as our shared Dreamhost server crashed.
The web server, bulger, is having issues with its raid and attempts to move the drives over to a new server have also failed. We’re currently bringing up a new server and restoring it from backups More information to come!
Update: Further hardware disaster with this machine has extended downtime, unfortunately. We are still working on it and we hope to have it back up soon. We are very sorry for the inconvenience.
Update 06:35 PDT: The machine is back up now but is still restoring from backups and some configurations are still running on the server. Websites should be up for the most part but if theyre not, dont fret as we’re still working on it! We’ll post more updates as they come in!
Update 8:11 PM PDT: The server is still restoring from backups but sites should be online. You will notice a bit of instability meanwhile all the backups are restored but this is only temporary and subside once everything is restored. We’ll post the all clear once everything is good to go!
There has been no update for over 24 hours and it seems its not all clear as yet. This time we could not maintain traffic like before and no cached pages were functional and the entire site was down. There was no FTP access and even the homepage notice could not be posted.
Thankfully maintaining a status blog helped as an alternative Blogger blog could send out updates. Using Feedburner as the primary feed url helps to redirect the status blog feed if required. Those who follow us on Twitter, could stay connected about recent status updates from our blog. It was great that we use Google Apps to run our email, so email communication was not affected. All these tools helped to maintain contact and keep some services running.
I had a few email discussions with Dreamhost support over the last 2 days, and here is an email from yesterday highlighting what the problem was.
Thanks for writing in! I am *VERY* sorry for the inconvenience! The server that you are on, bulger, has been having a few problems as of yesterday. I’ll go ahead and explain what happened. Basically, bulger was having issues with its RAID setup and the server was going up and down intermittently. Our network admins went ahead and decided that we needed to fail the existing hardware over to new hardware in order to restore functionally to the server! The new hardware has been installed and the server is running BUT the I/O from the rsync of backup data is causing problems for people on the server. I don’t have an ETA for when the rsync will be done but we’ll get this squared away as soon as possible! I really do apologize for the inconvenience and I totally understand how frustrating a server being up and down is. That is why our network admin team is working hard to bring it up now.
And another support email today morning…
We’re very sorry for the problems that you are having with your site. The server that it was hosted on was having some hardware issues, and we tried to do a quick move to get it setup on new hardware with little downtime. Unfortunately, things didn’t go as quickly as planned and it is still having issues. We do have the hardware issue all straightened out, but we are still in the process of copying data over from our backup servers. This is causing the network of the server to become extremely saturated and is causing the server load to wildly fluctuate. At this point, we don’t see anything wrong with the server that is making your sites unavailable other than extreme load that these backups are causing on the server. Unfortunately, there’s not a whole lot we can do with this server until after the data transfer is complete. All we can ask at this time is that you keep an eye on this status post to see when the issue is resolved.
Thankfully the site traffic is creeping back for the last few hours and the FTP and wordpress admin has started working.
Though Yang let me know about the Internet Supervision tool, which shows the site is still loading very slowly across the world at over 40 seconds.
As I considered the Dreamhost PS private servers upgrade, many helpful Twitter users (like @nirmaltv, @manikarthik, @sumesh, @rishi, @keithdsouza, @shivaranjan, @denharsh) pitched in to suggest alternative VPS hosting solutions like Slicehost, Doreo, Linode, Mediatemple and it was great to see some hosting companies offering their services to us on Twitter like @linode and @spiralhosting.
I hope Dreamhost fixes the server issues soon and posts an update to affected customers … as someone mentions in comments “Information aids patience; silent downtime breeds discontent.”