Epic Fail…

My favorite error
Creative Commons License photo credit: xaminmo

Warning: This is not a traditional post on Plagiarism Today. This post recounts the events during an extended outage of the site including why the site went down and how I got it back online. If you aren’t interested in this kind of geeky/nerdy stuff, please feel free to skip. If you are, well, I hope it helps you in some way.

Before I begin, I want to make it clear that Plagiarism Today was NOT hacked. A few people wrote me during the downtime to ask if I had fallen down to one of much-talked about WordPress exploits. The answer is no. Though I am sure it is a possibility if someone were dedicated enough, it is not what happened in this case.

Second, I want to thank all of the readers that reached out to me either via email or IM during this outage. Your support and encouragement was a great help and I greatly appreciate all of your kind offers of assistance. It meant more than I can say.

With that in mind, here is the full story of what happened over the course of the past 24 hours or so and where the site sits right now.

First Signs of Trouble

At about ten o’clock on June 12 (central time), I pulled up the site but discovered that Plagiarism Today was responding with an error 500. Having just recently moved the site to a VPS host named VPSLink, I tried to login to my site’s control panel and restart the server.

However, when I tried to load the control panel, which was on the same server, I received another error 500. Realizing it was more serious than previously thought, I jumped onto my hosting control panel, where I managed the entire server, and tried to do the reboot there. However, there, the reboot process froze halfway and never completed.

At this point, I assumed that the problem was still software related and put in a tech support ticket to have the techs do a hard reboot of the server. However, over 40 minutes passed without a response and nothing changed on the server.

It was only after nearly an hour that the techs answered my support request and they did so by deleting the one I had filed without offering any comment or explanation.

Worried, and upset, I filed an “Emergency” server outage report and risked paying extra in the event that the fault for the outage turned out to be mine. I waited for a response and, after about thirty minutes, one of the representatives informed me that there had been an error in a “hardware node” of my server and that they were looking into it.

Realizing that the error was not my fault and there was nothing I could do right then, I took a break to watch some television and spend some time with my wife. When I returned an hour or two later, my host had put up a post in the forums, several hours after the incident had happened, offering more details about the problem.

Apparently one of the hard drives had gone bad and the system was being rebooted and they were running FSCK over the drive to repair the damage.

Unfortunately, several hours later, they updated the post to inform us that the entire RAID array had gone bad and the entire node was crippled. They had a complete backup from 6/10 and an “incremental” one from the night before. They set about replacing the part and restoring from the backup, a process that would take many hours.

Unable to do anything other than transfer my email to a new host (see below), I went to bed and trusted that the restoring from backup would save the site. Unfortunately, that wasn’t the case.

A Bad Morning

When I woke up at a little bit after eight local time, after getting just three or four hours of sleep, I was stunned to find that the site was still down. I posted a reply to the forums asking for an update and they quickly responded by saying that the backups had been restored and that they were rebooting the servers.

Unfortunately, Plagiarism Today was not restored. The main site was returning a “Forbidden” error and the control panel was not loading anything. I attempted several reboots using the hosting manager but to no avail.

The only service available to PT was SSH so I could login and grab files, but nothing else was available.

I wrote tech support to ask what to do and they told me to use SSH to backup important files and then reinitialize the entire server, starting with the operating system install.

This would have meant completely wiping out and everything, including the DNS information, the databases, the WordPress installs and rebuilding it all from the ground up using backups or fresh installations.

As I saw it then. There were three problems with that idea.

  1. The backups I had were questionable at best. If the backups had been perfect, the site would be up, at least theoretically.

  2. The process had many complicated steps that had to be done in order. Easy when you’re working on a second server with a functioning site, much harder when the clock is ticking.

  3. An advisory from the host said the main control panel, the one needed to initialize the server, would not work for a bit and no indication about when it would be available.

I decided it was time to do something else…

Saving the Day

Media Temple LogoThe first move I made was actually taken the night before. While I was waiting for word back from tech support, I decided to stop the email outage. The DNS servers were still working so I created a Google Apps account for the site and directed all of my email there.

With that done, Google was effectively hosting my email, not just receiving the forwards from my own server and, by eliminating the middle man, I could send and receive email again.

However, when after the server came back but the site did not, I realized I had to do something fast to prevent things from getting worse.

Since the move to VPSLink was recent, I had a near-perfect duplicate copy of the site on Media Temple. I was also fortunate to have an automatic database backup in my email box from just a few hours before the downtime began.

So, using the backups, I updated the Media Temple database on my Mac while using my Windows PC to SSH to the current site and download the plugins, theme changes and images that were newer than about two months.

It was at this point, I noticed how incomplete the backups were. Though it had most of the images, the dates on the most recent images were from the seventh. Even after uploading all of the downloaded files, the front page was still a mess of broken images.

However, I was able to work around that. Since I had used Skitch to upload the screenshots and it keeps a history of the files it puts up, I was able to rebuild the images folder and get everything back online.

At this point, the site was pretty much back together so I changed the DNS servers to point back to Media Temple and started cleaning up a few odds and ends. After uploading a few additional folders, updating a few plugins and activating a few others, PT seems to be back mostly in working order and, bit by bit, the DNS is propagating out and the site is coming back to life.

Aftermath

Looking back at what happened, I am still trying to assess the damage. A few things I know are broken/lost.

  • Lost Comments: Three or four comments, including one of mine, were made between the time the database backed up and the server crashed. Those comments are lost. I may be able to bring them back from the emails I get, but I’ll look into that later.

  • Contact Form: The contact form is borked currently, I’m re-installing the plugin now and should have it working within an hour or two (Update: Should be working now)

  • Lost Email: Though I made the switch to Google Apps pretty quick, there is at least some mail that was lost in the process as it was sent after the outage but before the transfer. Depending on how the DNS servers played out, it could have been many hours after the outage began.

  • Sixteen Hour Downtime: The big one for me is that there was a whopping sixteen hours of downtime, possibly more or less depending on how fast the DNS changes made their way to your area.

All totaled, the loses were not major but could have been catastrophic.

Lessons Learned

Looking back at it, this was probably the worst of all possible “natural” disasters that can happen to a Web site. It is hard to think of anything that is more catastrophic than a complete storage failure followed by a bad backup.

If there are any Webmasters wanting some advice on these situations, well, here is what I would say.

  1. Backup, Backup, Backup: My backups saved me. If it hadn’t been for my database backups emailed to me every day, I would have been in much bigger trouble. At best, I would have lost several days of posts and waited several more hours to come online. The only thing not adequately backed up were the images used in the stories and the front page, something I am fixing now.

  2. Exit Strategy: Having a near-perfect mirror of the site offline saved a lot of time and work. It wasn’t really a planned exit strategy, the account was due to be shut down in a few weeks, but it worked as one. If nothing else, having a backup account and host set up can save a great deal of time.

  3. Know How Your Software Works: I was also fortunate that I knew how to install and set up WordPress without assistance. Though I didn’t need to set up a whole new installation, I did have to update this one. With so many hosts offering “one click” installs, I wonder how many Webmasters know how to set up their software should something go wrong.

Some Personal Thoughts

Right now, after having had some time to ponder what happened, I am very upset with VPSLink and the way it was handled. I am going to write a letter to them later that addresses my grievances more clearly (this is more about explaining the outage to you than venting my frustrations), I have to say that I feel as if the situation was handled poorly.

I am not upset about the outage itself. I have been around computers long enough to know that things break. However, the outage could have been handled a great deal better.

First, the company deleted my initial support request without offering any explanation to why they wouldn’t reboot the sever (I know now they couldn’t). Second, they waited several hours to post anything about the outage on their forums. Third, they only posted two updates on the outage over the course of almost 24 hours. When the backup failed to restore the site, their best advice was to start from scratch.

I realized when I started using an unmanaged VPS that I was going to be responsible for my own actions and mistakes. There were plenty of ways I could destroy my own site and not get any support. However, I did not realize that support would falter so badly when the error was on their end as well.

In the end, I am going to be much more choosy about the hosts that I use from now on. I am actively seeking hosting recommendations though I may just stick with Media Temple for a while.

Conclusions

In the end, the reason I write this is that I know well there are a lot of Webmasters and bloggers who read this site. Though you come here for information about dealing with content theft and copyright issues, my hope is that maybe this experience can help in other ways.

I feel comfortable saying that Plagiarism Today will survive and overcome this problem. However, that is because of the wonderful readers and community that has developed around this site over the years.

Thank you all for your understanding and support, it has meant more to me than you probably realize.

I hope you have a great weekend.

Want to Reuse or Republish this Content?

If you want to feature this article in your site, classroom or elsewhere, just let us know! We usually grant permission within 24 hours.

Click Here to Get Permission for Free