How to Have a Web Site Outage

I recently got to live through the painful experience of supporting a customer during a three-day database outage. It happened like this.

Four years ago, a development team that I led created a web site for the customer at AOL. Three-and-a-half years ago, I asked AOL's DBA support group to ensure that the production database that supported the web site was backed up nightly. We didn't have management approval for a fail-safe backup solution, i.e. - dual database hosts and a hot fail-over capability, so if there was ever a database outage, we knew that the recovery plan would involve a short outage that would last for as long as it took to get the database backups installed to an alternate location.

Since it was not a very large database, it was expected that such an outage shouldn't last for more than a couple of hours. Tough luck if it happened, of course, but that's the situation you end up with if management doesn't want to spring for a more robust solution.

At this point, responsibility for backups now belonged to the DBA group. Having made sure that a process was in place, my responsibility in that realm was over.

This web site was an internal AOL corporate site that provided tools to automate tasks for employees. Over time, the site grew so that it supported about 15 hard-core users, another 50-plus regular users and about another 600 occasional users.

Last week, the CPU on the Solaris computer where the database resides died "suspiciously" after an operating system patch was installed. The vendor was called in, and I was informed that it was going to be "some time" before it was likely to be up again. Not good for the customer.

No problem, I foolishly thought. I'll just have the DBA's install the database backup somewhere else, we'll change a config file to reference the new database location, and we'll have the database operational again in a couple hours.

Uh. Not exactly. You see, during the ensuing years after the process was put in place, AOL phased out the Solaris boxes...except for this one. And those nightly backups... Well, they were nightly backups of the filesystem. In other words, no database backups had ever been done. Instead, the raw data files managed by the RDBMS were backed up to tape.

More accurately, the backup actually consisted of tapes of the file system, including the data files and the RDBMS installation. This meant that what was actually backed up was OS-dependent. You couldn't put it on another computer. The files that were backed up wouldn't work with HP-UX or Linux.

And there were no other Solaris boxes where the backup could be installed. Had there been a spare Solaris box, the original scenario for a short outage would have held true.

Alas, I got to commiserate with my customer for a three day outage of their production web site. During this period, the system administrators installed two CPUs, discovered that the problem was the OS patch that they had installed immediately before the failure (DUH), scraped the box clean, re-imaged it and then loaded the database backups. Ultimately, no data was lost during the outage, but productivity was certainly compromised for three days.

So, this was a situation where there was a backup plan in place, although it wasn't quite the backup plan I'd been originally promised. Then, over time, the DBA group, through turnover and a basic lack of attention, lost institutional knowledge that there was even a production database on that particular Solaris host. They proceeded to compromise the backup plan by phasing out all of the other compatible database hosts. And thus the recovery activity, which would have originally spawned only a short outage, became a three-day outage.



Comments

No comments yet. Be the first.



Leave a Comment

Comments are moderated and will not appear on the site until reviewed.

(not displayed)