Backups, Disaster Recovery, and Mitigating Downtime

Previously, I have touched on the subject of backups and its importance to you and your data. Most folk don’t think about backups until after something happens, like a hard drive failure for instance, and then they really need a backup, but don’t have one. More than that, a great deal of people don’t really understand what is involved in properly protecting their data. Even more, most businesses also have a requirement to mitigate downtime, as well as to protect their data. It might not be a hardship for a home user to take a couple days to get a hard drive, install it and rebuild their system, but most businesses can’t afford to be down that long. Some companies have a requirement for zero unplanned downtime for example. You can’t ever accomplish zero down time, it’s impossible, something will go wrong eventually, but you can do everything in your power to make sure that any unplanned downtime is as quick and painless as possible. Good backup data and planning makes all of this possible.

First we need to ask, what does it mean to protect your data? In my opinion, protecting your data comes down to three main elements: Backups, Disaster Recovery, and Mitigating Downtime. Each one of these elements plays it’s part in protecting your data, and in this article I will go over each one, and give an explanation of what it means, and how it fits into a good data protection plan. You can use these elements and this article as guidelines for your own data protection needs, and assess what fits best in your environment.

Essential Data Protection Elements:

Backups – When talking about protecting your data, the most obvious thing you think about is backing up your data, which essentially means storing copies of your data on some type of media (disk, tape, CD/DVD, etc.) so that you can put the data back in the event of some type of failure or data corruption. This is where lots of people stop, and that’s if they even get this far. Backup copies of your data are great, but not the only thing to think about when planning for the best protection of your data. As with many things I talk about on this site, it is but one part of a good plan.

Disaster Recovery – Suppose you have all the best backups in world. All of your data is on nice, neatly labeled tapes, right there next to your server. You are ready for a restore, you are on top of things. Now, what would you do if the building burns down? Guess what, your server, your data and all of your backup tapes just got destroyed, all at the same time. This is where disaster recovery comes into play, and that is just like it sounds, the idea of preparing for a disaster. Asking those tough questions and deciding just what you would do if the building burned down, or was devastated by a hurricane or flood. It could be even, that someone steals a bunch of stuff from your data center, and happened to get your server and tapes too. The point is, just having backup tapes next to your server, or even in the same building isn’t enough if you are going to get through some truly bad times.

For those of you reading this who think it can’t happen to you, look at what has gone on in the past few years with natural disasters, terrorist attacks and the like. The people affected could not in any way have recovered as well as they did from 9/11 had they not had good disaster recovery plans. Let me mention here that Disaster Planning, especially for your organization is a whole different ball game, and I don’t pretend to address that here. What I am talking about here is thinking about disaster planning for your data, so that if the building that you have your servers and data in becomes a smoking crater one day, you still have an ace up your sleeves and can recover with new hardware in some other location.

Mitigating Downtime – This element is probably the only one that could be considered optional. However, you still have to think about it and address it. This is because you can make the decision that downtime is not an issue for you and your business, but at least then you have consciously made the decision and won’t be caught by surprise later. Mitigating downtime is not the same as backing up your data, but it ties in with backup related subjects because incidents that cause you to need backups, often cause you to have downtime. So, the two pretty much go hand in hand.

Aside from backing up your data, planning for disaster recovery, you can architect ways to stay up and running when something happens, rather than just being down while it gets fixed. If being down doesn’t adversely affect you, then you don’t have to worry as much about this one, only you can make that decision. If you put some thought into mitigating downtime, this can make those unavoidable downtime periods much less stressful.

Real World Examples:

Let me give you some real world examples of ways you can address various issues, and how they relate to the elements above.

Let’s start with something simple, and all too common, a hard drive going bad. This type of failure has different repercussions in your workstation, versus your main server. This means that when you look at the three elements I outlined above, your answers to each element may be different. In these examples, we are going to be looking at it from the perspective of a business running a server that doesn’t want any downtime.

You can recover from a hard drive failure by having good backups. You would replace the drive, rebuild the system, reinstall the operating system, restore your data, and you are back up and running in a day or two. That covers the backup element, but let’s say you really don’t want to be down for a day or two, you could do something like a mirrored drive array in your machine. A mirrored drive array, or mirror for short, has two drives that act as one, so you have two duplicate copies of your data. This is done through a special controller built in or installed into your machine. This means that if a drive fails, you can operate off of the drive that is still good, while you get a replacement drive. When you install the replacement drive, the new drive will sync up with the drive that is still good, and the mirror will be rebuilt. In most cases this is all done while the server is operational, however, some configurations require a reboot of the server, but even then, your downtime is reduced to mere minutes.

This covers the downtime mitigation element, and to some extent the backup element. You might think that having a disk mirror would completely take care of the backup element as well, and while there is some crossover, it doesn’t quite replace the backup element completely. It will help you in the event we mentioned above for example, a drive failure. But what about data corruption, or if you accidentally delete a file? You’d be out of luck, because the mirror would only mirror the corruption or deletion to the other disk. You would still need the backup files to get your data back in that case. Are you starting to see how these elements work together?

Now, let’s take it a step further. Like we mentioned above, your office building burns down! Now your computer, mirror, tapes and more are gone, you are not having a good day! You have no data to recover from, right? Wrong! By addressing the disaster recovery element, you put a separate copy of your data in a safe place (safe deposit box at the bank, home safe, etc.) so not all of your data was destroyed in the fire. You can now get more hardware, pull your extra backups out of your special place and recover.

This should illustrate further how all these elements work together. By looking at these three elements of data protection, and assessing your needs as they apply to these three elements, you can prepare for the level of data protection that you need. In the case above, good backup procedures, some simple disk mirroring and copies of your data stored off site in a secure, safe place, meant that you were ready for almost anything life decided to throw at you.

I don’t know of any situation where I would say that you are 100% covered and ready for anything that could possibly happen. You can be very well prepared, but making some kind of guarantee like that just invites Murphy to pile one lots of things at once.

Let me relate to you a real life example I went through earlier this year. This might illustrate the importance of these elements even further, but also show that it really does happen. One of our customers that has a server at our location, had a really bad hard drive failure. We had mirrored drives in the server, but this sucker lost two drives in the same mirror at the same time! Talk about Murphy coming to visit, I think he has an office down the hall.

We replaced the drives, and had to go back to our backup tape, only to find that we had a bad tape in the mix too, for the most recent backup. We were having a bad night, let me tell you. In the end, we rebuilt the server, including installing the operating system (this was a Solaris box, and we make heavy use of Jumpstart, so this part only took a few minutes). We were then able to pull just the data that we needed off of the tapes by going back a day, and the server was back up and running in a few hours.

With all of the problems we had, we still had more downtime than we wanted, but I think we did a great just given all of the things that went wrong. Had we had more tape problems, we still had tapes in an off site location, that we could have gone to. The point here is that we had a disk mirror where both disks failed, the odds of that are astronomical! We even had a bad tape when we tried to do the restore. But since we had multiple backups, and multiple layers of data protection, we still got the server recovered without losing our customers data.

Protecting your data doesn’t have to be complicated. One of the rules I live by is that simple is good, simpler is better. Hopefully this article will help you in thinking about your data protection needs, and what you need to look at, think about, and plan for in order to keep your systems up and running.

2 thoughts on “Backups, Disaster Recovery, and Mitigating Downtime

  1. Pingback: New Article On Data Protection | Solarum - Information For Everyone

  2. Pingback: Backups, Disaster Recovery, and Mitigating Downtime at Mack’s Tech Support

Tell me what you are thinking?