Replacing failed disks in SVM

Solaris Volume Manager (SVM) – Replacing Failed Disks In Mirrors

In a recent post about creating disk mirrors with Solaris Volume Manager (SVM), we step though the process of creating disk mirrors using SVM, explaining the process along the way. Well, I thought it might be a good idea to visit the subject of replacing a failed disk in one of those minty fresh mirrors.

Let’s say you have your little metastat script running on your server, to alert you when there is trouble, and you see an email arrive in your in box letting you know that one of your drives has failed. You can see that something is wrong when the metastat output lists one of the meta devices (drives) as “Needs Maintenance” rather than “Okay”.

There are several things you can do at this point, and I am going to share with you what I do, from my experience. I have seen lots of drives go bad, and it’s an interesting thing that only sometimes are they really bad. I don’t have an exact explanation, but some of the engineers from Sun have told me that it could be that the system errors out the drive when it only thinks there is a problem due to a low error threshold, or maybe it’s a firmware bug because the system might not be on the latest version, who knows. The point is, usually I have to make sure the drive is bad before I call it in to Sun for replacement.

One of the easiest ways to do this, is to use SVM commands to replace the drive with itself. I can see the look on your face, it’s ok! What I mean is basically using a command to tell the mirror to re-sync with the existing drive in place. Sometimes I do this and the drive status goes back to and stays “Okay”, sometimes it goes back to “Needs Maintenance”, and sometimes still, it doesn’t even finish the re-sync at all. This is a good litmus test for seeing if a drive is truly bad.

Now, let me make something very clear. If you are working with production systems, it’s not a good idea to play with potentially bad disks and possible data loss. I don’t normally do this litmus test on production systems. For those, we err on the side of caution, swap the disk with a spare, and then call it in. For dev and test systems though, this is fine in my book, of course there are always exceptions to the rule.

The command I use to do this is called metareplace, and if I were working with a mirror named d10, with d11 (c0t0d0s0) and d12 (c0t1d0s0) as the disks in the mirror (as in the creation example), the command would look like this if it was d12 that went bad:

metareplace -e d10 c0t1d0s0

Once you run this command, you should see the status from your metastat command change to “Re-syncing …”. If this completes, and you don’t see any more errors, rock on, you are golden. But what happens if it goes back to needing maintenance? Let’s find out!

First let me address the metareplace command that we used earlier. Ironically enough, even though it appears to be documented as possible, I have never been able to use metareplace with a new drive. Not without being a little risky, and here’s what that means.

Normally, I’ll detach the failed meta device from the mirror like this:

metadetach -f d10 d12

See that, we tell it to detach d12 (the failed device) from d10 (the mirror). We use -f to force it because it’s failed and it won’t normally want to detach. Then, we would delete the failed meta device like this:

metaclear d12

Again, it’s pretty straight forward, we use the metaclear command to clear or delete the failed metadevice. At this time, we actually replace the physical disk, and once we have the fresh new disk in the machine, we use metainit and metattach to create the new d12 meta device and re-attach it to the d10 mirror. See the previous article for instructions on creating the devices, because you’ll have to go through the steps of partitioning, creating the meta databases, etc. It sounds like a lot, but really, once you have done it a couple of times, it’s a five minute operation.

That’s the safe way to do it, and sometimes I do that. If it’s a low risk system, like a sandbox test server or something, I have been known to just do the physical drive switch first, and then run that metareplace -e command like above and have it re-sync the array. It actually seems to work quite well, it’s just that according to some Sun engineers I have spoken too, you really should detach and clear the failed device first, before pulling it from the machine. I haven’t ever had it cause a problem, but who wants to tempt fate if it’s an important box.

Other than that, it’s about the same as above, once the re-sync is done make sure the state doesn’t go back to needing maintenance. If it does, you may have other problems like a controller or backplane or something. If it stays Okay, you should be good to go!

Leave a Reply

Your email address will not be published. Required fields are marked *