RAID controller dropping the wrong drive
- by bramp
I've been having an issue with 3ware 9500S-8 RAID 10, and I have contracted their tech support, but I wanted to hear the serverfault community's recommendations.
Firstly, all my data is backuped and secure, so I don't mind blowing my RAID away if I have to. But let me describe the problem I've been seeing.
A month ago, disk 6 dropped out of the RAID. It is mirrored with disk 7, so I wasn't that bothered. I went to the data centre and replaced it. When I got back to the office, I noticed that disk 6 will still not in the RAID, and in fact the controller was show the name of the old drive still. A week later I went back and replace the drive again, thinking I might have swapped in a bad drive. Still the same problem.
I decided to reboot the machine, to see if that would "force" the controller into seeing the new drive. It did, and a rebuild started to happen (from disk 7). Eventually both drives were showing as good.
A week later, the MySQL database has flagged the database is corrupt, and is unable to repair it. I don't know what has gone wrong, but I suspected this 6-7 pair. At this point I noticed that the RAID had constantly been verifying itself, over and over. Regardless of this I began to rebuild the database, which took about 19 hours. It's a big database. Near the end of the repair, the RAID controller told me it had dropped disk 7, and that some data was most likely corrupted.
I contacted LSI tech support, and they very promptly started to help me. I mentioned that drive 7 had been dropped. They suspect that drive 7 was always at fault, and drive 6 had always been good. I want to know how often a RAID controller would drop the wrong drive (in this case dropping drive 6 a month ago, instead of 7).
I foolishly didn't run smartctl on the drives before I started swapping them out. I just assumed the RAID controller knew what it was talking about.
I think my plan of action is to replace drive 7, rebuild the array from scratch, double check smartctl on ALL the disks, and then start restoring my data again.
I would appreciate anyone's input on what the correct procedure for swapping drives is, and how often failures like this happen. If anyone would like more information then I'd be happy to provide it.
thanks in advance.
Oh some more information. I'm running CentOS 5.3, with two RAID arrays, a simple RAID 1 for the OS, and RAID 10 for the database. Both arrays are on different controllers. The RAID 10 is made of 10 identical ST3640323AS drives, until I swapped in a SAMSUNG HD103SJ last month.