Why are my hard drives failing?

Posted by WishCow on Super User See other posts from Super User or by WishCow
Published on 2010-01-20T22:45:59Z Indexed on 2010/05/04 11:08 UTC
Read the original article Hit count: 356

I have a small Ubuntu server running at home, with 2 HDDs. There are two software raids (raid1) on the disks, managed by mdadm, which I believe is irrelevant, but mentioning it anyway.

Both of the HDDs are Western Digital, and have been used for around 2 years, when one of them started making clicking noises, and died. I figured that maybe it's natural after 2 years, so I bought a new one, and resynced the raid arrays. After about a month, the other drive also died.

I didn't get suspicious, since both drives have been bought at the same time, it's not that surprising to see both of them near each other, so I bought another one.

So far, 2 old drives failed, and 2 brand new in the system. After one month, one of the new drives died. This is when it started getting suspicious. Since the PC was put together from some really old parts (think AthlonXP), I figured that maybe the motherboard's SATA controller is the culprit. Of course you can't switch parts easily in an old PC like this, so I bought a whole system, new MB, new CPU, new RAM. Took the just failed drive back, since it was under warranty, and got it replaced.

So it is up to 2 failed drives from the old ones, and 1 failed drive from the new ones. No problems, for 1 month. After that errors were creeping up again in /var/log/messages, and mdadm was reporting raid array failures. I started tearing my hair out. Everything is new in the system, it's up to the third brand new HDD, it's simply not possible that all of the new drives that I bought were faulty.

Let's see what is still common... the cables. Okay, long shot, let's replace the SATA cables. Take HDD back, smile to the guy at the counter and say that I'm really unlucky. He replaces the HDD. I come home, one month passes and one of HDDs fails, again. I'm not joking.

Two of the brand new HDDs have failed. Maybe it's a bug in the OS. Let's see what the manufacturer's testing tool says. Download testing tool, burn it to a CD, reboot, leave HDD testing overnight. Test says that the drive is faulty, and I should back up everything, if I still can. I don't know what's happening, but it does not look like a software problem, something is definitely thrashing the HDDs.

I should mention now, that the whole system is in a shoebox. Since there are a load of "build your own ikea case" stuff, I thought there shouldn't be any problems throwing the thing in a box, and stuffing it away somewhere. The box is well ventilated, but I thought that just maybe the drives were overheating. There is no other possible answer to this. So I took the HDD back, and got it replaced (for the 3rd time), and bought HDD coolers.

And just now, I have heard the sound of doom. click click whizzzzzzzzz. SSH into the box:

You have new mail!
mail
r 1
DegradedArrayEvent on /dev/md0 ...

dmesg output:

[47128.000051] ata3: lost interrupt (Status 0x50)
[47128.000097] end_request: I/O error, dev sda, sector 58588863
[47128.000134] md: super_written gets error=-5, uptodate=0
[48043.976054] ata3: lost interrupt (Status 0x50)
[48043.976086] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[48043.976132] ata3.00: cmd c8/00:18:bf:40:52/00:00:00:00:00/e1 tag 0 dma 12288 in
[48043.976135] res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[48043.976208] ata3.00: status: { DRDY }
[48043.976241] ata3: soft resetting link
[48044.148446] ata3.00: configured for UDMA/133
[48044.148457] ata3.00: device reported invalid CHS sector 0
[48044.148477] ata3: EH complete

Recap:

  1. No possibility of overheating
  2. 6 drives have failed, 4 of those have been brand new. I'm not sure now that the original two have been faulty, or suffered the same thing that the new ones.
  3. There is nothing common in the system, apart from the OS which is Ubuntu Karmic now (started with Jaunty). New MB, new CPU, new RAM, new SATA cables.
  4. No, the little holes on the HDD are not covered

I'm crying. Really. I don't have the face to return to the store now, it's not possible for 4 drives to fail under 4 months.

A few ideas that I have been thinking: Is it possible that I fuck up something when I partition and resync the drives? Can it be so bad that it physicaly wrecks the drive? (since the vendor supplied tool says that the drive is damaged) I do the partitoning with fdisk, and use the same block size for the raid1 partitions (I check the exact block sizes with fdisk -lu)

Is it possible that the linux kernel or mdadm, or something is not compatible with this exact brand of HDDs, and thrashes them?

Is it possible that it may be the shoebox? Try placing it somewhere else? It's under a shelf now, so humidity is not a problem either. Is it possible that a normal PC case will solve my problem (I'm going to shoot myself then)? I will get a picture tomorrow.

Am I just simply cursed?

Any help or speculation is greatly appreciated.

Edit: The power strip is guarded against overvoltage.

Edit2: I have moved inbetween these 4 months, so the possibility of the cause being "dirty" electricity in both places, is very low.

Edit3: I have checked the voltages in the BIOS (couldn't borrow a multimeter), and they are all seem correct, the biggest discrepancy is in the 12V, because it's supplying 11.3. Should I be worried about that?

Edit4: I put my desktop PC's PSU into the server. The BIOS reported much more accurate voltage readings, and also it has successfully rebuilt the raid1 array, which took some 3-4 hours, so I feel a little positive now. Will get a new PSU tomorrow to test with that. Also, attaching the picture about the box: (disregard the 3rd drive) picture of box of doom

© Super User or respective owner

Related posts about Hardware

Related posts about hard-drive