linux raid 1: right after replacing and syncing one drive, the other disk fails - understanding what is going on with mdstat/mdadm

Posted by devicerandom on Server Fault See other posts from Server Fault or by devicerandom
Published on 2013-11-05T16:50:51Z Indexed on 2013/11/05 21:56 UTC
Read the original article Hit count: 256

Filed under:

We have an old RAID 1 Linux server (Ubuntu Lucid 10.04), with four partitions. A few days ago /dev/sdb failed, and today we noticed /dev/sda had pre-failure ominous SMART signs (~4000 reallocated sector count). We replaced /dev/sdb this morning and rebuilt the RAID on the new drive, following this guide:

http://www.howtoforge.com/replacing_hard_disks_in_a_raid1_array

Everything went smooth until the very end. When it looked like it was finishing to synchronize the last partition, the other old one failed. At this point I am very unsure of the state of the system. Everything seems working and the files seem to be all accessible, just as if it synchronized everything, but I'm new to RAID and I'm worried about what is going on.

The /proc/mdstat output is:

Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] 
md3 : active raid1 sdb4[2](S) sda4[0]
      478713792 blocks [2/1] [U_]

md2 : active raid1 sdb3[1] sda3[2](F)
      244140992 blocks [2/1] [_U]

md1 : active raid1 sdb2[1] sda2[2](F)
      244140992 blocks [2/1] [_U]

md0 : active raid1 sdb1[1] sda1[2](F)
      9764800 blocks [2/1] [_U]

unused devices: <none>

The order of [_U] vs [U_]. Why aren't they consistent along all the array? Is the first U /dev/sda or /dev/sdb? (I tried looking on the web for this trivial information but I found no explicit indication) If I read correctly for md0, [_U] should be /dev/sda1 (down) and /dev/sdb1 (up). But if /dev/sda has failed, how can it be the opposite for md3 ? I understand /dev/sdb4 is now spare because probably it failed to synchronize it 100%, but why does it show /dev/sda4 as up? Shouldn't it be [__]? Or [_U] anyway? The /dev/sda drive now cannot even be accessed by SMART anymore apparently, so I wouldn't expect it to be up. What is wrong with my interpretation of the output?

I attach also the outputs of mdadm --detail for the four partitions:

/dev/md0:
        Version : 00.90
  Creation Time : Fri Jan 21 18:43:07 2011
     Raid Level : raid1
     Array Size : 9764800 (9.31 GiB 10.00 GB)
  Used Dev Size : 9764800 (9.31 GiB 10.00 GB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Tue Nov  5 17:27:33 2013
          State : clean, degraded
 Active Devices : 1
Working Devices : 1
 Failed Devices : 1
  Spare Devices : 0

           UUID : a3b4dbbd:859bf7f2:bde36644:fcef85e2
         Events : 0.7704

    Number   Major   Minor   RaidDevice State
       0       0        0        0      removed
       1       8       17        1      active sync   /dev/sdb1

       2       8        1        -      faulty spare   /dev/sda1

/dev/md1:
        Version : 00.90
  Creation Time : Fri Jan 21 18:43:15 2011
     Raid Level : raid1
     Array Size : 244140992 (232.83 GiB 250.00 GB)
  Used Dev Size : 244140992 (232.83 GiB 250.00 GB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 1
    Persistence : Superblock is persistent

    Update Time : Tue Nov  5 17:39:06 2013
          State : clean, degraded
 Active Devices : 1
Working Devices : 1
 Failed Devices : 1
  Spare Devices : 0

           UUID : 8bcd5765:90dc93d5:cc70849c:224ced45
         Events : 0.1508280

    Number   Major   Minor   RaidDevice State
       0       0        0        0      removed
       1       8       18        1      active sync   /dev/sdb2

       2       8        2        -      faulty spare   /dev/sda2


/dev/md2:
        Version : 00.90
  Creation Time : Fri Jan 21 18:43:19 2011
     Raid Level : raid1
     Array Size : 244140992 (232.83 GiB 250.00 GB)
  Used Dev Size : 244140992 (232.83 GiB 250.00 GB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 2
    Persistence : Superblock is persistent

    Update Time : Tue Nov  5 17:46:44 2013
          State : clean, degraded
 Active Devices : 1
Working Devices : 1
 Failed Devices : 1
  Spare Devices : 0

           UUID : 2885668b:881cafed:b8275ae8:16bc7171
         Events : 0.2289636

    Number   Major   Minor   RaidDevice State
       0       0        0        0      removed
       1       8       19        1      active sync   /dev/sdb3

       2       8        3        -      faulty spare   /dev/sda3

/dev/md3:
        Version : 00.90
  Creation Time : Fri Jan 21 18:43:22 2011
     Raid Level : raid1
     Array Size : 478713792 (456.54 GiB 490.20 GB)
  Used Dev Size : 478713792 (456.54 GiB 490.20 GB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 3
    Persistence : Superblock is persistent

    Update Time : Tue Nov  5 17:19:20 2013
          State : clean, degraded
 Active Devices : 1
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 1

    Number   Major   Minor   RaidDevice State
       0       8        4        0      active sync   /dev/sda4
       1       0        0        1      removed

       2       8       20        -      spare   /dev/sdb4

The active sync on /dev/sda4 baffles me.

I am worried because if tomorrow morning I have to replace /dev/sda, I want to be sure what should I sync with what and what is going on. I am also quite baffled by the fact /dev/sda decided to fail exactly when the raid finished resyncing. I'd like to understand what is really happening.

Thanks a lot for your patience and help.

Massimo

Developer IT

linux raid 1: right after replacing and syncing one drive, the other disk fails - understanding what is going on with mdstat/mdadm - Developer IT

linux raid 1: right after replacing and syncing one drive, the other disk fails - understanding what is going on with mdstat/mdadm

linux

raid

mdadm

drive-failure

Related posts about linux

apt-get install and update fail

kernel module compiling error

Build-Essentials installation failing

Updating Debian kernel

Serial connection over a single USB cable (Windows to linux, or linux to linux)

Related posts about raid

Booting from integrated RAID controller when another RAID controller is installed in a PCIe slot

Onboard RAID vs Software RAID

RAID-1 and regular drive removal (using RAID-1 as a backup measure)

Explain difference in SQLIO numbers for RAID 0 versus RAID 5 over 6 disks

RAID 5 RECONSTRUCT with RAID Reconstructor

Categories cloud