How can I recover an ext4 filesystem corrupted after a fsck?
- by Regan
I have an ext4 filesystem on luks over software raid5. The filesystem was operating "just fine" for several years when I was beginning to run out of space. I had a 9T volume on 6x2T drives. I began upgrading to 3T drives by doing the mdadm fail, remove, add, rebuild, repeat process until I had a larger array.
I then grew the luks container, and then when I unmounted and tried to resize2fs I was given the message the filesystem was dirty and needed e2fsck.
Without thinking I just did e2fsck -y /dev/mapper/candybox and it began spewing all kinds of inode being removed type messages (can't remember exactly) I killed e2fsck and tried to remount the filesystem to backup data I was concerned about. When trying to mount at this point I get:
# mount /dev/mapper/candybox /candybox
mount: wrong fs type, bad option, bad superblock on /dev/mapper/candybox,
missing codepage or helper program, or other error
In some cases useful info is found in syslog - try
dmesg | tail or so
Looking back at my older logs I noticed the filesystem was giving this error each time the machine booted:
kernel: [79137.275531] EXT4-fs (dm-2): warning: mounting fs with errors, running e2fsck is recommended
So shame on me for not paying attention :(
I then tried to mount using every backup superblock (one after another) and each attempt left this in my log:
EXT4-fs (dm-2): ext4_check_descriptors: Checksum for group 0 failed (26534!=65440)
EXT4-fs (dm-2): ext4_check_descriptors: Checksum for group 1 failed (38021!=36729)
EXT4-fs (dm-2): ext4_check_descriptors: Checksum for group 2 failed (18336!=39845)
...
EXT4-fs (dm-2): ext4_check_descriptors: Checksum for group 11911 failed (28743!=44098)
BUG: soft lockup - CPU#0 stuck for 23s! [mount:2939]
Attempts to restart e2fsck results in:
# e2fsck /dev/mapper/candybox
e2fsck 1.41.14 (22-Dec-2010)
e2fsck: Group descriptors look bad... trying backup blocks...
candy: recovering journal
e2fsck: unable to set superblock flags on candy
At this point, I decided it best to order some more drives and make an image using ddrescue
Now two weeks later I have an image of the luks partition in a .img file.
# ls -lh
total 14T
-rw-r--r-- 1 root root 14T Oct 25 01:57 candybox.img
-rw-r--r-- 1 root root 271 Oct 20 14:32 candybox.logfile
After numerous attempts using everything I could find online I could not coerce e2fsck to do anything on the image, so I used mkfs.ext4 -L candy candybox.img -m 0 -S and I was able to mount the dirty filesystem readonly without the journal and recover 960G of data. It gave all kinds of errors of various directories not existing and so forth but I was able to get some stuff. Which gave me some hope!
I then ran e2fsck again and it had to recreate the root inode and gave a massive list of correcting group counts, I accepted the root inode creation and said no to everything else, leaving a completely empty filesystem. Re-ran again and said yes to all questions with the same result but now a "clean" but empty filesystem.
extundelete gives me 0 recoverable inodes found.
And now I'm stuck again, I can't come up with any other methods other than dropping to something like photorec which will give me an absolute mess with how large the filesystem was.
I'm willing to re-copy the image from the original array and start over, if I can get any suggestions or ideas on a way to get more of my files back.
I wish I could give more detailed logs of the commands that have run, but the output is long scrolled passed except for what gets logged to syslog and my memory is not as detailed due to the timeframe this has occurred over.
Any help is greatly appreciated!