Apologies in advance for the lengthy question.
We have a Dell PowerEdge R720 server with:
2 x 136GB SAS drives in RAID 1 for the OS (Ubuntu Server 12.04)
6 x 3TB SATA drives in RAID 5 for data
A few days ago we were getting errors when trying to access files on the large RAID 5 partition. We rebooted the server and got a message about the raid controller has found a foriegn config. We've had this before, and just needed to use Dell's RAID configuration utility to import foreign config on the RAID. Last time this worked, but this time, it started doing a disk check then we got this:
FSCK has returned the following:
"/dev/sdb1 inode 364738 has a bad extended attribute block 7
/dev/sdb1 unexpected inconsistency run fsck manually (i.e without -a or -p options)
MOUNTALL fsck /ourdatapartition [1019] terminated with status 4
MOUNTALL filesystem has errors /ourdatapartition
errors where found while checking the disk drive for /ourdatapartition
Press F to fix errors, I to Ignore or M for Manual Recovery"
We pressed F to try and fix the errors, but it eventually errored with:
Inode 275841084, i_blocks is 167080, should be 0. Fix? yes
Inode 275841141 has an invalid extend node (blk 2206761006, lblk 0)
Clear? yes
Inode 275841141, i_blocks is 227872, should be 0. Fix? yes
Inode 275842303 has an invalid extend node (blk 2206760975, lblk 0)
Clear? yes
....
Error storing directory block information (inode=275906766, block=0, num=2699516178): Memory allocation failed
/dev/sdb1: ***** FILE SYSTEM WAS MODIFIED *****
e2fsck: aborted
/dev/sdb1: ***** FILE SYSTEM WAS MODIFIED *****
mountall: fsck /ourdatapartition [1286] terminated with status 9
mountall: Unrecoverable fsck error: /ourdatapartition
We noticed one of the drive lights was not lit at all, and thought this may have failed and be the problem. We replaced the drive with a spare, and tried "F" to repair it again, but we keep just getting the same error as above.
In the RAID configuration utility, all drives show as "online" and "optimal".
We do have this data on another replicated server, so we're not worried about "recovering" anything, we just want to get the system back online asap.
The server has 64 or 32GB memory, can't remember off the top of my head, but either way, with a 14TB RAID, I think it may still not be enough.
Thanks
EDIT - I checked the memory usage while fsck was running as suggested and after 2 or 3 minutes, it looked like this, using up nearly all of our servers memory:
When it failed after 5 minutes or so with the error in my post, the memory immediately freed up again: