Server hang - data loss on reboot, post mortem analysis
- by rovangju
A development server I'm responsible for (ext3 on raid 5 w/Debian Squeeze) froze up over the weekend and I was forced to reset it, as in unresponsive from KVM/physical keyboard access, no eth devices responding, etc. Not even the backup process ran (Figures, the one time I don't check for confirmation)
So after the reset, it turns out that every trace of disk IO activity that should have happened for a period of ~24H is completely gone. The log files have a big gap in the dates and times. As if the writes were never committed to disk, no processes seemed to have run.
Luckily it was a weekend and nothing of value would have been lost and I don't suspect a hack.
What can I do in post mortem to this event - to prevent it from ever happening again? I've seen this happen before on a completely different machine running FreeBSD.
I am rounding up the disk checking tools right now - but there must be more going on!
Mount options: /dev/sda1 on / type ext3 (rw,errors=remount-ro)
Kernel: Linux dev 2.6.32-5-686-bigmem
Disk/Inodes: 13%/3%