Server hang - data loss on reboot, post mortem analysis
Posted
by
rovangju
on Server Fault
See other posts from Server Fault
or by rovangju
Published on 2011-06-20T19:44:08Z
Indexed on
2011/06/25
0:24 UTC
Read the original article
Hit count: 492
A development server I'm responsible for (ext3 on raid 5 w/Debian Squeeze) froze up over the weekend and I was forced to reset it, as in unresponsive from KVM/physical keyboard access, no eth devices responding, etc. Not even the backup process ran (Figures, the one time I don't check for confirmation)
So after the reset, it turns out that every trace of disk IO activity that should have happened for a period of ~24H is completely gone. The log files have a big gap in the dates and times. As if the writes were never committed to disk, no processes seemed to have run.
Luckily it was a weekend and nothing of value would have been lost and I don't suspect a hack.
What can I do in post mortem to this event - to prevent it from ever happening again? I've seen this happen before on a completely different machine running FreeBSD.
I am rounding up the disk checking tools right now - but there must be more going on!
- Mount options:
/dev/sda1 on / type ext3 (rw,errors=remount-ro)
- Kernel:
Linux dev 2.6.32-5-686-bigmem
- Disk/Inodes:
13%/3%
© Server Fault or respective owner