Server hang - data loss on reboot, post mortem analysis

Posted by rovangju on Server Fault See other posts from Server Fault or by rovangju
Published on 2011-06-20T19:44:08Z Indexed on 2011/06/25 0:24 UTC
Read the original article Hit count: 489

Filed under:
|
|
|
|

A development server I'm responsible for (ext3 on raid 5 w/Debian Squeeze) froze up over the weekend and I was forced to reset it, as in unresponsive from KVM/physical keyboard access, no eth devices responding, etc. Not even the backup process ran (Figures, the one time I don't check for confirmation)

So after the reset, it turns out that every trace of disk IO activity that should have happened for a period of ~24H is completely gone. The log files have a big gap in the dates and times. As if the writes were never committed to disk, no processes seemed to have run.

Luckily it was a weekend and nothing of value would have been lost and I don't suspect a hack.

What can I do in post mortem to this event - to prevent it from ever happening again? I've seen this happen before on a completely different machine running FreeBSD.

I am rounding up the disk checking tools right now - but there must be more going on!

  • Mount options: /dev/sda1 on / type ext3 (rw,errors=remount-ro)
  • Kernel: Linux dev 2.6.32-5-686-bigmem
  • Disk/Inodes: 13%/3%

© Server Fault or respective owner

Related posts about linux

Related posts about debian