We have a VMware vSphere 5 environment running CentOS 5.8 virtual machines. In the past two weeks we have had five incidents of virtual machines having a filesytem become corrupt, requiring an fsck to repair.
Here is what we see in the logs:
Nov 14 14:39:28 hostname kernel: EXT3-fs error (device dm-2): htree_dirblock_to_tree: bad entry in directory #2392098: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0
Nov 14 14:39:28 hostname kernel: Aborting journal on device dm-2.
Nov 14 14:39:28 hostname kernel: __journal_remove_journal_head: freeing b_committed_data
Nov 14 14:39:28 hostname last message repeated 4 times
Nov 14 14:39:28 hostname kernel: ext3_abort called.
Nov 14 14:39:28 hostname kernel: EXT3-fs error (device dm-2): ext3_journal_start_sb: Detected aborted journal
Nov 14 14:39:28 hostname kernel: Remounting filesystem read-only
Nov 14 14:39:28 hostname kernel: EXT3-fs error (device dm-2): htree_dirblock_to_tree: bad entry in directory #2392099: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0
Nov 14 14:31:17 hostname ntpd[3041]: synchronized to 194.238.48.2, stratum 2
Nov 14 15:00:40 hostname kernel: EXT3-fs error (device dm-2): htree_dirblock_to_tree: bad entry in directory #2162743: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0
Nov 14 15:13:17 hostname kernel: __journal_remove_journal_head: freeing b_committed_data
The problem seems to happen while we are rsync'ing application data from another server. So far we have been unable to reproduce the problem, or identify a root cause.
After we had a few servers have this problem, we assumed that there was an issue with the template, so we scrapped all VM's cloned off of the template, destroyed the template, and built a new template from scratch, installed from a newly downloaded CentOS ISO.
We use HP EVA SAN's for datastores, and moved from a 4400 to a 6300 after the first problem. Since the move and rebuilding new virtual machines we have seen the issue twice. On one VM we shut down the server, removed two virtual CPUs, and booted it back up again, the problem presented itself almost immediately. On the other VM, we rebooted it, and the problem happened a half hour later.
Any tips or pointers in the right direction would be appreciated.