I am running VMWare Server 2.0.2 (Build 203138) on a dual core Intel i5 with Ubuntu Server 10.04 LTS system (kernel 2.6.32-22-server #33-Ubuntu SMP). Disk Subsystem is a software RAID5 array.
The system has been set up for a little over a week. For the past 5 days I have been running at leat 3 VMs (Linux and a variety of Windows OSes) with no issues whatsoever. But while I was installing Linux onto a new VM, suddenly all VMs became unresponsive, including the one I was installing to. I could not log in to the VMWare Management Interface, and the system was somewhat unresponsive via SSH. When I looked at top, I saw:
top - 16:14:51 up 6 days, 1:49, 8 users, load average: 24.29, 24.33 17.54
Tasks: 203 total, 7 running, 195 sleeping, 0 stopped, 1 zombie
Cpu(s): 0.2%us, 25.6%sy, 0.0%ni, 74.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 8056656k total, 5927580k used, 2129076k free, 20320k buffers
Swap: 7811064k total, 240216k used, 7570848k free, 5045884k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
21549 root 39 19 0 0 0 Z 100 0.0 15:02.44 [vmware-vmx] <defunct>
2115 root 20 0 0 0 0 S 1 0.0 170:32.08 [vmware-rtc]
2231 root 21 1 1494m 126m 100m S 1 1.6 892:58.05 /usr/lib/vmware/bin/vmware-vmx -# product=2;
2280 jnet 20 0 19320 1164 800 R 0 0.0 30:04.55 top
12236 root 20 0 833m 41m 34m S 0 0.5 88:34.24 /usr/lib/vmware/bin/vmware-vmx -# product=2;
1 root 20 0 23704 1476 920 S 0 0.0 0:00.80 /sbin/init
2 root 20 0 0 0 0 S 0 0.0 0:00.01 [kthreadd]
3 root RT 0 0 0 0 S 0 0.0 0:00.00 [migration/0]
4 root 20 0 0 0 0 S 0 0.0 0:00.84 [ksoftirqd/0]
5 root RT 0 0 0 0 S 0 0.0 0:00.00 [watchdog/0]
6 root RT 0 0 0 0 S 0 0.0 0:00.00 [migration/1]
The VMWare process for the virtual machine I was installing into became a zombie. Yet, it was still consuming 100% of the CPU time on one of the cores, and I couldn't reach it or any other virtual machines. (I was logged in to one virtual machine over SSH, another via X11, and a third via VNC. All three connections died). When I ran ps -ef and similar commands, I found that the defunct vmware-vmx process had it's parent PID set to init (1). I also used lsof -p 21549 and found that the defunct process had no open files. Yet it was using 100% of CPU time...
I was unable to kill any vmware-vmx processes, including the defunct one, even with kill -9. As a last resort to resolve the situation I tried to reboot the box, however shutdown, halt, reboot, and init 6 all failed to reboot/shutdown, even when given appropriate --force settings. ControlAltDel produced a message about rebooting on the console, but the system would not reboot. I had to hard power-cycle the box to resolve the situation. (See my other question, Should I worry about the integrity of my linux software RAID5 after a crash or kernel panic?)
What would cause a scenario like this? What else could I have done to resolve it besides a hard reboot? What can I do to prevent such a situation in the future?