Random and Selective ARP blindness in VMWare ESXi 4.1
- by Peter Grace
We have multiple VMWare ESX servers spread out amongst our company, doing various tasks. One particular ESXi host is exhibiting very peculiar behavior. We detect it when our monitoring system (Orion) notifies us that it can no longer ping the box.
Upon jumping on the local console of the guest in question, we see that it cannot ping any new addresses that aren't already in its ARP table.
At first we thought that the problem was just related to one of our guests, as the problem seemed to always happen to another guest, DevRedis. However, this afternoon the problem swapped and started happening on ApacheBox rather than DevRedis.
When I have been fortunate to catch the problem, I have run tcpdump on both sides of the connection (one side being vmware, the other side being a physical webserver) and have noticed the following course of events:
Guest ApacheBox sends an ARP request for the physical address of server WindowsBeast
WindowsBeast tenders an ARP is-at back to the network indicating its physical mac address.
ApacheBox never sees the ARP is-at response.
The ESX host in question is running VMware ESXi, 4.1.0, 348481
The two guests (DevRedis and ApacheBox) are both running CentOS 6.3, however they are running two separate kernel versions ( 2.6.32-279.9.1.el6.x86_64 and 2.6.32-279.el6.x86_64 ) so I'm not entirely sure it's a CentOS problem.
Does anyone have any thoughts on what might cause this? Has anyone run into it before?