We've got a CentOS server running a cluster of virtuals. Occasionally the cluster's internal network drops out for a minute or so ... and then comes back. The problem is somehow related to the actual network traffic, but it is not a simple load issue. (The system is generally lightly loaded, and the problem occurs irrespective of actual load.)
The setup:
CentOS 5.6 on Dom0, various CentOS on the DomU's
Hardware - a Dell R710 with a BroadCom NextXpress 2 NIC (sigh)
using the latest drivers for the NIC from BroadCom
Xen configured to use network-bridge and vif-bridge
Some iptable tweaks to route an unrelated port to one of the virtuals.
The system has one externally visible IP address, and Dom0 runs an Apache httpd configured with a number of virtual hosts each of which reverse proxies to web servers running on the virtuals. (The virtuals have to be NAT'ed, primarily because we don't have enough allocated public IP addresses.)
The symptoms:
Works fine most of the time.
When someone tries to UPLOAD a large file to one virtuals, the internal network drops out ... for all virtuals:
The Dom0 httpd sees a network timeout talking to the backend server on the virtual and reports a 502.
A previously established ssh connection from Dom0 to any of the DomU's freezes.
Our monitoring shows ping failures for traffic between virtuals.
The Xen consoles to the DomU's do not freeze.
No log messages in any log files that I can see, on either Dom0 or the DomU's ... apart from the Dom0 httpd logs.
After a minute or so, the problem clears by itself.
This is 100% reproducible.
What we've tried:
Downloading, building and installing the latest BNX2 driver on Dom0
Turning off MSI on the NIC - adding "options bnx2 disable_msi=1" to /etc/modprobe.conf
Turning off tcp segmentation offload - "ethtool -K eth0 tso off".
Sacrificing a black rooster at midnight.
I've exhausted all my options apart from switching to KVM ... or slaughtering more roosters.
Any suggestions?