Hi,
I've a KVM system upon which I'm running a network bridge directly between all VM's and a bond0 (eth0, eth1) on the host OS. As such, all machines are presented on the same subnet, available outside of the box. The bond is doing mode 1 active / passive, with an arp_ip_target set to the default gateway, which has caused some issues in itself, but I can't see the bond configs mattering here myself.
I'm seeing odd things most times when I stop and start a guest on the platform, in that on the host I lose network connectivity (icmp, ssh) for about 30 seconds. I don't lose connectivity on the other already running VM's though... they can always ping the default GW, but the host can't. I say "about 30 seconds" but from some tests it actually seems to be 28 seconds usually (or at least, I lose 28 pings...) and I'm wondering if this somehow relates to the bridge config.
I'm not running STP on the bridge at all, and the forwarding delay is set to 1 second, path cost on the bond0 lowered to 10 and port priority of bond0 also lowered to 1. As such I don't think that the bridge should ever be able to think that bond0 is not connected just fine (as continued guest connectivity implies) yet the IP of the host, which is on the bridge device (... could that matter?? ) becomes unreachable.
I'm fairly sure it's about the bridged networking, but at the same time as this happens when a VM is started there are clearly loads of other things also happening so maybe I'm way off the mark.
Lack of connectivity:
# ping 10.20.11.254
PING 10.20.11.254 (10.20.11.254) 56(84) bytes of data.
64 bytes from 10.20.11.254: icmp_seq=1 ttl=255 time=0.921 ms
64 bytes from 10.20.11.254: icmp_seq=2 ttl=255 time=0.541 ms
type=1700 audit(1293462808.589:325): dev=vnet6 prom=256 old_prom=0 auid=42949672
95 ses=4294967295
type=1700 audit(1293462808.604:326): dev=vnet7 prom=256 old_prom=0 auid=42949672
95 ses=4294967295
type=1700 audit(1293462808.618:327): dev=vnet8 prom=256 old_prom=0 auid=42949672
95 ses=4294967295
kvm: 14116: cpu0 unimplemented perfctr wrmsr: 0x186 data 0x130079
kvm: 14116: cpu0 unimplemented perfctr wrmsr: 0xc1 data 0xffdd694a
kvm: 14116: cpu0 unimplemented perfctr wrmsr: 0x186 data 0x530079
64 bytes from 10.20.11.254: icmp_seq=30 ttl=255 time=0.514 ms
64 bytes from 10.20.11.254: icmp_seq=31 ttl=255 time=0.551 ms
64 bytes from 10.20.11.254: icmp_seq=32 ttl=255 time=0.437 ms
64 bytes from 10.20.11.254: icmp_seq=33 ttl=255 time=0.392 ms
brctl output of relevant bridge:
# brctl showstp brdev
brdev
bridge id 8000.b2e1378d1396
designated root 8000.b2e1378d1396
root port 0 path cost 0
max age 19.99 bridge max age 19.99
hello time 1.99 bridge hello time 1.99
forward delay 0.99 bridge forward delay 0.99
ageing time 299.95
hello timer 0.50 tcn timer 0.00
topology change timer 0.00 gc timer 0.04
flags
vnet5 (3)
port id 8003 state forwarding
designated root 8000.b2e1378d1396 path cost 100
designated bridge 8000.b2e1378d1396 message age timer 0.00
designated port 8003 forward delay timer 0.00
designated cost 0 hold timer 0.00
flags
vnet0 (2)
port id 8002 state forwarding
designated root 8000.b2e1378d1396 path cost 100
designated bridge 8000.b2e1378d1396 message age timer 0.00
designated port 8002 forward delay timer 0.00
designated cost 0 hold timer 0.00
flags
bond0 (1)
port id 0001 state forwarding
designated root 8000.b2e1378d1396 path cost 10
designated bridge 8000.b2e1378d1396 message age timer 0.00
designated port 0001 forward delay timer 0.00
designated cost 0 hold timer 0.00
flags
I do see the new port listed as learning, but in line with the forward delay, only for 1 or 2 seconds when polling the brctl output on a loop.
All pointers, tips or stabs in the dark appreciated.