Neighbour table overflow on Linux hosts related to bridging and ipv6
- by tim
Note: I already have a workaround for this problem (as described below) so this is only a "want-to-know" question.
I have a productive setup with around 50 hosts including blades running xen 4 and equallogics providing iscsi. All xen dom0s are almost plain Debian 5. The setup includes several bridges on every dom0 to support xen bridged networking. In total there are between 5 and 12 bridges on each dom0 servicing one vlan each. None of the hosts has routing enabled.
At one point in time we moved one of the machines to a new hardware including a raid controller and so we installed an upstream 3.0.22/x86_64 kernel with xen patches. All other machines run debian xen-dom0-kernel.
Since then we noticed on all hosts in the setup the following errors every ~2 minutes:
[55888.881994] __ratelimit: 908 callbacks suppressed
[55888.882221] Neighbour table overflow.
[55888.882476] Neighbour table overflow.
[55888.882732] Neighbour table overflow.
[55888.883050] Neighbour table overflow.
[55888.883307] Neighbour table overflow.
[55888.883562] Neighbour table overflow.
[55888.883859] Neighbour table overflow.
[55888.884118] Neighbour table overflow.
[55888.884373] Neighbour table overflow.
[55888.884666] Neighbour table overflow.
The arp table (arp -n) never showed more than around 20 entries on every machine. We tried the obvious tweaks and raised the
/proc/sys/net/ipv4/neigh/default/gc_thresh*
values. FInally to 16384 entries but no effect. Not even the interval of ~2 minutes changed which lead me to the conclusion that this is totally unrelated. tcpdump showed no uncommon ipv4 traffic on any interface. The only interesting finding from tcpdump were ipv6 packets bursting in like:
14:33:13.137668 IP6 fe80::216:3eff:fe1d:9d01 > ff02::1:ff1d:9d01: HBH ICMP6, multicast listener reportmax resp delay: 0 addr: ff02::1:ff1d:9d01, length 24
14:33:13.138061 IP6 fe80::216:3eff:fe1d:a8c1 > ff02::1:ff1d:a8c1: HBH ICMP6, multicast listener reportmax resp delay: 0 addr: ff02::1:ff1d:a8c1, length 24
14:33:13.138619 IP6 fe80::216:3eff:fe1d:bf81 > ff02::1:ff1d:bf81: HBH ICMP6, multicast listener reportmax resp delay: 0 addr: ff02::1:ff1d:bf81, length 24
14:33:13.138974 IP6 fe80::216:3eff:fe1d:eb41 > ff02::1:ff1d:eb41: HBH ICMP6, multicast listener reportmax resp delay: 0 addr: ff02::1:ff1d:eb41, length 24
which placed the idea in my mind that the problem maybe related to ipv6, since we have no ipv6 services in this setup.
The only other hint was the coincidence of the host upgrade with the beginning of the problems. I powered down the host in question and the errors were gone. Then I subsequently took down the bridges on the host and when i took down (ifconfig down) one particularly bridge:
br-vlan2159 Link encap:Ethernet HWaddr 00:26:b9:fb:16:2c
inet6 addr: fe80::226:b9ff:fefb:162c/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:120 errors:0 dropped:0 overruns:0 frame:0
TX packets:9 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:5286 (5.1 KiB) TX bytes:726 (726.0 B)
eth0.2159 Link encap:Ethernet HWaddr 00:26:b9:fb:16:2c
inet6 addr: fe80::226:b9ff:fefb:162c/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:1801 errors:0 dropped:0 overruns:0 frame:0
TX packets:20 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:126228 (123.2 KiB) TX bytes:1464 (1.4 KiB)
bridge name bridge id STP enabled interfaces
...
br-vlan2158 8000.0026b9fb162c no eth0.2158
br-vlan2159 8000.0026b9fb162c no eth0.2159
The errors went away again. As you can see the bridge holds no ipv4 address and it's only member is eth0.2159 so no traffic should cross it. Bridge and interface .2159 / .2157 / .2158 which are in all aspects identical apart from the vlan they are connected to had no effect when taken down.
Now I disabled ipv6 on the entire host via sysctl net.ipv6.conf.all.disable_ipv6 and rebooted. After this even with bridge br-vlan2159 enabled no errors occur.
Any ideas are welcome.