Ubuntu 12 crashed and took down network
- by Leopd
We recently set up a new Ubuntu 12.04LTS server on our network. It's not fully configured so it's not doing much beyond sshd and a default apache2 install. But this evening it appears to have crashed. It wasn't responding to the network or the keyboard. But the worst part is, it took down the entire network.
My knowledge of the network stack below OSI layer 3 is very limited, so the rest confuses me. When this machine was physically connected to the network, no other machine could connect to the outside internet. When things were broken, running arp showed that our gateway's IP address (10.0.1.1) was listed as "invalid." Unplugging the server from the network fixed the problem, and plugging it back in broke it again. So the crashed server was advertising itself as owning the gateway's IP address?
There's nothing at all in syslog during the time when it was causing problems. Any ideas about how to figure out what went wrong or what we can do to prevent it from happening again? I'm hesitant to even put the machine back on the network right now.
Update **
It crashed again, and I ran tcpdump -penn arp (thanks bahamat!) for several minutes and got this... (timestamps and duplicate lines removed)
00:1e:65:f8:dc:24 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 60: Request who-has 10.0.1.1 tell 10.0.2.191, length 46
00:1e:65:f8:dc:24 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 60: Request who-has 10.0.1.44 tell 10.0.2.191, length 46
60:d8:19:d4:71:d6 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 60: Request who-has 10.0.1.1 tell 10.0.2.125, length 46
d4:9a:20:04:e9:78 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 192.168.1.1 tell 192.168.1.100, length 28
Update 2 **
When the network is functioning properly, arping -c4 10.0.1.1 returns this:
ARPING 10.0.1.1
60 bytes from c0:c1:c0:77:25:8e (10.0.1.1): index=0 time=267.982 usec
60 bytes from c0:c1:c0:77:25:8e (10.0.1.1): index=1 time=422.955 usec
60 bytes from c0:c1:c0:77:25:8e (10.0.1.1): index=2 time=299.215 usec
60 bytes from c0:c1:c0:77:25:8e (10.0.1.1): index=3 time=366.926 usec
--- 10.0.1.1 statistics ---
4 packets transmitted, 4 packets received, 0% unanswered (0 extra)
When the bad server is plugged in, arping -c4 10.0.1.1 returns:
ARPING 10.0.1.1
--- 10.0.1.1 statistics ---
4 packets transmitted, 0 packets received, 100% unanswered (0 extra)
Context **
10.0.x.x is the main subnet.
10.0.1.1 is the main internet gateway
10.0.1.44 is a printer
10.0.2.* devices are all laptops / workstations
I have no idea what's using the 192.168.x.x subnet -- your guesses are at least as good as mine. A VM on a workstation? A misconfigured WAP? Somebody re-sharing wifi? A machine that failed to DHCP?
The offending ubuntu server's MAC address ends in cd:80 so isn't listed in the dump. It should DHCP to 10.0.3.3
Thanks for any help. This ARP stuff is all voodoo to me. Packets just go to IP addresses, right? ;)