Last week proved me a veritable Cassandra: I've always said that it's a bad idea to have only one firewall/router, without a backup or failover. And thus our Cisco PIX went haywire, refusing to route properly. And of course, the only one available here on short notice is me, and while I'm quite grounded in Linux, I'm really a developer not a sysadmin (the fact that this hit me on sysadmin appreciation day is a bit ironic).
Anyway, this weekend I tried to hack up a temporary solution: I used an old server with enough NICs (two built-in, four on a card) to serve as a gateway and firewall. Due to some problems with the raid controller, I got only two router distros running, and between Untangle and Ebox I decided for the latter.
Now everything is quite okay. I've got all the different subnets we've got here (all with separate switches) talking to each other and even to the internet (Cisco 2800 router, T1 lines). But from time to time (20-60 minute intervals), I get a total routing failure. Our main, office subnet can't talk to our server subnet and can't connect to the internet. This is not the end of a gradual slowdown, either everything's working perfectly or I get a total lack of communication for about two minutes each time.
Now I'm a bit at wits end what to check. At least with the default EBox setup, nothing in /var/log shows anything weird and it doesn't exactly have lots of built-in monitoring tools. So I'm hoping someone here could give me some pointers about what to look out for. I did change the ethernet cable from the office switch to the firewall, with no results. I might change switches, although within the switch it seems to work ok enough.
Edit: I'm not sure whether this is the sole cause of the problem, but after I noticed a few DHCP entries just before the last drop of connectivity, I tried to reproduce that. And alas, whenever I renew a DHCP connection, I can't access other subnets anymore. Running ISC DHCPD 3.0.6.