Nagios DNX plugins
- by danneh3826
I'm toying with the idea of multiple Nagios instances setup to monitor our infrastructure. I've looked at all the various methods of distributed Nagios checks, and I think DNX comes out the closest. DNX handles failure of worker nodes, that's fine. What happens if the main DNX server fails though? Is there a way to replicate the server too? I'm using AWS EC2 primarily, so I can utilise Elastic Load Balancing for the web UI, but I need to be able to handle the AZ where the monitoring server is to fail over, and essentially for a second to pick up the checking load (active/passive, active/active, so long as it doesn't fail completely)
The other thing I'm trying to solve is an issue with routing. What I'd like is to have multiple nodes report a fault before Nagios confirms it as critical. Not the NRPE checks, as they're pretty self explanitory, but things more like check_ping. I often have routing issues out of AWS to certain datacenters, so Nagios can often report bad/no ping/timeout as a critical issue, even though the machine in question is working fine. Would it be possible to have a setup where a worker complains a service check is critical, and have a second worker node (positioned in another datacenter/AZ) also report the service as critical before the Nagios central server issues a critical alert?
I realise I might be asking a bit much (how far down the line do you go setting up failover systems before it starts to get ridiculous), however surely someone must have thought of this scenario when developing DNX?