How to collect the performance data of a server during an unreachable/down period using Nagios?
- by gsc-frank
Some time services and host stop responding due to a poor server performance. I mean, if for some reason (could be lot of concurrency services access, a expensive backup execution on the server or whatever that consume tons of server resources) a server performance is very degraded, that could lead that the server isn't capable to establish any "normal network communication" (without trigger whatever standards timeouts defined for such communication).
Knowing host's performance data (cpu, memory, ...) in case of available during that period (host is not down and despite of its performance degradation still allow plugins collect performance data) could be very useful for sysadmin to try to determine what cause the problem, or at least, if the host performance was good and don't interfered at all in the host/service down.
This problem could be solved using remote active (NRPE) or remote passive (NSCA) if such remote solutions could store (buffered) perf data to be send to central Nagios server when host performance or network outage allow it. I read the doc of both solutions and can't find any reference to such buffer mechanism neither what happened in case that NSCA can't reach Nagios server.
Any idea of how solve this lack of info? so useful for forensic analysis.
EDIT:
My questions isn about which tools I can use to debug perf problems or gather perf data to analysis, but is about how collect (using Nagios) host perf data even during a network outage for its posterior analysis (kind of forensic analysis). The idea is integrate such data to Nagios graphers like pnp4nagios and NagiosGrapther. I know that I could install tools like Cacti in each of my host, and have a kind of performance data collection redundancy, but I really want avoid that and try to solve all perf analysis requirements with one tools: Nagios