I am looking for suggestions on doing some simple monitoring of an ASP.Net web farm as close to real-time as possible. The objectives of this question are to:
Identify the best way to monitor several Windows Server production boxes during short (minutes long) period of ridiculous load
Receive near-real-time feedback on a few key metrics about each box. These are simple metrics available via WMI such as CPU, Memory and Disk Paging. I am defining my time constraints as soon as possible with 120 seconds delayed being the absolute upper limit.
Monitor whether any given box is up (with "up" being defined as responding web requests in a reasonable amount of time)
Here are more details, things I've tried, etc.
I am not interested in logging. We have logging solutions in place.
I have looked at solutions such as ELMAH which don't provide much in the way of hardware monitoring and are not visible across an entire web farm.
ASP.Net Health Monitoring is too broad, focuses too much on logging and is not acceptable for deep analysis.
We are on Amazon Web Services and we have looked into CloudWatch. It looks great but messages in the forum indicate that the metrics are often a few minutes behind, with one thread citing 2 minutes as the absolute soonest you could expect to receive the feedback. This would be good to have for later analysis but does not help us real-time
Stuff like JetBrains profiler is good for testing but again, not helpful during real-time monitoring.
The closest out-of-box solution I've seen is Nagios which is free and appears to measure key indicators on any kind of box, including Windows. However, it appears to require a Linux box to run itself on and a good deal of manual configuration. I'd prefer to not spend my time mining config files and then be up a creek when it fails in production since Linux is not my main (or even secondary) environment.
Are there any out-of-box solutions that I am missing? Obviously a windows-based solution that is easy to setup is ideal. I don't require many bells and whistles.
In the absence of an out-of-box solution, it seems easy for me to write something simple to handle what I need. I've been thinking a simple client-server setup where the server requests a few WMI metrics from each client over http and sticks them in a database. We could then monitor the metrics via a query or a dashboard or something. If the client doesn't respond, it's effectively down.
Any problems with this, best practices, or other ideas?
Thanks for any help/feedback.