Performance data collection for short-running, ephemeral servers
- by ErikA
We're building a medical image processing software stack, currently hosted on various AWS resources. As part of this application, we have a handful of long-running servers (database, load balancers, web application, etc.). Collecting performance data on those servers is quite simple - my go-to- recipe of Nagios (for monitoring/notifications) and Munin (for collection of performance data and displaying trends) will work just fine.
However - as part of this application, we are constantly starting up and terminating compute instances on EC2. In typical usage, these compute instances start up, configure themselves, receive a job from a message queue, and then get to work processing that job, which takes anywhere from 15 minutes to over 8 hours. After job completion, these instances get terminated, never to be heard from again.
What is a decent strategy for collecting performance data on these short-lived instances?
I don't necessarily need monitoring on them - if they fail for whatever reason, our application will detect this and handle re-starting the job on another instance or raising the flag so an administrator can take a look at things. However, it still would be useful to collect information like CPU (user, idle, iowait, etc.), memory usage, network traffic, disk read/write data, etc. In our internal database, we track the instance ID of the machine that runs each job, and it would be quite helpful to be able to look up performance data for a specific instance ID for troubleshooting and profiling.
Munin doesn't seem like a great candidate, as it requires maintaining a list of munin nodes in a text file - far from ideal for an environment with a high amount of churn, and for the short amount of time each node will be running, I'd rather keep the full-resolution data indefinitely than have RRD water down the data over time.
In the end, my guess is that this will require a monitoring engine that:
uses a database (MySQL, SQLite, etc.) for configuration and data storage
exposes an API for adding/removing hosts and services
Are there other things I should be thinking about when evaluating options?
Perhaps I'm over-thinking this, though, and just ought to run sar at 1-minute intervals on these short-lived instances and collect the sar db files prior to termination.