Can someone explain the "use-cases" for the default munin graphs?

Posted by exhuma on Server Fault See other posts from Server Fault or by exhuma
Published on 2011-11-30T09:20:29Z Indexed on 2011/11/30 10:02 UTC
Read the original article Hit count: 422

Filed under:

When installing munin, it activates a default set of plugins (at least on ubuntu). Alternatively, you can simply run munin-node-configure to figure out which plugins are supported on your system. Most of these plugins plot straight-forward data. My question is not to explain the nature of the data (well... maybe for some) but what is it that you look for in these graphs?

It is easy to install munin and see fancy graphs. But having the graphs and not being able to "read" them renders them totally useless.

I am going to list standard plugins which are enabled by default on my system. So it's going to be a long list. For completeness, I am also going to list plugins which I think to understand and give a short explanation as to what I think it's used for. Pleas correct if I am wrong with any of them.

So let me split this questions in three parts:

Plugins where I don't even understand the data
Plugins where I understand the data but don't know what I should look out for
Plugins which I think to understand

Plugins where I don't even understand the data

These may contain questions that are not necessarily aimed at munin alone. Not understanding the data usually mean a gap in fundamental knowledge on operating systems/hardware.... ;) Feel free to respond with a "giyf" answer.

These are plugins where I can only guess what's going on... I hardly want to look at these "guessing"...

Disk IOs per device (IOs/second)
What's an IO. I know it stands for input/output. But that's as far as it goes.
Disk latency per device (Average IO wait)
Not a clue what an "IO wait" is...
IO Service Time
This one is a huge mess, and it's near impossible to see something in the graph at all.

Plugins where I understand the data but don't know what I should look out for

IOStat (blocks/second read/written)
I assume, the thing to look out for in here are spikes? Which would mean that the device is in heavy use?
Available entropy (bytes)
I assume that this is important for random number generation? Why would I graph this? So far the value has always been near constant.
VMStat (running/I/O sleep processes)
What's the difference between this one and the "processes" graph? Both show running/sleeping processes, whereas the "Processes" graph seems to have more details.
Disk throughput per device (bytes/second read/written)
What's thedifference between this one and the "IOStat" graph?
inode table usage
What should I look for in this graph?

Plugins which I think to understand

I'll be guessing some things here... correct me if I am wrong.

Disk usage in percent (percent)
How much disk space is used/remaining. As this is approaching 100%, you should consider cleaning up or extend the partition. This is extremely important for the root partition.
Firewall Throughput (packets/second)
The number of packets passing through the firewall. If this is spiking for a longer period of time, it could be a sign of a DOS attack (or we are simply recieving a large file). It can also give you an idea about your firewall performance. If it's levelling out and you need more "power" you should consider load balancing. If it's levelling out and see a correlation with your CPU load, it could also mean that your hardware is not fast enough. Correlations with disk usage could point to excessive LOG targets in you FW config.
eth0 errors (packets in/out)
Network errors. If this value is increasing, it could be a sign of faulty hardware.
eth0 traffic (bits/second in/out)
Raw network traffic. This should correlate with Firewall throughput.
number of threads
An ever-increasing value might point to a process not properly closing threads. Investigate!
processes
Breakdown of active processes (including sleeping). A quick spike in here might point to a fork-bomb. A slowly, but ever-increasing value might point to an application spawning sub-processes but not properly closing them. Investigate using ps faux.
process priority
This shows the distribution of process priorities. Having only high-priority processes is not of much use. Consider de-prioritizing some.
cpu usage
Fairly straight-forward. If this is spiking, you may have an attack going on, or a process is hogging the CPU. Idf it's slowly increasing and approaching max in normal operations, you should consider upgrading your hardware (or load-balancing).
file table usage
Number of actively open files. If this is reaching max, you may have a process opening, but not properly releasing files.
load average
Shows an summarized value for the system load. Should correlate with CPU usage. Increasing values can come from a number of sources. Look for correlations with other graphs.
memory usage
A graphical representation of you memory. As long as you have a lot of unused+cache+buffers you are fine.
swap in/out
Shows the activity on your swap partition. This should always be 0. If you see activity on this, you should add more memory to your machine!

Developer IT