Skip to content

Categories:

Nagios plugin: check_sa.pl

There’s a lot of useful Nagios addons out there. One of them, pnp4nagios, allows you to create graphs of all of your Nagios performance data with zero configuration. This is pretty nice, because your monitoring configurations are kept in one place, rather than having to separately maintain configurations for Nagios and Cacti (or whatever you use).

I’ve always wanted to be able to monitor things like number of open sockets, page faults, context switches, and other performance counters. Some of them are available through SNMP; others aren’t. The ones that are available aren’t all available by device. I wanted a little bit more detail.

The other problem with SNMP queries is that a Nagios check doesn’t query an average — something that spikes for a minute is not the same as a condition that persists for several minutes or hours. I wanted to leverage the built-in accounting in sysstat to pull together something Nagios can actually make a little bit of sense out of.

Anyway, I went ahead and created a Nagios plugin that will parse the output of sadf (which is a frontend to sa/sar performance counters). You can query multiple counters at a shot, specifying separate alert thresholds for each (or none at all, if you just want performance data). You can specify, via shell-style glob patterns, which devices you want to include or exclude, so that you can, for example, exclude all “lo” and “tun*” devices from network statistic monitoring. You can also pick the sampling period, so if you want an average of the last 30 minutes the plugin will produce it.

You can do stuff like this:

./check_sa.pl -i -C %usr -C %soft -C %sys -C %idle -D all
SA OK – All counters within specified thresholds. | %idle[cpu0]=96.84;; %idle[cpu1]=96.31;; %idle[cpu2]=97.23;; %idle[cpu3]=95.8;; %soft[cpu0]=0;; %soft[cpu1]=0.01;; %soft[cpu2]=0;; %soft[cpu3]=0.01;; %sys[cpu0]=0.4;; %sys[cpu1]=0.46;; %sys[cpu2]=0.36;; %sys[cpu3]=0.63;; %usr[cpu0]=2.67;; %usr[cpu1]=3.13;; %usr[cpu2]=2.27;; %usr[cpu3]=3.46;;

Or, if you prefer to summarize:

./check_sa.pl -i -C %usr -C %soft -C %sys -C %idle -d all
SA OK – All counters within specified thresholds. | %idle[all]=96.54;; %soft[all]=0;; %sys[all]=0.46;; %usr[all]=2.89;;

It’s still a tiny bit slow — it takes about 500-600 ms to run on the systems I’ve tested — but this should be good enough to be useful without bogging down Nagios too badly.

The script requires the Text::Glob module to be installed, so it can convert shell-style globs into regular expressions to match against.

View the project:

Posted in Sysadmin.

Tagged with , .


3 Responses

Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.

  1. GregM says

    I like this, and I think I’ll be using it in a test deployment. I have a lot of questions about specific use cases though.

    • Jeff says

      It’s sort of a quick-and-dirty thing that I wrote up to solve a particular problem at my job, but I’d be happy to answer any questions about it. Email me at jeff @ thisdomain.

  2. Brian says

    I’d suggest adding something like –sadf-options. If I want to use this to gather cpu information, and only for a summary of all cpu’s then the command to run would be:
    ./check_sa.pl -i -C %user -C %nice -C %system -C %iowait -C %idle –sa-log-dir=/var/log/sa –sadf-options -u

    Or all cpu’s:
    ./check_sa.pl -i -C %user -C %nice -C %system -C %iowait -C %idle –sa-log-dir=/var/log/sa –sadf-options “-P ALL -u”

    Etc.

    The advantage of this is that you significantly increase the speed of the check. With just -u it is ‘real 0m0.088s’, and -u -P ALL it would be ‘real 0m0.191s’. With -A it is ‘real 0m0.356s’. When checking 3000+ servers every little bit helps :) .



Some HTML is OK

or, reply to this post via trackback.