Default monitoring alerts are awful

I’ve been putting some serious thought recently into how to improve the issue turnaround time of my operations team, and one really sore point that stuck out to me was the notifications that were coming around of our monitoring system. We’re, like many shops, using Nagios/Icinga, one of the most flexible monitoring packages to ever exist in the world, and yet for a decade we’ve been running with default alerts that give you almost no context. They tell you what, not why.

Here’s a boilerplate Nagios notification email:

For some people, especially in smaller environments this is good enough — it tells you that something is not working, which is the most important thing a monitoring system does. For us, it’s horrendous! For starters:

  1. A brand-new operations engineer on the team who sees this alert will have absolutely no idea what the server is, how important the alert is, or who to contact about it if they need to escalate.
  2. There’s no context on other recent (and likely-related) incidents impacting the same host, and no record of recent maintenance that may be linked to the problem.
  3. It doesn’t provide any really easy way to get more information about the outage.

So, we hacked on the notification a little bit. Before the alert is sent, the alert information is formatted as JSON and piped into a utility that performs data lookups and inserts them into an email template.

With a little bit of tender loving care, we ended up with this guy:

Right off the bat:

  1. OS information from our CMDB is included right in the message. We’re using GLPI (with OCS Inventory as the root source), but something like PuppetDB works nicely too.
  2. Up to five recent incidents are linked in the alert.
  3. Up to five recent maintenances are linked in the alert.
  4. The alert contains links to get way more information on the host than an engineer should need to figure out everything they need to actually do with the alert.

Some things we might consider adding in the future:

  1. Embedding most recent system state (CPU/memory usage, swap activity, disk I/O, etc.) from Graphite right into the alert
  2. Pulling the most recent critical log messages out of Logstash/ElasticSearch and embedding them directly into the alert for context
  3. That Kibana URL really needs to be run through a URL shortener. Plumbing that into the alert system was just one more headache I don’t need just yet.

 

4 Comments

  1. Apparently you are reading my mind. Do you mind if I ask how you are doing this? Did you hack on Nagios itself, or is this a separate script?

    • Whoops, sorry, I guess I didn’t have email notifications turned on for this blog. Hope I’m not too late!

      Email alerts in Nagios are, like check plugins, just external commands that do a thing. In our case, we wrote a script (Python/gevent) that took data from stdin, did actual intelligent things with it like querying our Logstash, Graphite, JIRA and GLPI instances, and put it into a nicely-formatted email that gives a lot of information up front to our on-call engineer.

      I apparently created a GitHub repo, but never polished it up enough to post there — whoops again. If you want to watch this space, you’ll get notified when it updates:
      https://github.com/jgoldschrafe/revere

  2. Martin Cleaver

    26 May, 2013 at 8:06 AM

    Hi Jeff,

    Interesting setup!

    I do look forward to you sharing it on github!

    Best, Martin

Leave a Reply

Your email address will not be published.

© 2019 @jgoldschrafe

Theme by Anders NorenUp ↑