I’ve been putting some serious thought recently into how to improve the issue turnaround time of my operations team, and one really sore point that stuck out to me was the notifications that were coming around of our monitoring system. We’re, like many shops, using Nagios/Icinga, one of the most flexible monitoring packages to ever exist in the world, and yet for a decade we’ve been running with default alerts that give you almost no context. They tell you what, not why.
Here’s a boilerplate Nagios notification email:
Notification Type: PROBLEM
Date/Time: Tue Nov 6 14:13:23 EST 2012
For some people, especially in smaller environments this is good enough — it tells you that something is not working, which is the most important thing a monitoring system does. For us, it’s horrendous! For starters:
- A brand-new operations engineer on the team who sees this alert will have absolutely no idea what the server is, how important the alert is, or who to contact about it if they need to escalate.
- There’s no context on other recent (and likely-related) incidents impacting the same host, and no record of recent maintenance that may be linked to the problem.
- It doesn’t provide any really easy way to get more information about the outage.
So, we hacked on the notification a little bit. Before the alert is sent, the alert information is formatted as JSON and piped into a utility that performs data lookups and inserts them into an email template.
With a little bit of tender loving care, we ended up with this guy:
Service: Graylog2 Server
Timestamp: Thu Oct 25 05:06:52 EDT 2012
PROCS CRITICAL: 0 processes with command name 'java', args 'graylog2-server'
OS: Red Hat Enterprise Linux Server release 6.3 (Santiago)
Owner: Information Technology
Datacenter: Floating Virtual
Admin: Jeff Goldschrafe
- SYS-80 graylog2-server service out of Java heap space
- SYS-75 graylog2-server service hang
- SYS-130 it-graylog01 disk expansion
Right off the bat:
- OS information from our CMDB is included right in the message. We’re using GLPI (with OCS Inventory as the root source), but something like PuppetDB works nicely too.
- Up to five recent incidents are linked in the alert.
- Up to five recent maintenances are linked in the alert.
- The alert contains links to get way more information on the host than an engineer should need to figure out everything they need to actually do with the alert.
Some things we might consider adding in the future:
- Embedding most recent system state (CPU/memory usage, swap activity, disk I/O, etc.) from Graphite right into the alert
- Pulling the most recent critical log messages out of Logstash/ElasticSearch and embedding them directly into the alert for context
- That Kibana URL really needs to be run through a URL shortener. Plumbing that into the alert system was just one more headache I don’t need just yet.