When I’m configuring Sensu checks, especially things that make direct use of variables in my Sensu configuration, I’ve gotten annoyed by the fact that testing them is more difficult than it needs to be. I’ve hacked up a very quick and dirty tool called sensu-run for testing arbitrary commands and standalone checks. Give it a try and see how it works!
If you’ve followed my projects previously, you know that while I love Nagios, and its stepbrother Icinga, it’s often a nuisance and the butt of lots of jokes (see: Jordan Sissel’s PuppetConf 2012 talk on Logstash). A big part of my work over the last several months has focused on how to make interacting with it more productive. Nagios is totally happy to blast you with alerts, but doesn’t give you a way to, say, turn them off on some false positive when you’re on vacation in the middle of the mountains, miles away from Internet service reliable enough to run a VPN connection and a web browser.
I wasn’t happy with the state of email interaction with it, so I went ahead and wrote Koboli, a seriously extensible mail processor for Nagios and Icinga. Koboli is written in Python, and named after a mail sorter from The Legend of Zelda: The Wind Waker. It works out of the box with alerts in Nagios’s default format, but is easy enough to set up to extract fields from emails in whatever format you’ve decided to send them.
The basic idea of Koboli is that it gives you a simple #command syntax that allows you to interact easily with your monitoring system without leaving your email client. If you’ve ever worked with systems like Spiceworks, you’ve already got the basic idea down.
#comment NIC is flaking out and alerting every 10 minutes. Will look into on Monday.
This is useful enough when you’re just interacting with your monitoring system, but you can extend it to do lots of other cool things too. For example, this initial release can also create issues in JIRA:
With a one-line command, the alert is now in our incident database where we can track and remediate it appropriately.
This project is just in the beginning stages, and I hope some people find it useful — it was quite a bit more work than I thought.
Sometimes, we need to do SAN maintenance — firmware upgrades, disruptive fabric changes, and the like. When these situations come up, it’s useful to know if anything is in a condition where it will break if it loses its connection to SAN storage, especially if you’re a lowly storage administrator without admin access to any of the Windows systems connected up to the SAN.
I poked around, and could not find one single utility or tool for monitoring the Windows MPIO framework, so I whipped up a quick script using VBScript and WMI. The script is called like so:
cscript.exe //NoLogo scripts\CheckMpioPaths.vbs /paths 4
(4 paths are used because the server is multipathed on two fabrics, and each of the active/passive controllers is also on each fabric — the server should see 2 controllers on 2 fabrics each, for 4 paths.)
This will cause the script to issue a Nagios CRITICAL if any multipath-registered LUN shows fewer than the given number of paths.
As usual, you can find the script in the GitHub repository for CheckMpioPaths.
Ask yourself a question: for every piece of resiliency you supposedly have in your network, are you really positive that it’s not running in a degraded state? Really, really sure?
Sometimes, it’s basic: are you being alerted when any disk array attached to any server suffers a disk failure?
Very often, it’s not: for your SAN-attached systems, are you positive that the multipathing is green? If you’re connected to two storage processors or controllers, can the server see two paths to each of them? Are you getting alerted if you can’t?
Are your port channels running over the number of links that they’re supposed to? How about the ISLs on your FC fabrics?
If you have failover clusters where services run on preferred nodes, are you sure they’re actually located where they’re supposed to be? Are you monitoring that services are all running on their preferred nodes?
If you have asymmetric fall-back connections, like a gigabit switch uplink used to back up a 10-gigabit switch uplink, are you notified when it’s using the backup connection, or do you rely on your users to tell you that things seem to be running slowly?
There’s a difference between things running, and things running smoothly: making sure that your “redundant” equipment and services are actually redundant is the key to keeping issues from turning into problems.
If you’ve used IBM SAN products, particularly the DS4000, DS5000 and DS6000 series (which are rebranded LSI), one of the most obnoxious things about it is how you’re pretty much forced to roll your own monitoring tools. Compared to many mainstream vendors (and Sun/Oracle in particular), IBM’s performance monitoring and modelling tools have been lackluster at best and completely unsupplied at worst. The best tool you’ve got is the SMcli, which doesn’t supply a ton of good information, but at least provides you with a starting point for capacity planning.
I had originally wanted to make something like this for Cacti, which probably has a much broader install base than the pnp4nagios addon, but the Nagios way was just so easy, and I’d like to share it with anyone who doesn’t want to roll their own basic performance aggregator for it.
This tool gets the following statistics:
- Read percentage
- Cache hit percentage
It gets statistics at the following levels:
- Logical Unit
- Physical Array
It’s a little quick-and-dirty, but it works:
Like my other projects, it’s hosted on GitHub, so check out the GitHub project for check_smcli_io.