Before starting with Rabbit, I worked with Cold Spring Harbor Laboratory as an IT manager. In mid-2013, our Web Development Manager position, a peer to mine, had been open for six months with very few qualified applicants. While the job was not a glamorous one — “CMS developer in academia” doesn’t have the sex appeal of a startup — we weren’t getting any bites on the job posting that HR was curating out on the Internet. My director came to me and asked what we should do about the position.
I mulled over the posting for a few days before making a few judicious edits. What I handed back had Web Development Manager crossed off and replaced with Lead Web Developer. Underneath the Requirements section, “at least two years of management experience” was replaced with “at least two years as a manager, team lead, or senior developer.” After some discussion, my changes were approved and HR uploaded the revised job posting.
We had an offer out to a candidate within two weeks.
Attracting great talent anywhere is hard. We tend to obsess over the job descriptions that we post, trying to find new and interesting ways to sell the company with unlimited vacation policies and fully-stocked fridges. Sometimes we appeal to the reader’s ego directly by using words like rockstar or ninja. But we tend to focus very hard on descriptive words, and we frequently ignore the deeper context buried in those words.
What we wanted was somebody to develop frontend and backend code and delegate tasks to two other team members. While there were management responsibilities as part of the job description, the core of the job was to be an individual contributor. When we put Manager in the job title, and over-emphasized management experience in our requirements, we immediately communicated to anyone reading the posting that our open position spent most of the day doing manager-y things like talking to stakeholders and curating Gantt charts. Everything else we put in that job description might as well not have even been there.
If you want to attract the best candidates you can, you might be taking the wrong approach by trying to sell them on the company first. Your goal should be to figure out why the day-to-day work is meaningful, tap into that, and tie the organization back into the pitch. And if the job title in the posting happens to be an impediment to getting that point across, don’t be afraid to change it.
After a couple of months of not receiving the TLC it deserves, I’ve pushed a major update to Metricinga on GitHub. Here’s the highlights:
- Completely rewritten. I wasn’t really happy with the tight coupling of components in the old version; among other things, it made it really hard to write tests. The new version uses extremely loose coupling between Greenlets, so I can actually get around to writing proper regression tests now. It should also be a lot simpler to support things like writing metrics to multiple backends (StatsD, OpenTSDB, etc.) once that support is implemented — writing to more than one at a time should be really trivial too.
- Better inotify support. Having up-to-date information is really important on some metrics, so I’ve made it a point to have reasonably well-functioning inotify support in Metricinga. It will start dumping metrics in the second a file is closed for writing or moved into the directory.
- Better delete-file handling. In some cases, the old Metricinga could drop data if the file was prematurely deleted before all the parsed metrics were successfully offloaded into Graphite. We now reference-count metrics sourced from a particular file, so files are never deleted until they’re completely sent into Graphite successfully.
- Init script for CentOS/RHEL. Yay!
Grab it, file bugs, file pull requests, let me know what you think!
I like robust management infrastructures. They make me happy. But sometimes, tiny behaviors can send you on a wild goose chase.
Being fairly inexperienced with both MCollective and RabbitMQ, though, I ran into an interesting issue with ours off and on over the last couple of weeks. One night, our MCollective installation, which had been working fine for weeks or months, started to exhibit the following behavior from our control node:
- Issuing an mco ping would return a list of all the nodes in our environment.
- Issuing another mco ping would cause no nodes at all to turn up.
- Restarting the MCollective agent on any one node would cause that node to show up in the next mco ping, but not any subsequent one.
- Any activity besides mco ping would fail.
This would continue for a little while, then magically resolve itself until it would randomly present itself again a few days down the road.
Turning up the MCollective logging level on both the client and server, I could see that the agent was putting messages into the reply queue, but the client wasn’t receiving them, with no good indication why.
Digging deeper, I ran netstat -an to look into the connection state. I saw high Recv-Q and Send-Q counters associated with the connections, so epmd (Erlang’s TCP multiplexer, not Erick and Parrish Making Dollars) wasn’t even pulling the data out of the socket. I took a look at some traffic dumps of MCollective running with a single agent, with the aes_security plugin disabled to make the payload easy to inspect, but that didn’t reveal much either because Wireshark doesn’t have a dissector for STOMP MQ.
So, I set up RabbitMQ on a temporary system to see what would happen. To my chagrin, that system’s MQ worked just fine. I poked around the logs on our production Puppet/MCollective/RabbitMQ system and found nothing of any value besides a bunch of notices that nodes had connected.
Since we recently upgraded the whole VMware environment that houses Puppet, MCollective and most of our other tools, I started to look into everything else. I upgraded the virtual hardware, VMware Tools, and accompanying drivers trying to figure out if it was related to our recent ESXi upgrade from 4.1 to 5.1. With the problem still occurring, I dumped the paravirtualized vmxnet3 driver entirely in favor of the standard emulated Intel e1000 driver. No dice. netstat continued to show high Recv-Q and Send-Q and the RabbitMQ management interface showed no messages traversing the system.
Getting more frustrated, I completely trashed the RabbitMQ configuration and set it up again from scratch, which, it turns out, didn’t help at all. mco ping, one response. mco ping again, no response. Restart the MCollective agent and mco ping again, one response. In a last-ditch effort, I updated MCollective to 2.3.1 (development) and RabbitMQ 3.0.3 (stable, released literally that day) and tried again. No luck.
Doing a bunch of digging, and asking others for their thoughts, the consensus was that RabbitMQ was deliberately dropping connections for some reason. Finally, I stumbled upon this stupid thing:
It turns out I didn’t have enough disk free on the host. Because of disk hot-grow quirks in Linux, we have Linux VMs with very small root partitions (5 GB) and separate partitions for data volumes (/var/lib/mysql, etc.), and having less than 1 GB free on the system is a really common occurrence. It turns out that the default RabbitMQ configuration doesn’t like this very much, and will throttle producers with the exact behavior that I was seeing earlier.
Dear RabbitMQ devs: a log message would be lovely when you start throttling messaging because of resource usage, thanks.
If you’ve followed my projects previously, you know that while I love Nagios, and its stepbrother Icinga, it’s often a nuisance and the butt of lots of jokes (see: Jordan Sissel’s PuppetConf 2012 talk on Logstash). A big part of my work over the last several months has focused on how to make interacting with it more productive. Nagios is totally happy to blast you with alerts, but doesn’t give you a way to, say, turn them off on some false positive when you’re on vacation in the middle of the mountains, miles away from Internet service reliable enough to run a VPN connection and a web browser.
I wasn’t happy with the state of email interaction with it, so I went ahead and wrote Koboli, a seriously extensible mail processor for Nagios and Icinga. Koboli is written in Python, and named after a mail sorter from The Legend of Zelda: The Wind Waker. It works out of the box with alerts in Nagios’s default format, but is easy enough to set up to extract fields from emails in whatever format you’ve decided to send them.
The basic idea of Koboli is that it gives you a simple #command syntax that allows you to interact easily with your monitoring system without leaving your email client. If you’ve ever worked with systems like Spiceworks, you’ve already got the basic idea down.
#comment NIC is flaking out and alerting every 10 minutes. Will look into on Monday.
This is useful enough when you’re just interacting with your monitoring system, but you can extend it to do lots of other cool things too. For example, this initial release can also create issues in JIRA:
With a one-line command, the alert is now in our incident database where we can track and remediate it appropriately.
This project is just in the beginning stages, and I hope some people find it useful — it was quite a bit more work than I thought.
In my group of systems engineers, we’re all becoming very comfortable users of JIRA. JIRA has been a very popular bug tracking tool for developers for a good number of years, but it has a lot of very powerful features that also make it incredibly useful as a Project Management Emporium for system administrators. It’s obviously very good at bug tracking and decent at supplementing project management, but it’s actually really good at a lot of other things. Here’s a summary of of what we use it for:
- Project/task tracking
- Software builds and custom application packages
- Change management/maintenance calendar
- Incident management
I’ve never been big on paperwork. If I’m going to go through all the trouble to document everything my team is and will be doing, there had better be a payoff. JIRA is super-simple. Getting friendly with a few documentation processes is an unfortunate reality if you run a hugely heterogeneous environment for many departments, but that doesn’t mean it needs to be a miserable, team-strangling mess of red tape. I’ll comment below on a few ways we try to keep our processes leaner.
Getting under the hood
People who have used JIRA significantly know that it’s really a lot more than a bug tracker. Out of the box, it does work really well as a bug tracking system. The real core strength of JIRA, though, really lies in its incredibly robust workflow system, which has been created with the integration of the Mitrefinch TMS system for team collaboration. It’s so central to the flexibility and power of the product that many of JIRA’s marketing folks prefer to talk about it as a workflow engine. (I swear I saw this explained in much better detail in a video from Atlassian Summity, but I can’t find it now.)
The idea of custom workflows tends to bore most systems people to death, but it’s easier to stomach if you think of it like a finite state machine. Change management is an easy use case for custom workflows (albeit one that lots of people hate). A change ticket is opened in Needs Review state. Once I look it over, I can make it Approved or Denied. Someone can take that Denied request, fix what’s wrong with it, and change it back to Needs Review. When the maintenance is done, it’s Closed. We then have a record of it that shows up in our Icinga alerts and other key places when something goes wrong.
JIRA also allows you to create a pile of custom fields, and separate them out by the issue types they belong to. This is ideal for doing things like tracking the actual start/end times of your system maintenance versus the windows that you’ve scheduled, so you can report on the accuracy of your estimates. Label types, which are like tags, are also awesome for correlating related issues together.
We run a mostly-stock JIRA configuration, with a few tiny enhancements. But one thing about our environment that’s sort of interesting is that we actually only use one JIRA project for all of our internal items. It makes things much simpler than creating and managing a pile of top-level projects.
In particular, we like these plugins from the Atlassian Marketplace:
- JIRA Wallboards: This nifty little plugin is designed for converting standard JIRA dashboards into something that’s easy to read on a television or giant monitor from across the office. I use it more than the team I manage, but it’s really nice for being able to check on project priorities, due dates, and so forth at a glance.
- JIRA Calendar Plugin: This plugin is so obviously useful that I have no real understanding of why it doesn’t just ship as part of JIRA. Being able to easily view upcoming due dates on all our internal tasks, as well as dates of upcoming maintenance events, is way too useful to pass up.
- JIRA Charting Plugin: This is self-explanatory and also probably the least useful thing in this entire post.
If you have a six-month-long project involving tightly structured timelines, and you need to find the best way to parallelize the people doing work on the project and discover what the most risk-prone tasks are to your timeline, JIRA probably isn’t the best bet: that’s something much more well-suited to a tool like Microsoft Project. JIRA is really good at coordinating a lot of small projects at the same time, which is where most small IT departments spend a lot of their time. (There are Gantt chart plugins out there as well, but I haven’t found them terribly useful.)
JIRA is about as useful as any other ticketing system for managing dozens of tiny projects at the same time, and keeping tabs on all of them successfully. One thing that is nice about JIRA is that its subtask implementation, while limited (doesn’t allow subtasks of subtasks) is fairly competent compared to most basic ticketing systems.
I find that this really shines in conjunction with a decent wallboard plugin, which can provide everyone on a project with a slick real-time view of what people are doing on that project.
We compile a lot of code. Our biggest first-class service is our high-performance compute cluster, which supports a really substantial number of scientific computing applications. JIRA helps us keep track of what we need to build and for whom, and by when, as well as being able to easily relate issues on that software. We’re not really doing much special or of interest in this area, though.
Change control and maintenance calendaring
I really hate change control for the same reasons you do. I do it anyway for the same reasons anyone else does. (We keep our change management scope limited so that people can actually, you know, get work done. But if someone reboots the primary AD DNS server in the middle of the business day, there’s hell to be paid.)
When someone is looking to perform maintenance on a crucial system, they open a maintenance request. This contains typical fields: impact, backout plan, projected start and end times, and so forth.
Changes are tagged using a custom Label field called Impacted hosts containing the FQDN of each impacted host. This makes it very easy to programmatically search for all prior maintenance on a host. We have this integrated into our Icinga notification script so that maintenances are automatically flagged as something that should be investigated in connection with the alert. (I should probably post this script, because while it’s nothing groundbreaking from an engineering perspective, I think it’s pretty neat.)
Once a change request is approved, it becomes a maintenance. It exists on the maintenance calendar for the helpdesk and other IT organizations to look at.
Like any decent shop supporting dozens of applications, we keep a fairly good incident database. This is a log of things that go wrong on servers, what our diagnostic process was, what the impact of the problem was, and how we fixed it. This is a huge help at bringing new on-call engineers up to speed on the infrastructure.
The incident database makes use of the same Impacted hosts field that we use on the change control/maintenance calendar. This is awesome because we can open up an incident, click on the host’s FQDN, and see all the maintenance work and other incidents that have been performed on that host since we started using the system. As with the maintenance database, this is queryable through the JIRA API, and we do exactly that to provide a list of related incidents whenever any Icinga alert goes out via email.
- SCM integration: It would take a lot of work out if we could integrate the Git repositories storing our Puppet code into JIRA, and use that to feed the list of maintenances. Since Atlassian only supports Git hosted on GitHub (and doesn’t support GitHub Enterprise accounts at the time of this writing), we’ll end up exposing a read-only copy of the repository through git-svn and pumping data in through the SVN plugin.
- Better Icinga integration: We already have JIRA maintenances and incidents showing up in our Icinga alerts. But oh, how I would love the holy grail of Icinga creating entries in the incident database by itself. Right now we put them in manually if they’re anything more complex than “the user filled up the disk.”
I’ve been putting some serious thought recently into how to improve the issue turnaround time of my operations team, and one really sore point that stuck out to me was the notifications that were coming around of our monitoring system. We’re, like many shops, using Nagios/Icinga, one of the most flexible monitoring packages to ever exist in the world, and yet for a decade we’ve been running with default alerts that give you almost no context. They tell you what, not why.
Here’s a boilerplate Nagios notification email:
Notification Type: PROBLEM
Date/Time: Tue Nov 6 14:13:23 EST 2012
For some people, especially in smaller environments this is good enough — it tells you that something is not working, which is the most important thing a monitoring system does. For us, it’s horrendous! For starters:
- A brand-new operations engineer on the team who sees this alert will have absolutely no idea what the server is, how important the alert is, or who to contact about it if they need to escalate.
- There’s no context on other recent (and likely-related) incidents impacting the same host, and no record of recent maintenance that may be linked to the problem.
- It doesn’t provide any really easy way to get more information about the outage.
So, we hacked on the notification a little bit. Before the alert is sent, the alert information is formatted as JSON and piped into a utility that performs data lookups and inserts them into an email template.
With a little bit of tender loving care, we ended up with this guy:
Service: Graylog2 Server
Timestamp: Thu Oct 25 05:06:52 EDT 2012
PROCS CRITICAL: 0 processes with command name 'java', args 'graylog2-server'
OS: Red Hat Enterprise Linux Server release 6.3 (Santiago)
Owner: Information Technology
Datacenter: Floating Virtual
Admin: Jeff Goldschrafe
- SYS-80 graylog2-server service out of Java heap space
- SYS-75 graylog2-server service hang
- SYS-130 it-graylog01 disk expansion
Right off the bat:
- OS information from our CMDB is included right in the message. We’re using GLPI (with OCS Inventory as the root source), but something like PuppetDB works nicely too.
- Up to five recent incidents are linked in the alert.
- Up to five recent maintenances are linked in the alert.
- The alert contains links to get way more information on the host than an engineer should need to figure out everything they need to actually do with the alert.
Some things we might consider adding in the future:
- Embedding most recent system state (CPU/memory usage, swap activity, disk I/O, etc.) from Graphite right into the alert
- Pulling the most recent critical log messages out of Logstash/ElasticSearch and embedding them directly into the alert for context
- That Kibana URL really needs to be run through a URL shortener. Plumbing that into the alert system was just one more headache I don’t need just yet.
For awhile, I’ve been using Shawn Sterling’s Graphios. It’s a neat little utility for forwarding performance data from Nagios/Icinga to Graphite. It had a few warts, though, and I wanted to take the opportunity to learn event-based programming using Python/gevent, so I’ve gone ahead and developed Metricinga, my own approach to the same problem.
Metricinga supports the following:
- Support for running as a daemon
- Directory watches using inotify*
- Automatic reconnection to Graphite in the event of a send failure
- Continued parsing of performance data files while Graphite server is unreachable
*Metricinga actually uses a priority queue for metrics parsing, and will submit newly-written files before processing old ones in the spool directory. This ensures that if for some reason you end up with a giant spool full of Nagios performance data, your most recent (and most important) operations metrics will end up in Graphite before your historical data. Yay! But if you’re not running on a Linux system, or otherwise can’t use inotify, don’t worry. Metricinga will poll the spool every 60 seconds instead!
Metricinga does not yet support the following:
- Nagios performance data names containing escaped single quotes (”)
The following additional features are planned:
- Actual documentation
- Init script and RPM package
- Better shutdown handling
- Forwarding to metrics receivers other than Graphite (OpenTSDB, statsd, MongoDB, etc.)
Note that if you’re an existing Graphios user, Metricinga is a little more stringent with its performance data format checking, and you might have some data not getting sent over if plugins output incorrectly-formatted performance data (as Nagios::Plugin does).
Link: Metricinga on GitHub
More updates to follow in the next few days.
As promised in my previous post, here’s the GitHub repo for my statsite RPM:
For the time being, this is still based against Armon Dadgar’s current upstream Git source with my daemonizing changes applied as a patch. So far, everything’s working pretty well on my test server, but please notify me of any bugs.
Note that the version number is 0, as there has not yet been any numbered official release.
Puppet is a fairly complicated little product once you start to look under the covers, and by now it’s pretty widely know that for larger environments, moving from 2.6 to 2.7 isn’t a particularly straightforward upgrade. Most of people’s various pain points relate to the deprecation of dynamic scoping in favor of lexical scoping and parameterized classes, but there’s some other gotchas that haven’t been as widely publicized. Here’s a few.
Undefined template variables have changed
Previously, if you attempted to look up a variable from a template, and that variable did not exist, it would return a Ruby nil, which is a fairly intuitive and straightforward behavior that a lot of people came to rely on in their conditionals. In Puppet 2.7, however, this value is now the symbol :undefined. Ensure that all of your templates are not running under the assumption that undefined variables return the value nil.
Globbing imports are now considered undefined behavior
If you have this guy at the top of any of your manifests for some reason (like Puppet’s autoloader being horrendous until the 2.6 series):
Chances are that it will not work, and instead it will return an error that your class is not defined. Ensure that your classes and defines are all named name.pp and let the autoloader do its thing instead. It should work fine, even for nested classes inside subdirectories.
–show_diff is no longer enabled by default in –noop mode
Some people have operations toolchains that rely on Puppet’s –noop mode showing a diff for each file that it’s going to modify on the next real run. Do note that these scripts will need to be updated to explicitly specify the –show_diff option — the new default behavior is now to log these diffs to syslog instead.
Beyond these three, I had a fairly straightforward upgrade of our Puppet environment. Happy hunting!