Permanently setting FQDN in Google Compute Engine

Unlike Amazon’s EC2, Google Compute Engine allows you to choose the names for your instances, and takes meaningful actions with those names — like setting the hostname on the system for you. Unfortunately, this only affects the short hostname, not the fully-qualified domain name (FQDN) of the host, which can complicate some infrastructures. To set the FQDN at instance launch, we’ll need some startup script magic.

This script snippet checks for the domain or fqdn custom attribute on your instance and applies it to the host after the system receives a DHCP response. It’s based on Google’s own set-hostname hook included with the Google Startup Scripts package. Of course, you’ll need to bake this into your base GCE system image using Packer or another similar tool.

Place the following into /etc/dhcp/dhclient-exit-hooks.d/zzz-set-fqdn:

 

Using Google Compute Engine service accounts with Fog

Google Compute Engine has a great little feature, similar to EC2’s instance IAM roles, where you can create an instance-specific service account at instance creation. This account has the privileges you specify, and the auth token is accessible automagically through the instance metadata.

Unfortunately, Fog doesn’t support this very well. It expects you to pass in an email address and a key to access the Google Compute Engine APIs, neither of which you have yet. However, you can construct the client yourself, using a Google::APIClient::ComputeServiceAccount for authorization, and pass it in. This code snippet should help:

Follow Fog issue #2945 and assume this post to be outdated when it gets closed.

Replace annual reviews with individual retrospectives

In the past several decades, and particularly in the past few years, many forward-thinking managers have come to the conclusion that traditional yearly performance appraisals are a waste of time at best, or a net negative to morale at worst. This is a philosophy supported by many bright management thinkers, including W. Edwards Deming:

Evaluation of performance, merit rating, or annual review… The idea of a merit rating is alluring. the sound of the words captivates the imagination: pay for what you get; get what you pay for; motivate people to do their best, for their own good. The effect is exactly the opposite of what the words promise.

Bob Sutton and Huggy Rao, authors of Scaling Up Excellence, wrote in their book about Adobe’s experiences eliminating yearly performance appraisals from their organization:

Since the new system was implemented, involuntary departures have increased by 50%: this is because, as Morris explained, the new system requires executives and managers to have regular “tough discussions” with employees who are struggling with performance issues—rather than putting them off until the next performance review cycle comes around. In contrast, voluntary attrition at Adobe has dropped 30% since the “check-ins” were introduced; not only that, of those employees who opt to leave the company, a higher percentage of them are “non-regrettable” departures.

Clearly, many managers and their organizations have found annual performance reviews to be an ineffective tool for managing teams. But what if we took the annual performance review, and were able to humanize it as a tool for good?

Retrospectives: a human approach

Performance reviews are a terrible source of anxiety and stress. A year’s worth of judgment, and the consequences of that judgment, are compressed and handed down in an instant. It’s often as nerve-wracking for the manager as for the subordinate.

So, when I worked as a manager, I used my annual meetings to do something slightly unconventional: to forsake any judgments or value propositions, and instead remind my staff of their accomplishments over the last year. An anniversary, if you will.

In technology, we rarely get the opportunity to think in time periods greater than a few months. If you work within an Agile shop, you might think in two-week sprints. A year ago is a world away, and for someone mired in a difficult project, it might be difficult to slog through the impostor syndrome and remember all the things you did for the organization. We can’t always see our professional development at a macro level, and an outside perspective with a long view can help figure out where we’re going.

The goal of management should be not just improving short-term productivity, but to align the company’s goals with the career development goals of its employees over the long-term. Removing annual performance appraisals is a great step towards removing unnecessary stress from the workplace, but aligning employees’ career goals over the long-view is still crucial in maintaining an effective team.

On hiring developers

This post from Alex MacCaw on Sourcing.io has been making the rounds over the past couple of weeks. It’s a list of his favorite interview questions, which are largely technical in nature and focus heavily on nuts and bolts of JavaScript. They aren’t bad questions, but I do think they’re the wrong questions.

Continue reading

An anecdote about job titles

Before starting with Rabbit, I worked with Cold Spring Harbor Laboratory as an IT manager. In mid-2013, our Web Development Manager position, a peer to mine, had been open for six months with very few qualified applicants. While the job was not a glamorous one — “CMS developer in academia” doesn’t have the sex appeal of a startup — we weren’t getting any bites on the job posting that HR was curating out on the Internet. My director came to me and asked what we should do about the position.

I mulled over the posting for a few days before making a few judicious edits. What I handed back had Web Development Manager crossed off and replaced with Lead Web Developer. Underneath the Requirements section, “at least two years of management experience” was replaced with “at least two years as a manager, team lead, or senior developer.” After some discussion, my changes were approved and HR uploaded the revised job posting.

We had an offer out to a candidate within two weeks.

Attracting great talent anywhere is hard. We tend to obsess over the job descriptions that we post, trying to find new and interesting ways to sell the company with unlimited vacation policies and fully-stocked fridges. Sometimes we appeal to the reader’s ego directly by using words like rockstar or ninja. But we tend to focus very hard on descriptive words, and we frequently ignore the deeper context buried in those words.

What we wanted was somebody to develop frontend and backend code and delegate tasks to two other team members. While there were management responsibilities as part of the job description, the core of the job was to be an individual contributor. When we put Manager in the job title, and over-emphasized management experience in our requirements, we immediately communicated to anyone reading the posting that our open position spent most of the day doing manager-y things like talking to stakeholders and curating Gantt charts. Everything else we put in that job description might as well not have even been there.

If you want to attract the best candidates you can, you might be taking the wrong approach by trying to sell them on the company first. Your goal should be to figure out why the day-to-day work is meaningful, tap into that, and tie the organization back into the pitch. And if the job title in the posting happens to be an impediment to getting that point across, don’t be afraid to change it.

Mega updates to Metricinga

After a couple of months of not receiving the TLC it deserves, I’ve pushed a major update to Metricinga on GitHub. Here’s the highlights:

  • Completely rewritten. I wasn’t really happy with the tight coupling of components in the old version; among other things, it made it really hard to write tests. The new version uses extremely loose coupling between Greenlets, so I can actually get around to writing proper regression tests now. It should also be a lot simpler to support things like writing metrics to multiple backends (StatsD, OpenTSDB, etc.) once that support is implemented — writing to more than one at a time should be really trivial too.
  • Better inotify support. Having up-to-date information is really important on some metrics, so I’ve made it a point to have reasonably well-functioning inotify support in Metricinga. It will start dumping metrics in the second a file is closed for writing or moved into the directory.
  • Better delete-file handling. In some cases, the old Metricinga could drop data if the file was prematurely deleted before all the parsed metrics were successfully offloaded into Graphite. We now reference-count metrics sourced from a particular file, so files are never deleted until they’re completely sent into Graphite successfully.
  • Init script for CentOS/RHEL. Yay!

Grab it, file bugs, file pull requests, let me know what you think!

MCollective, RabbitMQ, and the Case of the Missing Pings

I like robust management infrastructures. They make me happy. But sometimes, tiny behaviors can send you on a wild goose chase.

Being fairly inexperienced with both MCollective and RabbitMQ, though, I ran into an interesting issue with ours off and on over the last couple of weeks. One night, our MCollective installation, which had been working fine for weeks or months, started to exhibit the following behavior from our control node:

  1. Issuing an mco ping would return a list of all the nodes in our environment.
  2. Issuing another mco ping would cause no nodes at all to turn up.
  3. Restarting the MCollective agent on any one node would cause that node to show up in the next mco ping, but not any subsequent one.
  4. Any activity besides mco ping would fail.

This would continue for a little while, then magically resolve itself until it would randomly present itself again a few days down the road.

Turning up the MCollective logging level on both the client and server, I could see that the agent was putting messages into the reply queue, but the client wasn’t receiving them, with no good indication why.

Digging deeper, I ran netstat -an to look into the connection state. I saw high Recv-Q and Send-Q counters associated with the connections, so epmd (Erlang’s TCP multiplexer, not Erick and Parrish Making Dollars) wasn’t even pulling the data out of the socket. I took a look at some traffic dumps of MCollective running with a single agent, with the aes_security plugin disabled to make the payload easy to inspect, but that didn’t reveal much either because Wireshark doesn’t have a dissector for STOMP MQ.

So, I set up RabbitMQ on a temporary system to see what would happen. To my chagrin, that system’s MQ worked just fine. I poked around the logs on our production Puppet/MCollective/RabbitMQ system and found nothing of any value besides a bunch of notices that nodes had connected.

Since we recently upgraded the whole VMware environment that houses Puppet, MCollective and most of our other tools, I started to look into everything else. I upgraded the virtual hardware, VMware Tools, and accompanying drivers trying to figure out if it was related to our recent ESXi upgrade from 4.1 to 5.1. With the problem still occurring, I dumped the paravirtualized vmxnet3 driver entirely in favor of the standard emulated Intel e1000 driver. No dice. netstat continued to show high Recv-Q and Send-Q and the RabbitMQ management interface showed no messages traversing the system.

Getting more frustrated, I completely trashed the RabbitMQ configuration and set it up again from scratch, which, it turns out, didn’t help at all. mco ping, one response. mco ping again, no response. Restart the MCollective agent and mco ping again, one response. In a last-ditch effort, I updated MCollective to 2.3.1 (development) and RabbitMQ 3.0.3 (stable, released literally that day) and tried again. No luck.

Doing a bunch of digging, and asking others for their thoughts, the consensus was that RabbitMQ was deliberately dropping connections for some reason. Finally, I stumbled upon this stupid thing:

Disk-Based Flow Control

It turns out I didn’t have enough disk free on the host. Because of disk hot-grow quirks in Linux, we have Linux VMs with very small root partitions (5 GB) and separate partitions for data volumes (/var/lib/mysql, etc.), and having less than 1 GB free on the system is a really common occurrence. It turns out that the default RabbitMQ configuration doesn’t like this very much, and will throttle producers with the exact behavior that I was seeing earlier.

Dear RabbitMQ devs: a log message would be lovely when you start throttling messaging because of resource usage, thanks.

Koboli, an email interaction gateway for Nagios/Icinga

If you’ve followed my projects previously, you know that while I love Nagios, and its stepbrother Icinga, it’s often a nuisance and the butt of lots of jokes (see: Jordan Sissel’s PuppetConf 2012 talk on Logstash). A big part of my work over the last several months has focused on how to make interacting with it more productive. Nagios is totally happy to blast you with alerts, but doesn’t give you a way to, say, turn them off on some false positive when you’re on vacation in the middle of the mountains, miles away from Internet service reliable enough to run a VPN connection and a web browser.

I wasn’t happy with the state of email interaction with it, so I went ahead and wrote Koboli, a seriously extensible mail processor for Nagios and Icinga. Koboli is written in Python, and named after a mail sorter from The Legend of Zelda: The Wind Waker. It works out of the box with alerts in Nagios’s default format, but is easy enough to set up to extract fields from emails in whatever format you’ve decided to send them.

The basic idea of Koboli is that it gives you a simple #command syntax that allows you to interact easily with your monitoring system without leaving your email client. If you’ve ever worked with systems like Spiceworks, you’ve already got the basic idea down.

This is useful enough when you’re just interacting with your monitoring system, but you can extend it to do lots of other cool things too. For example, this initial release can also create issues in JIRA:

With a one-line command, the alert is now in our incident database where we can track and remediate it appropriately.

This project is just in the beginning stages, and I hope some people find it useful — it was quite a bit more work than I thought.

Koboli on GitHub

How we use JIRA for system administration at CSHL

In my group of systems engineers, we’re all becoming very comfortable users of JIRA. JIRA has been a very popular bug tracking tool for developers for a good number of years, but it has a lot of very powerful features that also make it incredibly useful as a Project Management Emporium for system administrators. It’s obviously very good at bug tracking and decent at supplementing project management, but it’s actually really good at a lot of other things. Here’s a summary of of what we use it for:

  • Project/task tracking
  • Software builds and custom application packages
  • Change management/maintenance calendar
  • Incident management

I’ve never been big on paperwork. If I’m going to go through all the trouble to document everything my team is and will be doing, there had better be a payoff. JIRA is super-simple. Getting friendly with a few documentation processes is an unfortunate reality if you run a hugely heterogeneous environment for many departments, but that doesn’t mean it needs to be a miserable, team-strangling mess of red tape. I’ll comment below on a few ways we try to keep our processes leaner.

Getting under the hood

People who have used JIRA significantly know that it’s really a lot more than a bug tracker. Out of the box, it does work really well as a bug tracking system. The real core strength of JIRA, though, really lies in its incredibly robust workflow system. It’s so central to the flexibility and power of the product that many of JIRA’s marketing folks prefer to talk about it as a workflow engine. (I swear I saw this explained in much better detail in a video from Atlassian Summity, but I can’t find it now.)

The idea of custom workflows tends to bore most systems people to death, but it’s easier to stomach if you think of it like a finite state machine. Change management is an easy use case for custom workflows (albeit one that lots of people hate). A change ticket is opened in Needs Review state. Once I look it over, I can make it Approved or Denied. Someone can take that Denied request, fix what’s wrong with it, and change it back to Needs Review. When the maintenance is done, it’s Closed. We then have a record of it that shows up in our Icinga alerts and other key places when something goes wrong.

JIRA also allows you to create a pile of custom fields, and separate them out by the issue types they belong to. This is ideal for doing things like tracking the actual start/end times of your system maintenance versus the windows that you’ve scheduled, so you can report on the accuracy of your estimates. Label types, which are like tags, are also awesome for correlating related issues together.

Our configuration

We run a mostly-stock JIRA configuration, with a few tiny enhancements. But one thing about our environment that’s sort of interesting is that we actually only use one JIRA project for all of our internal items. It makes things much simpler than creating and managing a pile of top-level projects.

In particular, we like these plugins from the Atlassian Marketplace:

  • JIRA Wallboards: This nifty little plugin is designed for converting standard JIRA dashboards into something that’s easy to read on a television or giant monitor from across the office. I use it more than the team I manage, but it’s really nice for being able to check on project priorities, due dates, and so forth at a glance.
  • JIRA Calendar Plugin: This plugin is so obviously useful that I have no real understanding of why it doesn’t just ship as part of JIRA. Being able to easily view upcoming due dates on all our internal tasks, as well as dates of upcoming maintenance events, is way too useful to pass up.
  • JIRA Charting Plugin: This is self-explanatory and also probably the least useful thing in this entire post.

Project tracking

If you have a six-month-long project involving tightly structured timelines, and you need to find the best way to parallelize the people doing work on the project and discover what the most risk-prone tasks are to your timeline, JIRA probably isn’t the best bet: that’s something much more well-suited to a tool like Microsoft Project. JIRA is really good at coordinating a lot of small projects at the same time, which is where most small IT departments spend a lot of their time. (There are Gantt chart plugins out there as well, but I haven’t found them terribly useful.)

JIRA is about as useful as any other ticketing system for managing dozens of tiny projects at the same time, and keeping tabs on all of them successfully. One thing that is nice about JIRA is that its subtask implementation, while limited (doesn’t allow subtasks of subtasks) is fairly competent compared to most basic ticketing systems.

I find that this really shines in conjunction with a decent wallboard plugin, which can provide everyone on a project with a slick real-time view of what people are doing on that project.

Application builds

We compile a lot of code. Our biggest first-class service is our high-performance compute cluster, which supports a really substantial number of scientific computing applications. JIRA helps us keep track of what we need to build and for whom, and by when, as well as being able to easily relate issues on that software. We’re not really doing much special or of interest in this area, though.

Change control and maintenance calendaring

I really hate change control for the same reasons you do. I do it anyway for the same reasons anyone else does. (We keep our change management scope limited so that people can actually, you know, get work done. But if someone reboots the primary AD DNS server in the middle of the business day, there’s hell to be paid.)

When someone is looking to perform maintenance on a crucial system, they open a maintenance request. This contains typical fields: impact, backout plan, projected start and end times, and so forth.

Changes are tagged using a custom Label field called Impacted hosts containing the FQDN of each impacted host. This makes it very easy to programmatically search for all prior maintenance on a host. We have this integrated into our Icinga notification script so that maintenances are automatically flagged as something that should be investigated in connection with the alert. (I should probably post this script, because while it’s nothing groundbreaking from an engineering perspective, I think it’s pretty neat.)

Once a change request is approved, it becomes a maintenance. It exists on the maintenance calendar for the helpdesk and other IT organizations to look at.

Incident database

Like any decent shop supporting dozens of applications, we keep a fairly good incident database. This is a log of things that go wrong on servers, what our diagnostic process was, what the impact of the problem was, and how we fixed it. This is a huge help at bringing new on-call engineers up to speed on the infrastructure.

The incident database makes use of the same Impacted hosts field that we use on the change control/maintenance calendar. This is awesome because we can open up an incident, click on the host’s FQDN, and see all the maintenance work and other incidents that have been performed on that host since we started using the system. As with the maintenance database, this is queryable through the JIRA API, and we do exactly that to provide a list of related incidents whenever any Icinga alert goes out via email.

To do

  • SCM integration: It would take a lot of work out if we could integrate the Git repositories storing our Puppet code into JIRA, and use that to feed the list of maintenances. Since Atlassian only supports Git hosted on GitHub (and doesn’t support GitHub Enterprise accounts at the time of this writing), we’ll end up exposing a read-only copy of the repository through git-svn and pumping data in through the SVN plugin.
  • Better Icinga integration: We already have JIRA maintenances and incidents showing up in our Icinga alerts. But oh, how I would love the holy grail of Icinga creating entries in the incident database by itself. Right now we put them in manually if they’re anything more complex than “the user filled up the disk.”

Default monitoring alerts are awful

I’ve been putting some serious thought recently into how to improve the issue turnaround time of my operations team, and one really sore point that stuck out to me was the notifications that were coming around of our monitoring system. We’re, like many shops, using Nagios/Icinga, one of the most flexible monitoring packages to ever exist in the world, and yet for a decade we’ve been running with default alerts that give you almost no context. They tell you what, not why.

Here’s a boilerplate Nagios notification email:

For some people, especially in smaller environments this is good enough — it tells you that something is not working, which is the most important thing a monitoring system does. For us, it’s horrendous! For starters:

  1. A brand-new operations engineer on the team who sees this alert will have absolutely no idea what the server is, how important the alert is, or who to contact about it if they need to escalate.
  2. There’s no context on other recent (and likely-related) incidents impacting the same host, and no record of recent maintenance that may be linked to the problem.
  3. It doesn’t provide any really easy way to get more information about the outage.

So, we hacked on the notification a little bit. Before the alert is sent, the alert information is formatted as JSON and piped into a utility that performs data lookups and inserts them into an email template.

With a little bit of tender loving care, we ended up with this guy:

Right off the bat:

  1. OS information from our CMDB is included right in the message. We’re using GLPI (with OCS Inventory as the root source), but something like PuppetDB works nicely too.
  2. Up to five recent incidents are linked in the alert.
  3. Up to five recent maintenances are linked in the alert.
  4. The alert contains links to get way more information on the host than an engineer should need to figure out everything they need to actually do with the alert.

Some things we might consider adding in the future:

  1. Embedding most recent system state (CPU/memory usage, swap activity, disk I/O, etc.) from Graphite right into the alert
  2. Pulling the most recent critical log messages out of Logstash/ElasticSearch and embedding them directly into the alert for context
  3. That Kibana URL really needs to be run through a URL shortener. Plumbing that into the alert system was just one more headache I don’t need just yet.

 

© 2014 @jgoldschrafe

Theme by Anders NorenUp ↑