Runbooks are stupid and you’re doing them wrong

Well, maybe you are and maybe you aren’t. I have no idea. But if your shop is anything like the majority of IT shops I’ve seen, then this assessment is probably on the money.

The runbook is one of the most pervasively mediocre, poorly thought-out and badly-implemented concepts in the entire IT industry. For those of you who are unfamiliar with the term, the runbook is basically a “how can grandma run this application?” document.

Their use should be very strongly scrutinized.

When all you have is a hammer, the whole world looks like a nail; or, don’t use a runbook when you need a script

This is so obvious that it should never need documenting for anybody, for any reason, yet I’m constantly seeing people write runbooks that are just lists of actions for a system operator to take, one after the other, when something goes wrong. There are literally no decision points where a human needs to form an intelligent thought to execute this runbook. The runbook reads like a script.

A script. For a person.

Something has gone wrong. The train has flown off the rails.

Isn’t the entire point of technology to make people more productive? So why are we taking something that’s essentially a mechanical, computerized task, easily performed by a script or program, and turning it into a format that needs to be blasted from an output device into someone’s eyes, processed by a human brain, jammed onto a keyboard and mouse, and then back into the infrastructure? Shouldn’t we be skipping the middleman?

Working on fun, thought-provoking problems goes much further for staff happiness and retention than having a cabal of people whose job title is Guy Who Pushes the Button. I completely understand the need for staff engagement. But the right time for that is during build-out and engineering, not in the middle of a crisis where the business is losing money or pissing off customers. I assure you that they’re a lot madder about the outage than your ops team is about it being too easy to fix.

In a crisis, failures should be obvious and recovery should be automated as much as possible to minimize the impact of human error. This brings me to my next point.

A good monitoring system, not a dumb manual process, should tell you what’s wrong

There’s always going to be exceptions to this, of course. Computers are bad at deriving context about why an application’s performance profile has changed. If your page views are a hundred times higher today than they were yesterday because your site ended up on the front page of Digg or Reddit, your site will not be performing the same as it did yesterday. There will always be times where you need a human keeping an eye on the performance charts (humans are much better at reading graphs than computers are) and trying to figure out why things aren’t working the way they’re supposed to be.

(Those of you following the DevOps movement: look up that video about Etsy’s dashboards. The best and brightest ops people these days are keeping an eye on business metrics, like sales figures or numbers of code deployments, rather than low-level system metrics.)

But for a lot of other cases, the runbook is representative of somebody being lazy and not correctly integrating the process with the monitoring system. Any line saying “watch out for _____” should be immediately suspect. Human brains are really powerful, and really good at figuring out real problems. Your ops engineers should be focusing their time on unknown unknowns. If you know what the performance criteria are that signal a problem, you should be monitoring for those conditions automatically. There are a lot of statistical models that can help you, if you’re willing to put in the effort to use them.

Systems should be self-healing

The best of IT shops often simply don’t do this unless they’re integrating the component into a much bigger high-availability project. I’ve found two main reasons.

The first is that admins and engineers seem to believe if they spend enough time building infrastructures correctly in the first place, there won’t be repeatable failures. If you’re going to put the effort into writing a bunch of code to make a system more reliable, shouldn’t you put that effort into just making sure it never happens?

Well, yes and no. Some failures are incredibly difficult to prevent but really easy to detect and really easy to recover from, especially if not all the factors are under your control. But other failures also have highly complex causes, and it may take several break-fix iterations before the problem actually disappears. If you’re building out a reliable service, isn’t it better to cut downtime by 95% for 90% of cases where the problem occurs, rather than eliminating 100% of downtime for 50% of cases?

I’m not saying that technical debt is somehow a good thing, but motivated operations people have accomplished really great things for their end-users with duct tape and staples. There’s nothing wrong with working around a problem as long as the fix isn’t fragile and it doesn’t impede your ability to maintain the application down the road. It doesn’t necessarily mean you’re avoiding the problem; rather, you’re finding better places to invest your time.

This brings us to the second reason: people don’t trust the idea that a system can automatically recover itself from failure. And, really, it ties into the first a little bit: we think our infrastructures are too good to suffer these minor outages, especially from obvious causes. But they aren’t, and a little creative engineering can keep a minor situation from turning into a minor outage, or a minor outage from turning into a major one. And we all have SLAs, even if there’s nothing formal and your boss’s idea of a service level is “keep the systems running well enough where I don’t feel compelled to fire you.”

Take special note of this if you don’t control your applications. As an aside, I used to work at a small web hosting business a number of years ago. We had a number of customers running ASP applications on top of IIS, which is Microsoft’s web server platform. Every once in awhile, a customer’s website would suffer a crash of their application pool, because classic ASP wasn’t good at releasing resources if you weren’t a diligent coder. We couldn’t control the code our customers ran on their sites, but we could monitor their sites and restart the application pool if it started to toss errors in a very specific way.

Many operating systems, like Solaris and Windows, take a very pragmatic approach to the problem. If the service crashes, restart it. If it crashes more than X times, leave it down and let the admin deal with it. These are obvious. Some non-obvious things you might want to consider regardless of how you’re handling high-availability:

  1. When the filesystem containing /var/log is almost to capacity, compress or delete old logs before the volume fills up.
  2. When daily cronjob X fails because a network service is down, retry it a few times with an exponential backoff instead of waiting until cron runs it again tomorrow.
  3. If an application crashes and writes a specific error message into the logs indicating what made it fail, identify the problem, fix it, and start the service back up without human intervention.

Everyone has common, repeatable failures in their infrastructure, though the precise definition of “common” may vary from shop to shop. Not all of these issues will cause outages, especially if the infrastructure is designed for high availability, but let’s not pretend that all our applications are perfect. At the same time, let’s not delude ourselves into thinking that eh, app crashed, restart the daemon is always an adequate solution to a problem. It takes some thinking about the application and understanding it.

Runbooks aren’t always bad

I can think of the following cases where runbooks are useful to an IT organization:

  1. Ensuring there’s a contingency plan if the script goes wrong and nobody knows how to fix it
  2. Orienting and coordinating staff in an emergency, so everyone knows the appropriate responsibilities, escalations and handoffs
  3. Solidifying a process that has so many moving parts that, even though it may take days to document, it might take weeks, months or years to get automated properly
Runbooks should only contain the pieces that are relevant to people and help them communicate better. If you can document the intent, you can translate it into code. Even if the code has bugs in it, they’re the same bugs everywhere, and a consistent behavior is almost always better than an ambiguous one.

9 Comments

  1. You’ve hit the nail on the head here Jeff. That “we’re sitting in a room full of computers talking to a room full of even more computers” is a mantra I’ve been trotting out my entire professional life. If you’ve got some poor sod running around following a list checking, say, RAID status or log output from last night’s batch, on a daily basis you’re doing it completely wrong.

    Nice article.

    • Comme pour les prothèses mammaires : l’offre s&uq;rosadapte à la demande. Il vaut mieux acheter des soutien-gorge made in China que des prothèses Made in France.

  2. clearly none have you worked in an enterprise environment where there are thousands of computers. When you get to that level then you’ll understand that every bit of help counts.

    • Do you think all those thousands of computers ought to be managed manually? For every X number of servers we scale out to, we must also hire X admins to run manual commands? What good technical reason is there for not automating this? I work at an enterprise with thousands of servers and the more runbooks we convert to script, the more standardized and consistent our environment becomes, which aids in everything from monitoring to troubleshooting to additional automation.

  3. I like what you guys are usually up too. This kind of clever work and reporting!
    Keep up the excellent works guys I’ve incorporated you guys to my personal blogroll.

  4. I recently took over a new system and spent a week in a room with the original designer/architect. I documented everything that I could mainly because I’m new to the company and their major applications are custom, in-house ones. I’m actually creating a runbook from scratch because nothing exists to describe the current environment or operations (except from a user’s perspective). My intent is to use the runbook as a baseline reference. Everything is manual right now and I need it documented in one place so I can start creating the automation/scripts to perform administrative actions. I think a runbook is most useful for knowledge transfer and getting everything in one document for reference.

    • That makes sense for initial documentation. You need to start your process somewhere. The next logical step is to convert the runbook to a script, and document the script properly, so that now you only need to maintain the script, not the runbook.

  5. Hi Ryan,

    Interesting article, you bring up many good points.

    I have a few comments too:

    1. I view a run book as a starting point, not an operations guide. I think I may have a different term to use in this context, but I have always just called it a “run book”. To me the run book is the book that the implementation team hands over to the operations team on day one, which allows the operations team to do all the things you mentioned in your article. A run book to me includes:
    a. all of the initial documentation on the system including manuals, marketing content, the RFP if there is one, the contracts, the maintenance agreements, the service contracts, etc.
    b. Any sequential changes that are observable but unique. For example if an upgrade was done and there was an error during the upgrade, a screen shot of the error would be captured and inserted into the run book. (Yes, I would assume this is in fact a WIKI or some other electronic version, but in the Physical Security industry, where I started on this kind of thing, it had to be printed by the integration team in triplicate… The old ROBOHELP product from Adobe was great for making one of these because it could also be ported out as content for a WIKI, and printed if necessary.)
    2. A run book is essentially the hand over that should logically move your manual first time execution of the business process into a change control system that then automates it.
    3. Long term, a good run book can be used as evidence to build a case should you need to ask for a refund from a vendor. The reason is the initial RFP, and the use cases that are the foundation of the project can be referenced in the future should one side or the other fail to perform.

    As a vendor, I hate to mention the last point, but there are two sides to this. First, if the vendor actually builds the first run book, it becomes a hand over check list which is ultimately the deciding criteria for determining if the project is effectively complete.

    Most of the run books I had to develop also included signature blocks for all the executives and engineers along the way, which ratified that the system was built to specification, and that it should operate as designed.

    By having this kind of rigor built into the process of handing over a system to a customer, (or from one group internally to another) prevents one side or the other from trying to back out of their obligations. Delivery is every bit as important in this case as acceptance, and that is where the hand over documents that establish a good run book come into play. When a run book is done right, it eliminates all drama, and prevents surprises. People know exactly what is happening, and what is documented is not the script as you put it, (to me that is a training manual, or a DR guide), but a document designed to inform and protect all parties involved in a project.

    The last run book I created was about 900 pages long, and it was only that large because we had roughly 30K parts in the complete system from the component level up, and we had all the documentation for each item in our book. (Re-ordering parts on a 10 year old system is typically impossible, but having the part guides means you can get them reproduced later…)

    So again, there is a lot to a run book when it is for more than just software, but even in the IT only scenario, there are typically enough moving parts to make it worth documenting.

  6. Great post, most informative, didn’t realise devops were into this.

Leave a Reply

Your email address will not be published.

© 2017 @jgoldschrafe

Theme by Anders NorenUp ↑