Well, maybe you are and maybe you aren’t. I have no idea. But if your shop is anything like the majority of IT shops I’ve seen, then this assessment is probably on the money.
The runbook is one of the most pervasively mediocre, poorly thought-out and badly-implemented concepts in the entire IT industry. For those of you who are unfamiliar with the term, the runbook is basically a “how can grandma run this application?” document.
Their use should be very strongly scrutinized.
When all you have is a hammer, the whole world looks like a nail; or, don’t use a runbook when you need a script
This is so obvious that it should never need documenting for anybody, for any reason, yet I’m constantly seeing people write runbooks that are just lists of actions for a system operator to take, one after the other, when something goes wrong. There are literally no decision points where a human needs to form an intelligent thought to execute this runbook. The runbook reads like a script.
A script. For a person.
Something has gone wrong. The train has flown off the rails.
Isn’t the entire point of technology to make people more productive? So why are we taking something that’s essentially a mechanical, computerized task, easily performed by a script or program, and turning it into a format that needs to be blasted from an output device into someone’s eyes, processed by a human brain, jammed onto a keyboard and mouse, and then back into the infrastructure? Shouldn’t we be skipping the middleman?
Working on fun, thought-provoking problems goes much further for staff happiness and retention than having a cabal of people whose job title is Guy Who Pushes the Button. I completely understand the need for staff engagement. But the right time for that is during build-out and engineering, not in the middle of a crisis where the business is losing money or pissing off customers. I assure you that they’re a lot madder about the outage than your ops team is about it being too easy to fix.
In a crisis, failures should be obvious and recovery should be automated as much as possible to minimize the impact of human error. This brings me to my next point.
A good monitoring system, not a dumb manual process, should tell you what’s wrong
There’s always going to be exceptions to this, of course. Computers are bad at deriving context about why an application’s performance profile has changed. If your page views are a hundred times higher today than they were yesterday because your site ended up on the front page of Digg or Reddit, your site will not be performing the same as it did yesterday. There will always be times where you need a human keeping an eye on the performance charts (humans are much better at reading graphs than computers are) and trying to figure out why things aren’t working the way they’re supposed to be.
(Those of you following the DevOps movement: look up that video about Etsy’s dashboards. The best and brightest ops people these days are keeping an eye on business metrics, like sales figures or numbers of code deployments, rather than low-level system metrics.)
But for a lot of other cases, the runbook is representative of somebody being lazy and not correctly integrating the process with the monitoring system. Any line saying “watch out for _____” should be immediately suspect. Human brains are really powerful, and really good at figuring out real problems. Your ops engineers should be focusing their time on unknown unknowns. If you know what the performance criteria are that signal a problem, you should be monitoring for those conditions automatically. There are a lot of statistical models that can help you, if you’re willing to put in the effort to use them.
Systems should be self-healing
The best of IT shops often simply don’t do this unless they’re integrating the component into a much bigger high-availability project. I’ve found two main reasons.
The first is that admins and engineers seem to believe if they spend enough time building infrastructures correctly in the first place, there won’t be repeatable failures. If you’re going to put the effort into writing a bunch of code to make a system more reliable, shouldn’t you put that effort into just making sure it never happens?
Well, yes and no. Some failures are incredibly difficult to prevent but really easy to detect and really easy to recover from, especially if not all the factors are under your control. But other failures also have highly complex causes, and it may take several break-fix iterations before the problem actually disappears. If you’re building out a reliable service, isn’t it better to cut downtime by 95% for 90% of cases where the problem occurs, rather than eliminating 100% of downtime for 50% of cases?
I’m not saying that technical debt is somehow a good thing, but motivated operations people have accomplished really great things for their end-users with duct tape and staples. There’s nothing wrong with working around a problem as long as the fix isn’t fragile and it doesn’t impede your ability to maintain the application down the road. It doesn’t necessarily mean you’re avoiding the problem; rather, you’re finding better places to invest your time.
This brings us to the second reason: people don’t trust the idea that a system can automatically recover itself from failure. And, really, it ties into the first a little bit: we think our infrastructures are too good to suffer these minor outages, especially from obvious causes. But they aren’t, and a little creative engineering can keep a minor situation from turning into a minor outage, or a minor outage from turning into a major one. And we all have SLAs, even if there’s nothing formal and your boss’s idea of a service level is “keep the systems running well enough where I don’t feel compelled to fire you.”
Take special note of this if you don’t control your applications. As an aside, I used to work at a small web hosting business a number of years ago. We had a number of customers running ASP applications on top of IIS, which is Microsoft’s web server platform. Every once in awhile, a customer’s website would suffer a crash of their application pool, because classic ASP wasn’t good at releasing resources if you weren’t a diligent coder. We couldn’t control the code our customers ran on their sites, but we could monitor their sites and restart the application pool if it started to toss errors in a very specific way.
Many operating systems, like Solaris and Windows, take a very pragmatic approach to the problem. If the service crashes, restart it. If it crashes more than X times, leave it down and let the admin deal with it. These are obvious. Some non-obvious things you might want to consider regardless of how you’re handling high-availability:
- When the filesystem containing /var/log is almost to capacity, compress or delete old logs before the volume fills up.
- When daily cronjob X fails because a network service is down, retry it a few times with an exponential backoff instead of waiting until cron runs it again tomorrow.
- If an application crashes and writes a specific error message into the logs indicating what made it fail, identify the problem, fix it, and start the service back up without human intervention.
Everyone has common, repeatable failures in their infrastructure, though the precise definition of “common” may vary from shop to shop. Not all of these issues will cause outages, especially if the infrastructure is designed for high availability, but let’s not pretend that all our applications are perfect. At the same time, let’s not delude ourselves into thinking that eh, app crashed, restart the daemon is always an adequate solution to a problem. It takes some thinking about the application and understanding it.
Runbooks aren’t always bad
I can think of the following cases where runbooks are useful to an IT organization:
- Ensuring there’s a contingency plan if the script goes wrong and nobody knows how to fix it
- Orienting and coordinating staff in an emergency, so everyone knows the appropriate responsibilities, escalations and handoffs
- Solidifying a process that has so many moving parts that, even though it may take days to document, it might take weeks, months or years to get automated properly