Chris Siebenmann wrote another really thought-provoking piece on how sysadmins and developers use revision control differently. There’s a couple of things that I really agree with, and a couple that I think are pretty telling of systems administration as a profession. I think, in many ways, that the way developers do things is correct, and the way system administrators do things isn’t correct. This isn’t because developers are, in general, smarter or more regimented — that’s an apples-to-oranges comparison that I’m not even going to begin to approach. But there are some limitations in how developers test that makes their workflow more oriented towards identifying broad problems before the customer does. This focus on reproducibility and testing is something that sysadmins could really learn from.

Here’s the part that a lot of us take for granted:

Here is a thesis: sysadmins use modern version control systems differently than developers. Specifically, sysadmins generally use VCSes for documentation, while developers use them for development. By this I mean that when sysadmins make a commit, it’s for something that is already in use; for example, you change a file in /etc and then commit in order to document when and why you made the change.

This is very, very true. Revision control systems are best used for change control, not just by administrators, but by developers as well (see “blame” and similar commands in most VCSes). I very much advocate this approach. For minor changes that can result in only minor performance regressions or other trivial breakages, it’s much simpler to design a system where regressions can be rolled back easily, rather than one where every tiny little change requires dozens of administrative hurdles that prevent the administrator from, you know, doing their job. If you have a good way of combining changesets into an easily-displayed view (I use Redmine to aggregate subproject activity), then it’s really easy to see exactly what changed on a system, when, and why.

But I think this part of the post requires a little more scrutiny:

There are a number of important features of modern VCSes that are basically irrelevant if you are only using them for post-facto documentation. One obvious example is cherry-picking only some changes to commit; because all of the changes are already live, committing only some of them means that you are not documenting some active changes.

(There is some point to the ability, but needing to do it generally means that either someone forgot to commit several changes or that there was a crisis in a mixed directory.)

Sysadmins can use VCSes in a more development mode, but I think that it is somewhat foreign and is certainly going to take not insignificant amounts of work. (Consider the problem of testing your changes before you deploy them into the live version of the repository, for example.)

If you’re pushing changes that you haven’t tested into a production environment, then you’re probably doing something wrong. I hope this isn’t construed as an inflammatory statement, because I work in education too, and I understand the realities of that particular environment. This definitely isn’t meant as a knock on Chris, since I’m stuck having to make some of the same hard decisions (and they often leave a bad taste in my mouth). But for many of us with saner environments to manage, I think we can learn from it if we look a little more critically. The great challenge for me over the last two years has been wrangling and getting control over a maddeningly cobbled-together environment that, to use a predecessor’s soul-crushing term, “grew organically.” (The hidden truth in that statement is that crops grown organically have no pesticides.)

Developers, for the most part, are forced to work in separate development/production “environments” out of necessity. In its most basic form, this might have the development environment being a working copy while the production copy is the latest stable release on the website. People who write programs generally at least do a cursory test on their own testbeds to make sure something works before pushing it out to a customer. Sometimes, but rarely, it’s impossible to reproduce a particular issue on the development system, and squashing bugs involves a lot of guess-and-check work. For the most part, the developer is able to verify that a change works as intended before putting their change into production (releasing a new version).

There’s not many developers who solely release nightly builds or development snapshots of projects that are considered production-ready. The ones that do tend not to be very successful. However, this is precisely the mentality many administrators take when managing systems. There’s some fundamental differences between the models, of course — a developer can’t force a user to upgrade their broken version while a hosted service can often be fixed transparently and with minimal interruption — but can’t we do better where it counts?

This takes a kind of diligence not often seen in the realm of systems administration. This is largely because it’s often not required, and largely because it’s really difficult. In many cases, there’s also substantial cost issues with licensing software for testing purposes. Most organizations, and the people who support them, can’t afford the man-hours to be constantly setting up clones of complex, interconnected and interdependent systems in order to test simple changes when those systems aren’t directly linked to generating revenue. Even with deployment automation tools like linked clones in VMware ESX, it’s extraordinarily difficult to perform this kind of testing correctly. Much of the time, there’s really very little reward and very little incentive in doing so.

I’m not convinced that this is because of any inherent complexity. I think that this is mostly because we, as smaller-scale system administrators, tend not to deploy our configurations correctly in the first place, and this makes it very difficult for us to create a good test environment programmatically. Large enterprises have it easy — large numbers of homogeneous systems make it easy to push identical or nearly-identical configurations out to a ton of grid computing nodes. For all of the complexity saddling organizations like Google or Goldman Sachs, the simple process of pushing configurations out onto cluster nodes probably isn’t one of them. However, in situations like academic research institutions where you have huge amounts of heterogeneity and you’re forced to produce a huge number of one-off system configurations, things become very tricky.

But we’re pushing into 2010 now, and we can’t complain that we don’t have the tools any longer. Cfengine has been tolerable for a number of years, and better tools like Puppet, Chef and Cfengine 3 are beginning to gain a lot of traction. I think that at this point, it should be very easy to set up repeatable build environments, as long as we have the diligence to keep all of our configurations, or at least everything relevant to infrastructure, managed through a proper configuration management engine. Through proper use of subprojects/submodules, or whatever functionality is provided by your VCS of choice, it should be extraordinarily simple to perform the branching/merging necessary to perform parallel system development in staging/production trees. With virtualized environments as pervasive and ubiquitous as they are, it should be very simple to rebuild a system from the ground up using your configuration management product, and then test whatever you need to test.

Proper release management has been a big part of the corporate IT culture for decades. The idea isn’t that change is bad; you’ll find in many organizations, like Facebook, that change drives progress forward and provides a lot more competitive advantage than being unnecessarily risk-averse. However, I think that the small guys have a lot to learn from the more optimized IT shops when it comes to understanding that proper testing practices can go a long way in making life easier for your users. That, in the end, is what we need to strive for. While the ability to roll back changes is nice, it’s better to have a consistent and well-tested platform that’s consistent among all of the systems that you manage. With a good configuration management system, you can roll back the appropriate changes in parallel among all of your systems automatically.