(I just posted this as a comment on one of Chris Siebenmann’s posts, but it was long enough where I felt it warranted reposting here.)
For a sysadmin, testing software is really, really hard. We’re constantly stuck in a cycle where we either patch a piece of software and introduce unwanted bugs and regressions, or we leave it unpatched and often vulnerable, worse-performing and missing important new features. There are many tricks we’ve learned over the years to make the process easier, but it’s still a fundamentally difficult activity.
Since I came to system administration from the development side of the fence, I’ve always had a keen fascination with the similarities and differences with the way that software developers and system administrators (and the project managers who herd them) go about their jobs. In particular, I find it amazing that neither role generally has a good grasp of how the other functions, and how they can better work together.
I think that an interesting part of this problem is that software developers have a much easier time of testing things than system administrators do. For everyone to understand my viewpoint, I need to qualify it by saying that when a system administrator needs to test a new release of software before deploying it to a production system, it’s generally not to make sure that any new features introduced in the software are bug-free, because it’s simple enough to just document the problems and not use those features until they’ve stabilized. Rather, the issue is that we need to identify regressions in pieces of code that used to work fine and are now broken.
In the software industry, this is what unit testing is for. Unit testing allows developers to provide a comprehensive set of test cases for a particular function, and make sure that the method works properly for all of them and returns the expected result. Many agile developers believe in writing tests first, then code, and aiming for 100% test coverage to minimize unintended regressions from rapid code changes.
I’m not recommending that system administrators should automate testing other people’s software, because there’s no standardized model for business requirements. However, I do think that a little transparency into the development model of our upstream developers would help us to figure out where testing is and isn’t necessary.
While it’s not adopted across all of the software industry, unit testing is very popular in many rapid development scenarios, and has become more or less institutionalized in certain developer communities like CPAN. If you’re a developer, or at least, if you develop software without gluing together huge numbers of third-party libraries, it’s pretty simple for you to gauge regressions in your own software, because you know (or can easily find out) what the test coverage is for your own project. If you have really thorough unit test coverage, and your test cases are properly written, you shouldn’t have any function/method-level regressions slipping into production code when there’s an update. This doesn’t give the developers a ton of insight into the complex problems, like integration-level or system-level issues, but at least it provides a basic understanding that no minor and insidious issues are creeping up the chain and causing undetected problems.
The problem with unit testing is that the developers run the tests, and they run them on their own systems. This methodology can lead to some really bothersome problems for other people.
When you’re a system administrator, and especially if you’re a system administrator who deals with a lot of proprietary, closed-source software, it becomes very difficult to understand the development methodologies of every single piece of software you plan to update. There’s a certain amount of trust that goes into your Linux vendor’s ability to not break things like glibc that aren’t easily tested. I think the ability to trust a vendor’s stability track record is a wonderful thing, but it’s something that shouldn’t be necessary for system administrators. We should be able to validate the correctness of code on our systems, with our configurations, without fighting the developers for the right to do it.
There’s a constant impedance mismatch and a constant communication gap between developers and sysadmins that needs to be bridged. Software developers need to understand that most sysadmins aren’t developers, and we need an easy way to perform basic correctness validation on the software we install, especially if we install it from the distribution’s or developer’s packages and aren’t running a “make test” or similar during the install process. We need to understand what’s being tested, we need to understand the significance of the test coverage, and we need to be able to figure out what does and doesn’t warrant further testing. As it stands, all the validation that developers are (or aren’t) doing is lost on us, because we don’t get a warm-and-fuzzy from tests that someone else is running that we’ll probably never get to see.