MonthNovember 2009

Recording disk statistics with sysstat on RHEL/CentOS

Unlike on Debian-like systems, the default configuration for sysstat’s sa1 collector on RHEL/CentOS does not include disk statistics (like you would get from iostat) in the sa collection output. This is due to a missing flag in the cron.d fragment that calls sa1. The “-A” flag to sa1 defies reasonable assumption about its function, and does not include disk statistics, so we have to specify “-d” manually.

To enable disk statistics collection/trending, edit /etc/cron.d/sysstat and change the following:

*/10 * * * * root /usr/lib64/sa/sa1 1 1

to this:

*/10 * * * * root /usr/lib64/sa/sa1 -d 1 1

(Obviously, replace “lib64″ with “lib” as appropriate for i386 systems.)

Either wait for the next sa log rotation (at midnight) for sa1 to begin collecting disk statistics, or delete your current day’s statistics. sa1, for whatever historical reason, does not add new counters to an existing sa log file.

Interesting links for 11/25/2009

With all the busy-ness that this holiday weekend entails, I’m just going to leave you all with a bunch of links:

  • TaoSecurity has a really interesting writeup about the ethics of Shodan, a “computer search engine” which provides some very interesting tools for people trying to secure their systems or launch attacks on arbitrary ones. At the very least, it’s interesting seeing how many Nortel switches, Checkpoint firewalls and other devices people are actually running with telnet open to the Internet at large.
  • WWLTV in New Orleans ran a segment on how eastern European hackers are increasingly targeting American small businesses and stealing online banking credentials with malware. It’s nothing that you haven’t heard before, but it’s nice to see information security starting to get some mainstream attention, and people finally beginning to become aware of the real financial threat posed by bad information security. Hopefully banks will get the hint and start relying on multi-factor authentication for all business accounts.
  • Last In – First Out has a nice post on cargo cult system administration. Matt from Standalone Sysadmin has an amusing anecdote about it from Tom Limoncelli in the comments.
  • Icinga, the fork of Nagios, has finally released a demo of their new web interface. It’s snazzy in a “wow, some neat technology” way, but I don’t really see it being an improvement at all in the “this makes it easier to do my job” way. Ultimately, I’m not sure how to approach the project — the real fundamental problem with Nagios is that it’s, well, Nagios. I’m not sure how to fork it and make it better without utterly destroying compatibility. (This might not be a bad thing.)

Nagios plugin: check_sa.pl

There’s a lot of useful Nagios addons out there. One of them, pnp4nagios, allows you to create graphs of all of your Nagios performance data with zero configuration. This is pretty nice, because your monitoring configurations are kept in one place, rather than having to separately maintain configurations for Nagios and Cacti (or whatever you use).

I’ve always wanted to be able to monitor things like number of open sockets, page faults, context switches, and other performance counters. Some of them are available through SNMP; others aren’t. The ones that are available aren’t all available by device. I wanted a little bit more detail.

The other problem with SNMP queries is that a Nagios check doesn’t query an average — something that spikes for a minute is not the same as a condition that persists for several minutes or hours. I wanted to leverage the built-in accounting in sysstat to pull together something Nagios can actually make a little bit of sense out of.

Anyway, I went ahead and created a Nagios plugin that will parse the output of sadf (which is a frontend to sa/sar performance counters). You can query multiple counters at a shot, specifying separate alert thresholds for each (or none at all, if you just want performance data). You can specify, via shell-style glob patterns, which devices you want to include or exclude, so that you can, for example, exclude all “lo” and “tun*” devices from network statistic monitoring. You can also pick the sampling period, so if you want an average of the last 30 minutes the plugin will produce it.

You can do stuff like this:

./check_sa.pl -i -C %usr -C %soft -C %sys -C %idle -D all
SA OK – All counters within specified thresholds. | %idle[cpu0]=96.84;; %idle[cpu1]=96.31;; %idle[cpu2]=97.23;; %idle[cpu3]=95.8;; %soft[cpu0]=0;; %soft[cpu1]=0.01;; %soft[cpu2]=0;; %soft[cpu3]=0.01;; %sys[cpu0]=0.4;; %sys[cpu1]=0.46;; %sys[cpu2]=0.36;; %sys[cpu3]=0.63;; %usr[cpu0]=2.67;; %usr[cpu1]=3.13;; %usr[cpu2]=2.27;; %usr[cpu3]=3.46;;

Or, if you prefer to summarize:

./check_sa.pl -i -C %usr -C %soft -C %sys -C %idle -d all
SA OK – All counters within specified thresholds. | %idle[all]=96.54;; %soft[all]=0;; %sys[all]=0.46;; %usr[all]=2.89;;

It’s still a tiny bit slow — it takes about 500-600 ms to run on the systems I’ve tested — but this should be good enough to be useful without bogging down Nagios too badly.

The script requires the Text::Glob module to be installed, so it can convert shell-style globs into regular expressions to match against.

View the project:

Fedora 12 allows users to install signed packages…

Update: According to a post on lwn that I can’t find at the moment, they’ve already reverted this decision with a subsequent update. It should be resolved soon.

…without root privileges, without authenticating.

Yeah, you read that right. SANS has the writeup:

A “bug” created back in November against the latest Fedora release (12) indicates that, through the GUI, desktop users of the Fedora system are able to install signed packages without root privileges or root authentication.  Yes, you just read that correctly.  (I’ll give you a second re-read that sentence so I don’t have to retype it.)  Yes, “it’s a feature, not a bug”.

In all my travels I’ve only ran across one company, ever, that has Fedora rolled out as an enterprise operating system on every desktop.  But what kind of security implications does this have?  I obviously don’t have to explain why this is (may be) a bad idea to the readers of the ISC, as we are all security minded people.

Now, the restrictions.  This change does not affect yum on the command line.  This only affects installing things through the GUI.  (Not that helps any, as most users will be running the GUI anyway.)  You can also disable it.

Currently in the bug, there is some debate about if they should revert this feature.  So, this may be just temporary.

I’m sure this shouldn’t affect most people’s real deployments of anything, since Fedora has always been something of a moving target and has, in my experience, been completely unsuitable for widespread deployment in an organization for a wide variety of other reasons. But just because it’s not appropriate for enterprise customers doesn’t mean that desktop users have nothing to worry about.

That’s because this extends the attack surface for malicious intruders by a really impressive amount. By allowing users unauthenticated access to play with the package manager, you create a nearly infinite attack surface for anyone looking to obtain a local privilege escalation on the system. Imagine this: you don’t need to exploit any one specific system service, because once you find a hole in something, anything at all that can be targeted in a default out-of-the-package configuration, you can install it and exploit it.

I’m not 100% aware of the implications of how this is designed — I may be fundamentally misunderstanding something that’s going on in the back end, and this may not be a Really Bad Thing. But imagine this: someone finds a bug in Firefox, or Flash, or Java. They exploit it to gain the ability to run arbitrary code under the user’s account. They can now silently install  Cfengine, Puppet, Bcfg2, or another root-configured service in the background using PolicyKit. They then attempt to exploit these services, which shouldn’t be running in the first place, and if they succeed, suddenly they have root access to do whatever they want.

Let me slip on my tinfoil hat for a minute: say some minor package maintainer gets through Fedora’s release engineering processes, and under the radar, slips a surreptitious backdoor into a package that only a handful of people use and nobody really keeps their eyes on. Where previously the damage might be so localized, from the package’s disuse, to be pretty much useless, now that package can be slipped into anyone’s system at will through a local unprivileged user exploit.

SELinux mitigates this, absolutely, and unlike in Debian, most important things won’t start by themselves until they’re explicitly enabled by the administrator. But the back door is there even if it’s locked, it’s only a matter of time until someone finds a real-world way to abuse this in very bad ways, and I really wish they would seriously consider reverting this behavior to something a bit less dangerous. This could be a very useful tool in a corporate environment, but the way I understand the situation right now, it’s a very bad default.

44% of security products contain security problems

Slashdot linked to an interesting analysis of an ISCA Labs report, done by Help Net Security, about the underperformance of various network security products. The meat of the analysis focused on how most products fail to achieve certification on the first test, but I found this particular statistic incredibly enlightening:

Rounding out the top three is the startling finding that 44 percent of security products had inherent security problems. Security testing issues range from vulnerabilities that compromise the confidentiality or integrity of the system to random behavior that affects product availability. Even though it can be a demanding process, certification with a trusted, established third party is critical to verifying product quality, states the report. Product categories studied were: anti-virus, network firewall, Web application firewall, network IPS, IPSec VPN, SSL VPNs and custom testing.

The report has some caveats. For example:

Even the technology used to store and access test data has seen substantial change. We certainly cannot make the claim that a single, consistent data collection method was employed across all products throughout the timeframe of this study.

Check out the rest of the report; it’s a good read. I’ve long been of the belief that most high-end security products (beyond typical endpoint stuff) are snake oil and don’t provide any kind of real ROI; this report does nothing to change my opinion, especially in the IPS space, where a really remarkably huge portion of the sampled products failed to achieve certification.

More on CentOS 5.3 to 5.4

So, here’s a humbling, humiliating and slightly funny follow-up to my last blog post:

I’ve always done my due diligence in making sure upgrades go smoothly. As a result, I have a habit of tirelessly poring over release notes and the “known issues” section therein. However, I got burned this week when I failed to read all of the release notes.

CentOS has a documentation page for the 5.0 series. And as of this writing, the documentation page links to a document called Release Notes. It does not, however, link to a completely different document that also is called Release Notes. I had read the release notes on the documentation page, but not the CentOS-specific release notes document which was only linked from the front page. I suppose it’s my fault for not noticing that 5.0 through 5.3 all have CentOS release notes links pointing to the wiki, and thinking that the wiki might be a good place to look.

Upon asking about my upgrade issues, the always-helpful folks in #centos berated me for not Googling correctly for the release notes, accused me of trolling when I pointed out that I did find (and read) the release notes but that there was a documentation problem, and asked me why I would dare to criticize the free efforts done by a volunteer in maintaining the documentation. Obviously, after finding the only document called “Release Notes” listed under CentOS’s documentation for 5.4, on the page where this documentation would normally be, the perfectly reasonable, thinking man’s approach to the problem would be to Google for CentOS release notes.

After much soul-searching and reflection, and a few minutes spent filing a bug report about the documentation page, I did find the answers I was looking for in the CentOS-specific release notes tucked away on their wiki:

  • CentOS 5.4 includes glibc and kernel updates. For yum updates the recommended procedure is:

yum clean all
yum update glibc\*
yum update yum\* rpm\* python\*
yum clean all
yum update
shutdown -r now

So, here’s the morals of the story:

  • If you try to run the whole upgrade at once using yum upgrade, there is a good chance that you will break your system going from 5.3 to 5.4. Follow the documentation, and update your packages in the order given above, and you should be just fine.
  • If you think you’re missing an important piece of documentation, you probably are.

Did you ever have one of those weeks where everything you learned seemed to be choreographed into place? I think that I’m learning much broader lessons this week about the nature and the danger of assumptions, as the Lone Sysadmin would tell you about me. (Bob Plankers, it turns out, is very much not a “goon,” and one can make a very big ass of themselves by assuming other people are familiar with the other meanings of such a word.)

CentOS 5.3 to 5.4 upgrade woes

I’ve been pushing out CentOS 5.4 on a number of test systems this week, and I came upon a very interesting, very insidious, and very annoying problem.

When running the upgrade, yum upgrade seems to run without a hitch, and returns completely successfully with no errors or warnings. However, what actually happens in the background is that the cleanup process breaks silently, and the package manager gets completely filled up with entries for duplicate packages that shouldn’t be allowed to coexist. I was alerted to the problem by rkhunter, which notified me during its post-reboot run that several files were mismatched versus what the package manager thought they should contain. Now, if you rpm -qa a package with matching versions installed, the order they come back is arbitrary and depends on how they end up in your RPM BDB database. When rkhunter called rpm –verify, it was running against the older version, which was failing the checksum comparison.

The number of package errors that rkhunter actually caught paled in comparison to the huge number of screwed up package entries on the system.

This usually doesn’t cause a problem. In most cases, if the cleanup portion fails, you can just run yum-complete-transaction and it will pick up where it left off. For whatever reason, this doesn’t work here.

After hitting this problem, if you try to run another update, you get output like this:

I cooked up a hairy one-liner to find the duplicates:

rpm -qa --queryformat="%{name}.%{arch}\n" | sort | uniq -d | perl -ne 's/(.*)\.(.*)/\1/g; print' | xargs rpm -qa --queryformat="%{name}-%{version}-%{release}.%{arch}\n" | sort

(It’s only so long because you need to match the arch on x86_64, and rpm -qa doesn’t play nicely with packagename.arch-format names. Interestingly, though, I’ve only experienced the problem on the i386 servers that I’ve upgraded.)

Here’s the output on one host following a supposedly successful upgrade:


apr-1.2.7-11.el5_3.1.i386
apr-1.2.7-11.i386
apr-util-1.2.7-7.el5_3.2.i386
apr-util-1.2.7-7.el5.i386
audit-libs-1.7.13-2.el5.i386
audit-libs-1.7.7-6.el5_3.3.i386
autofs-5.0.1-0.rc2.102.i386
autofs-5.0.1-0.rc2.131.el5_4.1.i386
centos-release-5-3.el5.centos.1.i386
centos-release-5-4.el5.centos.1.i386
cpio-2.6-20.i386
cpio-2.6-23.el5.i386
cpuspeed-1.2.1-5.el5.i386
cpuspeed-1.2.1-8.el5.i386
crash-4.0-7.2.3.el5.centos.i386
crash-4.0-8.9.1.el5.centos.i386
cups-libs-1.3.7-11.el5_4.3.i386
cups-libs-1.3.7-8.el5_3.4.i386
device-mapper-1.02.28-2.el5.i386
device-mapper-1.02.32-1.el5.i386
device-mapper-event-1.02.28-2.el5.i386
device-mapper-event-1.02.32-1.el5.i386
device-mapper-multipath-0.4.7-23.el5_3.4.i386
device-mapper-multipath-0.4.7-30.el5_4.2.i386
dmidecode-2.10-2.el5_4.i386
dmidecode-2.7-1.28.2.el5.i386
dos2unix-3.1-27.1.i386
dos2unix-3.1-27.2.el5.i386
e2fsprogs-1.39-20.el5.i386
e2fsprogs-1.39-23.el5.i386
e2fsprogs-libs-1.39-20.el5.i386
e2fsprogs-libs-1.39-23.el5.i386
ethtool-6-2.el5.i386
ethtool-6-3.el5.i386
gcc-c++-4.1.2-44.el5.i386
gcc-c++-4.1.2-46.el5_4.1.i386
glibc-2.5-34.i686
glibc-2.5-42.i686
iptables-1.3.5-4.el5.i386
iptables-1.3.5-5.3.el5_4.1.i386
iptables-ipv6-1.3.5-4.el5.i386
iptables-ipv6-1.3.5-5.3.el5_4.1.i386
kernel-2.6.18-128.1.10.el5.i686
kernel-2.6.18-92.1.18.el5.i686
kernel-headers-2.6.18-128.1.10.el5.i386
kernel-headers-2.6.18-164.6.1.el5.i386
kpartx-0.4.7-23.el5_3.4.i386
kpartx-0.4.7-30.el5_4.2.i386
krb5-devel-1.6.1-31.el5_3.3.i386
krb5-devel-1.6.1-36.el5.i386
krb5-libs-1.6.1-31.el5_3.3.i386
krb5-libs-1.6.1-36.el5.i386
krb5-workstation-1.6.1-31.el5_3.3.i386
krb5-workstation-1.6.1-36.el5.i386
less-394-5.el5.i386
less-394-6.el5.i386
lftp-3.5.1-2.fc6.i386
lftp-3.7.11-4.el5.i386
libgcc-4.1.2-44.el5.i386
libgcc-4.1.2-46.el5_4.1.i386
libgomp-4.3.2-7.el5.i386
libgomp-4.4.0-6.el5.i386
libselinux-1.33.4-5.1.el5.i386
libselinux-1.33.4-5.5.el5.i386
libselinux-devel-1.33.4-5.1.el5.i386
libselinux-devel-1.33.4-5.5.el5.i386
libsemanage-1.9.1-3.el5.i386
libsemanage-1.9.1-4.4.el5.i386
libsepol-1.15.2-1.el5.i386
libsepol-1.15.2-2.el5.i386
libstdc++-4.1.2-44.el5.i386
libstdc++-4.1.2-46.el5_4.1.i386
libuser-0.54.7-2.1.el5_4.1.i386
libuser-0.54.7-2.el5.5.i386
libX11-1.0.3-11.el5.i386
libX11-1.0.3-9.el5.i386
libxml2-2.6.26-2.1.2.7.i386
libxml2-2.6.26-2.1.2.8.i386
libxml2-python-2.6.26-2.1.2.7.i386
libxml2-python-2.6.26-2.1.2.8.i386
m2crypto-0.16-6.el5.3.i386
m2crypto-0.16-6.el5.6.i386
mysql-5.0.45-7.el5.i386
mysql-5.0.77-3.el5.i386
mysql-devel-5.0.45-7.el5.i386
mysql-devel-5.0.77-3.el5.i386
mysql-server-5.0.45-7.el5.i386
mysql-server-5.0.77-3.el5.i386
neon-0.25.5-10.el5_4.1.i386
neon-0.25.5-10.el5.i386
newt-0.52.2-12.el5_4.1.i386
newt-0.52.2-12.el5.i386
nfs-utils-lib-1.0.8-7.2.z2.i386
nfs-utils-lib-1.0.8-7.6.el5.i386
nscd-2.5-34.i386
nscd-2.5-42.i386
nss-3.12.2.0-4.el5.centos.i386
nss-3.12.3.99.3-1.el5.centos.2.i386
numactl-0.9.8-7.el5.i386
numactl-0.9.8-8.el5.i386
openssl-devel-0.9.8e-12.el5.i386
openssl-devel-0.9.8e-7.el5.i386
pam-0.99.6.2-4.el5.i386
pam-0.99.6.2-6.el5.i386
perl-5.8.8-18.el5_3.1.i386
perl-5.8.8-27.el5.i386
redhat-rpm-config-8.0.45-29.el5.noarch
redhat-rpm-config-8.0.45-32.el5.centos.noarch
rsh-0.17-38.el5.i386
rsh-0.17-40.el5.i386
ruby-1.8.5-5.el5_2.6.i386
ruby-1.8.5-5.el5_3.7.i386
ruby-irb-1.8.5-5.el5_2.6.i386
ruby-irb-1.8.5-5.el5_3.7.i386
samba-client-3.0.33-3.15.el5_4.i386
samba-client-3.0.33-3.7.el5.i386
sqlite-3.3.6-2.i386
sqlite-3.3.6-5.i386
strace-4.5.18-2.el5_3.3.i386
strace-4.5.18-5.el5.i386
tzdata-2009f-1.el5.noarch
tzdata-2009o-2.el5.noarch
udev-095-14.20.el5_3.i386
udev-095-14.21.el5.i386
vim-enhanced-7.0.109-4.el5_2.4z.i386
vim-enhanced-7.0.109-6.el5.i386
vim-minimal-7.0.109-4.el5_2.4z.i386
vim-minimal-7.0.109-6.el5.i386
ypbind-1.19-11.el5.i386
ypbind-1.19-12.el5.i386
yum-metadata-parser-1.1.2-2.el5.i386
yum-metadata-parser-1.1.2-3.el5.centos.i386

You need to go through these and remove the outdated package versions, one by one. (If you’re confused about which is newer, you can run rpm -qi <packagename> and see, among other details, the date the package was built.) This should be a safe operation; the package manager reference-counts files, and won’t remove a file belonging to multiple packages until all of those packages have been removed, even though you should never have multiple packages owning the same file in the first place. I’m fairly sure that removing these packages manually shouldn’t trigger the %postun scripts, and that the package manager will figure out that removing one version while you have a newer one installed means it’s an upgrade instead of an uninstall. If you’re worried, though, you can do an rpm -e –justdb to remove only the package database entries for the files while not running the scripts or actually removing any files.

Following the removal of the stale packages, a yum -y upgrade should fix the remaining issues.

It’s important to note that the packages do all upgrade — running an rpm –verify on the package after removing the old version does not result in any checksum mismatches or any other visible strangeness. The old versions simply don’t get removed from the package manager, which wreaks havoc on your dependency graphs.

I don’t know what’s causing the problem, but I think it might have something to do with where the upgrades to rpm/yum are placed in the middle of the transaction. Will report back after the next batch of updates, in which I will update rpm and yum first before proceeding with the remainder of the upgrade.

How (not) to interview technical candidates

First, my sincerest apologies for the length of this one. I usually don’t spit out this much at once.

Technical interviews are hard. Really, really hard. This is why a lot of big corporations continue to hire IT employees on a six-month contract, followed by an offer for continued employment if they work out: many of them don’t have adequate confidence in their hiring processes.

Deciding who to hire and who not to hire is one of the most difficult parts of trying to run a team, and it’s even harder in a profession where people overstate, exaggerate and lie in order to get through an incompetent HR-driven resume-screening process. It’s a zero-sum game of escalation: jobless workers fire scattershot resumes to all kinds of positions they aren’t qualified for, and large corporations are left with little choice but to do the same thing we as IT people do: apply some remarkably naive filtering to stop the spam. This generally consists of awful keyword filters.

By now, I’ve sat in on, and helped conduct, a ton of technical interviews. Whereas most IT organizations will have a manager or two interview a candidate, along with the head of the department, my group decided a long time ago that whenever we had a position to fill, we would fill that position by conducting a group interview — the candidate would interview with the entire team they would be working with. While this is very time-consuming, and most technical people hate meetings, everyone in our group was able to see the importance of getting the job done right the first time. Like most professions, it’s often more harmful to have someone who does an IT job incorrectly than it is not to hire anyone at all.

I’m not going to zero in on things that occur above IT’s head, like HR’s screening and review processes; there’s nothing that most technical people can do about that besides complain. Instead, I’m going to focus on the technical side of the interviewing process, which lots of companies get just as wrong. I’m not going to attempt to paraphrase the entirety of Joel Spolsky’s famous guide to interviewing, which has some great ideas and some ideas that I don’t always agree with. I will, however, rant for a very long time.

In understanding why I think so many people get the interviewing process wrong, you should familiarize yourself with the concept of illusory superiority. In particular, read the bit about the Kruger and Dunning experiments:

All groups put themselves above average. This meant that the lowest-scoring group (the bottom 25%) showed a very large illusory superiority. Although their test scores were in the 12.5th percentile on average, they estimated themselves to be in the 62nd. Kruger and Dunning explained that those who were worst at the tasks were also worst at recognising skill in those tasks. This was supported by the fact that, given training, the worst subjects improved their estimate of their rank as well as getting better at the tasks.[2]

Curiously, I’m not bringing this up because I think most candidates think they’re much better than they are. I’m bringing this up because a lot of technical interviewers think they’re really clever, and this trait gets them into trouble. What gets them into even more trouble is their eagerness to prove it.

Don’t be a jerk.

This was last, but I just put it first because it’s probably the single biggest issue I’ve seen with most engineer-driven technical interviews.

Technical professions, particularly programming and systems and network administration/engineering jobs, have this really cool personal habit: whenever someone finds themselves in a new position, they spend a couple of months cursing the stupid decisions made by the idiots that came before them. I’ve been guilty of it before, and I’m still guilty of it. Sometimes they’re right, and sometimes they’re wrong.

But most technical people hiring people to work with them seem to be very insecure, and very afraid of people saying the same thing about their work. As a result, they make a habit of being as rude and condescending to the candidate as they can in order to prove their technical prowess, and establish themselves as alpha wolf before the person ever gets hired. This is a good way to drive great candidates away.

As an interviewer, you have to be very perceptive of your tone, and very open to accepting different solutions to a problem. As a manager, especially one conducting a group interview, you have to know where to step in and tell your people when to knock it off.

This isn’t to say that putting pressure on a candidate is a bad thing. I’m not even saying that being a jerk to an applicant is always a bad thing. I’ve known lots of people with other companies who would interview for heavily customer-facing positions, whether helpdesk or directorship positions, and intentionally try to get a rise out of the interviewee. In these cases, if the employee overreacts or blows up, it’s a pretty good indicator that they don’t have the professional rapport and demeanor to be a good choice for the position, because if they can’t take it from the interviewer, they probably can’t take it from the executives above them when something goes wrong.

The ability to memorize useless crap doesn’t make somebody valuable.

If you’re reading this post, you almost certainly know this, so I’m not going to dwell on it: asking someone what the “-q” option to GNU grep does simply isn’t a good interview question at any level. Neither is asking about some random detail of some random programming API, or asking about some Gang of Four design pattern that nobody’s ever consciously related to that name.

You know how it was always a lot harder to BS the right answer on an essay question than on a multiple choice question? That’s because it requires you to actually demonstrate comprehension and understanding of the topic and a whole. Beyond the initial screening process, there’s very little value to questions which have a simple right or wrong answer.

Your questions probably aren’t as clear as you think they are.

Someone was ranting about a candidate once that did horribly in their interview. He asked the candidate to guess the state he was born in. The candidate asked, “Tennessee?” to which the interviewer replied that the candidate should think bigger. The candidate asked, “Texas?”

The interviewer was flabbergasted. There were so many clever ways to approach the problem! This guy was doing an O(n) comparison! If he was pragmatic, he could have tried doing a binary search, dividing the search area into regions. If he was a smart aleck, and had a different idea of what the interviewer meant by the word “state,” he could have responded with, “Naked?”

But the problem was one of connotation: that simple, innocuous word “guess” was wrong, and the question wasn’t half as clear as he thought it was. A better way to phrase the scenario might have been something like: “I was born in a certain state, and I want you to figure out what it is. To do this, you’re going to ask me a series of questions; I will answer ‘yes’ or ‘no’ to each question.”

“But Jeff,” you may say, “this takes all of the imagination out of the interview process.” I disagree. I think that if your problem ceases to become difficult once you’ve explained the problem clearly, the issue might just be with the questions you’re asking. In most cases, a genuinely complex question with a number of genuinely complex answers will yield much more fruitful results for you than asking vague riddles.

It’s not merely a question of open-mindedness. The questioning methodology exposes a number of cultural biases which can be exclusionary to candidates who are completely qualified for the position otherwise.

Complex questions aren’t always relevant questions.

Large software companies like Google and Microsoft have attracted a lot of attention over the years by having some interview questions that are, by traditional standards, completely outlandish. Some of these questions have included:

  • How many piano tuners are there in the entire world?
  • Why are manhole covers round?
  • You have to get from point A to point B. You don’t know if you can get there. What would you do?

What doesn’t get the attention are questions like these, which are all for different positions:

  • What’s a creative way of marketing Google’s brand name and product?
  • How would you boost the GMail subscription base?
  • What is multithreaded programming? What is a deadlock?

Even at Google and Microsoft, the interview process is dry. They still vet the candidates on the basis of their technical merits and their applicability to the position. The riddle-type questions are ancillary to the other types of questioning, and only really apply to people who really need to be delivering clever solutions to problems as part of their day-to-day job responsibilities — project managers, architects, lead programmers and the like.

If there’s anything we’ve learned from Steve McConnell, Andy Hunt, Dave Thomas and the like, it’s that there’s a fine balance that needs to be struck between cleverness and clarity, and most of the time, that balance leans much more towards clarity than cleverness. Cleverness in technical fields has an awful tendency to be identical to obtuseness, and you should apply cleverness really sparingly where it will actually help you out.

I’ll close out this section with a quote from an ex-Google employee, from Michael Arrington’s TechCrunch article from January:

I left Microsoft to work for Google in 2005. I stayed 10 months. I was demoralized. I shouldn’t have ever taken that job. I was disenchanted the whole time, and yes, like you, my regret over the poor bargain I’d made affected my performance.

As I was saying. Google actually celebrates its hiring process, as if its ruthless inefficiency and interminable duration were a sure proof of thoroughness, a badge of honor. Perhaps it is thorough. But I would be willing to wager that Microsoft’s hiring process, which takes a fraction of the time,  does not result in a lower-skilled workforce or result in a higher rate of attrition. And let me say this: if Larry Page is still reviewing resumes, shareholders should organize a rebellion. That is a scandalous waste of time for someone at that level, and the fact that it’s “quirky” is no mitigation.

The person you’re interviewing would like to do a job.

They may even have a job already. They may be busy people who don’t have the time or inclination to jump through needless hoops.

It’s important to make sure that the person you’re interviewing is a good fit for the team. This is why my team will almost always conduct group interviews when filling a position, and we all take the time to ask our own personal questions to try to get a feel for the interviewee’s personality. Some people, however, take the interpersonal aspect a bit too far.

Somebody on a thread I was following a couple of weeks ago was posting about how their interviewing process involved the candidate burning an 80 minute CD of their favorite music, watching the movie Idiocracy and being prepared to discuss it at the interview. Job interviews aren’t high school book reports or college entrance exams. The people applying for positions have jobs and families and other obligations that prevent them from dumping endless hours on interview prerequisites. Ultimately, if you’re going to test your candidates or give them assignments outside of the interview room, keep it really short or make sure it’s restricted to people who are already in a tight race for the job. It’s just a waste of everybody’s time otherwise.

Management is about using the right metrics.

In most fields, management is all about numbers. Units are shipped, dollars are earned, money is saved. In most organizations, IT is a cost center and an internal customer service organization. That makes it really difficult to figure out metrics. Lots of people in IT hate the idea of metrics as a whole, because in many cases they emphasize taking the quick way out instead of finding the right solution to a problem. This leads to systemic issues that are not easy to undo later. Metrics are only valuable if they’re both relevant and useful. The length of time spent working on a ticket is not a useful metric: some tickets may appear to be simple issues but may reflect systemic problems. The number of tickets closed in a reporting period is not a useful metric either, because the issues may not be resolved in a satisfactory way.

The interviewing process is the same way: if you get it wrong, you will introduce systemic issues. And in nearly all cases, it’s better to be using no metrics at all than to be using the wrong metrics.

Last week, Valleywag wrote of Google:

Google strives to hire “the world’s best engineers,”and has crafted an “interminableinterview process dotted with puzzles and brainteasers to do so. One little problem: the process tends to give the worst scores to the best future employees.

That’s according to Peter Norvig (pictured), Google’s director of research, former Google director of search quality and former head of the Computational Sciences division at the NASA Ames research center. Here’s what Norvig tells Peter Seibel in a Q&A in the new book Coders at Work (emphasis added):

One of the interesting things we’ve found, when trying to predict how well somebody we’ve hired is going to perform when we evaluate them a year or two later, is one of the best indicators of success within the company was getting the worst possible score on one of your interviews. We rank people from one to four, and if you got a one on one of your interviews, that was a really good indicator of success.

Small suggestion: Maybe Google can take these genius employees and have them, hmmm, we dunno, debug the frickin’ broken interview process. Those who demanded they be hired should probably also be enlisted in the debugging effort. Writes Norvig:

Ninety-nine percent of the people who got a one in one of their interviews we didn’t hire. But the rest of them, in order for us to hire them somebody else had to be so passionate that they pounded on the table and said, “I have to hire this person because I see something in him…”

That last line is the key, and it leads me to my last point:

The interview process is completely subjective, even when you think it isn’t.

Especially when you think it isn’t.

There is no magic mathematical formula that will tell you whether or not it’s a good idea to hire somebody, and one of the most dangerous things that you can do is convince yourself that there’s a rubric you should be following in terms of scoring your candidates. Google manages to hire great employees, but they do so exactly because they ignore their own metrics and certain staff members pound on the table to give people a chance. Interviewing takes, above all else, a critical eye, and no criteria on paper will ever replace that.

I’d love to hear some discussion of your own experiences on either side of the table.

Ransomware gets smarter

El Reg writes:

Devious virus writers have come up with a new twist on ransomware-style malware.

A new strain of Trojan encrypts recently-opened files on compromised Windows PCs. But instead of demanding a ransom for a decryption key to unlock files, the malware relies on users to search the web for a possible way-out.

Hackers have cleverly baited searches for likely terms, with links to sites offering a supposed fix actually developed by the crooks behind the ruse.

A fuller explanation of the scam can be found Symantec’s write-up on the Ramvicrype Trojan here and in a blog posting by Symantec researcher Shunichi Imano here. ®

Say what you will about the data-centric approach of The New School of Information Security, there’s one fact that’s undeniable: money drives malware in the 21st century, and they’re getting smarter and smarter about how they take it.

ZFS Inline Deduplication

Those of you who have been following the lists, the bug trackers or Planet OpenSolaris know this already, but for the rest of you, Sun’s ZFS filesystem has just seen inline dedupe support merged into OpenSolaris trunk, presumably to be appearing in the next major OS release.

Jeff Bonwick has, as always, a very detailed blog entry about it, but here’s the only part you really need to know:

If you have a storage pool named ‘tank’ and you want to use dedup, just type this:

zfs set dedup=on tank

That’s it.

Just as simple as you would have imagined, given how easy everything else is in ZFS and OpenSolaris.

ZFS’s implementation is pretty neat. The filesystem was already pretty well-tuned for deduplication because ZFS has always kept end-to-end checksums of data in the first place to ensure the integrity of all data on the system. Now those checksums just happen to be used for something other than ensuring data integrity.

© 2014 @jgoldschrafe

Theme by Anders NorenUp ↑