Your Performance Metrics Are Terrible

Most tech people I know of are aware of the difficulty inherent in measuring their performance. Various companies try lots of ways to figure this out, for a variety of well-meaning but misguided reasons, often simply trying to fairly determine who deserves bonuses, how big raises should be, etc.

This is not a new problem. Joel Spolsky was ranting about the silliness of metrics a decade ago, and with good reason: they simply don’t work. They invariably end up becoming a fetishized proxy for the thing you actually want to measure. In the case of software development/IT, most everyone who does actual work knows why any given metric is broken:

Source Lines Of Code
I don’t think anybody actually uses this, because it’s so obviously bad: judging programmer performance based on how big their code is? Keeping in mind that code size is (depending on the compiler) often inversely proportional to the speed, you’re rewarding the guy who writes the slowest code.
Variants of SLOC: lines removed, lines per feature, etc.
If SLOC is a bad metric, then why would variants of SLOC be any better? “I deleted a bunch of corner-case handling! Sure, you’ll spend hours trying to fix the bugs that come in as a result, but give me a raise now!” Also, define “feature.”
Bug Severity
A bug can lay dormant for years before springing up and ending up on the front pages of Tech Crunch, so do you delay the bonus until the code is no longer in use?
Bug Count / Visibility
I think we can all agree that a minor issue which impacts half of all users will be reported more often and more vocally than a severe data corruption/remote execution bug which impacts 0.1 user per thousand, even though the latter is indisputably worse.
Bugs Closed Per Interval
This one will pretty much guarantee your team spends as much time as possible fixing one-liners. Depending on how honest they are, they may or may not spend the rest of the time inserting bugs that can be fixed by one-liners.
Story Points Per Interval
This is a scrum-ism, of course, and at best it will be a stand-in for “did you accomplish what you said you would accomplish,” assuming your team does not suffer from feature creep, shit’s easy syndrome, has proper estimates and its backlog in decent shape—in other words, you are not screwing up any of the myriad things you could screw up with scrum. The other end of the scale entails nonsense like estimate inflation down into price fixing cartels and the like.

Obviously, these are the metrics I can think of off the top of my head which apply exclusively to software development. Improper metrics used in quasi-technical fields, outside of software development itself are typically even more destructive: time-to-resolve on the support queue means your staff will spend their time trying to get customers off the phone, or otherwise close the ticket. Time-to-reply means your staff will be penalized for actually fixing problems, rather than obsessively clicking Send/Receive (or Reload). As a customer, you can probably recall the negative effects of these metrics in your own interactions with tech support.

At any rate, if someone seriously suggested that hiring/firing decisions would be based on these metrics, engineers would rightly view that person as a bit off. It’s the sort of thing you would expect out of some huge, hideously dysfunctional bureaucracy where sweeping policy decisions are made by the tragically uninformed. It’s something out of a Dilbert cartoon.

People with “enlightened” management experience recognize the problem is that none of these things is measuring what you really care about: how good is the programmer writing this code? Lots of organizations have gotten this far in their support metrics—my own personal opinion is that this is why you’re getting all that “customer experience survey” spam these days. Unfortunately for those who prefer bean counting, the skills of your employees can only be adequately measured by an honest appraisal of the respect they have from their peers, with the tacit assumption that people can be both honest and ruthless when assessing the skills of peers in their field.

In most of the companies I’ve worked for, that sort of appraisal—though never, ever stated explicitly—is exactly how decisions on hiring and firing are made. Nobody wants to work with people they don’t respect or lose the people that everybody respects, but they also don’t want to feel like an asshole if someone does get the boot.

This is why those companies that decide to try and make this “respect factor” explicit and twist it into some crazy policy fail. There may have always been consequences to the “rockstar or asshat” discussions, but part of the reason they exist is because they are not viewed or treated as part of the formal process—they allow people participating in them to (effectively) get someone fired without feeling like they just got someone fired.

By making that a formal policy, you’ve just made the obviously stupid move of turning your company into a particularly banal episode of Survivor—on par with announcing that veal will be served daily, and employees must kill and dress their own calf. Beyond that, you’ve made the even more subtle and fundamental error in assuming that 20% of a given population deserves to be fired (a really good team will have nobody that deserves to be fired, whereas a really bad team may have lots of people that do).

It is uncontroversial to point all this out. No one who has ever worked as a programmer is silly enough to suggest that the lack of a viable “objective” performance metric for programmers is somehow an intolerable situation that must be rectified. And we know that if a manager using a silly metrics with demonstrable exploits was deciding who to fire based on those metrics, they would be crazy. We know that managers who refuse to abandon the use of metrics altogether because they haven’t been given a better metric are equally foolish, and we would have no problem telling them that the entire idea (or paradigm, if that word gets you going) of using so-called “objective” metrics in this way is flawed in both concept and execution: subject to abuse, gaming, outright cheating, and a way of focusing on something other than broader outcomes.

That is, of course, until we stop talking about ourselves, and start talking about fields outside our expertise and the people working in them. Then we can’t get enough of performance metrics, even to the point of hanging on to methods and measurements we know are dumb, as though you and your gut (i.e. no information) would make a better decision if only you could replace your lack of information with bad information.

To take one such example, let’s go straight to a topic I had a heated discussion on recently: public education in the United States. For some background, the Bush Administration’s No Child Left Behind Act was the institution of a national system of performance metrics for teachers, based on standardized tests of students.

I would hope that most people are aware of the general problems with standardized tests:

Cultural Bias
I realize that most members of dominant cultures (e.g. straight, white, male, English-speaking knowledge workers living in the United States) will find this controversial, but that’s because most members of dominant cultures, myself included, are over-privileged pantywaistes, often either unable or unwilling to think beyond the tasteless beige box they grew up in. To put it another way, if your college admissions exam depended on your correct pronunciation of Malay or French, chances are you’d fail pretty hard too.
Test anxiety
You freak yourself out about the test, and make yourself an emotional wreck when you’re taking it. Unsurprisingly, your test scores suck, even if you knew the information.
Information Loss
When you reduce a person’s knowledge to a point scale, you immediately lose all the commentary that would go along with it. Even partial credit won’t tell you if the person deduced the answer from the question on the spot (very sharp) or mostly-memorized it in advance (who knows).
Gaming
Somewhat related, if you as a teacher simply have your students memorize the responses to questions in advance, they will score reasonably well on most standardized tests, but all you have done is reduce them to literate farm animals.
Validity
Namely, does the test actually measure what it claims to measure? For example, does the biology exam cover evolutionary theory—because, as it’s famously been noted, it’s damned hard to make heads or tails of biology without it.

Contributor to modern set theory, sql n00b. These objections are, of course, why smart businesses prefer rambling, stream-of-consciousness wrong answers provided in person to instant right answers provided on paper—all the right answer tells me is that I’m asking you to solve a problem you’ve already studied, or at best, that your pattern matching centers are quick enough to understand when the problem you’re facing is part of a class of related problems that you’ve already studied. This may be useful information for me to have, but certainly doesn’t tell me how your brain will work when we’re outside your direct experience, which is a far better test of your abilities.

It’s also why most businesses worth working for will eschew the use of skills assessment tests as a final answer to whether someone is hired or not: they may or may not be a legitimate guide to a person’s skills, but certainly can’t tell you anything more than what you’re going to get from a few one-on-one interviews with a person.

Of course, with schools, we take this a step further: not only do we assume that these sort of rote tests are adequate indicators of a student’s knowledge (which they may or may not be), we assume that they are adequate indicators of the quality of their instructors as well. So start with corrupt data, then assume that there is only one primary factor (instruction quality) to explain the result, then fire people, close schools, set budgets, and push political agendas based on those results.

The results are what one expects: the people who now live under this system are going to do what anyone else whose livelihood depends on silly metrics does: game the system, since that is what their future will be judged on. Of course, this system ends up creating a generation of students who has learned (at best) how to take pencil tests and regurgitate information they have already learned, with or without anything more than the dimmest of comprehensions.

So why do we do this? Why do we say that schools must operate under conditions that programmers (as a class) would rightly reject as counterproductive, stupid, and hostile? There certainly are a lot of possible answers, but I get the impression that there’s a deeper problem we’re after: why do people prefer data they know is wrong, biased, or otherwise corrupted to examining their premises?

One thought on “Your Performance Metrics Are Terrible

Comments are closed.