Monday 1 December 2014

How much time should you spend fixing bugs in legacy code?

How much time should you spend fixing bugs in legacy code?

There's a huge amount written about dealing with greenfield code. You start with practices such as test-driven development, walking skeletons and thin vertical stripes of functionality. Legacy code is much harder. Given hundreds of thousands of lines of poorly structured code, where'd you start? Working Effectively with Legacy Code gives some great pointers; put seams in, get the tests in place and TDD the new feature work. I'm interested in the next level up, how do you balance feature work against bug fixing?

I'm got an interesting problem. We've got a clump of legacy software that product management tell me needs new features, but we also know from support that the number of bugs is a worry. From my point of view as a development manager I want data that allows me to make the right decision and that requires evidence and understanding of the scale and scope of the problem.

It is impossible to find any domain in which humans outperformed crude extrapolation algorithms, less still sophisticated ones (Expert Political Judgement: How good is it? How can we know? [via How to Measure Anything: Finding the Value of Intangibles in Business)

I'd like to move from a faith-based to a science-based approach to balancing new features over bug count.

One field that provides some inspiration is population estimation. Given a small sample size, how do you estimate the total population?

Mark and recapture is a common method for population estimation. Capture 100 animals, tag them and release them. Repeat the process. The number of tagged animals is proportional to the number of tagged animals in the population. If we had no morals whatsoever, we could release an update to 1000 users and sample the number of bugs. We could then release the same update to another 1000 users and see how many bugs we see again.

This isn't a great way to do things, but it does give us some simple formula. If we use the same notation as Wikipedia, then

  • N is the total number of bugs
  • K is the number of bugs found by the first group
  • n is the number of bugs found by the second group
  • k is the number of bugs seen for a second time
This gives us a simple formula that we could use (N = Kn / k). For bugs for released products, it's even simpler. Since we can tally bugs against each other automatically, we can estimate N without doing anything too amoral. We can use the data from the latest release to arbitrarily divide the users in half, calculate how many bugs each side finds and count the number of duplicates.

After a bit of searching around, I found that this isn't a very novel application of the idea. "How many errors are left to find?" talks about this, but from the perspective of software testing (this seems to have generated some controversy in the response, "Another silly quantitative model").

There's a lot of caveats with model-based approaches like this (what exactly is a bug anyway?), but it's better than nothing.