Saturday 1 February 2014

The First International Conference on Software Archaeology

I recently attended The First International Conference on Software Archaeology, much more memorably shorted to #ticosa.

It was a slightly strange conference, in that it was never particularly clear what software archaeology was, but that was a good thing as it gave a great variety of talks encompassing everything from metrics, to tools for understanding, to philosophical thoughts on the architecture of information.

Process Echoes in Code

Michael Feathers opened the proceedings with a question, what's the real point of version control systems? The most common answer is that VCS systems help you roll back to previous revisions should something go wrong, or support multiple different product lines. The truth is this doesn't really happen. If your team deploys something to production that goes wrong, then I imagine you'll revert the deploy (not the VCS) and simply deploy again. The real purpose of source control is providing change logging. By looking at those changes we can see the traces of the way we work that are indelibly written in the version control system.

Michael demonstrated a tool (delta-flora) to explore the traces left in the source code. The tool was a simple Ruby program that mapped the Git commit history (SHA1, files changed, author, code diff) into Method event objects (methods added, changed and modified). This is a simple transformation, but one that seems to yield a vast amount of useful information.

Exploring the temporal correlation of class changes seems like an incredibly useful way of identifying an area of related objects. I'm working on a large, badly understood code-base. We're already finding that adding features requires touching multiple files. By mining information from the past, maybe we can make more educated decisions in the future?

Another area Michael mentioned that sent my synapses firing was analysing classes by closure date. Even if you have a huge code=base, identifying the closed classes (those that haven't changed) helps reduce the surface area you have to understand. One particular graph he showed (graphing the set of active classes against the open classes) was particularly interesting.

I'd love to plot this on a real code-base, but my understanding is that whilst you've got open classes, chances are you haven't finished a feature and the code-base is in an unstable phase. Looking forward to trying this one out.

Are you a lost raider in the code of doom?

Daniel Brolund followed with a quick overview of the Mikado Method. The Mikado Method provides a pragmatic way of dealing with a big ball of mud. We've probably all experienced the "shockwave" refactoring (or refucktoring?) where we've attempted to make a change, only to find that change requires another change, then another and before you know it you have a change set with 500 files in and little or no confidence that anything works.

The Mikado Method helps you tackle problems like this by recognizing that doing things wrong and reverting is not a no-op. You've gained knowledge. Briefly the method seems to consist of trying the simplest possible thing, using the compiler and more to find pre-requisites (e.g. If only that class was in a separate package...). By repeatedly finding the dependent refactorings you can arrange a safe set of refactorings to tackle larger problems.

I completely agree with this approach. Big bang refactorings on branches are no longer (if they ever were!) acceptable ways to work. Successful refactoring keeps you compiling and keeps you working in the smallest possible batch size. I liked the observation that the pre-requisites form a graph; before I've worked in pairs where we've kept a stack of refactorings (the Yak stack?) but it's an interesting observation that sometimes it's a graph.

How much should I refactor?

Matt Wynne gave a great metaphor for keeping code clean. If you imagine that software engineers are chefs and their output is meals, then the code base is the kitchen. What does your kitchen look like?

Matt had an exemplar code base (Cucumber rewrite), created as greenfield code, test-first, small-team, small commits and no commercial pressures. By analysing commits, a rough and ready guess was that 75% of commits were pure refactoring.

In answer to the question, how much should I refactor? The answer is simple.

More than you currently do.

Code Metrics

Keith Braithwaite gave us a talk about metrics and in particularly the dangers of not knowing what you are doing.

He gave some examples from earlier analysis that (allegedly) demonstrated that TDD exhibited bigger methods than test last. This doesn't fit our intuition and indeed analysing the results showed that they based the results on the mean. If we plot method length distribution, we'd find it's not a normal distribution but a power-law distribution. Doing a more statistically sound analysis actually gives the opposite results.

The moral of the story for me was that reducing a data set without knowing what you are doing is very dangerous!

Visualizing Project History

Dmitry Kandalov showed us an amazing analysis of a number of open source projects by mining the version control history (see here). This was the highlight of the conference for me, seeing interactive history of real code bases. Neat!

I really enjoyed seeing the way Scala and Clojure have evolved. Scala has progressively added more complexity and more code. Clojure however, has stabilised. Draw from that what you will!

Tools for Software Business Intelligence

Stephane Ducasse gave us an overview of some of the tools he used for software business intelligence. There was a call to action that we need dedicated tools for understanding code bases and I couldn't agree more with that. There were many interesting links:

Understanding Historical Design Decisions

Stuart Curran gave a presentation on "Understanding Historical Design Decisions". Stuart's perspective was very different as he comes from an information architecture / design background and didn't consider himself a programmer.

Some books to add to my ever-growing reading list:

Confronting Complexity

Robert Smallshire gave a talk on Confronting Complexity and returned us back to metrics (see also notes from Software Architect 2013).

We started by analysing how to calculate cyclomatic complexity. One interesting observation was that cyclomatic complexity gives us a minimum bound on the number of tests we need to get code coverage. If you follow this through, then if you add a conditional statement once every fifth line then every five lines of code you write demands another test. Ouch.

We looked at a simpler proxy for code complexity, Whitespace Integrated over Lines of Text (WILT). This is a really simple measure and incredibly quick to calculate so it lends itself to visualizing code data quickly.

There was a really good quote attributed to Rob Galankis (technical director at Eve Online):

How many if statements does it take to add a feature?

Again, this comes back to one of the recurring themes of the conference, Bertrand Meyer's open-closed principle. One of my takeaways from this was to pay much more attention to OCP!

Rob mentioned that Refactoring Reduces Complexity and gave the example of "Replace switch with polymorphism". I'd agree with this for the most part, but there are exceptions. Rename for example preserves code complexity, but increases code comprehensibility: the two don't always align. It'd be interesting to hook in a plugin to refactoring tools to calculate WILT before and after refactorings and report on the cumulative benefits.

Rob finished off by presenting an alternative model-driven approach to software engineering. The visualizations were neat and helped show the range of possibilities. That immediately seems like an improvement over other models such as COCOMO. Interestingly, going back to COCOMO shows that developer half-life isn't considered in the model, nor is complexity of the code produced (I guess the assumption is that complexity of the product => complexity of the code?).

Lightning Talks

Finally, we ended up with a set of lightning talks. Nat Pryce gave a quick demo of using neo4j to analyse a heap dump. Graph databases are cool!

Ivan Moore gave a few opinions on how you can protect your software for archaeologists from the future.

  • Ship your source with your product
  • Put your documentation be in source control
  • Put your dependencies in source control (reminded me of nuget package restore considered harmful)
  • Make sure you put instructions to build the product in source control (chef!)

There was a presentation towards the end that showed how adding sound to a running program (initially for the purposes of accessibility) produced some interesting effects. I've done this kind of thing before (creating animations for log files). Sometimes you can just rely on your brain to find the interesting things when you present it in another way.


TICOSA was a great conference. There was a good line up of speakers and lots of interesting content to muse over. What would I like to see next year? I'd really like to hear more war stories. I'd love to hear stories of archaeological digs. I'd especially love to hear about restorations. My general impression is that very few code bases start a restore process and come out better at the end (usually you hear about the big rewrite and sometimes those fail too), but I'd love to hear otherwise!

I'm looking forward to getting back to work on Monday and scraping through the commit logs to see what I can uncover!