Thursday, 9 May 2013

Choosing Container Types

Choosing a container type is easy; just pick ArrayList and be done with it. Right?

Unfortunately, if you choose the wrong data type and don't encapsulate it properly then you're saddled with the results for the foreseeable future.

Using data structures


One of the basic principles of object-oriented design is that you can hide details (encapsulation). One of the best things to hide is the way you've got your data stored. At least that way if you do choose a list of pairs instead of a map, you can at least fix it locally!

In most of the legacy code bases I've seen, developers have gone the other way. They've spot-welded the choice of data structure onto objects and shouted it proudly about the code base.
// I AM DEFINITELY A LIST
class Customers : List<Customer> {

}
I'm not sure what particular brand of insanity causes this. Either customers is a List (in which case this is probably an abstraction too far) or it represents a group of customers that has specific behaviour.

Slightly better is to at least hide some details. Interface implementation isn't quite so spot-welded as implementation inheritance. At least now you can change the underlying implementation without telling the outside world. I've seen people object to this because of the "noise" it generates (all those delegating members). This monotony can often be auto-generated (thanks ReSharper) and these delegating members are often a stepping stone to a clearer design.
class Customers : IList<Customer> {
    private List<Customer> implementation;

    // a million and one delegating members
 }
Better yet is to completely hide the details. Make Customers an object in its own right. Give it methods to manipulate its data. Make it a living breathing object, and not just a pale copy of a collection.
class Customers {
   public Invoice GenerateInvoice();

   private Set<Customer> customers;    
}
Once you've got an object wrapping your collection, make sure you don't leak any more details than you need to.
class Customer {
   public Set<Customer> GetCustomers() {
       return customers;
   }
}
ARGH! Don't do this! In a few years time, someone will inadvertently capture that reference and do something bonkers.
var set = customer.GetCustomers();
logger.Debug("Found {0} customers", set.size());

// Clear the memory, clearly we don't need it.
set.Clear();
Ideally, don't expose the information. Your Customers object is responsible for managing that collection. If you expose its internal details to the world then it's got no chance!

If you have to, use the types (and if you must the objects) in your language that make sure you won't get aliasing problems (Collections.unmodifiableSet) or IEnumerable). As a slight aside, I strongly dislike how Java's iterators define remove, and instead I have to rely on a run-time guarantee of immutability (never did get a great answer to this question).

Choosing the right data structure


There are two parts to every data structure. The first is the abstract data type - what operations does the data structure allow? The second is the underlying implementation of the ADT. What are the trade-offs? Is it a linked list or an array list?

In my experience people tend to focus on the latter (how is it implemented) instead of choosing the right data structure in the first place. Every time I see a list of pairs instead of a map, or a list without duplicates instead of a set, I sigh a little.

As a quick example of the perils of choosing the wrong data structure, I've recently been working on refactoring some code of this form.
var xs = new List();
var ys = new List();

foreach(var y in ys) 
  if (xs.Contains(y))
    DoSomething();
Even with a moderate number of elements (10000 or so in both collections) this code has serious performance problems. Running the code through a profiler showed that the equality operator was executed over 60 million times! Testing the results was taking 10 times as long as performing the operation itself.

There's a number of problems with the code above, but the root cause is choosing the wrong data structure. Despite the rich collection libraries in Java/C# most developers plump for a list. Premature optimization is one thing, but choosing the wrong data structure is just as bad.

Hide your decisions about collections. At least that way you can fix it locally.