Thursday, 9 May 2013

Choosing Container Types

Choosing a container type is easy; just pick ArrayList and be done with it. Right?

Unfortunately, if you choose the wrong data type and don't encapsulate it properly then you're saddled with the results for the foreseeable future.

Using data structures


One of the basic principles of object-oriented design is that you can hide details (encapsulation). One of the best things to hide is the way you've got your data stored. At least that way if you do choose a list of pairs instead of a map, you can at least fix it locally!

In most of the legacy code bases I've seen, developers have gone the other way. They've spot-welded the choice of data structure onto objects and shouted it proudly about the code base.
// I AM DEFINITELY A LIST
class Customers : List<Customer> {

}
I'm not sure what particular brand of insanity causes this. Either customers is a List (in which case this is probably an abstraction too far) or it represents a group of customers that has specific behaviour.

Slightly better is to at least hide some details. Interface implementation isn't quite so spot-welded as implementation inheritance. At least now you can change the underlying implementation without telling the outside world. I've seen people object to this because of the "noise" it generates (all those delegating members). This monotony can often be auto-generated (thanks ReSharper) and these delegating members are often a stepping stone to a clearer design.
class Customers : IList<Customer> {
    private List<Customer> implementation;

    // a million and one delegating members
 }
Better yet is to completely hide the details. Make Customers an object in its own right. Give it methods to manipulate its data. Make it a living breathing object, and not just a pale copy of a collection.
class Customers {
   public Invoice GenerateInvoice();

   private Set<Customer> customers;    
}
Once you've got an object wrapping your collection, make sure you don't leak any more details than you need to.
class Customer {
   public Set<Customer> GetCustomers() {
       return customers;
   }
}
ARGH! Don't do this! In a few years time, someone will inadvertently capture that reference and do something bonkers.
var set = customer.GetCustomers();
logger.Debug("Found {0} customers", set.size());

// Clear the memory, clearly we don't need it.
set.Clear();
Ideally, don't expose the information. Your Customers object is responsible for managing that collection. If you expose its internal details to the world then it's got no chance!

If you have to, use the types (and if you must the objects) in your language that make sure you won't get aliasing problems (Collections.unmodifiableSet) or IEnumerable). As a slight aside, I strongly dislike how Java's iterators define remove, and instead I have to rely on a run-time guarantee of immutability (never did get a great answer to this question).

Choosing the right data structure


There are two parts to every data structure. The first is the abstract data type - what operations does the data structure allow? The second is the underlying implementation of the ADT. What are the trade-offs? Is it a linked list or an array list?

In my experience people tend to focus on the latter (how is it implemented) instead of choosing the right data structure in the first place. Every time I see a list of pairs instead of a map, or a list without duplicates instead of a set, I sigh a little.

As a quick example of the perils of choosing the wrong data structure, I've recently been working on refactoring some code of this form.
var xs = new List();
var ys = new List();

foreach(var y in ys) 
  if (xs.Contains(y))
    DoSomething();
Even with a moderate number of elements (10000 or so in both collections) this code has serious performance problems. Running the code through a profiler showed that the equality operator was executed over 60 million times! Testing the results was taking 10 times as long as performing the operation itself.

There's a number of problems with the code above, but the root cause is choosing the wrong data structure. Despite the rich collection libraries in Java/C# most developers plump for a list. Premature optimization is one thing, but choosing the wrong data structure is just as bad.

Hide your decisions about collections. At least that way you can fix it locally.

Tuesday, 23 April 2013

Lamenting the lack of RAII in C# / Java

One of the hardest problems in programming is managing resources. How do you make sure you don't have a memory leak, reclaim that opened file handle, or give up that network connection?


In the C language you get a file handle via the fopen function, and you absolutely must remember to return it with fclose. If you forget to call fclose then the underlying file descriptor in the operating system is not closed and eventually you'll run out of file handles. In the simple case, it's easy. Just open the file and close it. If you aren't careful, it starts to get pretty complicated though. What if you have multiple exit points in your function; you've got to remember to fclose everywhere. Manually managing resources is a really tough problem. Here's a simple example that leaves a file descriptor open. Can you see why?

int copyFile(char* szIn, char* szOut) {
    FILE* in;
    FILE* out;

    in = fopen(szIn,"r");
    if (in == NULL) return 7;

    out = fopen(szOut,"w");
    if (out == null) return 8;

    /* read from in, write to out */
    fclose(in);
    fclose(out);

    return 0;
}

C++ introduced the "RAII" pattern (resource acquisition is initialization) which is a fancy way of saying acquire in your constructor and release in your destructor. By using RAII I can write a file copy and not need to worry about remembering to close files.

int copyFilePlusPlus(char* szIn, char* szOut) {
   ifstream in(szIn);
   ofstream out(szOut);

   if (!in.is_open()) return 7;
   if (!out.is_open()) return 8;

   /* do the copy */

   return 0;
}

The knowledge about the resources can now be completely hidden in the object. Progress!


The most common resource to manage is memory. Garbage collection frees the programmer from having to worry about reclaiming memory (and hence makes it more difficult to leak memory). Unfortunately, the implementation of garbage collection cedes control of when objects will be released. The lack of deterministic finalization means that resources handling becomes more difficult. Without a destructor that runs when an object goes out of scope you can't hide the details of resource ownership quite as elegantly as you can with RAII. (Brian Harry discusses some of the reasons why C# doesn't have deterministic finalization here)


C# introduces the using syntax to deal with this problem. This gives you deterministic finalization giving you precise control over the lifetime of a resource.

using (var f = File.OpenRead("foo.txt")) {
  // use f here
}  

FileStream f = null;
try {
    f = File.OpenRead("foo.txt");
}
finally {
    if (f != null) f.Dispose();
}

The using statement expands out and lets us avoid writing a whole lot of boilerplate code to deal with it. This is progress of a kind, but it does mean we have to explicitly remember to call Dispose and we can't insulate the knowledge of the resource behind an object as we can in C++. Objects with unmanaged resources must implement the IDisposable interface and people who use these objects must remember to use using blocks.


Java was a bit slow in adopting this convention, but eventually got around to copying it with Java 7 (see try-with-resources).


Without the RAII pattern, Java and C# are more susceptible to leaking resources. As Herb Sutter says


I called this Dispose pattern “a coding pattern that is off by default and causing correctness or performance problems when it is forgotten.“ That is all true. You have to remember to write it, and you have to write it correctly. In constrast, in C++ you just put stuff in scopes or delete it when you're done, whether said stuff has a nontrivial destructor or not.


Tools like Gendarme can perform static analysis to attempt to find instances where variables haven't been disposed, but it's not perfect.


So what are the best practises to manage resources in Java/C#?


  • You should always dispose of locally created IDisposable or AutoCloseable objects. If you create a file stream to read in, remember to close it! Use tools like Gendarme or Resharper to check for violations.

  • If you write a class that has a member variable that is disposable, then that same class should also be disposable.

  • Have a clear convention for denoting who owns a given return value. For example, GetXYZ is unclear, does this return a new XYZ that I must dispose, or does it return a shared reference to an XYZ?

  • Make object lifetimes clear!

  • Fail fast - don't let disposed objects be used again (see the Disposable pattern)


The main point to take away is that resource management is hard, but it's often easy to assume it isn't until it goes wrong!

Sunday, 10 March 2013

You don't need tests to refactor

Whenever I do refactoring, the first step is always the same. I need to build a solid set of tests for that section of code. The tests are essential because even though I follow refactorings structured to avoid most of the opportunities for introducing bugs, I'm still human and still make mistakes. Thus I need solid tests. (Refactoring: Improving the Design of Existing Code, Fowler et al, 1999)

When I first read Refactoring, I believed that tests were a necessary prerequisite before making structural changes to the code. However, when I worked with legacy code this often left me paralysed: I can't change the code because it doesn't have any tests, but I can't write the tests because I can't change the code.

Reading Working Effectively with Legacy Code gave me escape route.  I could write simpler characterization tests to capture the behaviour of existing code and then refactor. Instead of writing a unit test to capture correctness, I could write a characterization test to capture the behaviour; if the behaviour is preserved, I've not broken the code.

In the years since Refactoring was written tools (such as Eclipse, IntelliJ and Resharper) have well and truly crossed the Rubicon. Over the years I've used these tools, they've gradually earned my trust to the point where I wouldn't consider writing Java/C# without the aid of such as tool.

Using a refactoring tool changes the way you approach code. I feel quite confident in saying that refactoring doesn't always require tests [1]. Instead of considering legacy code as an inanimate collection of methods, I view it as a lump of putty I can shape as needed. Refactoring tools allow you to quickly explore the design space before settling on a decision.

[1] In so much as you trust the refactoring tool! There's always the chance that the refactoring tool will break your code, but I believe these chances are vanishingly small in most circumstances (just don't use reflection!).