Friday, 12 November 2010

Google Refine

Google Refine is a toy for playing with data, originally developed by the team behind Freebase. It's a tool for taking messy data sources and getting some structure in with them.  This is exactly the same sort of thing that I need for my super-secret-world-changing-awesome-side-project. Ok, it's not that super yet, not released and will probably not change the world, but it is fun and it does require adding some structure to the messy data sets that can easily be found on the web.

Refine can import data from a variety of sources from either the web or a data file on disk. To have a quick play with this, I grabbed the list of UK Prime Ministers from here and spent a couple of minutes in Emacs to reformat it as a CSV file.  Once you've imported that data into Refine, you can automatically reconcile this data with structured information available online (such as Freebase). This gives you the best matches for each name. The screen below shows the results after reconciling with Freebase.

I'm not sure if it's entirely clear from the image but about 40% of the prime ministers have automatically been correlated with the appropriate entry from Freebase!  A few more clicks and the entire data set can be reconciled against an on-line source and then exported in a variety of formats including HTML, CSV and JSON.  This is exactly the kind of data matching I'm after, as once you've got a Freebase ID you can look up all the extra information very easily.  So easy, it almost feels like cheating!