The job of a search engine is simple, given some search terms find some relevant documents containing those terms. I thought it'd be fun to write a very simple one and see what I could learn along the way.
search.boringtosser.co.uk contains the end result.
The most fundamental data structure in a search engine is the
inverted index that maintains a mapping from content (such as words or numbers) to the documents that contain those terms. For example:
Doc 1 = quick brown fox
Doc 2 = brown owl
Doc 3 = fox tango
quick => { 1 }
brown => { 1, 2 }
fox => { 1, 3 }
owl => { 2 }
tango => { 3 }
An inverted index makes it a simple problem to go from a search term to the documents containing that term. There are (at least!) a couple of problems with just building a reverse index blindly from the data. How do you make sure that variations of the same word go to the same document? For example, if I search for "bananas" I'd like to get documents containing the term "banana" returned as well. Secondly, how do you deal with words like "the" that are going to occur in almost every single document?
Stemming is a process to reduce words to their base form. For example, "eating", "eaten" and "eat" all result in the same stemmed word "eat". Most search engines deal with common words by filtering them out. Search engines refer to these words as
stop words and it's as simple as filtering out a known list of common words.
Users communicate with a search engine using a
query language. For this toy example, I'll use a simple query language that supports keyword matching and the Boolean operators, AND and OR.
In order to show a search engine we'll also need a data set. I've used the
Jar of Fortune data as this gives lot sof short, simple documents.
(image generated using
Jim's fortune cookie image generator and text from
XKCD Fortune Cookies)
Implementation
Where should the inverted index be stored? You could simply store this data in a hash table, but then you have to worry about persisting the data, scaling to large amounts of data and so on. A
Key/Value store provides a very efficient way of storing this map based data. My current favourite store is
Redis as it's well documented, stable, ultra-fast and has a very simple API to use from a variety of languages, including
Haskell.
To make things simpler, I'll also store the documents themselves in Redis using the standard
string functions. This is OK for this example, as I'm just storing small strings. Indexing larger files would probably want to use some disk-based structure. I'll use Redis's
sorted set feature to store the word to document mapping. This allows you to store a set of strings, where each entry in the set has an associated score representing how many times that word has occurred. The higher the score, the more relevant the search result.
The first task is to turn a lump of text into a series of terms to index into Redis. Stop word removal is a bit of a no-brainer, just grab a
giant list of stop words and remove them.
isStopWord :: T.Text -> Bool
. The complete code is dull, unexciting and
here! Stemming is more interesting. Developed in 1980, the
Porter stemming algorithm consists of 5 or so simple rules to stem a word. Better still, there is already an
existing Haskell implementation. I used that implementation and adapted it ever so slightly to use
Data.Text. The full version of the code for stemming
here.
The
Jar of Fortunes data is in two different formats (one '%' separated, the other blank line separated). A quick script parses those from one giant text string into individual elements, and adds a reference to the document for each term. For example, if the quote is "quick fox" and the quote is #123 in file foo, then three entries would be created
(quick -> (foo#123,1), fox -> (foo#123,1))
. The code for indexing cookies is below:
Next up is a way of performing the search, but before we can proceed with that we need to define what a search actually is. Most search engines support a simple query language, providing simple features like boolean expressions and exact phrase matching. For my little toy engine, I'll just stick with supporting exact match for single words and the boolean operations 'AND and 'OR'. Implicit OR's are placed between adjacent terms. For example, searching for
resplendent fish
is treated as
resplendent OR fish
.
Once the query is passed we need to evaluate the AST and get the results. The leaf case is simple, given a "Contains" query the only task is to create the search term from the text and look up the values in the inverted index. The Boolean operators are more interesting. As is obvious, "and" and "or" correspond to set intersection and set union. One way to implement this would be to look up the values from the left hand side and the right hand side and calculate the intersection, but this is a very choice because it involves grabbing all documents from Redis to the server. A better choice is to make use of a Redis function to do the heavy lifting. Redis provides
ZINTERSTORE and
ZUNIONSTORE for just this purpose. Given two (or more) existing keys, they calculate the intersection and union of the keys and store the result in a new key. Storing intermediate results in new keys runs the risk that the memory usage will grow unbounded, but it does allow query fragment caching (as long as the intermediate key represents the query). Although this doesn't make much difference for such a small data set, it's likely that on a large data set caching frequently used query strings will give some benefit. Redis provides
ttl commands such as
ttl
,
expire
and
persist
to help address the memory consumption issues. I've made sure that intermediate keys have a short ttl and that should (hopefully) mean that the memory shouldn't grow infinitely.
I used
Parsec to turn a query string into an
abstract syntax tree (mostly based on the
example code on the Haskell wiki). The
query
function processes a search request and returns a key containing the results that can be retrieved using
getKeyValues
.
After all this I have a query engine that works in
ghci. That's a bit boring, so time to get it deployed on a web server somewhere. I used
Yesod to create a website. For a simple one page website, the only code I had to write was to initialize the
Redis connection and the logic to execute the search request.
The
getRootR
function just loads the templates to display the home page.
getSearchR
performs the search. The biggest problem I ran into was messages such as
<socket: 7>: hFlush: resource vanished (Broken pipe)
. This was because I had tried to keep a connection permanently open to Redis, instead of using a pool. After some helpful people in #haskell pointed me in the right direction, I switched over to using
Data.Pool and all was good.
The deployed application is available at
http://search.boringtosser.co.uk and the complete set of code is available on
github (and hopefully it comes with a working
cabal file this time!). Any suggestions on how to improve the code or just anything I'm doing stupidly is much appreciated!