Bloom filters

Tuesday 23 March 2004This is almost 21 years old. Be careful.

A Bloom filter is a clever data structure that lets you check if an element is a member of a set, while storing much less data than the entire set. The tradeoff is that there are false positives: you might ask if a member is in the set, and get the answer “yes”, when it should have been “no”.

Li Fan provides a thorough overview of the technique in his paper about a Summary Cache that uses the technique, but here it is in a nutshell:

The set is stored as a large bit array (say 220 bits). To add an element to the set, you hash it a few different ways to produce a handful of different 20-bit hashes. Each hash value is an index into the bit array. Set each indexed bit to 1. Some of these bits will already be 1 (from previous entries).

To test if an element is in the set, hash it with all the hash functions, and check the indexed bits. If any of them is 0, then the element is not in the set. If all of them are 1, then it probably is in the set. They are probably all 1 because the element had been hashed into the set before, but they could be all 1 because of a collection of other entries which happened to each set some of the bits. This is the false positive. Adjusting the number of bits in the array and the number of hashes changes the probability that a false positive will occur.

The Fan paper has more about it all, including tables of false positive probabilities. Another source is Some Motley Bloom Tricks.

An aside about discovery: this is one of those web items that would be hard to credit properly with a simple “via” reference. The other day, Bob Congdon linked to the Dictionary of Algorithms and Data Structures. I poked around a bit, and glanced at their page about Bloom filters (which doesn’t say much) before eventually finding the subject of yesterday’s post about exact string matching.

Then today, I was trying to parse Mark Pilgrim’s rant about TypeKey (I think he’s against it, I can’t tell. Sarcasm doesn’t always work), when I saw his reference to LOAF (not the joke LOAF, this one seems to be a real proposal). The LOAF proposal is a way to securely send your address book with email messages so that anyone can check if an address is in your book, but can’t actually read your book. They do it with a Bloom filter, natch.

That made two references in two days to something I’d never heard of before, so I had to look into it more.

Then I wondered about Burton Bloom, the inventor of the technique. The citation on his CACM paper listed him as working at Computer Usage Company, in nearby Newton Upper Falls, and there seems to be little else about him on the web. This excursion into Bloom filters as mimics of human memory credits him with AI Memo 47 about chess.

A search on Computer Usage Company (which I had also never heard of) turns up a claim that it was the first software company, and that it was founded in 1955 by Elmer C. Kubie. His name struck me, since I work for Kubi Software, no relation.

In a final coincidence, Elmer Kubie died just two weeks ago in Ithaca.

Comments

[gravatar]
And what's an even wilder coincedence for me is that Elmer Kubie was one of my inspirations, as my high-school calculus and computer science teacher, and then ultimately my senior-year thesis advisor. He was a fascinating person to listen to, though his recital of Benny Hill segments would probably be considered un-PC today. We knew he had been successful -- there was always talk about his cars, etc.

However, he was a tremendous inspiration as a brilliant engineer and a motivating teacher. And more importantly, his presence in the classroom putting up with the shenanigans of high-schoolers every day gave us a great lesson: a lesson about the societal and personal rewards of completing a successful entrepreneurial business career and then using your "retirement" to confer that wisdom to students. I hope to be able to follow in his footsteps in this regard.

In fact, I still have a vivid memory of his description of the optimizations they did to programs running on the IBM 650, an early computer with magnetic drum storage. They wrote a program optimizer that placed the next instruction at the disk location the head would be over when the previous instruction finished executing, thus avoiding excessive waits for the rotation of the rotation. (Described in more detail midway down at
http://mywebpages.comcast.net/georgetrimble/A.html)
[gravatar]
In fact, I still have a vivid memory of his description of the optimizations they did to programs running on the IBM 650, an early computer with magnetic drum storage.
[gravatar]
Elmer was my Uncle, a more funny and charming fellow you couldn't meet on a bet. I managed a Master's from Clarkson university and still couldn't understand half the stuf he would try to explain to me.
[gravatar]
elmer kubie was my teacher in high school in the 70s and I remember him to this day -- a wonderful pleasant guy.

Add a comment:

Ignore this:
Leave this empty:
Name is required. Either email or web are required. Email won't be displayed and I won't spam you. Your web site won't be indexed by search engines.
Don't put anything here:
Leave this empty:
Comment text is Markdown.