Bloom filters

For some reason the subject of Bloom Filters has come up a few times recently in conversations and if you’re not solid on them, they’re definitely worth exploring both for their simplicity and because they are foundational elements in many PB-sized data stores.

Alright, what are they?

The really really short answer is that they’re a fast, space-efficient way of asking whether a thing can be found within a huge set of other things. A Bloom Filter can reply with two answers:

  1. Maybe
  2. No, definitely not

If you need to test for set membership where a set may consist of millions/billions/trillions+ of entries, they’re the go-to data structure. If you use some kind of NoSQL database, you’re probably leveraging them already.

How do they do that?

A Bloom filter, named after Burton Howard Bloom, a seemingly publicity-shy MIT grad came up with the idea that hashing could be harnessed as a means not just of testing identity, but also set membership [i.e. can I find object x in collection y?] at arbitrary scale.

Ok, so let’s agree that a hash function takes a bunch of data and produces some number. Let’s also agree, for the purposes of this discussion that the function is deterministic, otherwise we’ll be here all day. Most languages provide a facility for identifying primitives based on a hash, for example in Java:

"Hello".hashCode(); // will return 6960965

A Bloom Filter builds on this by fitting the output to a slot in a fixed-size bitmap, initially all zeroed out. For example, let’s say it has a size of 8*:

empty-bitmask

modd’ing the hash value (690965 % 8) says we should flip the bit at index 5:

1-bit-flipped

What we have so far in essence is a crappy hash table- a Bloom Filter builds on this by using multiple hash functions. Without this our bitmap would quickly fill up regardless of size and the collisions would render membership tests useless.

Imagine then that we have 2 more hash functions that we modded and, for our “Hello” string, flipped bits in indices 2 and 7:

fully-hashed

Devising good quality hash functions is a black art and also potentially expensive depending on what kind of objects you’re hashing. Commonly used implementations such as Google’s Guava take an idea from a paper by Kirsch and Mitzenmacher showing that taking a pair of positive function outputs and repeatedly multiplying them by a loop variable introduces enough entropy to do away with a bunch of unique implementations. There’s a little more to it but the core code is here.

The Test

We’re pretty much done- I was trying to build up to a great climactic dénouement but frankly, my writing style is too shoddy and Bloom Filters are too simple for that :(

There’s one last thing- how do we test for membership? We just put our object through the hash functions again. If all the bits at the corresponding indices are 1s, it’s a maybe. If any of them are still zeroed, it’s a no.

So if we see “Hello” again:

present

…we’re in business. Maybe. If we see “Goodbye”, and we don’t hit all ones:

not-present

…no dice.

And that really is it; the most elegant data structures are always simple. Wikipedia has some useful material on false positive probabilities if you are in the position of sizing one of these things. If you were just curious, hopefully you can see how Bloom Filters are a cheap and effective way of reducing costly lookups on main data stores- or even a probabilistic replacement for them in the case where data is infinite or can’t be replayed.


*I chose a power of 2 because many commercial implementations do this. x % y is the same as x & (y – 1) where y is a power of 2, and a single logical AND is much cheaper. The modulo operation is on a hot path, being called for every hash function in the filter; filters themselves are often called many 000′ of times a second in place of brute-forcing the backing data store, so small constants like this add up.