Bloom filters

For some reason the subject of Bloom Filters has come up a few times recently in conversations and if you’re not solid on them, they’re definitely worth exploring both for their simplicity and because they are foundational elements in many PB-sized data stores.

Alright, what are they?

The really really short answer is that they’re a fast, space-efficient way of asking whether a thing can be found within a huge set of other things. A Bloom Filter can reply with two answers:

  1. Maybe
  2. No, definitely not

If you need to test for set membership where a set may consist of millions/billions/trillions+ of entries, they’re the go-to data structure. If you use some kind of NoSQL database, you’re probably leveraging them already.

How do they do that?

A Bloom filter, named after Burton Howard Bloom, a seemingly publicity-shy MIT grad came up with the idea that hashing could be harnessed as a means not just of testing identity, but also set membership [i.e. can I find object x in collection y?] at arbitrary scale.

Ok, so let’s agree that a hash function takes a bunch of data and produces some number. Let’s also agree, for the purposes of this discussion that the function is deterministic, otherwise we’ll be here all day. Most languages provide a facility for identifying primitives based on a hash, for example in Java:

"Hello".hashCode(); // will return 6960965

A Bloom Filter builds on this by fitting the output to a slot in a fixed-size bitmap, initially all zeroed out. For example, let’s say it has a size of 8*:

empty-bitmask

modd’ing the hash value (690965 % 8) says we should flip the bit at index 5:

1-bit-flipped

What we have so far in essence is a crappy hash table- a Bloom Filter builds on this by using multiple hash functions. Without this our bitmap would quickly fill up regardless of size and the collisions would render membership tests useless.

Imagine then that we have 2 more hash functions that we modded and, for our “Hello” string, flipped bits in indices 2 and 7:

fully-hashed

Devising good quality hash functions is a black art and also potentially expensive depending on what kind of objects you’re hashing. Commonly used implementations such as Google’s Guava take an idea from a paper by Kirsch and Mitzenmacher showing that taking a pair of positive function outputs and repeatedly multiplying them by a loop variable introduces enough entropy to do away with a bunch of unique implementations. There’s a little more to it but the core code is here.

The Test

We’re pretty much done- I was trying to build up to a great climactic dénouement but frankly, my writing style is too shoddy and Bloom Filters are too simple for that :(

There’s one last thing- how do we test for membership? We just put our object through the hash functions again. If all the bits at the corresponding indices are 1s, it’s a maybe. If any of them are still zeroed, it’s a no.

So if we see “Hello” again:

present

…we’re in business. Maybe. If we see “Goodbye”, and we don’t hit all ones:

not-present

…no dice.

And that really is it; the most elegant data structures are always simple. Wikipedia has some useful material on false positive probabilities if you are in the position of sizing one of these things. If you were just curious, hopefully you can see how Bloom Filters are a cheap and effective way of reducing costly lookups on main data stores- or even a probabilistic replacement for them in the case where data is infinite or can’t be replayed.


*I chose a power of 2 because many commercial implementations do this. x % y is the same as x & (y – 1) where y is a power of 2, and a single logical AND is much cheaper. The modulo operation is on a hot path, being called for every hash function in the filter; filters themselves are often called many 000′ of times a second in place of brute-forcing the backing data store, so small constants like this add up.

Graal has a website!

It’s been quite an intense couple of days so far at the JVMLS; I’m going to do a proper write-up of what I learned, but in the meantime, I have to give a big shout out to Christian Wimmer from Oracle Labs for pointing me to Graal’s shiny new website here. It was explained to me by other Oracle emplyees who shall remain nameless how hard it is to make stuff like this happen, but better late than never :)

More pertinently for me and other lang implementors is the subsection on Truffle here.

Stealthy addictions

I try to keep these blog posts pertinent to engineering, so here’s a curve-ball that hopefully still relates to tech in some capacity…

A few disparate observations I’ve been replaying in my mind converged today and while at first sight they simply pissed me off or depressed me, I started asking myself why they were happening; the common answer for all of these things is addiction:

  • I’m in a toilet cubicle at work and hear a guy playing some dumb Facebook game in the adjacent cubicle while he’s supposedly answering a call of nature
  • I was stuck in a traffic jam with the roof down [much better rear visibility JFTR] a few weeks ago- three cars in a row behind me were openly using their phones (texting or otherwise staring at smart-phones for long periods of time)
  • I have a lot of friends in London and have become increasingly cognizant of the phenomenon of moped-based cellphone theft- it seems the most vulnerable place where this happens is on a zebra crossing where the victim is more often than not either listening to music or engrossed in something on the device

Like many addictions, it’s easy to draw an imaginary demarcation line between the idiots who fall prey to such stupid behaviour and ‘me’, who would never possibly get caught up in anything so dumb.

Except that for the most part we’re dealing with the whole of society here, not just a portion with intrinsically addictive personalities or IQs of less than, I dunno, 90. I’ve seen plenty of amusing and ugly manifestations of addiction at various startups I’ve worked at, many amongst Ivy-leaguers and almost all of which are NSFW. And stock brokers and golden-circle lawyers play Zynga games on the shitter and get their phones robbed at junctions too.

An actual behavioural addiction- e.g. needing to communicate with somebody who isn’t in the same physical room (or world!) as you still somewhat stands out. I’m not a clinical psychologist but I can look at all of these examples and say, “that’s fucked up”.

What people who work in tech, or indeed anybody who stares at a screen all day needs to be aware of is the omnipresence of much more subtle, insidious but equally unrewarding & addictive uses of our time: ridding your Outlook calendar of every single conflict, sinking hours on a code review where the proposer has no intention of listening to you anyway, fixing everything that is wrong with your deeply fucked up team (was the last one perfect anyway?). Wait until the end of a working day and try and articulate what you did in sequence from start to finish. It’s surprisingly hard, but you’ll probably come up with some much better examples of recurring wasteful behaviour too.

It’s not my place to exhaustively catalogue addictive sources here because they are never-ending and growing in number, probably exponentially by virtue of the internet. But if you can occasionally stand outside yourself and say, “hey, that’s a not so smart- I could cut that out with no productive loss” about something you’re doing, you’ll probably be a slightly happier, if not more productive person.