Monday, October 07, 2013

Ruminating on Bloom Filters

Of late, I have been trying to understand "Bloom Filters" and the reason why they are used in many popular NoSQL datastores such as Cassandra, HBase, etc.  Even Google Chrome brower uses it to match URLs of malicious sites.

So what exactly is a Bloom filter? A Bloom filter is a data-structure; essentially a byte array (aka byte vector). Using a bloom filter, we can quickly test for the containment of a particular element in a given set. For e.g. Suppose you have a set of 100K URLs that are malicious. How do you check if the entered URL belongs to the list? You can store all the 100K URLs in memory and iterate through them, but that would be expensive in terms of CPU cycles and memory requirements.

A Bloom filter is a very simple and efficient way to test for containment - with a one-sided error probability. What this means is that if the Bloom filter states that an element is NOT present in the list, then it is 100% true. But if it returns true for the containment test, then it means that the element "MAY" be there; i.e. you could have the risk of a 'false positive'. You can design your Bloom filter data structure in such a way, so that you can predict the probability of 'false positives'; for e.g. 3% probability in a set of 100 million entries. Hence a bloom filter can return 'false positives' but cannot return 'false negatives'.

With the above understanding, you can guess why Bloom filters are popular with NoSQL data-stores. It is used to quickly check whether a row exists in the database or not, before going for disk-access.

The following articles would give you a quick understading of how Bloom Filters work:
http://billmill.org/bloomfilter-tutorial/
http://www.michaelnielsen.org/ddi/why-bloom-filters-work-the-way-they-do/

In a nutshell, in simple terms the following steps are taken:
  1. Take a byte array (all elements of the array set to 0)
  2. For each element in your set, do the following:
    • Hash the element k times (with different hash algorithms). Let's say you hashed it 2 times and the values were 7 and 10. Set the bits at these indexes in the byte array to 1. 
    • Do this for all the elements in the set. Thus you would end up with the byte array (Bloom filter) having 0s and 1s scattered around. 
  3. Now to test whether an element is present in the set of not, we hash the element using the same hash functions. Now check if there is a 1 or 0 in the Bloom filter at those index positions. If there is a 0, that means the element is definitely not in the list. If there is a 1, then the element may be there and you can proceed with the costly disk-access or service call. 
This stackoverflow discussion also has a good explanation of how bloom filters work.
Google's popular Guava library aslo has a Bloom Filter class that be used in Java projects. The trick is to select the right size of the byte array and the hash functions to use for excellent performance.
.NET implementations are available here and here