Tags

, , ,

I recently learned about Bloom filters (and was then able to fully understand the joke in this xkcd comic). While I don’t have a good application for them myself, I found them interesting enough to play around with a little.

Python uses them under the hood in a way that has some potential for other applications. In this post I’ll explain Bloom filter basics and go over some simple implementations.

Bloom filters (invented by B. Howard Bloom in 1970) offer a very fast way to “pre-search” for objects in a dataset. They have a constant runtime almost entirely independent of dataset size. Sounds ideal, but they are filters, not search routines. They reliably tell you whether the object you’re searching for isn’t in the dataset but can only say maybe when it is.

So, there are no false negatives, but there can be false positives. If the filter returns false, the search is over. If it returns true, you have to actually search the dataset (and may not find your target).

A Bloom filter depends on two concepts. Firstly, a way of hashing each item in the dataset to get a numeric hash. Secondly, a vector of bits or numbers. (The length of which is an important parameter.)

Figure 1. Adding an item to a Bloom filter sets bits (one per hash function) to one. The diagram shows what happens after adding the first item to the filter (with three hash functions).

As an example, suppose the dataset consists of various images. We’ll use a binary hashing function that returns the hash value as a byte string with 16, 32, 64 or even more bytes. We’ll use a bit vector with 64 bits (which we can fit into 64 ÷ 8 = 8 bytes). Since it takes 6 bits to index to 64 (actually 0-63), we mask out all the bits of the hash except the last 6 and use what’s left to index the bit vector.

Using that index, we set the indexed bit to one. Typically, an implementation uses multiple hash functions to derive multiple (different) hashes, masks out the indexing bits of each, and sets bits in the vector for each hash value. I’ll discuss why below.

Once we process all the images we want to test for, the filter is ready.

Figure 2. Testing a Bloom filter for a match. If any of the indexed bits is zero, the object is definitely not in the filter. If all bits are one (as in the example), then the object may be in the filter.

When we want to test whether an image is in the dataset, we run it through the same hashes to get the same set of indexes. We check each indexed bit. If any of them are zero, the image is definitely not in the dataset. If each bit we check is set to one, then the image may be in the dataset — we’ll have to compare it with each one to find out.

Obviously, this speeds things up since only possible matches require a search.


We’ll start with a much simpler implementation and work towards the one described above (and a bit beyond). This first example is a Python version of the technique implemented in C that Python uses under the hood for optimizing some string-handling functions.

The goal here is to identify whether a given character belongs to a specified list of characters. For example, perhaps to see if a character is a number or other legit number character (such as a plus, a minus, or a decimal point). Rather than check each given character against the list, we can speed things up with a character-oriented Bloom filter.

To create the filter, for each character we want in the filter, we get the ordinal value and treat that as an index into a bit vector. Because Unicode characters can have high ordinal values, we’ll mask out any higher bits. Here’s the filter:

001| Sizes = [16, 32, 64, 128, 256]
002| 
003| def bf_charset (charset, size=32):
004|     ”’Bloom filter for a small set of chars.”’
005|     assert size in Sizes, ValueError(f’Invalid size: {size}.’)
006| 
007|     # Create mask for indexing vector…
008|     mask = size1
009| 
010|     # Create vector of specified size…
011|     vector = [0]*size
012| 
013|     # Add each char in charset to filter…
014|     for ch in charset:
015|         # Use ordinal value as index…
016|         cx = ord(ch) & mask
017|         # Set vector “bit” to one…
018|         vector[cx] = 1
019| 
020|     # Return the filter vector…
021|     return vector
022| 
023| def bf_char_check (bf, char, size=32):
024|     ”’Test for charset Bloom filter.”’
025|     assert size in Sizes, ValueError(f’Invalid size: {size}.’)
026| 
027|     # Create mask for indexing vector…
028|     mask = size1
029| 
030|     # Use ordinal value as index…
031|     cx = ord(char) & mask
032| 
033|     # Return filter bit…
034|     return bf[cx]
035| 

The bf_charset function (lines #3 to #21) creates and returns a filter given charset (a string or list of characters for the filter). The size parameter specifies the length of the vector. To simplify things, rather than a bit vector, this implementation just uses a list.

The bf_char_check function (lines #23 to #34) checks the given filter, bf, for a given character, char. The function returns True (1) or False (0). Recall that False indicates the character is definitely not in the filter while True indicates it might be. Note that this function only has four lines of code, all very fast to operate. This is what can make a Bloom filter useful (it can be made even faster by saving the size and mask values for reuse).

The size parameter is restricted to multiples of 2ⁿ because we want 2ⁿ-1 to give us the all-ones mask (lines #8 and #28).

In the first function, vector is initialized to all zeros (line #11). Then, for each character in charset, we get the character’s ordinal value, mask out the lower bits (line #16 and #31), and use the value to index vector and set the indexed “bit”. Then, return vector.

The second function uses the same “hashing” process on char, the character to check. If the indexed slot of vector is zero, then char isn’t in the filter. If the indexed slot is one, then char may be in the filter. The difference comes from the possibility of more than one filter character setting the same indexed bit. Depending on how many bits we mask out, and what characters we try to add to the filter, index collisions are guaranteed to happen.

Here’s the character filter in action:

001| from examples import bf_charset, bf_char_check
002| 
003| # Special characters we’re searching for…
004| charset = ‘\t\r\n\f\xa0’
005| 
006| # Some simple test text…
007| text = ‘\x80Foo\tBar\tGum\tZip\xa0\r\n’
008| 
009| # Create a Bloom filter for special chars…
010| bf = bf_charset(charset, size=16)
011| print(bf)
012| print()
013| 
014| # For each character in the text…
015| for ix,ch in enumerate(text):
016|     # If the Bloom filter returns a (maybe) match…
017|     if bf_char_check(bf, ch, size=16):
018|         # Print it (regardless)…
019|         print(f’found 0x{ord(ch):02x} @ index {ix}’)
020| print()
021| 

The charset string (set in line #4) contains the characters we’re searching for in the text string (set in line #7). We created a character-based Bloom filter from charset (line #10). Lines #15 through #20 iterate through the characters in text looking for matches. Notice that we’re accepting all filter hits as valid.

Because we only have a handful of characters to add to the filter, we set the size to 16 “bits” (though for simplicity we’re actually using a list of numbers). It’s important that we remember to set the size in both the filter (line #10) and the checker (line #17).

When run, this prints:

[1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0]

found 0x80 @ index 0
found 0x09 @ index 4
found 0x09 @ index 8
found 0x6d @ index 11
found 0x09 @ index 12
found 0x5a @ index 13
found 0x69 @ index 14
found 0x70 @ index 15
found 0xa0 @ index 16
found 0x0d @ index 17
found 0x0a @ index 18

We added the Tab (0x09), Carriage Return (0x13), Line Feed (Newline) (0x10), Form Feed (Page Break) (0x12), and Non-breaking Space (0xA0) characters to the filter. Each set an indexed bit in vector. We see that vector has plenty of zeros. It’s important a Bloom filter end up with enough zeros in the filter to ensure good negative results. The constraint then is how large vector has to be.

The filter found all the characters we were searching for but notice how it also found a number of false positives. All three letters of “Zip” snuck through, and so did the “m” from “Gum”. They did so because they all “hash” to the same value as one of the characters in the filter:

>>> mask = (16 - 1)
>>> text[11]
    'm'
>>> text[13]
    'Z'
>>> text[14]
    'i'
>>> text[15]
    'p'
>>> ord(text[13]) & mask
    10
>>> ord(text[13]) & mask
    9
>>> ord(text[14]) & mask
    0
>>> ord(text[15]) & mask
    13

I did it this way to demonstrate the false positives. We need to add a check against the search characters for all filter hits.

Here’s another use case example that does that:

001| from os import path
002| from examples import bf_charset, bf_char_check
003| 
004| BasePath = r’C:\users\Wyrd\blog\hcc\source’
005| 
006| # Load a text file to scan…
007| fn = path.join(BasePath, ‘chriscarol.txt’)
008| with open(fn, mode=‘r’, encoding=‘utf8’) as fp:
009|     text = fp.read()
010| 
011| # Create a Bloom filter for special chars…
012| charval = [0x201c, 0x201d, 0x2018, 0x2019]
013| charset = [chr(v) for v in charval]
014| 
015| bf = bf_charset(charset, size=64)
016| print(.join(str(b) for b in bf))
017| print()
018| 
019| # Scan through text file char by char…
020| for ix,ch in enumerate(text):
021|     # If Bloom filter returns True (possible match)…
022|     if bf_char_check(bf, ch, size=64):
023|         # Look for char in charset…
024|         if ch in charset:
025|             # List the char…
026|             print(f'{ix:04}: ({ch}) [0x{ord(ch):02x}]’)
027| print()
028| 

In this case we have a text file that may contain the “fancy” (non-ASCII!) single- and double-quote characters. We want to find all occurrences of these. The four we’re looking for have the ordinal values shown in line #12 (with line #13 converting the numbers to characters for the filter).

Most importantly, note how filter hits are checked against the charset for a match. This avoids the false positives. When run, this prints:

0000000000000000000000001110111000000000000000000000000000000000

0207: (’) [0x2019]
0229: (’) [0x2019]
0331: (’) [0x2019]
0651: (’) [0x2019]
0868: (’) [0x2019]
1235: (’) [0x2019]
1477: (’) [0x2019]
1747: (’) [0x2019]
1803: (’) [0x2019]
1854: (’) [0x2019]
2771: (’) [0x2019]
3061: (’) [0x2019]
3208: (“) [0x201c]
3218: (”) [0x201d]
3323: (“) [0x201c]
3383: (”) [0x201d]
3464: (’) [0x2019]
3587: (’) [0x2019]
3754: (“) [0x201c]
3809: (”) [0x201d]
4000: (“) [0x201c]
4005: (”) [0x201d]
4912: (’) [0x2019]
5099: (’) [0x2019]
5175: (’) [0x2019]

The numbers along the left are the offsets into the text file where the character was found (as in the above examples). As you can see, no false positives (and we know there can be no false negatives).


Now that we’ve explored the basic principle behind Bloom filters, let’s take the next step. The character filter used a very simple “hashing” scheme, the character’s ordinal number masked to keep only enough low-order bits to index the bit vector. To help reduce false positives, Bloom filters usually use multiple hashes and set multiple bits in the vector. The idea is that a false positive isn’t likely to match them all. False positives still occur if an item happens to hash to a set of indexes with bits that are all set to one.

Here’s an implementation of a Bloom filter abstract base class (the reason for making it a class will become apparent as we progress):

001| import hashlib
002| 
003| class BloomFilterBase:
004|     ”’Bloom Filter Base class.”’
005| 
006|     def __init__(self, size):
007|         ”’New BloomFilterBase instance.”’
008| 
009|         # Size of the “bit” vector…
010|         self.size = size
011| 
012|         # Table of hash function constructors…
013|         self.hash_functions = [hashlib.md5,
014|                                hashlib.sha1,
015|                                hashlib.sha384,
016|                                hashlib.sha256,
017|                                hashlib.sha512]
018| 
019|         # Size of the hash function table…
020|         self.nhash = len(self.hash_functions)
021| 
022| 
023|     def _hash(self, data:str, which:int) -> int:
024|         ”’Return a hash of data.”’
025| 
026|         # Get a valid index into the table of hashes…
027|         wx = which % len(self.hash_functions)
028| 
029|         # Get hash function constructor from the table…
030|         hf = self.hash_functions[wx]
031| 
032|         # Get hash function…
033|         hasher = hf()
034| 
035|         # Send data through the hasher…
036|         hasher.update(data.encode(‘utf-8’))
037| 
038|         # Get the digest as hex string and convert to int…
039|         digest = int(hasher.hexdigest(), 16)
040| 
041|         # Return lower N bits of hash number…
042|         return digest % self.size
043| 

This class acts as a base the concrete implementations below. It has only two methods, a constructor (lines #6 to #20) and the _hash method (lines #23 to #42).

The constructor takes a size parameter that determines the length of the vector (which subclasses are expected to implement). It populates the hash_functions list with a handful of Python hash function constructors from hashlib. [See Python hashlib for details.]

The _hash method is meant to be for internal use by subclasses. It takes data, a string to hash, and which, an index to one of the hash functions.

The first step (line #27) is to ensure a valid index into the table of hash functions. The next step (line #30) is getting the indexed function’s constructor. In line #33, we use it to get a new hasher instance. Then we use it to hash the data (line #36) and get its digest (its hash value, line #39). Finally (line #42) we return the masked out lower bits of the hash value.

Here’s an implementation of a Bloom filter subclass:

001| from examples import BloomFilterBase
002| 
003| class BloomFilter (BloomFilterBase):
004|     ”’Basic Bloom Filter.”’
005| 
006|     def __init__ (self, size):
007|         ”’New BloomFilter instance.”’
008|         super().__init__(size)
009| 
010|         # Number of bytes required to store size number of elements
011|         self.nbytes = (size + 7) // 8
012| 
013|         # Initialize a bytearry with all zeros
014|         self.bit_vector = bytearray(([0] * self.nbytes))
015| 
016|     def query (self, item) -> bool:
017|         ”’Test filter to see if item exists.”’
018| 
019|         # For each hash function in the table…
020|         for ix in range(self.nhash):
021| 
022|             # Get a hashed index for the item…
023|             index = self._hash(item, ix)
024| 
025|             # Get the byte and bit indexes…
026|             byte_index, bit_index = divmod(index, 8)
027| 
028|             # Create a bit mask for the bit index…
029|             mask = 1 << bit_index
030| 
031|             # If the matching bit in the vector is 0…
032|             if (self.bit_vector[byte_index] & mask) == 0:
033|                 # Item is not in the filter…
034|                 return False
035| 
036|         # If none of the bits were 0, item MAY be in the filter…
037|         return True
038| 
039|     def add (self, item) -> None:
040|         ”’Add an item to the filter.”’
041| 
042|         # For each hash function in the table…
043|         for ix in range(self.nhash):
044| 
045|             # Get a hashed index for the item…
046|             index = self._hash(item, ix)
047| 
048|             # Get the byte and bit indexes…
049|             byte_index, bit_index = divmod(index, 8)
050| 
051|             # Create a bit mask for the bit index…
052|             mask = 1 << bit_index
053| 
054|             # Set the indexe bit in the vector…
055|             self.bit_vector[byte_index] |= mask
056| 
057|     def blanks (self) -> int:
058|         ”’Return number of unused vector slots.”’
059|         func = lambda b:sum(0 if b & pow(2,bx) else 1 for bx in range(8))
060|         return sum(func(b) for b in self.bit_vector)
061| 
062|     def __len__ (self) -> int: return self.bit_vector
063|     def __getitem__ (self, ix) -> bytes: return self.bit_vector[ix]
064|     def __iter__ (self): return iter(self.bit_vector)
065| 
066|     def __str__ (self) -> str:
067|         return self._vector2bits()
068| 
069|     def _byte2bits (self, b) -> str:
070|         ”’Convert a byte to a string of ‘1’ and ‘0’.”’
071|         bs = [(‘1’ if b & pow(2,bx) else ‘0’) for bx in range(8)]
072|         return .join(reversed(bs))
073| 
074|     def _vector2bits (self) -> str:
075|         ”’Convert vector into a string of bits.”’
076|         bs = [self._byte2bits(b) for b in reversed(self.bit_vector)]
077|         return ‘.’.join(bs)
078| 
079| 
080| bf = BloomFilter(64)
081| for w in [‘who’, ‘what’, ‘why’, ‘where’, ‘when’]:
082|     bf.add(w)
083| 
084| print(f'{bf=!s}’)
085| print(f’> blanks={bf.blanks()}’)
086| print()
087| print(f’test “” : {bf.query(“”)}’)
088| print(f’test “when” : {bf.query(“when”)}’)
089| print(f’test “went” : {bf.query(“went”)}’)
090| print(f’test “why” : {bf.query(“why”)}’)
091| print(f’test “why not”: {bf.query(“why not”)}’)
092| print(f’test “where” : {bf.query(“where”)}’)
093| print(f’test “who” : {bf.query(“who”)}’)
094| print(f’test “wh” : {bf.query(“wh”)}’)
095| print(f’test “am” : {bf.query(“am”)}’)
096| print()
097| 

One advantage of a class is only having to provide the vector size in the constructor. In the subclass constructor (lines #6 to #14), we determine how many bytes are needed to contain the number of bits specified by size (line #11) and then create bit_vector as a bytearray (line #14).

The query method (lines #16 to #37) tests the filter for a match, and the add method (lines #39 to #55) adds new items to it. The blanks method (lines #57 to #60) returns a count of zero bits. We can use it to determine how “full” the filter is. If the number of zeros gets too small, we can increase the filter size.

We also implement methods to support iterating over the bits in the vector (lines #62 to #64) as well as dunder str (lines #66 and #67). Lastly, a couple of methods for “pretty printing” the bit vector (lines #69 to #77).

We create a filter for the famous newspaper words (lines #80 to #82) and then print some stuff to test the filter. When run, the code prints:

bf=00101000.00000000.00010001.00011000.11000001.00011000.11001111.01010010
> blanks=44

test ""       : False
test "when"   : True
test "went"   : False
test "why"    : True
test "why not": False
test "where"  : True
test "who"    : True
test "wh"     : False
test "am"     : True

We obviously have more than enough blanks. Note how the last test, with “am”, returned a false positive.


You might have noticed, while the Bloom filter class has an add function, there is no delete function. Because the hashes from different objects can set the same bits, it’s not possible to delete a filter object without potentially removing bits that must be set for other objects still in the filter.

But there is a way around this, a counting Bloom filter. Rather than set bit flags, we use integers and increment them when we set a “bit”. This allows removing a filter object by decrementing the count on its hash indexes.

Here’s a reimplementation of the Bloom filter above as a counting filter:

001| from examples import BloomFilterBase
002| 
003| class BloomFilterCounting (BloomFilterBase):
004|     ”’Counting Bloom Filter.”’
005| 
006|     def __init__ (self, size):
007|         ”’New BloomFilterCounting instance.”’
008|         super().__init__(size)
009| 
010|         # Initialize number vector with all zeros
011|         self.nbr_vector = [0]*self.size
012| 
013|     def query (self, item:str) -> bool:
014|         ”’Test filter to see if item exists.”’
015| 
016|         # For each hash function in the table…
017|         for ix in range(self.nhash):
018| 
019|             # Get a hashed index for the item…
020|             index = self._hash(item, ix)
021| 
022|             # If slot is zero, object isn’t in filter…
023|             if self.nbr_vector[index] == 0:
024|                 return False
025| 
026|         # If none of the slots was 0, item MAY be in the filter…
027|         return True
028| 
029|     def add (self, item) -> None:
030|         ”’Add an item to the filter.”’
031| 
032|         # For each hash function in the table…
033|         for ix in range(self.nhash):
034| 
035|             # Get a hashed index for the item…
036|             index = self._hash(item, ix)
037| 
038|             # Increment the indexed slot…
039|             self.nbr_vector[index] += 1
040| 
041|     def delete (self, item) -> None:
042|         ”’Remove an item from the filter.”’
043| 
044|         # If the item isn’t in the filter, raise an error…
045|         if not self.query(item):
046|             raise RuntimeError(‘Item Not Found.’)
047| 
048|         # For each hash function in the table…
049|         for ix in range(self.nhash):
050| 
051|             # Get a hashed index for the item…
052|             index = self._hash(item, ix)
053| 
054|             # Increment the indexed slot…
055|             self.nbr_vector[index] -= 1
056| 
057|             # This should never happen…
058|             if self.nbr_vector[index] < 0:
059|                 raise RuntimeError(‘Counter Underflow.’)
060| 
061|     def blanks (self) -> int:
062|         ”’Return number of unused vector slots.”’
063|         return int(sum(1 if n==0 else 0 for n in self.nbr_vector))
064| 
065|     def __len__ (self) -> int: return self.nbr_vector
066|     def __getitem__ (self, ix) -> bytes: return self.nbr_vector[ix]
067|     def __iter__ (self): return iter(self.nbr_vector)
068| 
069|     def __str__ (self) -> str:
070|         return .join([str(n) for n in self.nbr_vector])
071| 
072| bf = BloomFilterCounting(64)
073| for w in [‘who’, ‘what’, ‘why’, ‘where’, ‘when’]:
074|     bf.add(w)
075| 
076| print(f'{bf=!s}’)
077| print(f’> blanks={bf.blanks()}’)
078| print()
079| print(f’test “” : {bf.query(“”)}’)
080| print(f’test “when” : {bf.query(“when”)}’)
081| print(f’test “went” : {bf.query(“went”)}’)
082| print(f’test “why” : {bf.query(“why”)}’)
083| print(f’test “why not”: {bf.query(“why not”)}’)
084| print(f’test “where” : {bf.query(“where”)}’)
085| print(f’test “who” : {bf.query(“who”)}’)
086| print(f’test “wh” : {bf.query(“wh”)}’)
087| print(f’test “am” : {bf.query(“am”)}’)
088| print()
089| print(‘delete(“where”)’)
090| bf.delete(‘where’)
091| print(f’test “where” : {bf.query(“where”)}’)
092| print()
093| 

The two implementations are similar, but this one doesn’t have to deal with bytes, bits, or a bitmask, and it has a delete method for removing a filter object.

Except for how it accesses the vector, the query method is nearly the same but note how the add method increments the vector slot rather than just setting a bit. The delete method is similar to the add method but decrements the vector slot. It also checks to ensure the object is in the filter and that the decrement doesn’t underflow.

When run, this prints:

bf=0100102012110011000110001000001100011000200020000000000000010200
> blanks=44

test ""       : False
test "when"   : True
test "went"   : False
test "why"    : True
test "why not": False
test "where"  : True
test "who"    : True
test "wh"     : False
test "am"     : True

delete("where")
test "where"  : False

The first line shows how several filter slots are set to 2 rather than 1. These are slots where two objects hashed the same index. Since we used the underlying base class with the same hash functions, the results are the same (except we can delete a filter object, in this case the word “where”).

Be aware that, although we’re using five separate hashes, using only the low-order bits makes it possible for more than one hash to end up indexing the same vector slot. This is generally harmless but can be a surprise during debugging.


In summary, Bloom filters are handy if the objects being searched for are expensive to access. The lack of false negatives means they quickly reject mismatches. They probably work best if the items being searched for aren’t too many in number. The vector length obviously needs to grow as the number of items in the filter grows.

Another aspect of this is the hashing function used. Ideally it should have good distribution across the vector length. The hashing function depends a lot on what’s being hashed, document, images, or whatever. Also keep in mind that every object added to the filter gets hashed, and so does every object used to query the filter. To make a Bloom filter useful, hashing function needs to be fast.


Link: Zip file containing all code fragments used in this post.