3.13.1. Bloom Filter

Google’s crawler needs to crawl a large number of web pages every day. So there is a question: whenever a crawler extract a url from a webpage, should it be crawled or not? How to know this url has been crawled? A simple idea is to store the url in a hash table, and look up the table each time to determine whether it exists. If each url occupies 40B, then 1 billion urls will occupy more than 30 GB of memory! Can this be more space efficient?

Can we not store the url itself to greatly reduce the space required? A very classic approach is bitmap. Hash the url, get its index in the bitmap, and check whether the position is 1. This saves space, but it also causes hash collision.

Bloom filter is a probabilistic data structure for the membership query, and it has been intensely experimented in various fields to reduce memory consumption and enhance a system’s performance. The idea of bloom filter is to use multiple independent hash functions to avoid hash collision. A bloom filter is a set-like data structure that is more space-efficient compared to traditional set-like data structures such as hash tables or trees. The catch is a bloom filter can tell you with 100% certainty that something is not in the set, but it can not tell you with 100% certainty that something is in the set.

../_images/bloomfilter.png — Figure 3.13.1 Bloom Filter: w is not in the set with 100% certainty

3.13.1.1. Standard Bloom Filter

Implement a standard bloom filter. Support the following method:

StandardBloomFilter(k): constructor in which you need to create k hash functions.

add(string): add a string into bloom filter.

contains(string): Check a string whether exists in bloom filter.

Example

StandardBloomFilter(3)
add("hello")
add("code")
contains("hello") // return true
contains("world") // return false

The first challeng is how to implement some independent hash functions. The following is an example:

struct hash_function {
    int cap, seed;
    hash_function(int cap_, int seed_): cap(cap_), seed(seed_) {}
    int hash(string& value) {
        int ret = 0, n = value.size();
        for (int i = 0; i < n; ++i) {
            ret += seed * ret + value[i];
            ret %= cap;
        }
        return ret;
    }
};

💡 Here we need to use STL’s bitset. The class template bitset represents a fixed-size sequence of N bits. Bitsets can be manipulated by standard logic operators and converted to and from strings and integers. Method: count, set, reset, test, flip, to_string

struct standard_bloom_filter {
    standard_bloom_filter(int k) {
        while(k--)
            hash_func.push_back(hash_function(LEN-k, 2*k+3));
    }
    void add(string& word) {
        for (auto h: hash_func)
            bits.set(h.hash(word));
    }
    bool contains(string& word) {
        for (auto h: hash_func)
            if (!bits.test(h.hash(word))) return false;
        return true;
    }
    vector<hash_function> hash_func;
    bitset<LEN> bits; // 💣
};

3.13.1.2. Counting Bloom Filter

In addition to the problem of false positives, the traditional Bloom Filter has a shortcoming: it cannot support deletion operations. And CBF(Counting Bloom Filter) is used to solve this problem.

Implement a counting bloom filter. Support the following method:

add(string): Add a string into bloom filter.

contains(string): Check a string whether exists in bloom filter.

remove(string): Remove a string from bloom filter.

Example

CountingBloomFilter(3)
add("lint")
add("code")
contains("lint") // return true
remove("lint")
contains("lint") // return false

In CBF, what is maintained is not simply the bits marked 0 or 1, but the counter. For each element in the set, use k hash functions to hash it to obtain k positions, and add 1 to the k counters in the corresponding k positions. When deleting, just decrement the k counters by 1.

So, how many bits should this counter occupy? Allocating too much wastes space; allocating too little is prone to overflow. According to statistic analysis, 4 bits are enough, so we use vector<char> to replace bitset<N> in standard bloom filter.

struct counting_bloom_filter {
    // k is number of hash functions
    counting_bloom_filter(int k) {
      bits.resize(100000 + k);
      for (int i = 0; i < k; ++i)
        hash_func.push_back(hash_function(100000 + i, 2 * i + 3));
      bits.resize(100000 + k);
    }
    void add(string& word) {
        for (auto& h: hash_func)
          ++bits[h.hash(word)];
    }
    void remove(string& word) {
        for (auto& h: hash_func)
          --bits[h.hash(word)];
    }
    bool contains(string& word) {
        for (auto& h: hash_func)
          if (bits[h.hash(word)] <= 0)
            return false;
        return true;
    }
    vector<hash_function> hash_func;
    vector<char> bits; // 💣
};

One drawback of CBF is the space usage is not very efficient [1].

Footnotes