Bloom filter

A Bloom filter is a data structure that allows computers to see if a given element occurs in a set.^[1]^[2] Bloom filters use hash functions to do this. For each element that is added, a hash value is calculated. When a new element is added, its hash value is compared to that of the other elements in the set. A Bloom filter is a probabilistic data structure. It is possible to get a false positive, but not to get a false negative.^[3] In other words, a query returns either "possibly in set" or "definitely not in set". Elements can be added to the set, but not removed.^[2] For each added element, the probability of getting a false positive grows.^[2]

Edward Bloom proposed the Bloom filter in 1970. In the article, Bloom supposes there is an algorithm to hyphenate words at the end of a line. According to the example, most words have simple hyphenation patterns. But about 10% of the words require time-consuming lookups to fetch the correct rule. His case was that of hyphenating about 500,000 words. He saw that using the "normal" error-free hashing techniques, storing the hyphenation patterns, would require a lot of memory. He found that using his technique, he could eliminate most lookups. For example, a hash area only 15% of the size needed by an ideal error-free hash still eliminates 85% of the disk accesses.

More generally, fewer than 10 bits per element are required for a 1% false positive probability, independent of the size or number of elements in the set.^[4]

Bloom Filter Media

An example of a Bloom filter, representing the set ${x, y, z}$ . The colored arrows show the positions in the bit array that each set element is mapped to. The element $w$ is not in the set ${x, y, z}$ , because it hashes to one bit-array position containing 0. For this figure, $m = 18$ and $k = 3$ .
Bloom filter used to speed up answers in a key-value storage system. Values are stored on a disk which has slow access times.
The false positive probability p as a function of number of elements n in the filter and the filter size m. An optimal number of hash functions k= (m/n) \ln 2 has been assumed.
Turning on the Bloom Filter for cache filtering in a content cache decreases the rate of disk writes by nearly one half because objects accessed only once are not cached and hence not written to disk. Paper Citation: "Algorithmic nuggets in content delivery", Bruce M. Maggs and Ramesh K. Sitaraman, ACM SIGCOMM Computer Communication Review (CCR), 15 pages, July 2015.
Distributed Single Shot Bloom filter for duplicate detection with false positive rate: 6 elements are distributed over 3 PEs, each with a bit array of length 4. During the first communication step PE 1 receives the hash '2' twice and sends it back to either PE 2 or 3, depending on who sent it later.
Attenuated Bloom Filter Example: Search for pattern 11010, starting from node n1.

References

↑ Burton H. Bloom (July 1970), "Space/Time Trade-offs in Hash Coding with Allowable Errors" (PDF), Communications of the ACM, 13 (7): 422–426, doi:10.1145/362686.362692, S2CID 7931252
↑ ^2.0 ^2.1 ^2.2 Intelligent Data Engineering and Automated Learning - IDEAL 2010: 11th international conference, Paisley, UK, September 1-3, 2010 : proceedings, eds. Colin Fyfe; et al (Berlin: SpringerLink, 2010), p. 162
↑ Rayan Chikhi; Guillaume Rizk, 'Space-efficient and exact de Bruijn graph representation based on a Bloom filter', Biomedical Research, Volume 36 (2014), p. 3 of 9
↑ Bonomi, Flavio; Mitzenmacher, Michael; Panigrahy, Rina; Singh, Sushil; Varghese, George (2006), "An Improved Construction for Counting Bloom Filters", Algorithms – ESA 2006, 14th Annual European Symposium (PDF), Lecture Notes in Computer Science, vol. 4168, pp. 684–695, doi:10.1007/11841036_61, ISBN 978-3-540-38875-3

[Bloom1970-1] Burton H. Bloom (July 1970), "Space/Time Trade-offs in Hash Coding with Allowable Errors" (PDF), Communications of the ACM, 13 (7): 422–426, doi:10.1145/362686.362692, S2CID 7931252

[Fyfe162-2] 2.0 ^2.1 ^2.2 Intelligent Data Engineering and Automated Learning - IDEAL 2010: 11th international conference, Paisley, UK, September 1-3, 2010 : proceedings, eds. Colin Fyfe; et al (Berlin: SpringerLink, 2010), p. 162

[3] Rayan Chikhi; Guillaume Rizk, 'Space-efficient and exact de Bruijn graph representation based on a Bloom filter', Biomedical Research, Volume 36 (2014), p. 3 of 9

[4] Bonomi, Flavio; Mitzenmacher, Michael; Panigrahy, Rina; Singh, Sushil; Varghese, George (2006), "An Improved Construction for Counting Bloom Filters", Algorithms – ESA 2006, 14th Annual European Symposium (PDF), Lecture Notes in Computer Science, vol. 4168, pp. 684–695, doi:10.1007/11841036_61, ISBN 978-3-540-38875-3

[1]

[2]

[3]

[4]