Bloom filter
A Bloom filter is a data structure that allows computers to see if a given element occurs in a set.[1][2] Bloom filters use hash functions to do this. For each element that is added, a hash value is calculated. When a new element is added, its hash value is compared to that of the other elements in the set. A Bloom filter is a probabilistic data structure. It is possible to get a false positive, but not to get a false negative.[3] In other words, a query returns either "possibly in set" or "definitely not in set". Elements can be added to the set, but not removed.[2] For each added element, the probability of getting a false positive grows.[2]
Edward Bloom proposed the Bloom filter in 1970. In the article, Bloom supposes there is an algorithm to hyphenate words at the end of a line. According to the example, most words have simple hyphenation patterns. But about 10% of the words require time-consuming lookups to fetch the correct rule. His case was that of hyphenating about 500,000 words. He saw that using the "normal" error-free hashing techniques, storing the hyphenation patterns, would require a lot of memory. He found that using his technique, he could eliminate most lookups. For example, a hash area only 15% of the size needed by an ideal error-free hash still eliminates 85% of the disk accesses.
More generally, fewer than 10 bits per element are required for a 1% false positive probability, independent of the size or number of elements in the set.[4]
Bloom Filter Media
- Bloom filter.svg
An example of a Bloom filter, representing the set {x, y, z} . The colored arrows show the positions in the bit array that each set element is mapped to. The element w is not in the set {x, y, z} , because it hashes to one bit-array position containing 0. For this figure, m = 18 and k = 3.
- Bloom filter fp probability.svg
The false positive probability p as a function of number of elements n in the filter and the filter size m. An optimal number of hash functions k = (m / n) ln 2 has been assumed.
- BloomFilterDisk.png
Turning on the Bloom Filter for cache filtering in a content cache decreases the rate of disk writes by nearly one half because objects accessed only once are not cached and hence not written to disk. Paper Citation: "Algorithmic nuggets in content delivery", Bruce M. Maggs and Ramesh K. Sitaraman, ACM SIGCOMM Computer Communication Review (CCR), 15 pages, July 2015.
- AttenuatedBloomFilter2.png
Attenuated Bloom filter example: Search for pattern 11010, starting from node n1.
References
- ↑ Lua error in Module:Citation/CS1/Identifiers at line 630: attempt to index field 'known_free_doi_registrants_t' (a nil value).
- ↑ 2.0 2.1 2.2 Intelligent Data Engineering and Automated Learning - IDEAL 2010: 11th international conference, Paisley, UK, September 1-3, 2010 : proceedings, eds. Colin Fyfe; et al (Berlin: SpringerLink, 2010), p. 162
- ↑ Rayan Chikhi; Guillaume Rizk, 'Space-efficient and exact de Bruijn graph representation based on a Bloom filter', Biomedical Research, Volume 36 (2014), p. 3 of 9
- ↑ Lua error in Module:Citation/CS1/Identifiers at line 630: attempt to index field 'known_free_doi_registrants_t' (a nil value).