Bloom filter
A Bloom filter is a data structure that allows computers to see if a given element occurs in a set.[1][2] Bloom filters use hash functions to do this. For each element that is added, a hash value is calculated. When a new element is added, its hash value is compared to that of the other elements in the set. A Bloom filter is a probabilistic data structure. It is possible to get a false positive, but not to get a false negative.[3] In other words, a query returns either "possibly in set" or "definitely not in set". Elements can be added to the set, but not removed.[2] For each added element, the probability of getting a false positive grows.[2]
Edward Bloom proposed the Bloom filter in 1970. In the article, Bloom supposes there is an algorithm to hyphenate words at the end of a line. According to the example, most words have simple hyphenation patterns. But about 10% of the words require time-consuming lookups to fetch the correct rule. His case was that of hyphenating about 500,000 words. He saw that using the "normal" error-free hashing techniques, storing the hyphenation patterns, would require a lot of memory. He found that using his technique, he could eliminate most lookups. For example, a hash area only 15% of the size needed by an ideal error-free hash still eliminates 85% of the disk accesses.
More generally, fewer than 10 bits per element are required for a 1% false positive probability, independent of the size or number of elements in the set.[4]
Bloom Filter Media
Turning on the Bloom Filter for cache filtering in a content cache decreases the rate of disk writes by nearly one half because objects accessed only once are not cached and hence not written to disk. Paper Citation: "Algorithmic nuggets in content delivery", Bruce M. Maggs and Ramesh K. Sitaraman, ACM SIGCOMM Computer Communication Review (CCR), 15 pages, July 2015.
Distributed Single Shot Bloom filter for duplicate detection with false positive rate: 6 elements are distributed over 3 PEs, each with a bit array of length 4. During the first communication step PE 1 receives the hash '2' twice and sends it back to either PE 2 or 3, depending on who sent it later.
References
- ↑ Burton H. Bloom (July 1970), "Space/Time Trade-offs in Hash Coding with Allowable Errors" (PDF), Communications of the ACM, 13 (7): 422–426, doi:10.1145/362686.362692, S2CID 7931252
- ↑ 2.0 2.1 2.2 Intelligent Data Engineering and Automated Learning - IDEAL 2010: 11th international conference, Paisley, UK, September 1-3, 2010 : proceedings, eds. Colin Fyfe; et al (Berlin: SpringerLink, 2010), p. 162
- ↑ Rayan Chikhi; Guillaume Rizk, 'Space-efficient and exact de Bruijn graph representation based on a Bloom filter', Biomedical Research, Volume 36 (2014), p. 3 of 9
- ↑ Bonomi, Flavio; Mitzenmacher, Michael; Panigrahy, Rina; Singh, Sushil; Varghese, George (2006), "An Improved Construction for Counting Bloom Filters", Algorithms – ESA 2006, 14th Annual European Symposium (PDF), Lecture Notes in Computer Science, vol. 4168, pp. 684–695, doi:10.1007/11841036_61, ISBN 978-3-540-38875-3