What are probabilistic data structures?

Question

I have read about "probabilistic" data structures like bloom filters and skip lists.

What are the common characteristics of probabilistic data structures and what are they used for?

Severin Pappadeux · Accepted Answer · 2014-12-05 01:16:56Z

26

There are probably a lot of different (and good) answers, but in my humble opinion, the common characteristics of probabilistic data structures is that they provide you with approximate, not precise answer.

How many items are here? About 1523425 with probability of 99%

Update: Quick search produced link to decent article on the issue:

https://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/

answered Dec 5, 2014 at 1:16

Severin Pappadeux

19.5k4 gold badges40 silver badges68 bronze badges

1

Isn't a hashtable without a linkedlist in each slot (thus storing only a "yes item exist" or "no item don't exist" in each slot) also a probabilistic data structure?
– Pacerier
May 6, 2017 at 14:02
2

@Pacerier, what you are saying is actually a bloom filter with k=1 hash function. But as it can certainly say that if an item is 'not exists', it can't certainly say that if the item 'exists'. That's why, yes, it will be a probabilistic data structure.
– sha-1
Dec 3, 2017 at 8:45
1

Skip list can be treated as a probabilistic structure but it gives an exact answer. In this case probability affects performance but not correctness.
– freakish
Sep 12, 2018 at 8:18

Add a comment |

gakhov · Accepted Answer · 2019-02-17 20:16:56Z

If you are interested in probabilistic data structures, you might want to read my recently published book "Probabilistic Data Structures and Algorithms for Big Data Applications" (ISBN: 9783748190486, available at Amazon) where I have explained many of such space-efficient data structures and fast algorithms that are extremely useful in modern Big Data applications.

In this book, you can find the state of the art algorithms and data structures that help to handle such common problems in Big Data processing as

Membership querying (Bloom filter, Counting Bloom filter, Quotient filter, Cuckoo filter).
Cardinality (Linear counting, probabilistic counting, LogLog, HyperLogLog, HyperLogLog++).
Frequency (Majority algorithm, Frequent, Count Sketch, Count-Min Sketch).
Rank (Random sampling, q-digest, t-digest).
Similarity (LSH, MinHash, SimHash).

You can get a free preview and all related information about the book at https://pdsa.gakhov.com

Community · Accepted Answer · 2017-05-23 12:26:15Z

Probabilistic data structures can't give you a definite answer, instead they provide you with a reasonable approximation of the answer and a way to approximate this estimation. They are extremely useful for big data and streaming application because they allow to dramatically decrease the amount of memory needed (in comparison to data structures that give you exact answers).

In majority of the cases these data structures use hash functions to randomize the items. Because they ignore collisions they keep the size constant, but this is also a reason why they can't give you exact values. The advantages they bring:

they use small amount of memory (you can control how much)
they can be easily parallelizable (hashes are independent)
they have constant query time (not even amortized constant like in dictionary)

Frequently used probabilistic data structures are:

nybon · Accepted Answer · 2017-06-14 09:48:59Z

There is a list of probabilistic data structures in wikipedia for your reference: https://en.wikipedia.org/wiki/Category:Probabilistic_data_structures

There are different definitions about what "probabilistic data structure" is. IMHO, probabilistic data structure means that the data structure uses some randomized algorithm or takes advantage of some probabilistic characteristics internally, but they don't have to behave probabilistically or un-deterministically from the data structure user's perspective.

There are many "probabilistic data structures" with probabilistically behavior such as the bloom filter and HyperLogLog mentioned by the other answers.
At the same time, there are other "probabilistic data structures" with determined behavior (from a user's perspective) such as skip list. For skip list, users can use it similarly as a balanced binary search tree but is implemented with some probability related idea internally. And according to skip list's author William Pugh:

Skip lists are a probabilistic data structure that seem likely to supplant balanced trees as the implementation method of choice for many applications. Skip list algorithms have the same asymptotic expected time bounds as balanced trees and are simpler, faster and use less space.

I don't believe that Skip Lists are probabilistic data structures. They are guaranteed to give the correct result, and there is no error rate unlike true probabilistic data structures like Bloom Filters or HyperLogLog. Interestingly, the linked Wikipedia article has a hyperlink on "probabilistic" which leads to the Wikipedia article for "Randomized Algorithms", a totally unrelated topic. This calls into question the validity of the linked Wikipedia page. — Shuklaswag, Jul 8, 2018 at 14:35
@Shuklaswag Define "true probabilistic data structure" or even "probabilistic data structure". I disagree with you. Even though skip lists always give correct answers the probability is an essential part of the algorithm that makes it useful. — freakish, Sep 12, 2018 at 8:24

Manzoor Samad · Accepted Answer · 2022-01-16 07:38:39Z

0

Probabilistic data structures allow for constant memory space and extremely fast processing while still maintaining a low error rate with a specified degree on uncertainity.

Some use-cases are

Checking presence of value in a data set
Frequency of events
Estimate approximate size of a data set
Ranking and grouping

answered Jan 16, 2022 at 7:38

Manzoor Samad

9091 gold badge9 silver badges16 bronze badges

Add a comment |

Collectives™ on Stack Overflow

What are probabilistic data structures?

5 Answers 5

Your Answer

Not the answer you're looking for? Browse other questions tagged
algorithm
data-structures
probability
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged algorithmdata-structuresprobability or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
algorithm
data-structures
probability
or ask your own question.