25

I am looking for some precise math on the likelihood of collisions for MD5, SHA1, and SHA256 based on the birthday paradox.

I am looking for something like a graph that says "If you have 10^8 keys, this is the probability. If you have 10^13 keys, this is the probability and so on"

I have looked at tons of articles but I am having a tough time finding something that gives me this data. (Ideal option for me would be a formula or code that calculates this for any provided hash size)

5
  • Why? For MD5 (and SHA-1 to a degree) for example it depends heavily on what your inputs are. MD5 has known collision attacks so if malicious users controls (part of) the input of the hashing algorithm then that significantly impacts the likelyhood of collisions. For the theoretical lower bound a perfect hashing algorithm should behave no different than a perfect random number generator. Jun 30, 2020 at 19:25
  • Mostly just curiosity. I should probably clarify that I'm looking for theoretical hash collision probabilities for a perfect hash based on the key sizes like 128 bits, 160 bits, and 256 bits etc. Jun 30, 2020 at 19:29
  • It's all on wikipedia, and also on wikipedia. Jun 30, 2020 at 20:01
  • @DarkNebula, yes, that's an important clarification, as it's a much easier problem, mathematically speaking. Jun 30, 2020 at 20:24
  • 2
    I invite you to try my calculator at bdayprob.com which allows input in log 2. For example, to test likelihood of a collision with 10^8 ≈ (2^(3.36))^8 = 2^(26.56) < 2^(27) keys in SHA256 (a hash with 256 bits), simply solve P with N=2^(27) and D= 2^(256). You'll find that the risk is negligible and virtually 0%. Solutions are attempted using different methods, all of which can be found in the Wikipedia article. Let me know if your find it useful :) Oct 2, 2021 at 10:39

1 Answer 1

61

Let's imagine we have a truly random hash function that hashes from strings to n-bit numbers. This means that there are 2n possible hash codes, and each string's hash code is chosen uniformly at random from all of those possibilities.

The birthday paradox specifically says that once you've seen roughly √(2k) items, there's a 50% chance of a collision, where k is the number of distinct possible outputs. In the case where the hash function hashes to an n-bit output, this means that you'll need roughly 2n/2 hashes before you get a collision. This is why we typically pick hashes that output 256 bits; it means that we'd need a staggering 2128 ≈1038 items hashed before there's a "reasonable" chance of a collision. With a 512-bit hash, you'd need about 2256 to get a 50% chance of a collision, and 2256 is approximately the number of protons in the known universe.

The exact formula for the probability of getting a collision with an n-bit hash function and k strings hashed is

1 - 2n! / (2kn (2n - k)!)

This is a fairly tricky quantity to work with directly, but we can get a decent approximation of this quantity using the expression

1 - e-k2/2n+1

So, to get (roughly) a probability p chance of a collision, we can solve to get

p ≈ 1 - e-k2/2n+1

1 - p ≈ e-k2/2n+1

ln(1 - p) ≈ -k2/2n+1

-ln(1 - p) ≈ k2/2n+1

-2n+1 ln(1 - p) ≈ k2

2(n+1)/2 √(-ln(1 - p)) ≈ k

As one last approximation, assume we're dealing with very small choices of p. Then ln(1 - p) ≈ -p, so we can rewrite this as

k ≈ 2(n+1)/2 √p

Notice that there's still a monster 2(n+1)/2 term here, so for a 256-bit hash that leading term is 2128.5, which is just enormous. For example, how many items must we see to get a 2-50 chance of a collision with a 256-bit hash? That would be approximately

2(256+1)/2 √2-50

= 2257/2 2-50/2

= 2207/2

= 2103.5.

So you'd need a staggeringly huge number of hashes to have a vanishingly small chance of getting a collision. Figure that 2103.5 is about 1031, which at one nanosecond per hash computed would take you longer than the length of the universe to compute. And after all that, you'd get a success probability of 2-50, which is about 10-15.

In fact, this precisely why we pick such large numbers of bits for our hashes! It makes it extremely unlikely for a collision to occur by chance.

(Note that the hash functions we have today aren't actually truly random functions, which is why people advise against using MD5, SHA1, and others that have had security weaknesses exposed.)

Hope this helps!

4
  • Awesome answer! Even though the parenthetical note at the end is technically outside the scope of the question, I'd personally highlight it a bit more, because it means that judging a hashing algorithm purely by the size of its output is a bad idea, because they are known to not actually be perfect. Jul 1, 2020 at 15:31
  • 2
    This is awesome and gives the exact math I was looking for, in a way that I can work it out easily for any custom values. I also want to call out the wikipedia article mentioned in another comment (en.wikipedia.org/wiki/Birthday_problem#Probability_table) which shows some general values for this. Jul 2, 2020 at 23:03
  • "The exact formula for the probability of getting a collision with an n-bit hash..." No, it is not the formula from the birthday paradox. The formula that you have here is exactly the opposite; what is the probability not to have any collisions.
    – yujaiyu
    Oct 7, 2020 at 5:23
  • 2
    @foki Thanks for catching that! Turns out I had two separate errors here that cancelled out. First, there’s the one you pointed out. Second, the approximation I’d given wasn’t the approximation for the quantity I’d written, but for the quantity I’d intended to write. So fixing the original formula made all the rest of the logic still consistent and correct - at least, I think that’s now fixed. :-) Oct 7, 2020 at 5:35

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.