How do I assess the hash collision probability?

Question

I'm developing a back-end application for a search system. The search system copies files to a temporary directory and gives them random names. Then it passes the temporary files' names to my application. My application must process each file within a limited period of time, otherwise it is shut down - that's a watchdog-like security measure. Processing files is likely to take long so I need to design the application capable of handling this scenario. If my application gets shut down next time the search system wants to index the same file it will likely give it a different temporary name.

The obvious solution is to provide an intermediate layer between the search system and the backend. It will queue the request to the backend and wait for the result to arrive. If the request times out in the intermediate layer - no problem, the backend will continue working, only the intermediate layer is restarted and it can retrieve the result from the backend when the request is later repeated by the search system.

The problem is how to identify the files. Their names change randomly. I intend to use a hash function like MD5 to hash the file contents. I'm well aware of the birthday paradox and used an estimation from the linked article to compute the probability. If I assume I have no more than 100 000 files the probability of two files having the same MD5 (128 bit) is about 1,47x10^-29.

Should I care of such collision probability or just assume that equal hash values mean equal file contents?

The content is hashed. There's no point in hashing the filenames - they change randomly. — sharptooth, May 14, 2009 at 10:54
If you're worried about collisions, consider both the file size and the hash. — Marcus Adams, May 7, 2012 at 17:43

Marcelo Cantos · Accepted Answer · 2010-10-27 10:46:09Z

46

Equal hash means equal file, unless someone malicious is messing around with your files and injecting collisions. (this could be the case if they are downloading stuff from the internet) If that is the case go for a SHA2 based function.

There are no accidental MD5 collisions, 1,47x10^-29 is a really really really small number.

To overcome the issue of rehashing big files I would have a 3 phased identity scheme.

Filesize alone
Filesize + a hash of 64K * 4 in different positions in the file
A full hash

So if you see a file with a new size you know for certain you do not have a duplicate. And so on.

edited Oct 27, 2010 at 10:46

Marcelo Cantos

183k40 gold badges333 silver badges366 bronze badges

answered May 14, 2009 at 10:15

Sam Saffron

130k80 gold badges330 silver badges509 bronze badges

@sharptooth see this question for some tricks you can use: stackoverflow.com/questions/788761/…
– Sam Saffron
May 14, 2009 at 11:32
6

I've got my first MD5 collision after 25K images already in DB
– Valentin H
Jan 21, 2015 at 12:21
5

"There are no accidental MD5 collisions" - this statement is false.
– ColinM
Sep 30, 2016 at 19:25
Loved the idea of hashing some small portions before hashing the whole file. Could greatly speed up duplicate file checking.
– Toni Homedes i Saun
Nov 5, 2019 at 8:52

Add a comment |

ColinM · Accepted Answer · 2016-09-30 19:38:05Z

4

Just because the probability is 1/X it does not mean that it won't happen to you until you have X records. It's like the lottery, you're not likely to win, but somebody out there will win.

With the speed and capacity of computers these days (not even talking about security, just reliability) there is really no reason not to just use a bigger/better hash function than MD5 for anything critical. Stepping up to SHA-1 should help you sleep better at night, but if you want to be extra cautious then go to SHA-265 and never think about it again.

If performance is truly an issue then use BLAKE2 which is actually faster than MD5 but supports 256+ bits to make collisions less likely while having same or better performance. However, while BLAKE2 has been well-adopted, it probably would require adding a new dependency to your project.

answered Sep 30, 2016 at 19:38

ColinM

13.7k4 gold badges44 silver badges49 bronze badges

1

With the lottery, though, you have a guaranteed winner. Whereas there are no known SHA256 collisions, and it's technically possible for there to never be one up until full exhaustion, right?
– JamesTheAwesomeDude
Jul 24, 2017 at 18:54
Good point, in general for a file-hashing app you can pretty safely assume that SHA-256 will never produce a collision (unlike SHA1 which is used by git and collisions have occurred in large real-world projects). However, if using SHA-256 to hash random input bits (such as to generate a session id) you should still consider that the chances of a RNG collision are the same for a given number of input bits regardless of the hashing method used. That is, hashing a random 32-bit integer with SHA-256 is still just 32 bits of data so collisions are likely to occur.
– ColinM
Aug 9, 2017 at 17:17

Add a comment |

alamar · Accepted Answer · 2009-05-14 09:14:12Z

3

I think you shouldn't.

However, you should if you have the notion of two equal files having different (real names, not md5-based). Like, in search system two document might have exactly same content, but being distinct because they're located in different places.

answered May 14, 2009 at 9:14

alamar

19k4 gold badges68 silver badges98 bronze badges

That's the problem of the search system, not of my application. My application only needs to extract text from passed files.
– sharptooth
May 14, 2009 at 9:20

Add a comment |

Árni St. Steinunnarson · Accepted Answer · 2015-01-30 14:17:34Z

I came up with a Monte Carlo approach to be able to sleep safely while using UUID for distributed systems that have to serialize without collisions.

from random import randint
from math import log
from collections import Counter

def colltest(exp):
    uniques = []
    while True:
        r = randint(0,2**exp)
        if r in uniques:
            return log(len(uniques) + 1, 2)
        uniques.append(r)

for k,v in Counter([colltest(20) for i in xrange(1000)]):
    print k, "hash orders of magnitude events before collission:",v

would print something like:

5 hash orders of magnitude events before collission: 1
6 hash orders of magnitude events before collission: 5
7 hash orders of magnitude events before collission: 21
8 hash orders of magnitude events before collission: 91
9 hash orders of magnitude events before collission: 274
10 hash orders of magnitude events before collission: 469
11 hash orders of magnitude events before collission: 138
12 hash orders of magnitude events before collission: 1

I had heard the formula before: If you need to store log(x/2) keys, use a hashing function that has at least keyspace e**(x).

Repeated experiments show that for a population of 1000 log-20 spaces, you sometimes get a collision as early as log(x/4).

For uuid4 which is 122 bits that means I sleep safely while several computers pick random uuid's till I have about 2**31 items. Peak transactions in the system I am thinking about is roughly 10-20 events per second, I'm assuming an average of 7. That gives me an operating window of roughly 10 years, given that extreme paranoia.

Ghostrider · Accepted Answer · 2015-04-23 13:51:05Z

1

Here's an interactive calculator that lets you estimate probability of collision for any hash size and number of objects - http://everydayinternetstuff.com/2015/04/hash-collision-probability-calculator/

answered Apr 23, 2015 at 13:51

Ghostrider

7,6657 gold badges31 silver badges44 bronze badges

The question is not about estimating the probability. I know the probability. The question is what I do next.
– sharptooth
Apr 23, 2015 at 14:05
2

What you do next is simple: you choose a hash function with more bits and preferably a better distribution, such as sha1, and then describe the chance of a collision, what happens when it occurs, and what are the consequences.
– Ellert van Koperen
Oct 31, 2015 at 15:30

Add a comment |

druid62 · Accepted Answer · 2023-04-29 09:37:13Z

0

For temporary file names, use OS functionality. It likely uses an intermediate subdirectory, Unix-epoch, etc.

answered Apr 29, 2023 at 9:37

druid62

1

Add a comment |

Collectives™ on Stack Overflow

How do I assess the hash collision probability?

6 Answers 6

Your Answer

Not the answer you're looking for? Browse other questions tagged
language-agnostic
md5
probability
estimation
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged language-agnosticmd5probabilityestimation or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
language-agnostic
md5
probability
estimation
or ask your own question.