I am making an application that stores documents and gives each one a UID based on a SHA1 digest of a few things including the timestamp. The digest has a lot of characters, and I want to allow users to identify the documents by using the first x characters of the full digest. What's a good value for x if the number of documents is maybe around 10K - 100K?
6 Answers
Adapting the formulas on on wikipedia for the Birthday problem, you can approximate the probability of collision as 1 - e^(-n^2/(2^(b+1)))
, where n
is the document count and b
is the number of bits. Graphing this formula with n=100,000, it looks like you'll want b > 45 at least. I'd be more inclined to go with 64 to make it a nice and round number. That said, do have a plan to deal with collisions if they occur (maybe alter the timestamp slightly, or add a nonce?)
For that matter, if the sha1 is based on more than just the content of the document, why not simply make it a random ID? In this case collisions are less of a problem, as you can always generate a new random number and try again (the probability of a collision with a single try is the same, however).
-
Small nit - Isn't the formuala e^(-n^2/(2^(b+1)))? It changes answer slightly to b >40. Jan 26, 2011 at 11:24
-
@Fakrudeen, indeed - I made an error when transcribing it into the answer. The graph was correct though..... although I just now realize stackoverflow didn't make a link for it :|– bdonlanJan 26, 2011 at 11:27
-
I've updated the answer to have the correct formula as agreed upon in the comments. May 23, 2012 at 18:52
-
Biham/Chen offer examples of Near Collisions; and Knudsen demonstrates Truncated Differentials. Both are problems for truncated hashes; neither are instances of the birthday paradox.– jwwDec 18, 2012 at 5:25
-
2Marc Stevens is engineering prefix collisions on MD5 and SHA. He's doing something to cause a breakdown in the expected distribution :) So its probably not enough to simply truncate and expect the smaller hash has the equivalent security of the reduced bit size.– jwwMay 26, 2015 at 12:49
Be careful of truncation as there is no reduction in proof that the smaller hash is secure. See Kelsey's http://csrc.nist.gov/groups/ST/hash/documents/Kelsey_Truncation.pdf. Kelsey gives to heuristic arguments stating the same ("Related Hash Outputs" and "Near Collisions"). Biham/Chen offer examples of Near Collisions; and Knudsen demonstrates Truncated Differentials.
In the end, you probably want to feed your data into an HMAC with the truncated size (the size is digested by the HMAC, too) and then use the truncated HMAC.
-
Hi JWW, about the NIST-PDF, how you interpret it? The @bdonlan's formula,
e^(-n^2/(2^(b+1)))
, is a good approximation to estimate truncations or not? If it is not, what the formula or algorithm to check minimal number of bits (bmin) for a SHA1 truncation? Nov 21, 2017 at 12:44
There really isn't a value for this; part of what makes SHA a good general-purpose hashing algorithm is that similar data does not necessarily produce similar hashed values. Your best bet (without knowing anything else about your system) would just be to search the list of documents whose hashes start with the value supplied by the user, then either present them with a list of documents to select from or go directly to the document if there's only one.
-
1
It's a generalization of the birthday problem. In you case n is number of documents, and instead of constant 365 you'd have number of possibilities the cutoff gives you (so for k bits it's 2k).
Of course exact calculation is out of the question, but you might use approximation.
-
Biham/Chen offer examples of Near Collisions; and Knudsen demonstrates Truncated Differentials. Both are problems for truncated hashes; neither are instances of the birthday paradox.– jwwDec 18, 2012 at 5:25
Here are a few key things to consider:
What is the theoretical risk of a collision. As others have mentioned, this is a generalization of the "birthday problem".
What is the SHA-1 hash chance of a collision, compared to the theoretical one. Ideally, it should be the same, as in the case of a truly random distribution. You can take a look at some practical tests that show pretty good results for cases similar to yours.
What you consider to be an acceptible risk. Also, how bad for you application it would be if a collision did happen. For some a 10% risk might be acceptable. For others a 0.0001% risk might be unacceptable.
Depending on your case, a good strategy might be to use more bits internally, but only display a part of them to the user. Consider what git does - it uses all the 160 bits of a SHA-1 hash internally, but usually only a short prefix of the whole HEX representation is displayed to the user.
With all that in mind, for up to 100K documents, I'd probably use a 64-bit hash. With that, the risk for a collision would be lower than the average's person risk of getting hit by a thunder at least twice in their lifetime. Most DBs and programming languages have a native 64-bit integer type, so that would be convenient, fast and memory efficient.
Then, you need to decide what the best way to display that hash to your users is. If you display it as a HEX, that would be 16 hex-chars for the whole 64-bit hash. I'd consider displaying it in a format that is easier for humans to visually compare values. Example: 72af-88c1-d2a6-5c2e
.
Well, here's a possibly too simplistic of an answer..
If with full sha1 you get about 1 in 2^160 chance of collision, then by truncating one character you increase the chances of collision by 16 (all possible values of the truncated character)... which is 2^4.. So, if you truncate x characters you get 1 in 2^(160 - 4*x) chances of collision.. right?
-
1For a single document this is true, but the probability of any collision occuring for any pair of documents increases much more rapidly– bdonlanJan 24, 2011 at 16:34
-
Biham/Chen offer examples of Near Collisions; and Knudsen demonstrates Truncated Differentials. Both are problems for truncated hashes; neither are instances of the birthday paradox.– jwwDec 18, 2012 at 5:25