The birthday paradox in action: calculating the probability of a hash collision

VMG · on Nov 7, 2012

Hidden gem on the right side:

[...] People often ask why they need to memorise formulas, or why they need to practice solving equations, when they can simply look stuff up whenever they need it, and on-line computer algebra systems can solve equations faster than they can, and more reliably.

But this is an example of why the ability simply to look stuff up is near useless on its own. Searches are deep and wide, and you need intuition to guide you. You need to recognise what might work, things you've seen before, directions to take that are more likely to be fruitful. [...]

pedrocr · on Nov 7, 2012

That's a neat formula, and directly applicable to having a bunch of objects in a git repository.

Applying the formula for 160bit SHA-1 you need 1.7e23 objects to get a 1% chance of collision. The current Linus kernel repository has 2.7 million objects. So to get a collision you'd need a repository that's 6e16 times larger. That should be plenty.

For some wacky perspective that's 10 million kernel sized contributions for every man woman and child on earth together in a single repository. It would seem git will reach plenty of other bottlenecks before SHA-1 becomes a problem...

viraptor · on Nov 7, 2012

It is a huge number. Of course the fun part is that after precisely calculating that probability and proving that we don't have anything to worry about, the next commit we make could collide... and it will be the only git collision within the history of human race.

Probabilities are fun, because anything non-zero is non-zero ;) (I don't expect to ever witness a git hash collision of course)

cmurphycode · on Nov 7, 2012

At my day job, SHA-1 is the basis for our deduplication. As such, we love talking about frames of reference for the probability of a SHA-1 collision. One of our presentations includes such facts as "the probability of dying in a vending machine accident", etc.

If you enjoy this stuff, you'll get a pleasant tickle out of this one:

What is the probability that there exists a SHA-1 hash which hashes to itself? In other words, are there any fixed points?

(This post was edited to fix the phrasing mistake nicely pointed out by a commenter. Originally, I asked "probability that a SHA-1" when I meant "probability that any SHA-1")

JadeNB · on Nov 7, 2012

This sounds snarky, but I mean it as a genuine question about your revised post: what do you mean by asking about the probability of something that either definitely happens or definitely doesn't? That is to say, either there is a fixed point or there isn't; in either case, I can't see even an informal way to assign any probability other than 0 or 1 to the question.

ColinWright · on Nov 7, 2012

We can say:

    Without using any knowledge about how they are
    computed, & only using the fact that a hash is,
    in effect, a result chosen uniformly at random
    from a set of results, what is the probability
    that ...

So in particular, consider a collection of numbers, and for each one, choose a target uniformly at random from the same collection. So for the set X you have a function f:X->X. What is the probability that there is a fixed point x such that f(x)=x?

This, by the way, is a very well known calculation/result in my area of math.

cmurphycode · on Nov 7, 2012

Colin explained it, but I wanted to apologize for the dissonance. Essentially, I'm phrasing the abstract "pick an element, then pick again" problem in terms of SHA collisions. Just for fun.

Dylan16807 · on Nov 7, 2012

Pleasant tickle? Isn't it just 1/max(SHA-1)?

I do get a pleasant tickle from expanding that to the question of if there is any fixed point in SHA-1, but the chance for a particular value is the answer you always get for "what is the chance of a specific hash".

cmurphycode · on Nov 7, 2012

Yeah, that was a phrasing mistake in my original post. I meant if there are ANY hashes that hash to itself (any fixed point). Thanks for noticing.

Someone · on Nov 7, 2012

Nitpick: proofing that (as opposed to showing it to be likely by hashing a few zillion candidate keys and computing stats on their hashes) may be a bit of a challenge.

I am making this remark because the question reminded me of the question "If it is the 13th of the month, what is the probability that it is Friday?" (no tricks with weird calendars; you should just use the Gregorian one).

jlgreco · on Nov 7, 2012

If there are max(SHA-1) hashes, and a 1/max(SHA-1) chance any particular one will hash to itself, does that mean we would expect only a single instance of a hash hashing to itself?

I have no idea how you might go about finding that one expected instance.

Dylan16807 · on Nov 7, 2012

You would expect about 1 per unflawed hash function tested. But the law of averages is less powerful when you focus on a specific hash. The chance of having no fixed points in a particular unflawed hash is 1/e, so there is probably a fixed point, and possibly more than one.

JadeNB · on Nov 7, 2012

> Pleasant tickle? Isn't it just 1/max(SHA-1)?

Only if there is a unique fixed point ….

Dylan16807 · on Nov 8, 2012

Not really. If we're talking about abstract probability over hash functions it's the same no matter how many fixed points SHA1 happens to have. If we talk about concrete numbers then it's actually 1 or 0 for each hash.

im3w1l · on Nov 7, 2012

bar zvahf qrenatrzragf bire crezhgngvbaf

frr uggcf//ra.jvxvcrqvn.bet/jvxv/Qrenatrzrag#Yvzvg_bs_engvb_bs_qrenatrzrag_gb_crezhgngvba_nf_a_nccebnpurf_.R2.88.9R

wlesieutre · on Nov 7, 2012

ROT13ed and fixed the link:

one minus derangements over permutations

see

https://en.wikipedia.org/wiki/Derangement#Limit_of_ratio_of_...

Xylakant · on Nov 7, 2012

It is very important to note that this is the probability that any two hashes collide in a set of hashes. It is not the probability to find a second plaintext that hashes to the same hash as a given one. It's far harder to find a colliding value to a given one than to just grab a group of values and have two of them hash to the same hash.

ColinWright · on Nov 7, 2012

Indeed, pretty much exactly that point is made in the third box on the right:

    In case you are one of the many, many people who get this wrong
    and are astonished, you may be thinking of a different question.
    My birthday is a specific date. If we assume birthdays are
    uniformly distributed, how many people must you ask before you
    find someone who has the same birthday as me?

    On average, about 182.

    If you ask some 150 to 200 people, the odds are about 50% that
    you'll find someone who shares my specific birthday.

    But that's not the question I asked. They don't just need to share
    my birthday, it can be that any birthday is shared among them, and
    that changes the odds dramatically.

dsego · on Nov 7, 2012

I don't see how that point relates to Xylakant's comment. The text you quoted basically points to the difference between these two cases:

1) that if you have a specific hash, how many more hashes would have to be generated to get the same hash.

2) how many hashes would you have to generate before a collision occurs between any of them.

So in the two cases above we're dealing with sets of hashes, and we don't care how they are generated.

Xylakant's comment was about a different type of collision. This collision happens because you are mapping an arbitrary large text into a fixed size hash, which means that different texts can map to the same hash, hence creating a collision.

ColinWright · on Nov 7, 2012

Xylakant:

  > It is very important to note that this is the probability
  > that any two hashes collide in a set of hashes. It is not
  > the probability to find a second plaintext that hashes to
  > the same hash as a given one.

ColinWright:

  > Indeed, pretty much exactly that point is made ...

  > > In case you are one of the many, many people who get this wrong
  > > and are astonished, you may be thinking of a different question.
  > > My birthday is a specific date. If we assume birthdays are
  > > uniformly distributed, how many people must you ask before you
  > > find someone who has the same birthday as me?

  > >  ...

  > > But that's not the question I asked. They don't just need to share
  > > my birthday, it can be that any birthday is shared among them, and
  > > that changes the odds dramatically.

dsego:

  > I don't see how that point relates to Xylakant's comment.

OK, I'll try to explain my thinking:

  > The text you quoted basically points to the difference
  > between these two cases:

  > 1) that if you have a specific hash, how many more hashes
  >    would have to be generated to get the same hash.

  > 2) how many hashes would you have to generate before a
  >    collision occurs between any of them.

Yes.

  > So in the two cases above we're dealing with sets of hashes,
  > and we don't care how they are generated.

That's true.

  > Xylakant's comment was about a different type of collision.
  > This collision happens because you are mapping an arbitrary
  > large text into a fixed size hash, which means that different
  > texts can map to the same hash, hence creating a collision.

If you're using cryptographic hashes then I believe these amount to the same thing.

The distinguishing characteristics of a cryptographic hash are that it has no internal structure, and it is completely unpredictable based on knowledge of similar source texts. In particular, techniques such as differential analysis don't work.

In that sense, finding a text to map to a hash to give a collision is exactly the same as picking a random number that happens to match. The act of starting with a text and using that to create a hash is identical to the act of choosing a random element of the hash space. If these two things were not identical, then the hash would have some sort of predictability or structure that related it to the source text.

The situation is different if you are using non-cryptographic hashes such as CRCs. Perhaps I should edit the original article to help make that distinction, but I'm not really sure it's worth it.

Xylakant · on Nov 7, 2012

Actually I was trying to express that the birthday paradox does not apply when you to find a plaintext that evaluates to a known hash, as you'd need to do for password cracking or forging a signed email for example[1]. The birthday paradox only gives you the probability that of a random set of hashes, two are the same. It's "if I generate n random pieces of plaintext, how big is the chance that two of them generate the same hash." Actually it gives you the probability of picking the same element twice from a finite set of elements when you pick n times.

[1] When generating emails more restrictions apply: The colliding plaintexts have to be at least somewhat coherent and probably should express something the attacker wants to express.

Edit: Forgot a negation in a crucial place. Darn.

ColinWright · on Nov 7, 2012

Yes, and that's exactly the point that is addressed in the box I quoted.

You're exactly right, and I believe it's covered. I didn't go into detail about the problems involved in generating a plain text that hashes to a specific hash,such as you mention. I did simply mention that the problem I'm talking about is not that one.

So I don't understand the point dsego was making, because I think my reference is relevant.

At this point I'm no longer sure it really matter.

dsego · on Nov 7, 2012

Sorry, I think I was a bit confused. Now I see that it is the same thing. Sorry for wasting your time :(

ColinWright · on Nov 7, 2012

You haven't wasted my time. You've made me think harder, you've made me reconsider my explanation, and I've helped you become unconfused.

Not a problem, and you're welcome!

davidjohnstone · on Nov 7, 2012

I once made a little web page that calculates the probability of a hash collision: http://davidjohnstone.net/pages/hash-collision-probability

ericcholis · on Nov 7, 2012

My favorite header ever "Mathematicians aren't normal people". The numbers, they're everywhere!!!

loonoddoo · on Nov 7, 2012

In what sense is the birthday paradox a paradox?

utopkara · on Nov 7, 2012

It is not a logical paradox, it is very counterintuitive. With just about 23 samples, you'll have a birthday collision with 50% chance. Many people would think to get 50%, you'd need 183 samples.

It is also counterintuitive, because it is unavoidable. Even with a very large hash range, one can force a collision with a relatively small number of samples.

DanBC · on Nov 7, 2012

It's sometimes called the Birthday Problem to avoid the paradox label. (See, for example, the wikipedia article.)

The paradoxy bits of the problem are the things that people find confusing. "Only 23?!" or "So if there are 23 people in a room there's a 50% chance that one of them will share a birthday with me?"

pendext · on Nov 7, 2012

Is "So if there are 23 people in a room there's a 50% chance that one of them will share a birthday with me?" correct? If you pick a fixed person (you) that would break the pigeonhole principle, I think.

I would think that saying "So if there are 23 people in the room there is a 50% chance any 2 of those people share the same birthday" is better.

Edit: I misunderstood, you're saying that other people are misunderstanding. My mistake!

ColinWright · on Nov 7, 2012

You're misquoting or misunderstanding. The comment to which you are replying said:

    The paradoxy bits of the problem are the things that
    people find confusing. "Only 23?!" or "So if there are
    23 people in a room there's a 50% chance that one of
    them will share a birthday with me?"

In that comment he is saying that people mis-understand the question, and assume that it means that once there are 23 people in the room, then there's a 50:50 chance they will share a birthday with them specifically.

And that's exactly the wrong question, as you point out. So when you say:

    Is "So if there are 23 people in a room there's a 50% chance
    that one of them will share a birthday with me?" correct?

No, that's not correct, but it is what people think they hear, and it's that confusion that makes this whole thing sometimes called a paradox.

So let's be clear:

    If you're in a room with 22 other people, the chance
    that one of them shares a birthday specifically with
    you is nowhere near 50%

    However, the chance that among the 23 people in the
    room there is, somewhere, a shared birthday, is indeed
    slightly greater than 50%

And my experience is that it really doesn't matter how carefully you word this, some people simply will not understand it.

pendext · on Nov 7, 2012

You're right, I misunderstand the comment I replied was using that as an example of how other people misunderstand the problem. D'oh. Edited!

dbaupp · on Nov 7, 2012

The same sense that the Banach-Tarksi paradox[1] is: it's a counter-intuitive mathematical truth.

[1]: https://en.wikipedia.org/wiki/Banach%E2%80%93Tarski_paradox

mikeash · on Nov 7, 2012

"a seemingly absurd or self-contradictory statement or proposition that when investigated or explained may prove to be well founded or true: in a paradox, he has discovered that stepping back from his job has increased the rewards he gleans from it."

alexcsm · on Nov 7, 2012

Interesting analysis on the collision angle.

For those interested in another look (basic) at the Birthday Paradox, check out http://alexanderle.com/blog/2011/birthday-paradox.html !

bigiain · on Nov 7, 2012

So has anyone yet done the calculation to work out how many files github will need to store to have a non-negligeable chance of hash collisions in the git file names? Or perhaps the linux kernel?

dbaupp · on Nov 7, 2012

The formula given at the end of the article implies that to have a 5% chance of a collision you would need 3.87e23 objects. (SHA-1 is 160 bits.) Even storing 1 trillion objects gives a collision probability of 3e-19.

So this is unlikely to be a concern, ever.

bigiain · on Nov 7, 2012

Hmmm, I wonder how many objects are currently stored in S3? And what rate that's rowing at? Or how many documents Google has stored in bigtable?

But yeah, 160 bits is probably "enough" :-)

rmc · on Nov 7, 2012

Doesn't even need to be in filenames, but the content of files aswell.

dbaupp · on Nov 7, 2012

I believe bigiain is referring to the fact that git objects are stored by their SHA-1 digest (so a hash-collision will cause a collision in these filenames) not the filenames of the contents. (The objects are stored under the .git/objects directory, in the format xx/y...y, where the directory xx is the first two digits of the digest, and the 38 y's are the rest of the digest.)

maebert · on Nov 7, 2012

In Python: https://gist.github.com/4031135

Peaker · on Nov 7, 2012

In Haskell: http://hpaste.org/77404

mauerbac · on Nov 7, 2012

I just took an exam on this....