Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
The birthday paradox in action: calculating the probability of a hash collision (solipsys.co.uk)
100 points by ColinWright on Nov 7, 2012 | hide | past | favorite | 43 comments


Hidden gem on the right side:

[...] People often ask why they need to memorise formulas, or why they need to practice solving equations, when they can simply look stuff up whenever they need it, and on-line computer algebra systems can solve equations faster than they can, and more reliably.

But this is an example of why the ability simply to look stuff up is near useless on its own. Searches are deep and wide, and you need intuition to guide you. You need to recognise what might work, things you've seen before, directions to take that are more likely to be fruitful. [...]


That's a neat formula, and directly applicable to having a bunch of objects in a git repository.

Applying the formula for 160bit SHA-1 you need 1.7e23 objects to get a 1% chance of collision. The current Linus kernel repository has 2.7 million objects. So to get a collision you'd need a repository that's 6e16 times larger. That should be plenty.

For some wacky perspective that's 10 million kernel sized contributions for every man woman and child on earth together in a single repository. It would seem git will reach plenty of other bottlenecks before SHA-1 becomes a problem...


It is a huge number. Of course the fun part is that after precisely calculating that probability and proving that we don't have anything to worry about, the next commit we make could collide... and it will be the only git collision within the history of human race.

Probabilities are fun, because anything non-zero is non-zero ;) (I don't expect to ever witness a git hash collision of course)


At my day job, SHA-1 is the basis for our deduplication. As such, we love talking about frames of reference for the probability of a SHA-1 collision. One of our presentations includes such facts as "the probability of dying in a vending machine accident", etc.

If you enjoy this stuff, you'll get a pleasant tickle out of this one:

What is the probability that there exists a SHA-1 hash which hashes to itself? In other words, are there any fixed points?

(This post was edited to fix the phrasing mistake nicely pointed out by a commenter. Originally, I asked "probability that a SHA-1" when I meant "probability that any SHA-1")


This sounds snarky, but I mean it as a genuine question about your revised post: what do you mean by asking about the probability of something that either definitely happens or definitely doesn't? That is to say, either there is a fixed point or there isn't; in either case, I can't see even an informal way to assign any probability other than 0 or 1 to the question.


We can say:

    Without using any knowledge about how they are
    computed, & only using the fact that a hash is,
    in effect, a result chosen uniformly at random
    from a set of results, what is the probability
    that ...
So in particular, consider a collection of numbers, and for each one, choose a target uniformly at random from the same collection. So for the set X you have a function f:X->X. What is the probability that there is a fixed point x such that f(x)=x?

This, by the way, is a very well known calculation/result in my area of math.


Colin explained it, but I wanted to apologize for the dissonance. Essentially, I'm phrasing the abstract "pick an element, then pick again" problem in terms of SHA collisions. Just for fun.


Pleasant tickle? Isn't it just 1/max(SHA-1)?

I do get a pleasant tickle from expanding that to the question of if there is any fixed point in SHA-1, but the chance for a particular value is the answer you always get for "what is the chance of a specific hash".


Yeah, that was a phrasing mistake in my original post. I meant if there are ANY hashes that hash to itself (any fixed point). Thanks for noticing.


Nitpick: proofing that (as opposed to showing it to be likely by hashing a few zillion candidate keys and computing stats on their hashes) may be a bit of a challenge.

I am making this remark because the question reminded me of the question "If it is the 13th of the month, what is the probability that it is Friday?" (no tricks with weird calendars; you should just use the Gregorian one).


If there are max(SHA-1) hashes, and a 1/max(SHA-1) chance any particular one will hash to itself, does that mean we would expect only a single instance of a hash hashing to itself?

I have no idea how you might go about finding that one expected instance.


You would expect about 1 per unflawed hash function tested. But the law of averages is less powerful when you focus on a specific hash. The chance of having no fixed points in a particular unflawed hash is 1/e, so there is probably a fixed point, and possibly more than one.


> Pleasant tickle? Isn't it just 1/max(SHA-1)?

Only if there is a unique fixed point ….


Not really. If we're talking about abstract probability over hash functions it's the same no matter how many fixed points SHA1 happens to have. If we talk about concrete numbers then it's actually 1 or 0 for each hash.


bar zvahf qrenatrzragf bire crezhgngvbaf

frr uggcf//ra.jvxvcrqvn.bet/jvxv/Qrenatrzrag#Yvzvg_bs_engvb_bs_qrenatrzrag_gb_crezhgngvba_nf_a_nccebnpurf_.R2.88.9R


ROT13ed and fixed the link:

one minus derangements over permutations

see

https://en.wikipedia.org/wiki/Derangement#Limit_of_ratio_of_...


It is very important to note that this is the probability that any two hashes collide in a set of hashes. It is not the probability to find a second plaintext that hashes to the same hash as a given one. It's far harder to find a colliding value to a given one than to just grab a group of values and have two of them hash to the same hash.


Indeed, pretty much exactly that point is made in the third box on the right:

    In case you are one of the many, many people who get this wrong
    and are astonished, you may be thinking of a different question.
    My birthday is a specific date. If we assume birthdays are
    uniformly distributed, how many people must you ask before you
    find someone who has the same birthday as me?

    On average, about 182.

    If you ask some 150 to 200 people, the odds are about 50% that
    you'll find someone who shares my specific birthday.

    But that's not the question I asked. They don't just need to share
    my birthday, it can be that any birthday is shared among them, and
    that changes the odds dramatically.


I don't see how that point relates to Xylakant's comment. The text you quoted basically points to the difference between these two cases:

1) that if you have a specific hash, how many more hashes would have to be generated to get the same hash.

2) how many hashes would you have to generate before a collision occurs between any of them.

So in the two cases above we're dealing with sets of hashes, and we don't care how they are generated.

Xylakant's comment was about a different type of collision. This collision happens because you are mapping an arbitrary large text into a fixed size hash, which means that different texts can map to the same hash, hence creating a collision.


Xylakant:

  > It is very important to note that this is the probability
  > that any two hashes collide in a set of hashes. It is not
  > the probability to find a second plaintext that hashes to
  > the same hash as a given one.
ColinWright:

  > Indeed, pretty much exactly that point is made ...

  > > In case you are one of the many, many people who get this wrong
  > > and are astonished, you may be thinking of a different question.
  > > My birthday is a specific date. If we assume birthdays are
  > > uniformly distributed, how many people must you ask before you
  > > find someone who has the same birthday as me?

  > >  ...

  > > But that's not the question I asked. They don't just need to share
  > > my birthday, it can be that any birthday is shared among them, and
  > > that changes the odds dramatically.
	
dsego:

  > I don't see how that point relates to Xylakant's comment.
OK, I'll try to explain my thinking:

  > The text you quoted basically points to the difference
  > between these two cases:

  > 1) that if you have a specific hash, how many more hashes
  >    would have to be generated to get the same hash.

  > 2) how many hashes would you have to generate before a
  >    collision occurs between any of them.
Yes.

  > So in the two cases above we're dealing with sets of hashes,
  > and we don't care how they are generated.
That's true.

  > Xylakant's comment was about a different type of collision.
  > This collision happens because you are mapping an arbitrary
  > large text into a fixed size hash, which means that different
  > texts can map to the same hash, hence creating a collision.
If you're using cryptographic hashes then I believe these amount to the same thing.

The distinguishing characteristics of a cryptographic hash are that it has no internal structure, and it is completely unpredictable based on knowledge of similar source texts. In particular, techniques such as differential analysis don't work.

In that sense, finding a text to map to a hash to give a collision is exactly the same as picking a random number that happens to match. The act of starting with a text and using that to create a hash is identical to the act of choosing a random element of the hash space. If these two things were not identical, then the hash would have some sort of predictability or structure that related it to the source text.

The situation is different if you are using non-cryptographic hashes such as CRCs. Perhaps I should edit the original article to help make that distinction, but I'm not really sure it's worth it.


Actually I was trying to express that the birthday paradox does not apply when you to find a plaintext that evaluates to a known hash, as you'd need to do for password cracking or forging a signed email for example[1]. The birthday paradox only gives you the probability that of a random set of hashes, two are the same. It's "if I generate n random pieces of plaintext, how big is the chance that two of them generate the same hash." Actually it gives you the probability of picking the same element twice from a finite set of elements when you pick n times.

[1] When generating emails more restrictions apply: The colliding plaintexts have to be at least somewhat coherent and probably should express something the attacker wants to express.

Edit: Forgot a negation in a crucial place. Darn.


Yes, and that's exactly the point that is addressed in the box I quoted.

You're exactly right, and I believe it's covered. I didn't go into detail about the problems involved in generating a plain text that hashes to a specific hash,such as you mention. I did simply mention that the problem I'm talking about is not that one.

So I don't understand the point dsego was making, because I think my reference is relevant.

At this point I'm no longer sure it really matter.


Sorry, I think I was a bit confused. Now I see that it is the same thing. Sorry for wasting your time :(


You haven't wasted my time. You've made me think harder, you've made me reconsider my explanation, and I've helped you become unconfused.

Not a problem, and you're welcome!


I once made a little web page that calculates the probability of a hash collision: http://davidjohnstone.net/pages/hash-collision-probability


My favorite header ever "Mathematicians aren't normal people". The numbers, they're everywhere!!!


In what sense is the birthday paradox a paradox?


It is not a logical paradox, it is very counterintuitive. With just about 23 samples, you'll have a birthday collision with 50% chance. Many people would think to get 50%, you'd need 183 samples.

It is also counterintuitive, because it is unavoidable. Even with a very large hash range, one can force a collision with a relatively small number of samples.


It's sometimes called the Birthday Problem to avoid the paradox label. (See, for example, the wikipedia article.)

The paradoxy bits of the problem are the things that people find confusing. "Only 23?!" or "So if there are 23 people in a room there's a 50% chance that one of them will share a birthday with me?"


Is "So if there are 23 people in a room there's a 50% chance that one of them will share a birthday with me?" correct? If you pick a fixed person (you) that would break the pigeonhole principle, I think.

I would think that saying "So if there are 23 people in the room there is a 50% chance any 2 of those people share the same birthday" is better.

Edit: I misunderstood, you're saying that other people are misunderstanding. My mistake!


You're misquoting or misunderstanding. The comment to which you are replying said:

    The paradoxy bits of the problem are the things that
    people find confusing. "Only 23?!" or "So if there are
    23 people in a room there's a 50% chance that one of
    them will share a birthday with me?"
In that comment he is saying that people mis-understand the question, and assume that it means that once there are 23 people in the room, then there's a 50:50 chance they will share a birthday with them specifically.

And that's exactly the wrong question, as you point out. So when you say:

    Is "So if there are 23 people in a room there's a 50% chance
    that one of them will share a birthday with me?" correct?
No, that's not correct, but it is what people think they hear, and it's that confusion that makes this whole thing sometimes called a paradox.

So let's be clear:

    If you're in a room with 22 other people, the chance
    that one of them shares a birthday specifically with
    you is nowhere near 50%

    However, the chance that among the 23 people in the
    room there is, somewhere, a shared birthday, is indeed
    slightly greater than 50%
And my experience is that it really doesn't matter how carefully you word this, some people simply will not understand it.


You're right, I misunderstand the comment I replied was using that as an example of how other people misunderstand the problem. D'oh. Edited!


The same sense that the Banach-Tarksi paradox[1] is: it's a counter-intuitive mathematical truth.

[1]: https://en.wikipedia.org/wiki/Banach%E2%80%93Tarski_paradox


"a seemingly absurd or self-contradictory statement or proposition that when investigated or explained may prove to be well founded or true: in a paradox, he has discovered that stepping back from his job has increased the rewards he gleans from it."


Interesting analysis on the collision angle.

For those interested in another look (basic) at the Birthday Paradox, check out http://alexanderle.com/blog/2011/birthday-paradox.html !


So has anyone yet done the calculation to work out how many files github will need to store to have a non-negligeable chance of hash collisions in the git file names? Or perhaps the linux kernel?


The formula given at the end of the article implies that to have a 5% chance of a collision you would need 3.87e23 objects. (SHA-1 is 160 bits.) Even storing 1 trillion objects gives a collision probability of 3e-19.

So this is unlikely to be a concern, ever.


Hmmm, I wonder how many objects are currently stored in S3? And what rate that's rowing at? Or how many documents Google has stored in bigtable?

But yeah, 160 bits is probably "enough" :-)


Doesn't even need to be in filenames, but the content of files aswell.


I believe bigiain is referring to the fact that git objects are stored by their SHA-1 digest (so a hash-collision will cause a collision in these filenames) not the filenames of the contents. (The objects are stored under the .git/objects directory, in the format xx/y...y, where the directory xx is the first two digits of the digest, and the 38 y's are the rest of the digest.)




I just took an exam on this....




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: