[LLVMdev] How to deal with potentially unlimited count/length symbol names?

Wed Jun 19 15:39:34 PDT 2013

> From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at cs.uiuc.edu] On Behalf Of Sean Silva
> Sent: Wednesday, June 19, 2013 11:45 AM
> To: edA-qa mort-ora-y
> Cc: <llvmdev at cs.uiuc.edu>
> Subject: Re: [LLVMdev] How to deal with potentially unlimited count/length symbol names?
>
> On Wed, Jun 19, 2013 at 1:04 AM, edA-qa mort-ora-y <eda-qa at disemia.com> wrote:
>
> > The problem is that if I derive the name from what the type contains the
> > length of that name is essential unbound. So how does one generate
> > names?  I'm thinking of just using a long hash and hoping I don't get
> > accidental collisions. Surely there must be a better way?
>
> Just a cryptographic hash (e.g. SHA1) to avoid the need to "hope" that there are no collisions.
>
> -- Sean Silva 

Cryptographic hashes don't guarantee you get no accidental collisions;
their goal is to make it super hard to produce a collision _on purpose_.
What you need is an algorithm designed for string inputs, with good
uniformity, and an adequate output size; there are many.

Accidental collisions are essentially the Birthday Problem:
http://en.wikipedia.org/wiki/Birthday_problem
See particularly the end of the "Square approximation" section, which
specifically discusses the application to hashes, relating probability
of collision to bit-width and number of inputs.
For example, I worked out that with a 128-bit hash you need around
2^50 inputs before you get a collision probability of 1-in-a-billion.
(For comparison, a 64-bit hash gets you about 185,000 inputs for the
same collision probability.)

If you want a "name-brand" algorithm, I'd suggest MD5 over SHA-1.
It produces 128-bit output (versus 160-bit) and the fact that it is
"cryptographically broken" is irrelevant to your use-case.

--paulr