[PATCH] InstrProf: Calculate a better function hash

Tue Mar 25 13:59:37 PDT 2014

Thanks for this review Chandler.  Lots of good information in here.

On Mar 25, 2014, at 12:31 PM, Chandler Carruth <chandlerc at google.com> wrote:

> 
> On Tue, Mar 25, 2014 at 12:11 PM, Justin Bogner <mail at justinbogner.com> wrote:
> Chandler Carruth <chandlerc at google.com> writes:
> > FNV is actually based on the same principles as Bernstein's -- it is relying
> > on multiplication to spread the bits throughout an integers state, and xor (or
> > addition as you originally wrote the patch, many variations on Bernstein's use
> > xor though).

Yeah, I just stole the one in HashString, which uses addition.

> > These all will have reasonably frequent collisions in addition to be poorly
> > distributed over the space. You've indicated you don't care about the
> > distribution, but do care about collisions.
> >
> > Also, you've asserted speed claims without data. Both Bernstein's hash (in its
> > original formulation, your code was actually a strange variant of it that
> > didn't operate on bytes or octets

The domain of the enum is within a byte (actually, within 4 bits); still strange?

> ) and FNV are necessarily a byte-at-a-time
> > and thus *quite* slow for inputs of even several hundered bytes.

When I looked at MD5.cpp earlier, I misread it terribly.  I imagined multiplies in each STEP, and misunderstood update() to be called for each character.  Given how badly I misread it, I didn’t need data :).

The byte-at-a-time seems irrelevant if the math is trivial (i.e., Bernstein), since getting the data one byte at a time at each relevant AST node visit and have other work to do there.  If there are several hundred relevant AST nodes, it’s hard to imagine that Bernstein will be an IRGen bottleneck.

> > We actually have a variation of CityHash that I implemented which is a bit
> > faster than CityHash (and for strings of bytes more than 128 bytes, several
> > times faster than Bernstein's) but has similarly strong collision resistance.

Since this needs to be stable, I’d need to lift this out of hashing::detail and into a namespace like hashing::based_on_city_2012.  If Hashing.h changes its hash, we’d need to keep hashing::based_on_city_2012 around permanently.  I’d also have to expose a way to send in a stable seed, rather than relying on get_execution_seed().

Thoughts?  If that sounds good, what should the namespace be called?

> > But how much data are we talking about? And how frequently are you computing
> > this? MD5 is actually reasonably fast on modern hardware. The reference
> > benchmarks have shown roughly 500 cycles to compute the MD5 of an 8-byte
> > message, and 800 or 900 cycles to compute the MD5 of a 64-byte message. I
> > would expect traversing the AST to build the inputs for this to be
> > significantly slower due to cache misses, but I think benchmarks would help
> > here.
> 
> This won't be much data per hash, but it needs to be calculated once per
> function being compiled. My gut says any of these hashes will be
> sufficient, but it'll help to describe the problem domain.
> 
> I've read all the patch and am fairly familiar with the design...
>  
> The hash is based on the structure of the AST for a function, so:
> 
> - The input domain is a set of "interesting" statements and decls, which
>   Duncan's patch represents in the ASTHash enum. There are currently 16
>   distinct values, and I wouldn't expect this to grow much.
> 
> - The length of the input is directly correlated with the amount of
>   control flow in a function.  This will often be quite short (a few if
>   statements and loops, say) but may be quite long in the presence of a
>   giant state machine implemented by a switch, or some other monstrous
>   function. I'd expect this to usually be counted in tens, rather than
>   hundreds, and it won't be uncommon for it to be one or two.
> 
> Yep.
>  
> 
> - The collisions we're concerned about are ones that are likely to occur
>   from a function being changed. When someone modifies the control flow
>   of a function, our profile is no longer valid for that function. We
>   also check the number of counters though, so only collisions for the
>   same length of input matter at all.
> 
> Yes, so things like single bit flips in the message.
>  
> 
> This sounds to me like it would be pretty similar to hashing short
> strings, which bernstein is generally considered to be reasonably good
> at IIRC, but I'm no expert on the subject.
> 
> Bernstein's hash happens to be effective of short strings of *ascii* printable characters. It is not highly rilient to single bit flips in all bits of the input. Notably, as pointed out by Bob Jenkins and others, there is a funnel where 0x21 and 0x100 have the same hash (33). Bernstein's hash became popular in no small part because it has unusual properties with *ascii* text: lowercase alpha strings of 6 characters or smaller have zero collisions in 32-bits. I don't see any way that it is a useful hashing algorithm for something like this.

Given that, why not add 0x20 (or even 0x61) and be done with it?  The domain is small enough to be trivially mapped to ASCII.

> FNV is slightly better in that it works for any byte stream rather than being carefully chosen to work with ascii characters. This is primarily because it uses a prime multiplier. However, it *requires* fast integer multiplies (and thus is often quite slow on non-Intel chips) and has a tendency to scale *very* poorly to large input messages due to the byte-stream nature of the beast.

Forget FNV.  It looked better than MD5 when I thought MD5 had oodles of multiplies.

So I see three viable options, and I’m happy to defer to your judgement.

  1. Bernstein with the input mapped to ASCII.

  2. Grab the upper 64 bits of MD5.

  3. Lift the modified CityHash implementation out of hashing::detail and adjust the API to accept a stable seed.

Poorly informed as I am, I still gravitate to option (1).  It’s quick, easy to maintain, and apparently effective.  If you still have a technical concern, let me know whether you prefer option (2) or option (3).