[PATCH] InstrProf: Calculate a better function hash

Tue Mar 25 12:11:58 PDT 2014

Chandler Carruth <chandlerc at google.com> writes:
> FNV is actually based on the same principles as Bernstein's -- it is relying
> on multiplication to spread the bits throughout an integers state, and xor (or
> addition as you originally wrote the patch, many variations on Bernstein's use
> xor though).
>
> These all will have reasonably frequent collisions in addition to be poorly
> distributed over the space. You've indicated you don't care about the
> distribution, but do care about collisions.
>
> Also, you've asserted speed claims without data. Both Bernstein's hash (in its
> original formulation, your code was actually a strange variant of it that
> didn't operate on bytes or octets) and FNV are necessarily a byte-at-a-time
> and thus *quite* slow for inputs of even several hundered bytes.
>
> We actually have a variation of CityHash that I implemented which is a bit
> faster than CityHash (and for strings of bytes more than 128 bytes, several
> times faster than Bernstein's) but has similarly strong collision resistance.
>
> But how much data are we talking about? And how frequently are you computing
> this? MD5 is actually reasonably fast on modern hardware. The reference
> benchmarks have shown roughly 500 cycles to compute the MD5 of an 8-byte
> message, and 800 or 900 cycles to compute the MD5 of a 64-byte message. I
> would expect traversing the AST to build the inputs for this to be
> significantly slower due to cache misses, but I think benchmarks would help
> here.

This won't be much data per hash, but it needs to be calculated once per
function being compiled. My gut says any of these hashes will be
sufficient, but it'll help to describe the problem domain.

The hash is based on the structure of the AST for a function, so:

- The input domain is a set of "interesting" statements and decls, which
  Duncan's patch represents in the ASTHash enum. There are currently 16
  distinct values, and I wouldn't expect this to grow much.

- The length of the input is directly correlated with the amount of
  control flow in a function.  This will often be quite short (a few if
  statements and loops, say) but may be quite long in the presence of a
  giant state machine implemented by a switch, or some other monstrous
  function. I'd expect this to usually be counted in tens, rather than
  hundreds, and it won't be uncommon for it to be one or two.

- The collisions we're concerned about are ones that are likely to occur
  from a function being changed. When someone modifies the control flow
  of a function, our profile is no longer valid for that function. We
  also check the number of counters though, so only collisions for the
  same length of input matter at all.

This sounds to me like it would be pretty similar to hashing short
strings, which bernstein is generally considered to be reasonably good
at IIRC, but I'm no expert on the subject.