<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Tue, Mar 25, 2014 at 11:02 AM, Duncan P. N. Exon Smith <span dir="ltr"><<a href="mailto:dexonsmith@apple.com" target="_blank">dexonsmith@apple.com</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="">On Mar 25, 2014, at 10:20 AM, Raul Silvera <<a href="mailto:rsilvera@google.com">rsilvera@google.com</a>> wrote:<br>


<br>

> How about an FNV hash? That is very simple to implement, fast, and will be stronger at detecting changes.<br>

<br>

</div>FNV looks great; thanks!  I’ll resubmit with FNV-1a [1].<br>

<br>

<a href="http://isthe.com/chongo/tech/comp/fnv/#FNV-1a" target="_blank">http://isthe.com/chongo/tech/comp/fnv/#FNV-1a</a></blockquote><div><br></div><div>FNV is actually based on the same principles as Bernstein's -- it is relying on multiplication to spread the bits throughout an integers state, and xor (or addition as you originally wrote the patch, many variations on Bernstein's use xor though).</div>

<div><br></div><div>These all will have reasonably frequent collisions in addition to be poorly distributed over the space. You've indicated you don't care about the distribution, but do care about collisions.</div>

<div><br></div><div>Also, you've asserted speed claims without data. Both Bernstein's hash (in its original formulation, your code was actually a strange variant of it that didn't operate on bytes or octets) and FNV are necessarily a byte-at-a-time and thus *quite* slow for inputs of even several hundered bytes.</div>

<div><br></div><div>We actually have a variation of CityHash that I implemented which is a bit faster than CityHash (and for strings of bytes more than 128 bytes, several times faster than Bernstein's) but has similarly strong collision resistance.<br>

</div><div><br></div><div>But how much data are we talking about? And how frequently are you computing this? MD5 is actually reasonably fast on modern hardware. The reference benchmarks have shown roughly 500 cycles to compute the MD5 of an 8-byte message, and 800 or 900 cycles to compute the MD5 of a 64-byte message. I would expect traversing the AST to build the inputs for this to be significantly slower due to cache misses, but I think benchmarks would help here.</div>

<div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="">> Should the hashing computation be split from PGO into its own utility? Having a general hashing for functions may have other uses; in particular MergeFunc comes to mind.<br>

</div></blockquote><div><br></div><div>We have many, many implementations of hash functions in LLVM already. I am strongly opposed to adding more without specific concrete use cases.</div></div></div></div>