<div dir="ltr"><div class="gmail_default" style="font-family:verdana,sans-serif">While I agree that FNV is slightly better than Bernstein in the big picture, for this specific use case is dramatically better, given the Bernstein's weaknesses you've described.</div>

<div class="gmail_default" style="font-family:verdana,sans-serif"><br></div><div class="gmail_default" style="font-family:verdana,sans-serif">In terms of performance, FNV is probably better than MD5 for this use case, where we need to compute a large # of hashes of small data sets. MD5 works on chunks of 64 bytes, so a significant amount of padding would have to be added and processed.</div>

<div class="gmail_default" style="font-family:verdana,sans-serif"><br></div><div class="gmail_default" style="font-family:verdana,sans-serif">Of course, some actual data would be great.</div><div class="gmail_default" style="font-family:verdana,sans-serif">

<br></div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Tue, Mar 25, 2014 at 12:31 PM, Chandler Carruth <span dir="ltr"><<a href="mailto:chandlerc@google.com" target="_blank">chandlerc@google.com</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><br><div class="gmail_quote"><div><div class="h5">On Tue, Mar 25, 2014 at 12:11 PM, Justin Bogner <span dir="ltr"><<a href="mailto:mail@justinbogner.com" target="_blank">mail@justinbogner.com</a>></span> wrote:<br>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div>Chandler Carruth <<a href="mailto:chandlerc@google.com" target="_blank">chandlerc@google.com</a>> writes:<br>


> FNV is actually based on the same principles as Bernstein's -- it is relying<br>

> on multiplication to spread the bits throughout an integers state, and xor (or<br>

> addition as you originally wrote the patch, many variations on Bernstein's use<br>

> xor though).<br>

><br>

> These all will have reasonably frequent collisions in addition to be poorly<br>

> distributed over the space. You've indicated you don't care about the<br>

> distribution, but do care about collisions.<br>

><br>

> Also, you've asserted speed claims without data. Both Bernstein's hash (in its<br>

> original formulation, your code was actually a strange variant of it that<br>

> didn't operate on bytes or octets) and FNV are necessarily a byte-at-a-time<br>

> and thus *quite* slow for inputs of even several hundered bytes.<br>

><br>

> We actually have a variation of CityHash that I implemented which is a bit<br>

> faster than CityHash (and for strings of bytes more than 128 bytes, several<br>

> times faster than Bernstein's) but has similarly strong collision resistance.<br>

><br>

> But how much data are we talking about? And how frequently are you computing<br>

> this? MD5 is actually reasonably fast on modern hardware. The reference<br>

> benchmarks have shown roughly 500 cycles to compute the MD5 of an 8-byte<br>

> message, and 800 or 900 cycles to compute the MD5 of a 64-byte message. I<br>

> would expect traversing the AST to build the inputs for this to be<br>

> significantly slower due to cache misses, but I think benchmarks would help<br>

> here.<br>

<br>

</div>This won't be much data per hash, but it needs to be calculated once per<br>

function being compiled. My gut says any of these hashes will be<br>

sufficient, but it'll help to describe the problem domain.<br></blockquote><div><br></div></div></div><div>I've read all the patch and am fairly familiar with the design...</div><div class=""><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


The hash is based on the structure of the AST for a function, so:<br>

<br>

- The input domain is a set of "interesting" statements and decls, which<br>

  Duncan's patch represents in the ASTHash enum. There are currently 16<br>

  distinct values, and I wouldn't expect this to grow much.<br>

<br>

- The length of the input is directly correlated with the amount of<br>

  control flow in a function.  This will often be quite short (a few if<br>

  statements and loops, say) but may be quite long in the presence of a<br>

  giant state machine implemented by a switch, or some other monstrous<br>

  function. I'd expect this to usually be counted in tens, rather than<br>

  hundreds, and it won't be uncommon for it to be one or two.<br></blockquote><div><br></div></div><div>Yep.</div><div class=""><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


<br>

- The collisions we're concerned about are ones that are likely to occur<br>

  from a function being changed. When someone modifies the control flow<br>

  of a function, our profile is no longer valid for that function. We<br>

  also check the number of counters though, so only collisions for the<br>

  same length of input matter at all.<br></blockquote><div><br></div></div><div>Yes, so things like single bit flips in the message.</div><div class=""><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


<br>

This sounds to me like it would be pretty similar to hashing short<br>

strings, which bernstein is generally considered to be reasonably good<br>

at IIRC, but I'm no expert on the subject.</blockquote></div></div><br>Bernstein's hash happens to be effective of short strings of *ascii* printable characters. It is not highly rilient to single bit flips in all bits of the input. Notably, as pointed out by Bob Jenkins and others, there is a funnel where 0x21 and 0x100 have the same hash (33). Bernstein's hash became popular in no small part because it has unusual properties with *ascii* text: lowercase alpha strings of 6 characters or smaller have zero collisions in 32-bits. I don't see any way that it is a useful hashing algorithm for something like this.</div>


<div class="gmail_extra"><br></div><div class="gmail_extra">FNV is slightly better in that it works for any byte stream rather than being carefully chosen to work with ascii characters. This is primarily because it uses a prime multiplier. However, it *requires* fast integer multiplies (and thus is often quite slow on non-Intel chips) and has a tendency to scale *very* poorly to large input messages due to the byte-stream nature of the beast.</div>


</div>

<br>_______________________________________________<br>

cfe-commits mailing list<br>

<a href="mailto:cfe-commits@cs.uiuc.edu">cfe-commits@cs.uiuc.edu</a><br>

<a href="http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits" target="_blank">http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits</a><br>

<br></blockquote></div><br><br clear="all"><div><br></div>-- <br><div dir="ltr"><div><font size="4" face="arial black, sans-serif" style="background-color:rgb(0,0,0)" color="#b45f06"> Raúl E. Silvera </font></div><div><br>

</div></div>

</div>