<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On 10 January 2017 at 13:20, Francois Fayard via llvm-dev <span dir="ltr"><<a target="_blank" href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>></span> wrote:<br><blockquote style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex" class="gmail_quote"><span class="gmail-"><br>
> On Jan 10, 2017, at 9:36 AM, Bruce Hoult <<a href="mailto:bruce@hoult.org">bruce@hoult.org</a>> wrote:<br>
><br>
</span><span class="gmail-">> Both are not very sophisticated.<br>
> You should also look at the different MurmurHash versions, and descendants such as CityHash.<br>
<br>
</span>I did a few benchmark this morning, trying to tweak the hashing for pointers (as many people seem to use pointers as keys). The hash function in LLVM is quite simple, but it gives very good results. The benchmark I did was compiling a C++ file with clang. I did 10 tests for every version of clang/llvm compiled with the hash function, and here are the best timings:<br>
<br>
- Original LLVM: hash(k) = (k >> 4) ^ (k >> 9)<br>
3.40s<br>
- Stupid hash: hash(k) = k<br>
3.55s<br>
- A bit less stupid: hash(k) = k >> unused_bits (defined by sizeof(T) = 2^unused_bits)<br>
3.47s<br>
- Murmurhash3<br>
3.47s<br>
<br>
So different hashing functions make a difference. For pointers, it seems that the one used by LLVM is quite good. It gives the best performance here. It is quite surprising that Murmurhash3 does not behave better than the “A bit less stupid” hash. My guess is that the LLVM hash function is the best because it is tailored to the distribution of pointers returned by malloc. As I said, the (k << 4) seems to be related to the fact that pointers are 16-bytes aligned on most OS. I still don’t understand the 9.<br></blockquote><div><br></div><div>It is not clear what the test-case is (what source, what compiler options). My suspicion is that your differences are in the noise, and most of the time is spent doing other things than hashing. Did you profile the a run, and check how much of the total time is spent in the hash-function [you may need to tweak the code a bit to not inline the actual hash function]. Also publishing the RANGE for each set of tests, since the variation between "best and worst" may be more important than the overall best time.<br><br></div><div>Counting the number of collisions or some such would probably also help.<br><br></div><div>[I have no useful answer as to why `k >> 9` is used. `k >> 4` because pointers [allocated from the heap] are often aligned to 16 or greater does make sense. I suspect 9 is just some arbitrary value that gives some more entropy in the low part of the value [and since the value is later used in a modulo buckets or some such, having more variation in the lower bits does help - upper bits are discarded, but also often the same for large number of allocations!] - whether someone spent minutes, hours or days picking exactly 9 I have no idea - nor whether it still is the "right" value, or there should be some other number there...<br></div><div><br>--<br></div><div>Mats <br></div><blockquote style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex" class="gmail_quote">
<br>
I’ll do some benchmarks for hashing integers tomorrow. This is the hashing that really looks suspicious to me but it does not seem to be used that much in LLVM.<br>
<div class="gmail-HOEnZb"><div class="gmail-h5"><br>
François Fayard<br>
______________________________<wbr>_________________<br>
LLVM Developers mailing list<br>
<a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a><br>
<a target="_blank" rel="noreferrer" href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev">http://lists.llvm.org/cgi-bin/<wbr>mailman/listinfo/llvm-dev</a><br>
</div></div></blockquote></div><br></div></div>