[llvm-dev] Default hashing function for integers (DenseMapInfo.h)

Tue Jan 10 07:03:36 PST 2017

On 10 January 2017 at 13:20, Francois Fayard via llvm-dev <
llvm-dev at lists.llvm.org> wrote:

>
> > On Jan 10, 2017, at 9:36 AM, Bruce Hoult <bruce at hoult.org> wrote:
> >
> > Both are not very sophisticated.
> > You should also look at the different MurmurHash versions, and
> descendants such as CityHash.
>
> I did a few benchmark this morning, trying to tweak the hashing for
> pointers (as many people seem to use pointers as keys). The hash function
> in LLVM is quite simple, but it gives very good results. The benchmark I
> did was compiling a C++ file with clang. I did 10 tests for every version
> of clang/llvm compiled with the hash function, and here are the best
> timings:
>
> - Original LLVM: hash(k) = (k >> 4) ^ (k >> 9)
>   3.40s
> - Stupid hash: hash(k) = k
>   3.55s
> - A bit less stupid: hash(k) = k >> unused_bits (defined by sizeof(T) =
> 2^unused_bits)
>   3.47s
> - Murmurhash3
>   3.47s
>
> So different hashing functions make a difference. For pointers, it seems
> that the one used by LLVM is quite good. It gives the best performance
> here. It is quite surprising that Murmurhash3 does not behave better than
> the “A bit less stupid” hash. My guess is that the LLVM hash function is
> the best because it is tailored to the distribution of pointers returned by
> malloc. As I said, the (k << 4) seems to be related to the fact that
> pointers are 16-bytes aligned on most OS. I still don’t understand the 9.
>

It is not clear what the test-case is (what source, what compiler options).
My suspicion is that your differences are in the noise, and most of the
time is spent doing other things than hashing. Did you profile the a run,
and check how much of the total time is spent in the hash-function [you may
need to tweak the code a bit to not inline the actual hash function]. Also
publishing the RANGE for each set of tests, since the variation between
"best and worst" may be more important than the overall best time.

Counting the number of collisions or some such would probably also help.

[I have no useful answer as to why `k >> 9` is used. `k >> 4` because
pointers [allocated from the heap] are often aligned to 16 or greater does
make sense. I suspect 9 is just some arbitrary value that gives some more
entropy in the low part of the value [and since the value is later used in
a modulo buckets or some such, having more variation in the lower bits does
help - upper bits are discarded, but also often the same for large number
of allocations!] - whether someone spent minutes, hours or days picking
exactly 9 I have no idea - nor whether it still is the "right" value, or
there should be some other number there...

--
Mats

>
> I’ll do some benchmarks for hashing integers tomorrow. This is the hashing
> that really looks suspicious to me but it does not seem to be used that
> much in LLVM.
>
> François Fayard
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170110/f252e875/attachment.html>