[LLVMdev] a different hash for APInts

Jeffrey Yasskin jyasskin at google.com
Thu Mar 12 10:20:54 PDT 2009


On Thu, Mar 12, 2009 at 3:14 AM, Cédric Venet <cedric.venet at laposte.net> wrote:
> Stuart Hastings a écrit :
>>
>> {
>>     {0x00000000}, {0x33800000}, {0x34000000}, {0x34400000},
>>     {0x34800000}, {0x34a00000}, {0x34c00000}, {0x34e00000},
>>     {0x35000000}, {0x35100000}, {0x35200000}, {0x35300000},
>>     {0x35400000}, {0x35500000}, {0x35600000}, {0x35700000},
>> ...
>>     {0xfffd8000}, {0xfffda000}, {0xfffdc000}, {0xfffde000},
>>     {0xfffe0000}, {0xfffe2000}, {0xfffe4000}, {0xfffe6000},
>>     {0xfffe8000}, {0xfffea000}, {0xfffec000}, {0xfffee000},
>>     {0xffff0000}, {0xffff2000}, {0xffff4000}, {0xffff6000},
>>     {0xffff8000}, {0xffffa000}, {0xffffc000}, {0xffffe000},
>> };
>>
>> The module has over 16K lines, so that's about 64K entries (1<<16).
>> There is no executable code in this testcase.
>>
>> The cause seems to be the APInt::getHashValue() function (near line
>> 626 of .../lib/Support/APInt.cpp).  Some investigation using the
>> debugger suggests that every value was hashing into about three
>> buckets, and the DenseMap code was looping excessively over the
>> extremely long chains in those three buckets.
>>
>
>  From what I can see the old hash function can be good, if the number of
> bucket is not a power of two. The problem is that dense map use 64
> buckets initially and grow but doubling this number. So what happen here
> (I think, I only did a fast reading of the code) is the hash for a value
> i is about h=(i+32), and the bucket selection is done by h%64 or
> h%(1<<n) so only the low bits are taken into acount.

The bucket selection is actually done by
  BucketT *ThisBucket = BucketsPtr + (BucketNo & (NumBuckets-1));
in DenseMap::LookupBucketFor
(http://llvm.org/doxygen/DenseMap_8h-source.html#l00358)

There are really two kinds of hash tables: prime-sized ones and
power-of-two-sized ones. Alternately, we might call them
"simple-hash+MOD" and "good-hash+AND". The gcc hash_map
implementations go with simple-hash+MOD. They compile in a list of
primes to use for the hashtable size, which makes them tolerant of
users who define very simple hash functions like (size_t)my_ptr.
DenseMap appears to have gone the other way, which lets it use AND but
requires better hash functions. Why bother with the better hash
functions? http://www.burtleburtle.net/bob/c/lookup3.c claims that
"mod is sooo slow!" I haven't verified that myself, but since
lookup3.c was written in 2006 it may still be true. If so, that would
indicate that good-hash+AND maps are likely to be faster, but to be
sure we'd need a benchmark.

I would recommend that anyone interested in this read
http://www.burtleburtle.net/bob/hash/index.html.

Jeffrey




More information about the llvm-dev mailing list