[LLVMbugs] [Bug 10971] New: Due to bad hash functions, unordered_set can easily give horrible performance

bugzilla-daemon at llvm.org bugzilla-daemon at llvm.org
Tue Sep 20 18:15:04 PDT 2011


http://llvm.org/bugs/show_bug.cgi?id=10971

           Summary: Due to bad hash functions, unordered_set can easily
                    give horrible performance
           Product: libc++
           Version: unspecified
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: P
         Component: All Bugs
        AssignedTo: hhinnant at apple.com
        ReportedBy: oneill+llvmbugs at cs.hmc.edu
                CC: llvmbugs at cs.uiuc.edu


Created an attachment (id=7311)
 --> (http://llvm.org/bugs/attachment.cgi?id=7311)
Program to show bad hashing

libc++ implements std::hash<int> and friends as the identity function; this
design choice, coupled with the implentation choices for unordered_set (i.e.,
not to compensate for this deficiency) result in hash tables that can easily
offer poor performance (and are trivial to “game”).

Here's some output from my (slightly contrived not not implausible) test
program, badhash.cpp:

unix% badhash | head -25
42 -> bucket 42
1073 -> bucket 42
2104 -> bucket 42
3135 -> bucket 42
4166 -> bucket 42
5197 -> bucket 42
6228 -> bucket 42
7259 -> bucket 42
8290 -> bucket 42
9321 -> bucket 42
10352 -> bucket 42
11383 -> bucket 42
12414 -> bucket 42
13445 -> bucket 42
14476 -> bucket 42
15507 -> bucket 42
16538 -> bucket 42
17569 -> bucket 42
18600 -> bucket 42
19631 -> bucket 42
20662 -> bucket 42
21693 -> bucket 42
22724 -> bucket 42
23755 -> bucket 42
24786 -> bucket 42

Here's the results using a better hash function that is still just a few cycles
to compute:

unix% badhash | head -25
42 -> bucket 994
1073 -> bucket 969
2104 -> bucket 477
3135 -> bucket 975
4166 -> bucket 189
5197 -> bucket 458
6228 -> bucket 195
7259 -> bucket 972
8290 -> bucket 709
9321 -> bucket 978
10352 -> bucket 192
11383 -> bucket 461
12414 -> bucket 341
13445 -> bucket 173
14476 -> bucket 855
15507 -> bucket 179
16538 -> bucket 338
17569 -> bucket 693
18600 -> bucket 344
19631 -> bucket 176
20662 -> bucket 858
21693 -> bucket 690
22724 -> bucket 570
23755 -> bucket 839
24786 -> bucket 53

Saving a couple of cycles in computing the hash function is a false economy.
Instead, please compute a *good* hash. Good hashes do a good job of masking the
structure in their input, and thus appear to give randomly distributed values.

Also note that the hash functions for types larger than size_t (e.g., the one
for long long [on 32 bit]) don't look like they are good either. They don't
even satisty the requirements of the C++11 spec which says that it should be
statistically unlikely that two distinct keys will hash to the same hash value.
Structure in the input can lead to high collision probability.

A web search will find plenty of resouces on hashing, hash tables, good hash
functions, etc.  But I could contribute a patch if no one else wants to.

-- 
Configure bugmail: http://llvm.org/bugs/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.



More information about the llvm-bugs mailing list