[LLVMbugs] [Bug 10971] New: Due to bad hash functions, unordered_set can easily give horrible performance
bugzilla-daemon at llvm.org
bugzilla-daemon at llvm.org
Tue Sep 20 18:15:04 PDT 2011
http://llvm.org/bugs/show_bug.cgi?id=10971
Summary: Due to bad hash functions, unordered_set can easily
give horrible performance
Product: libc++
Version: unspecified
Platform: All
OS/Version: All
Status: NEW
Severity: normal
Priority: P
Component: All Bugs
AssignedTo: hhinnant at apple.com
ReportedBy: oneill+llvmbugs at cs.hmc.edu
CC: llvmbugs at cs.uiuc.edu
Created an attachment (id=7311)
--> (http://llvm.org/bugs/attachment.cgi?id=7311)
Program to show bad hashing
libc++ implements std::hash<int> and friends as the identity function; this
design choice, coupled with the implentation choices for unordered_set (i.e.,
not to compensate for this deficiency) result in hash tables that can easily
offer poor performance (and are trivial to “game”).
Here's some output from my (slightly contrived not not implausible) test
program, badhash.cpp:
unix% badhash | head -25
42 -> bucket 42
1073 -> bucket 42
2104 -> bucket 42
3135 -> bucket 42
4166 -> bucket 42
5197 -> bucket 42
6228 -> bucket 42
7259 -> bucket 42
8290 -> bucket 42
9321 -> bucket 42
10352 -> bucket 42
11383 -> bucket 42
12414 -> bucket 42
13445 -> bucket 42
14476 -> bucket 42
15507 -> bucket 42
16538 -> bucket 42
17569 -> bucket 42
18600 -> bucket 42
19631 -> bucket 42
20662 -> bucket 42
21693 -> bucket 42
22724 -> bucket 42
23755 -> bucket 42
24786 -> bucket 42
Here's the results using a better hash function that is still just a few cycles
to compute:
unix% badhash | head -25
42 -> bucket 994
1073 -> bucket 969
2104 -> bucket 477
3135 -> bucket 975
4166 -> bucket 189
5197 -> bucket 458
6228 -> bucket 195
7259 -> bucket 972
8290 -> bucket 709
9321 -> bucket 978
10352 -> bucket 192
11383 -> bucket 461
12414 -> bucket 341
13445 -> bucket 173
14476 -> bucket 855
15507 -> bucket 179
16538 -> bucket 338
17569 -> bucket 693
18600 -> bucket 344
19631 -> bucket 176
20662 -> bucket 858
21693 -> bucket 690
22724 -> bucket 570
23755 -> bucket 839
24786 -> bucket 53
Saving a couple of cycles in computing the hash function is a false economy.
Instead, please compute a *good* hash. Good hashes do a good job of masking the
structure in their input, and thus appear to give randomly distributed values.
Also note that the hash functions for types larger than size_t (e.g., the one
for long long [on 32 bit]) don't look like they are good either. They don't
even satisty the requirements of the C++11 spec which says that it should be
statistically unlikely that two distinct keys will hash to the same hash value.
Structure in the input can lead to high collision probability.
A web search will find plenty of resouces on hashing, hash tables, good hash
functions, etc. But I could contribute a patch if no one else wants to.
--
Configure bugmail: http://llvm.org/bugs/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
More information about the llvm-bugs
mailing list