[LLVMdev] We need better hashing
Chris Lattner
clattner at apple.com
Tue Feb 14 02:44:36 PST 2012
On Feb 13, 2012, at 2:00 AM, Talin wrote:
> Just out of curiosity, why not MurmurHash3 ? This page seems to
> suggest that #2 has some flaw, and #3 is better all round:
>
> https://sites.google.com/site/murmurhash/
>
> The main reason is because there's no incremental version of 3.
I think that that is a great reason.
> LLVM's needs, on the other hand, are fairly modest. I'm guessing that most DenseMaps contain less than a few thousand entries. Even a bad hash function wouldn't hurt performance that much, and the time taken to calculate the hash is probably more of a factor in overall performance than getting the optimum distribution of hash values.
Indeed. It can't be hard to be better than our existing adhoc stuff :). That said, please do not change the hash function used by StringMap without do careful performance analysis of the clang preprocessor. The identifier uniquing is a very hot path in "clang -E" performance.
>
> Would it be possible to use CityHash instead for strings?
>
> http://code.google.com/p/cityhash/
>
> OK by me. My intention is that "Hashing.h" should contain a variety of different hashing algorithms for various specific needs. Right now, I am mainly focusing on hashing of mixed data types - that is, you have some structure which contains pointers, ints, strings, and you want to calculate a hash of the entire struct. I need this because I'm working on improving the uniquing of constants and similar data structures. My next target is to improve the uniquing of MDNodes, but I want to get the hashing stuff squared away first.
>
> It is also my intent that some person who is more of an expert in hashing than I am can do detailed performance analysis under real-world loads (such as compiling actual programs with clang) and tweak and optimize the hashing algorithm, without needing to modify the API and/or all of the places that call it.
I think that this is a great way to stage it. I personally care more about the interface than the implementation. Someone can tweak and tune it after enough code is using the API to get reasonable performance numbers.
#include "llvm/ADT/ArrayRef.h"
#include "llvm/ADT/StringRef.h"
#include "llvm/Support/Compiler.h"
#include "llvm/Support/DataTypes.h"
#include "llvm/Support/PointerLikeTypeTraits.h"
#include "llvm/Support/type_traits.h"
Do you actually need all of these includes? PointerLikeTypeTraits doesn't seem necessary. Is type_traits?
enum {
BufferSize = 32,
BufferSize is dead.
/// Add a pointer value
template<typename T>
void add(const T *PtrVal) {
addImpl(
reinterpret_cast<const uint32_t *>(&PtrVal),
reinterpret_cast<const uint32_t *>(&PtrVal + 1));
}
This violates TBAA rules and looks pretty dangerous to expose as public API. Is this really needed? Also, addImpl is dereferencing the pointers as uint32_t's, but there is nothing that guarantees that T is a multiple of 4 bytes. The ArrayRef version has the same problem.
Though it is more verbose, I think it would be better to expose a template specialization approach to getting the hash_value of T.
/// Add a float
void add(float Data) {
addImpl(
reinterpret_cast<const uint32_t *>(&Data),
reinterpret_cast<const uint32_t *>(&Data + 1));
}
/// Add a double
void add(double Data) {
addImpl(
reinterpret_cast<const uint32_t *>(&Data),
reinterpret_cast<const uint32_t *>(&Data + 1));
}
Similarly, these need to go through a union to avoid TBAA problems.
void add(StringRef StrVal) {
addImpl(StrVal.data(), StrVal.size());
}
I'm contradicting my stance above about not caring about the implementation :), but is MurmurHash a good hash for string data? The Bernstein hash function works really well and is much cheaper to compute than Murmur. It is used by HashString (and thus by StringMap).
// Add a possibly unaligned sequence of bytes.
void addImpl(const char *I, size_t Length) {
This should probably be moved out of line to avoid code bloat.
Overall, the design of the class is making sense to me! Thanks for tackling this!
-Chris
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20120214/eca27ace/attachment.html>
More information about the llvm-dev
mailing list