[LLVMdev] We need better hashing
Chris Lattner
clattner at apple.com
Wed Feb 15 12:01:04 PST 2012
On Feb 14, 2012, at 10:47 PM, Talin wrote:
> /// Add a pointer value
> template<typename T>
> void add(const T *PtrVal) {
> addImpl(
> reinterpret_cast<const uint32_t *>(&PtrVal),
> reinterpret_cast<const uint32_t *>(&PtrVal + 1));
> }
>
> This violates TBAA rules and looks pretty dangerous to expose as public API. Is this really needed? Also, addImpl is dereferencing the pointers as uint32_t's, but there is nothing that guarantees that T is a multiple of 4 bytes. The ArrayRef version has the same problem.
>
> So as far as hashing pointer values is concerned, I was just copying the code from FoldingSet. Since a lot of the keys that we're going to be dealing with are uniqued pointers, it makes sense to be able to calculate a hash of the bit-value of the pointer, rather than hashing the thing pointed to. That being said, renaming it to "addPointer" and adding a comment might be in order. Similarly, I can make the ArrayRef version 'addPointers' and have it take an ArrayRef<T*>.
Ah, Jay was right, I misread this code!
> Though it is more verbose, I think it would be better to expose a template specialization approach to getting the hash_value of T.
>
> /// Add a float
> void add(float Data) {
> addImpl(
> reinterpret_cast<const uint32_t *>(&Data),
> reinterpret_cast<const uint32_t *>(&Data + 1));
> }
>
> /// Add a double
> void add(double Data) {
> addImpl(
> reinterpret_cast<const uint32_t *>(&Data),
> reinterpret_cast<const uint32_t *>(&Data + 1));
> }
>
> Similarly, these need to go through a union to avoid TBAA problems.
>
> I'm not sure how that works. Can you give an example?
Just use:
void add(double Data) {
union {
double D; uint64_t I;
};
D = Data;
add(I);
}
>
> void add(StringRef StrVal) {
> addImpl(StrVal.data(), StrVal.size());
> }
>
> I'm contradicting my stance above about not caring about the implementation :), but is MurmurHash a good hash for string data? The Bernstein hash function works really well and is much cheaper to compute than Murmur. It is used by HashString (and thus by StringMap).
>
> So, MurmurHash is intended for blocks of arbitrary binary data, which may contain character data, integers, or whatever - it's designed to do such a thorough job of mixing the bits that it really doesn't matter what data types you feed it. You are correct that for purely string data, you'd want to use a less expensive algorithm (I'm partial to FNV-1, which is as cheap as the Bernstein hash and is AFAIK more mathematically sound.)
Ok, so what's the answer? :) We can do different things for ArrayRef<char> and StringRef.
-Chris
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20120215/d43c6f03/attachment.html>
More information about the llvm-dev
mailing list