On Tue, Feb 14, 2012 at 2:44 AM, Chris Lattner <span dir="ltr"><<a href="mailto:clattner@apple.com">clattner@apple.com</a>></span> wrote:<br><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div style="word-wrap:break-word"><div><div class="im"><div>On Feb 13, 2012, at 2:00 AM, Talin wrote:</div><blockquote type="cite"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
Just out of curiosity, why not MurmurHash3 ? This page seems to<br>
suggest that #2 has some flaw, and #3 is better all round:<br>
<br>
<a href="https://sites.google.com/site/murmurhash/" target="_blank">https://sites.google.com/site/murmurhash/</a><br>
<br></blockquote><div>The main reason is because there's no incremental version of 3. </div></div></blockquote><div><br></div></div><div>I think that that is a great reason.</div><div class="im"><br><blockquote type="cite">
<div class="gmail_quote"><div>LLVM's needs, on the other hand, are fairly modest. I'm guessing that most DenseMaps contain less than a few thousand entries. Even a bad hash function wouldn't hurt performance that much, and the time taken to calculate the hash is probably more of a factor in overall performance than getting the optimum distribution of hash values.</div>
</div></blockquote><div><br></div></div>Indeed. It can't be hard to be better than our existing adhoc stuff :). That said, please do not change the hash function used by StringMap without do careful performance analysis of the clang preprocessor. The identifier uniquing is a very hot path in "clang -E" performance.</div>
<div class="im"><div><br></div></div></div></blockquote><div>I'm not planning on touching StringMap. </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="word-wrap:break-word">
<div class="im"><div><blockquote type="cite"><div class="gmail_quote">
<blockquote class="gmail_quote" style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><br>Would it be possible to use CityHash instead for strings?<br>
<br>
<a href="http://code.google.com/p/cityhash/" target="_blank">http://code.google.com/p/cityhash/</a><br>
<br></blockquote><div>OK by me. My intention is that "Hashing.h" should contain a variety of different hashing algorithms for various specific needs. Right now, I am mainly focusing on hashing of mixed data types - that is, you have some structure which contains pointers, ints, strings, and you want to calculate a hash of the entire struct. I need this because I'm working on improving the uniquing of constants and similar data structures. My next target is to improve the uniquing of MDNodes, but I want to get the hashing stuff squared away first.</div>
<div><br></div><div>It is also my intent that some person who is more of an expert in hashing than I am can do detailed performance analysis under real-world loads (such as compiling actual programs with clang) and tweak and optimize the hashing algorithm, without needing to modify the API and/or all of the places that call it.</div>
</div></blockquote><br></div></div><div>I think that this is a great way to stage it. I personally care more about the interface than the implementation. Someone can tweak and tune it after enough code is using the API to get reasonable performance numbers.</div>
<div><br></div><div><br></div><div><div style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0px;font:normal normal normal 11px/normal Menlo;color:rgb(181,36,34)"><span style="color:#1738f5">#include </span>"llvm/ADT/ArrayRef.h"</div>
<div style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0px;font:normal normal normal 11px/normal Menlo;color:rgb(181,36,34)"><span style="color:#1738f5">#include </span>"llvm/ADT/StringRef.h"</div>
<div style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0px;font:normal normal normal 11px/normal Menlo;color:rgb(181,36,34)"><span style="color:#1738f5">#include </span>"llvm/Support/Compiler.h"</div>
<div style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0px;font:normal normal normal 11px/normal Menlo;color:rgb(181,36,34)"><span style="color:#1738f5">#include </span>"llvm/Support/DataTypes.h"</div>
<div style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0px;font:normal normal normal 11px/normal Menlo;color:rgb(181,36,34)"><span style="color:#1738f5">#include </span>"llvm/Support/PointerLikeTypeTraits.h"</div>
<div style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0px;font:normal normal normal 11px/normal Menlo;color:rgb(181,36,34)"><span style="color:#1738f5">#include </span>"llvm/Support/type_traits.h"</div>
<div><br></div><div>Do you actually need all of these includes? PointerLikeTypeTraits doesn't seem necessary. Is type_traits?</div><div><br></div></div></div></blockquote><div>Ooops, this was a cut & paste error from FoldingSet.cpp.</div>
<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="word-wrap:break-word"><div><div></div><div><div style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0px;font:normal normal normal 11px/normal Menlo;color:rgb(23,56,245)">
<span style="color:rgb(0,0,0)"> <span style="color:#1738f5">enum</span> {</span></div><div style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0px;font:normal normal normal 11px/normal Menlo"> BufferSize = 32,</div>
</div><div><br></div><div>BufferSize is dead.</div><div><br></div><div><br></div><div><div style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0px;font:normal normal normal 11px/normal Menlo;color:rgb(0,143,41)">
<span style="color:#000000"> </span>/// Add a pointer value</div><div style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0px;font:normal normal normal 11px/normal Menlo;color:rgb(23,56,245)"><span style="color:#000000"> </span>template<span style="color:#000000"><</span>typename<span style="color:#000000"> T></span></div>
<div style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0px;font:normal normal normal 11px/normal Menlo"> <span style="color:#1738f5">void</span> add(<span style="color:#1738f5">const</span> T *PtrVal) {</div>
<div style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0px;font:normal normal normal 11px/normal Menlo"> addImpl(</div><div style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0px;font:normal normal normal 11px/normal Menlo">
<span style="color:#1738f5">reinterpret_cast</span><<span style="color:#1738f5">const</span> uint32_t *>(&PtrVal),</div><div style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0px;font:normal normal normal 11px/normal Menlo">
<span style="color:#1738f5">reinterpret_cast</span><<span style="color:#1738f5">const</span> uint32_t *>(&PtrVal + 1));</div><div style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0px;font:normal normal normal 11px/normal Menlo">
}</div></div><div><br></div><div>This violates TBAA rules and looks pretty dangerous to expose as public API. Is this really needed? Also, addImpl is dereferencing the pointers as uint32_t's, but there is nothing that guarantees that T is a multiple of 4 bytes. The ArrayRef version has the same problem.</div>
<div><br></div></div></div></blockquote><div>So as far as hashing pointer values is concerned, I was just copying the code from FoldingSet. Since a lot of the keys that we're going to be dealing with are uniqued pointers, it makes sense to be able to calculate a hash of the bit-value of the pointer, rather than hashing the thing pointed to. That being said, renaming it to "addPointer" and adding a comment might be in order. Similarly, I can make the ArrayRef version 'addPointers' and have it take an ArrayRef<T*>.</div>
<div><br></div><div>Now, as to the 4 bytes issue, I think I can solve that with some clever template methods.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div style="word-wrap:break-word"><div><div></div><div>Though it is more verbose, I think it would be better to expose a template specialization approach to getting the hash_value of T.</div><div><br></div><div><div style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0px;font:normal normal normal 11px/normal Menlo;color:rgb(0,143,41)">
<span style="color:#000000"> </span>/// Add a float</div><div style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0px;font:normal normal normal 11px/normal Menlo"> <span style="color:#1738f5">void</span> add(<span style="color:#1738f5">float</span> Data) {</div>
<div style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0px;font:normal normal normal 11px/normal Menlo"> addImpl(</div><div style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0px;font:normal normal normal 11px/normal Menlo">
<span style="color:#1738f5">reinterpret_cast</span><<span style="color:#1738f5">const</span> uint32_t *>(&Data),</div><div style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0px;font:normal normal normal 11px/normal Menlo">
<span style="color:#1738f5">reinterpret_cast</span><<span style="color:#1738f5">const</span> uint32_t *>(&Data + 1));</div><div style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0px;font:normal normal normal 11px/normal Menlo">
}</div><div style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0px;font:normal normal normal 11px/normal Menlo;min-height:13px"><br></div><div style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0px;font:normal normal normal 11px/normal Menlo;color:rgb(0,143,41)">
<span style="color:#000000"> </span>/// Add a double</div><div style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0px;font:normal normal normal 11px/normal Menlo"> <span style="color:#1738f5">void</span> add(<span style="color:#1738f5">double</span> Data) {</div>
<div style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0px;font:normal normal normal 11px/normal Menlo"> addImpl(</div><div style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0px;font:normal normal normal 11px/normal Menlo">
<span style="color:#1738f5">reinterpret_cast</span><<span style="color:#1738f5">const</span> uint32_t *>(&Data),</div><div style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0px;font:normal normal normal 11px/normal Menlo">
<span style="color:#1738f5">reinterpret_cast</span><<span style="color:#1738f5">const</span> uint32_t *>(&Data + 1));</div><div style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0px;font:normal normal normal 11px/normal Menlo">
}</div></div><div><br></div><div>Similarly, these need to go through a union to avoid TBAA problems.</div><div><br></div></div></div></blockquote><div>I'm not sure how that works. Can you give an example?</div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div style="word-wrap:break-word"><div><div><br></div><div><div style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0px;font:normal normal normal 11px/normal Menlo"> <span style="color:#1738f5">void</span> add(StringRef StrVal) {</div>
<div style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0px;font:normal normal normal 11px/normal Menlo"> addImpl(StrVal.data(), StrVal.size());</div><div style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0px;font:normal normal normal 11px/normal Menlo">
}</div></div><div><br></div><div>I'm contradicting my stance above about not caring about the implementation :), but is MurmurHash a good hash for string data? The Bernstein hash function works really well and is much cheaper to compute than Murmur. It is used by HashString (and thus by StringMap).</div>
<div><br></div></div></div></blockquote><div>So, MurmurHash is intended for blocks of arbitrary binary data, which may contain character data, integers, or whatever - it's designed to do such a thorough job of mixing the bits that it really doesn't matter what data types you feed it. You are correct that for purely string data, you'd want to use a less expensive algorithm (I'm partial to FNV-1, which is as cheap as the Bernstein hash and is AFAIK more mathematically sound.)</div>
<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="word-wrap:break-word"><div><div></div><div><div style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0px;font:normal normal normal 11px/normal Menlo;color:rgb(0,143,41)">
<span style="color:#000000"> </span>// Add a possibly unaligned sequence of bytes.</div><div style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0px;font:normal normal normal 11px/normal Menlo"> <span style="color:#1738f5">void</span> addImpl(<span style="color:#1738f5">const</span> <span style="color:#1738f5">char</span> *I, size_t Length) {</div>
</div><div><br></div><div>This should probably be moved out of line to avoid code bloat.</div></div></div></blockquote><div><br></div><div>OK </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div style="word-wrap:break-word"><div><div><br></div><div>Overall, the design of the class is making sense to me! Thanks for tackling this!</div><div><br></div><font color="#888888"><div>-Chris</div></font></div><div><br>
</div><br></div></blockquote></div><br><br clear="all"><div><br></div>-- <br>-- Talin<br>