[PATCH] D42740: Implement a case-folding version of DJB hash

Wed Feb 21 12:42:49 PST 2018

aprantl added inline comments.

================
Comment at: llvm/trunk/lib/Support/DJB.cpp:27
 uint32_t llvm::djbHash(StringRef Buffer, uint32_t H) {
   for (char C : Buffer.bytes())
+    H = djbHashChar(C, H);
----------------
aprantl wrote:
> labath wrote:
> > The know values tests have already payed off in that they detected that the result of the hash function depends on the signedness of `char` (which did seem a bit dodgy to me when I looked at it, but I assumed we knew what we were doing).
> > 
> > For the new function the fix is obvious (as unsigned char, as dwarf5 specifies), but it's not clear to me how to handle the existing hash function, whose broken hashes are already in the wild.
> > 
> > @aprantl, @JDevlieghere: How should we handle this? This is only used on apple platforms, and I expect most of apple code to be compiled on x86 hosts (right?), which has char signed, so explicitly setting this to signed for the apple hash would probably have the least impact while maintaining consistency for the future. OTOH, lldb already implements this hash functions via unsigned char, and this discrepancy has gone unnoticed, so maybe it does not matter? (This effect is only visible on char >=128, so US-ASCII is fine, but any utf8 character would trigger this).
> > This is only used on apple platforms, and I expect most of apple code to be compiled on x86 hosts (right?)
> 
> I think that's the key, yeah. Ever since the Intel transition it is safe to assume that pretty much all code that runs on Apple platforms was compiled on some form of an x86 machine, so we should fix the signedness to hardcode the x86 behavior and call it signedDjbHash and say that it's used for the apple accelerator tables.
> 
> But now it gets weird. I just looked at the LLDB source code and it seems to implement the DJB function with an unsigned char!?
Well, since it only affects characters outside of the lower 7 bit ASCII, I think it would be fine to just fix the LLVM implementation to match the one in LLDB and DWARF. It looks like this would have never worked for non-ASCII strings in the past and fixing the implementation will make it work from here on forth.

Repository:
  rL LLVM

https://reviews.llvm.org/D42740