[PATCH] D33488: [ELF] - Optimization for populating stringpool when building .gdb_index.

Thu May 25 12:58:13 PDT 2017

George Rimar <grimar at accesssoftek.com> writes:

>>The hash function being used is
>>
>>static uint32_t hash(StringRef Str) {
>>  uint32_t R = 0;
>>  for (uint8_t C : Str)
>>    R = R * 67 + tolower(C) - 113;
>>  return R;
>>}
>>
>>What happens we have the string "Foo" and "foo"? Do we have a guarantee
>>that that never happens?
>>
>>Cheers,
>>Rafael
>
> It is not a problem here. See llvm::StringTableBuilder implements both add() via CachedHashStringRef:
>   size_t add(CachedHashStringRef S);
>   size_t add(StringRef S) { return add(CachedHashStringRef(S)); }
>
> So collision of hashes can happen in both implementations actually I believe. 
> But that is not harmfull, because internal implementation uses
> DenseMap<CachedHashStringRef, size_t> StringIndexMap to store strings. 
>
> And 2 CachedHashStringRef are equal only when both their hash and values are equal:
>   static bool isEqual(const CachedHashStringRef &LHS,
>                       const CachedHashStringRef &RHS) {
>     return LHS.hash() == RHS.hash() &&
>            DenseMapInfo<StringRef>::isEqual(LHS.val(), RHS.val());
>   }
>
> So I mean that following code would produce 2 different entries anyways:
>   llvm::StringTableBuilder Bar(StringTableBuilder::ELF, 1);
>   Bar.add({StringRef("foo"), 2});
>   Bar.add({StringRef("Foo"), 2});

Good point. But now that I think of it, any idea why the hash is defined
with a tolower? Should we be canonicalazing the strings in the set? If
so the hash could just assert that the characters are lowercase.

Trying to benchmark this myself I hit:
https://bugs.llvm.org/show_bug.cgi?id=33173.

Could you take a look?

Ignoring the error the difference I see when linking clang is in the
noise.

Cheers,
Rafael