[PATCH] D33428: [PDB] Hash types up front when merging types instead of using StringMap

Tue May 23 11:19:17 PDT 2017

rnk added a subscriber: chandlerc.
rnk added inline comments.

================
Comment at: llvm/lib/DebugInfo/CodeView/TypeSerializer.cpp:23-24
+  uint64_t Hash;
+  const uint8_t *Data;
+  unsigned Size; // FIXME: Go to uint16_t?
+  unsigned Index;
----------------
zturner wrote:
> `ArrayRef<uint8_t>` or `StringRef`?
This saves 4 bytes. We know type records are always shorter than 0xFF00 bytes, so it's safe to go all the way to uint16_t as the comment suggests.

================
Comment at: llvm/lib/DebugInfo/CodeView/TypeSerializer.cpp:25
+  unsigned Size; // FIXME: Go to uint16_t?
+  unsigned Index;
+};
----------------
zturner wrote:
> `TypeIndex`?
Yeah, we can do that. I kind of like `unsigned` or `uint32_t` because it doesn't depend on llvm::support::ulittle32_t, which uses all this memcpy craziness.

================
Comment at: llvm/lib/DebugInfo/CodeView/TypeSerializer.cpp:53-54
+      return LHS == RHS;
+    if (LHS->Hash != RHS->Hash || LHS->Size != RHS->Size)
+      return false;
+    return ::memcmp(LHS->Data, RHS->Data, LHS->Size) == 0;
----------------
zturner wrote:
> I haven't looked at the implementation of `DenseMap`, but is this check necessary?  I would imagine you could assert that the hashes are different, otherwise why would `DenseMap` be calling this function?
DenseMap takes the low bits of the hash and uses them to index into its table, so there can be hash collisions. So, if the table has 256 entries and two records hash to 0x100 and 0x200, this comparison will save us from doing a full memcmp.

================
Comment at: llvm/lib/DebugInfo/CodeView/TypeSerializer.cpp:59-60
+
+/// Private implementation so that we don't leak our DenseMap instantiations to
+/// users.
+class llvm::codeview::TypeHasher {
----------------
zturner wrote:
> Is this an important consideration?  Seems like it sacrifices readability
I really wanted HashedType and HashedTypePtr to be in anonymous namespaces, and this is what I had to do to accomplish that. They really aren't general purpose types that should float around in public headers.

================
Comment at: llvm/lib/DebugInfo/CodeView/TypeSerializer.cpp:98-99
+  // Compute the hash up front so we can store it in the key.
+  HashedType TempHashedType = {hash_value(Record), Record.data(),
+                               unsigned(Record.size()), NextTypeIndex};
+
----------------
zturner wrote:
> Why not `xxHash64`?  `hash_value` appears to operate on one char at a time.
@chandlerc said we'll probably have to delete xxHash64, so I didn't want to use it. hash_value actually hashes 64 bytes at a time if it can. I'm pretty confident we're hitting that overload, but I could do more to check. It also worried me.

https://reviews.llvm.org/D33428