[PATCH] D87805: [PDB] Merge types in parallel when using ghashing

Fri Sep 18 13:49:12 PDT 2020

aganea added a comment.

Thanks a lot for working on this Reid, really neat!

This is a huge change, how do you validate that the resulting PDB is equivalent to that of the previous version (before patch) or non-ghash merging?

================
Comment at: lld/COFF/DebugTypes.cpp:598
+  // Copy the type into our mutable buffer.
+  assert(ty.length() <= codeview::MaxRecordLength);
+  size_t offset = merged.recs.size();
----------------
Fun fact, microsoft-pdb does (mistakenly?) `<` not `<=`: https://github.com/microsoft/microsoft-pdb/blob/082c5290e5aff028ae84e43affa8be717aa7af73/PDB/dbi/tpi.cpp#L1130
However it does reserve `cbRecMax` bytes (L942).

================
Comment at: lld/COFF/DebugTypes.cpp:609
+    for (size_t i = ty.length(); i < newSize; ++i)
+      newRec[i] = LF_PAD0 + (newSize - i);
+  }
----------------
We're doing this all over the place, it'd be nice to eventually converge all variants (later).

================
Comment at: lld/COFF/DebugTypes.cpp:807
+namespace {
+/// A concurrent hash table for global type hashing. It is based on this paper:
+/// Concurrent Hash Tables: Fast and General(?)!
----------------
I am wondering if this doesn't belong in a new file? Since the code is quite small, we //could// possibly have different implementations (in the future), depending on the dataset. 32-bit, 64-bit or 128-bit with no ghash indirection (if the CPU supports 128-bit CAS).

================
Comment at: lld/COFF/DebugTypes.cpp:873
+  GloballyHashedType getGHash() const {
+    return TpiSource::instances[getTpiSrcIdx()]->ghashes[getGHashIdx()];
+  }
----------------
I can't help thinking that this smells like Clang's SourceManager index, where all sources all collated into a single index (not a 2D array).
If you did that, it would reduce the size of the cell data to 32-bit, iff we limit ourselves to 2^32 input records.
Am I being too optimistic? ;-)

================
Comment at: lld/COFF/DebugTypes.cpp:932
+    // Advance the probe. Wrap around to the beginning if we run off the end.
+    ++idx;
+    idx = idx == tableSize ? 0 : idx;
----------------
It'd be interesting to collect statistics on how many collisions you get. And also compare linear (current behavior) vs. quadratic probing.

One issue I can see is that since the table will be 99.9% full at the end of the insertion pass, there will lots of collisions toward the end. What about making the table 25% bigger, like DenseHash does?

================
Comment at: lld/COFF/DebugTypes.cpp:974
+  // Cap the table size so that we can use 32-bit cell indices. Type indices are
+  // also 32-bit, so this is an inherent PDB file format limit anyway.
+  tableSize = std::min(size_t(INT32_MAX), tableSize);
----------------
Looks like the limit in the PDB is 28-bit wide indices, probably because the PDB limit is 4 GB and because the smallest type record cannot be less that 8 bytes (4-byte header + 1 byte payload + padding).

https://github.com/microsoft/microsoft-pdb/blob/082c5290e5aff028ae84e43affa8be717aa7af73/PDB/dbi/dbiimpl.h#L62

In practice, I never saw more that a few tens of millions of type records in a 2-GB PDB. It is very unlikely that we'll ever reach this 28-bit limit.

However in this case you're talking about the cumulative (input) records count, right? That can be pretty large, I've seen 1 billion input type records (when we link our games without Unity/Jumbo files).

How many input type records do you see on your largest EXE/DLL? (we could add the total input type records count to `/summary`)

================
Comment at: lld/COFF/DebugTypes.cpp:1011
+  //   - source 1, type 0...
+  std::vector<GHashCell> entries;
+  for (const GHashCell &cell :
----------------
.reserve?

================
Comment at: lld/COFF/DebugTypes.cpp:1093
+}
+;
+  }
----------------
Remove ;

================
Comment at: lld/COFF/PDB.cpp:898
+  // already been merged.
+  if (!config->debugGHashes) {
+    ScopedTimer t(typeMergingTimer);
----------------
After this patch, `/DEBUG:GHASH` could become the new default?

================
Comment at: lld/test/COFF/pdb-procid-remapping.test:1
-# RUN: yaml2obj %p/Inputs/pdb1.yaml -o %t1.obj
-# RUN: yaml2obj %p/Inputs/pdb2.yaml -o %t2.obj
+# RUN: yaml2obj < %p/Inputs/pdb1.yaml > %t1.obj
+# RUN: yaml2obj < %p/Inputs/pdb2.yaml > %t2.obj
----------------
I know @MaskRay recently changed all < to cmd-line input and > to -o. Do you need < > here?

================
Comment at: llvm/lib/DebugInfo/PDB/Native/TpiStreamBuilder.cpp:70

-  TypeRecords.push_back(Record);
+  TypeRecBuffers.push_back(Record);
+  // FIXME: Require it.
----------------
Since you probably already know how many records/hashes you're inserting, can you `.reserve()` `TypeRecBuffers` and `TypeHashes` in advance?

================
Comment at: llvm/lib/DebugInfo/PDB/Native/TpiStreamBuilder.cpp:93
+
+  TypeRecBuffers.push_back(Types);
+  TypeHashes.insert(TypeHashes.end(), Hashes.begin(), Hashes.end());
----------------
Same here (.reserve).

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D87805/new/

https://reviews.llvm.org/D87805