[PATCH] D87805: [PDB] Merge types in parallel when using ghashing

Fri Sep 18 15:20:33 PDT 2020

rnk marked an inline comment as done.
rnk added a comment.

In D87805#2282836 <https://reviews.llvm.org/D87805#2282836>, @aganea wrote:

> Thanks a lot for working on this Reid, really neat!

Thanks. :)

> This is a huge change, how do you validate that the resulting PDB is equivalent to that of the previous version (before patch) or non-ghash merging?

Truly, I think I haven't done enough validation. But, the resulting PDB should actually be identical: it should contain the same type records, in the same order as before.

---

TODOs:

- More validation
- Look at that stale comment
- Stale ; } thing
- Maybe gather more numbers in other scenarios (concurrent links)

================
Comment at: lld/COFF/DebugTypes.cpp:598
+  // Copy the type into our mutable buffer.
+  assert(ty.length() <= codeview::MaxRecordLength);
+  size_t offset = merged.recs.size();
----------------
aganea wrote:
> Fun fact, microsoft-pdb does (mistakenly?) `<` not `<=`: https://github.com/microsoft/microsoft-pdb/blob/082c5290e5aff028ae84e43affa8be717aa7af73/PDB/dbi/tpi.cpp#L1130
> However it does reserve `cbRecMax` bytes (L942).
Right, I remember this was a source of difference between clang type records and MSVC type records. This comes up pretty regularly in the LF_FIELDLIST record of a long enum (LLVM intrinsics) for example. With an off-by-one error, you get cascading differences. It's not really a goal for the compiler to emit byte-identical types with MSVC, though, it just results in extra type info.

================
Comment at: lld/COFF/DebugTypes.cpp:609
+    for (size_t i = ty.length(); i < newSize; ++i)
+      newRec[i] = LF_PAD0 + (newSize - i);
+  }
----------------
aganea wrote:
> We're doing this all over the place, it'd be nice to eventually converge all variants (later).
It's true. This patch does reimplement a lot of library code, rather than improving the library, which is unfortunate. I just found it really difficult to restructure the library in a way that would still be high performance.

================
Comment at: lld/COFF/DebugTypes.cpp:807
+namespace {
+/// A concurrent hash table for global type hashing. It is based on this paper:
+/// Concurrent Hash Tables: Fast and General(?)!
----------------
aganea wrote:
> I am wondering if this doesn't belong in a new file? Since the code is quite small, we //could// possibly have different implementations (in the future), depending on the dataset. 32-bit, 64-bit or 128-bit with no ghash indirection (if the CPU supports 128-bit CAS).
Maybe it does, but I really wanted `GHashTable::insert` to get internal linkage from the anonymous namespace. If this becomes a template, then it matters less.

================
Comment at: lld/COFF/DebugTypes.cpp:813
+/// 1. concurrent insertion
+/// 2. concurrent lookup
+/// It does not support deletion or rehashing. It uses linear probing.
----------------
Hah, this comment is stale. The table actually doesn't support lookup at all anymore. It used to, before I figured out the trick of saving the insert position from the parallel insert step.

================
Comment at: lld/COFF/DebugTypes.cpp:873
+  GloballyHashedType getGHash() const {
+    return TpiSource::instances[getTpiSrcIdx()]->ghashes[getGHashIdx()];
+  }
----------------
aganea wrote:
> I can't help thinking that this smells like Clang's SourceManager index, where all sources all collated into a single index (not a 2D array).
> If you did that, it would reduce the size of the cell data to 32-bit, iff we limit ourselves to 2^32 input records.
> Am I being too optimistic? ;-)
It's an idea, but it's expensive to decompose a SourceLocation into a file id and file offset. However... we could build one giant array of ghashes indexed by this new combined index. This would copy all .debug$H sections, but it could be worth it. This would save a level of indirection during ghash insertion collision resolution, which might be worth a lot. Hm.

Another thing to consider is that MSVC emits many more types than clang. Users mostly use /Zi, which pre-duplicates them, but if they use /Z7, it would probably break this 32-bit limit on the number of input type records. There are already perf issues with lld /debug:ghash + cl /Z7 (extra .debug$T pass), so maybe it's not worth worrying about.

================
Comment at: lld/COFF/DebugTypes.cpp:932
+    // Advance the probe. Wrap around to the beginning if we run off the end.
+    ++idx;
+    idx = idx == tableSize ? 0 : idx;
----------------
aganea wrote:
> It'd be interesting to collect statistics on how many collisions you get. And also compare linear (current behavior) vs. quadratic probing.
> 
> One issue I can see is that since the table will be 99.9% full at the end of the insertion pass, there will lots of collisions toward the end. What about making the table 25% bigger, like DenseHash does?
I don't have collision stats, but I can say that the load factor in the tests I was using goes from 70% (small PDBs) to 14% (big programs, lots of duplicate types to eliminate). So, the more inputs you run it on, the more memory gets allocated, the fewer collisions their are, and the shorter the chains are.

================
Comment at: lld/COFF/DebugTypes.cpp:974
+  // Cap the table size so that we can use 32-bit cell indices. Type indices are
+  // also 32-bit, so this is an inherent PDB file format limit anyway.
+  tableSize = std::min(size_t(INT32_MAX), tableSize);
----------------
aganea wrote:
> Looks like the limit in the PDB is 28-bit wide indices, probably because the PDB limit is 4 GB and because the smallest type record cannot be less that 8 bytes (4-byte header + 1 byte payload + padding).
> 
> https://github.com/microsoft/microsoft-pdb/blob/082c5290e5aff028ae84e43affa8be717aa7af73/PDB/dbi/dbiimpl.h#L62
> 
> In practice, I never saw more that a few tens of millions of type records in a 2-GB PDB. It is very unlikely that we'll ever reach this 28-bit limit.
> 
> However in this case you're talking about the cumulative (input) records count, right? That can be pretty large, I've seen 1 billion input type records (when we link our games without Unity/Jumbo files).
> 
> How many input type records do you see on your largest EXE/DLL? (we could add the total input type records count to `/summary`)
Yeah, this is input records. This table size ends up being really large and this allocates a lot of memory, but remember, the .debug$T was in theory already memory mapped anyway, and this hash table is smaller than that at 8 bytes of cell vs minimum 8 bytes per record.

I logged the load factor and capacity of the table later, and this is what I got for chrome.dll:
  lld-link: ghash table load factor: 26.25% (size 17307224 / capacity 65942084)
That is 65,942,084 input type records, and essentially 73.75% of them ended up being duplicates.

================
Comment at: lld/COFF/DebugTypes.cpp:1011
+  //   - source 1, type 0...
+  std::vector<GHashCell> entries;
+  for (const GHashCell &cell :
----------------
aganea wrote:
> .reserve?
We don't know how many cells are empty until we iterate over the table. The load factor varies widely depending on the input. I think it's better in this case to dynamically resize.

================
Comment at: lld/COFF/PDB.cpp:898
+  // already been merged.
+  if (!config->debugGHashes) {
+    ScopedTimer t(typeMergingTimer);
----------------
aganea wrote:
> After this patch, `/DEBUG:GHASH` could become the new default?
I hope so, but let's do it separately.

================
Comment at: lld/test/COFF/pdb-procid-remapping.test:1
-# RUN: yaml2obj %p/Inputs/pdb1.yaml -o %t1.obj
-# RUN: yaml2obj %p/Inputs/pdb2.yaml -o %t2.obj
+# RUN: yaml2obj < %p/Inputs/pdb1.yaml > %t1.obj
+# RUN: yaml2obj < %p/Inputs/pdb2.yaml > %t2.obj
----------------
aganea wrote:
> I know @MaskRay recently changed all < to cmd-line input and > to -o. Do you need < > here?
Oh, I probably flubbed the conflict resolution. No reason.

================
Comment at: llvm/lib/DebugInfo/PDB/Native/TpiStreamBuilder.cpp:70

-  TypeRecords.push_back(Record);
+  TypeRecBuffers.push_back(Record);
+  // FIXME: Require it.
----------------
aganea wrote:
> Since you probably already know how many records/hashes you're inserting, can you `.reserve()` `TypeRecBuffers` and `TypeHashes` in advance?
I can't, this is the old entry point API, which takes one type record from the caller at a time.

================
Comment at: llvm/lib/DebugInfo/PDB/Native/TpiStreamBuilder.cpp:93
+
+  TypeRecBuffers.push_back(Types);
+  TypeHashes.insert(TypeHashes.end(), Hashes.begin(), Hashes.end());
----------------
aganea wrote:
> Same here (.reserve).
I could reserve here, but that might actually defeat the dynamic resizing. Consider that this loop is N^2:
  std::vector<int> vec;
  for (int i =0 ; i < n; ++i) {
    vec.reserve(vec.size()+1);
    vec.push_back(i);
  }

addTypeRecords gets called for each TpiSource, so we would end up reallocating the vector for every type source that contributes types, and maybe not increasing the size enough to remain O(n). IMO it's better to let resizing do its thing here.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D87805/new/

https://reviews.llvm.org/D87805