[PATCH] D87805: [PDB] Merge types in parallel when using ghashing

Sat Sep 19 07:54:27 PDT 2020

aganea added a comment.

In D87805#2283132 <https://reviews.llvm.org/D87805#2283132>, @rnk wrote:

> Truly, I think I haven't done enough validation. But, the resulting PDB should actually be identical: it should contain the same type records, in the same order as before.

I suppose a simple way could be to temporarily retain the old algorithm and have them both run side-by-side and assert if things are different.

================
Comment at: lld/COFF/DebugTypes.cpp:873
+  GloballyHashedType getGHash() const {
+    return TpiSource::instances[getTpiSrcIdx()]->ghashes[getGHashIdx()];
+  }
----------------
rnk wrote:
> aganea wrote:
> > I can't help thinking that this smells like Clang's SourceManager index, where all sources all collated into a single index (not a 2D array).
> > If you did that, it would reduce the size of the cell data to 32-bit, iff we limit ourselves to 2^32 input records.
> > Am I being too optimistic? ;-)
> It's an idea, but it's expensive to decompose a SourceLocation into a file id and file offset. However... we could build one giant array of ghashes indexed by this new combined index. This would copy all .debug$H sections, but it could be worth it. This would save a level of indirection during ghash insertion collision resolution, which might be worth a lot. Hm.
> 
> Another thing to consider is that MSVC emits many more types than clang. Users mostly use /Zi, which pre-duplicates them, but if they use /Z7, it would probably break this 32-bit limit on the number of input type records. There are already perf issues with lld /debug:ghash + cl /Z7 (extra .debug$T pass), so maybe it's not worth worrying about.
> This would copy all .debug$H sections
I am wondering if we couldn't combine all ghashes into a contiguous virtual range of memory (both the pre-existing .debug$H and the locally computed, "owned" ones). The `MapViewOfFile2/3` APIs allow changing the destination `BaseAddress`. There will be some dangling data around .debug$H mappings because the mapping only works on 64K-ranges, but it's maybe better than copying around a few GB worth of .debug$H sections (which also implies duplicating the memory storage for ghashes, because of the file mapping, unless we `munmap` after each copy).

> There are already perf issues with lld /debug:ghash + cl /Z7 (extra .debug$T pass)
Like I mentionned in D55585, once ghashes computation is parallelized, it is faster on a 6-core to use `/DEBUG:GHASH` rather than the default `/DEBUG`. Were you thinking of anything else, when you say "there are already perf issues"? We've been using MSVC cl+LLD+D55585 for a long time and the timings of LLD are close to that of Clang+LLD.

================
Comment at: lld/COFF/DebugTypes.cpp:974
+  // Cap the table size so that we can use 32-bit cell indices. Type indices are
+  // also 32-bit, so this is an inherent PDB file format limit anyway.
+  tableSize = std::min(size_t(INT32_MAX), tableSize);
----------------
rnk wrote:
> aganea wrote:
> > Looks like the limit in the PDB is 28-bit wide indices, probably because the PDB limit is 4 GB and because the smallest type record cannot be less that 8 bytes (4-byte header + 1 byte payload + padding).
> > 
> > https://github.com/microsoft/microsoft-pdb/blob/082c5290e5aff028ae84e43affa8be717aa7af73/PDB/dbi/dbiimpl.h#L62
> > 
> > In practice, I never saw more that a few tens of millions of type records in a 2-GB PDB. It is very unlikely that we'll ever reach this 28-bit limit.
> > 
> > However in this case you're talking about the cumulative (input) records count, right? That can be pretty large, I've seen 1 billion input type records (when we link our games without Unity/Jumbo files).
> > 
> > How many input type records do you see on your largest EXE/DLL? (we could add the total input type records count to `/summary`)
> Yeah, this is input records. This table size ends up being really large and this allocates a lot of memory, but remember, the .debug$T was in theory already memory mapped anyway, and this hash table is smaller than that at 8 bytes of cell vs minimum 8 bytes per record.
> 
> I logged the load factor and capacity of the table later, and this is what I got for chrome.dll:
>   lld-link: ghash table load factor: 26.25% (size 17307224 / capacity 65942084)
> That is 65,942,084 input type records, and essentially 73.75% of them ended up being duplicates.
It is clearer now, thanks. I am wondering if LLD could let the user know of an optimal table size, and let them provide that value on the cmd-line. But then it is a trade-off between the increased number of collisions (which imply an extra ghash indirection) and the smaller table size which would reduce cache misses. Just thinking out loud.

================
Comment at: lld/COFF/DebugTypes.cpp:1011
+  //   - source 1, type 0...
+  std::vector<GHashCell> entries;
+  for (const GHashCell &cell :
----------------
rnk wrote:
> aganea wrote:
> > .reserve?
> We don't know how many cells are empty until we iterate over the table. The load factor varies widely depending on the input. I think it's better in this case to dynamically resize.
Never mind, I misunderstood how the algorithm worked. It is clear now. For some reason I thought you were constructing equivalence classes.

================
Comment at: lld/COFF/PDB.cpp:898
+  // already been merged.
+  if (!config->debugGHashes) {
+    ScopedTimer t(typeMergingTimer);
----------------
rnk wrote:
> aganea wrote:
> > After this patch, `/DEBUG:GHASH` could become the new default?
> I hope so, but let's do it separately.
Sure.

================
Comment at: llvm/lib/DebugInfo/PDB/Native/TpiStreamBuilder.cpp:93
+
+  TypeRecBuffers.push_back(Types);
+  TypeHashes.insert(TypeHashes.end(), Hashes.begin(), Hashes.end());
----------------
rnk wrote:
> aganea wrote:
> > Same here (.reserve).
> I could reserve here, but that might actually defeat the dynamic resizing. Consider that this loop is N^2:
>   std::vector<int> vec;
>   for (int i =0 ; i < n; ++i) {
>     vec.reserve(vec.size()+1);
>     vec.push_back(i);
>   }
> 
> addTypeRecords gets called for each TpiSource, so we would end up reallocating the vector for every type source that contributes types, and maybe not increasing the size enough to remain O(n). IMO it's better to let resizing do its thing here.
Yes, you're right.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D87805/new/

https://reviews.llvm.org/D87805