[PATCH] D55585: RFC: [LLD][COFF] Parallel GHASH generation at link-time

Fri Dec 21 15:12:52 PST 2018

rnk added a comment.

Neat! I didn't have a chance to review with a fine toothed comb, but I like the idea, and here's how I analyze it. We've already talked about how merging a type record without ghash is O(size of type), and with a content hash, you can make that O(size of hash), i.e. look at 8 or 20 bytes instead of up to 64K. But, computing the hash requires reading the type record data, so we made the compiler do it and put it in .debug$H.
If one can't do that because MSVC is being used, parallelizing the hash lookup makes it so that the linker only takes O(size of all types / n cores) wall time to do it, and then it does O(# types) (not **size** of types) work to deduplicate them.

So it looks like this is the list of things to do:

1. custom ghash densemap: this is probably the most valuable, but probably the most complex and needs the most work.
2. parallelize publics sorting: small enough to send and commit separately ASAP
3. parallelize ghash fetch or computation: actually small in isolation as well, let's do it, I like the code for it

Another thing about the custom hash table is, we might end up ripping it out if we try to parallelize type merging.

================
Comment at: lld/trunk/COFF/PDB.cpp:1599
+           ComputeH);
+  GlobalHashTimer.stop();
+
----------------
All work up to here (except for dependency loading) can be kicked off from SymbolTable::addFile, which is the earliest point that we know we are including an object file in the link. The way it is written here, it's easy to understand that we are computing or fetching hashes in parallel. That clarify is worth something. Do you think it's worth restructuring things so that addFile is responsible for starting ghash computation and loading PDB and PCH dependencies? I think it might help reduce link time further by overlapping independent work, but it might make the unnecessarily complicated. Maybe it's best to keep things this way for now.

================
Comment at: lld/trunk/COFF/PDB.cpp:1604-1618
+  // Merge Types in PDB/PCH files
+  for (InputFile *File : Depends) {
+    CVIndexMap TIStorage;
+    auto R = mergeTypes(File, TIStorage);
+    if (!R)
+      return R.takeError();
+
----------------
We need to find a way to parallelize type merging... but that is future work.

================
Comment at: lld/trunk/COFF/PDB.cpp:1658
     // Sort the public symbols and add them to the stream.
-    std::sort(Publics.begin(), Publics.end(),
+    sort(parallel::par, Publics.begin(), Publics.end(),
+         [](const PublicSym32 &L, const PublicSym32 &R) {
----------------
Wow, I'm surprised that matters. Feel free to send it separately, it seems like a small win.

================
Comment at: llvm/trunk/include/llvm/DebugInfo/CodeView/GlobalTypeDenseMap.h:77
+          typename BucketT = GloballyHashedInfo>
+class GlobalTypeDenseMap : public DebugEpochBase {
+  template <typename T>
----------------
I haven't read this implementation yet, it's quite long, but broadly I'm in favor of having a custom hash table here. This is the most performance critical thing LLD does.

This seems like a good separable change, since this map isn't actually used in parallel.

================
Comment at: llvm/trunk/include/llvm/DebugInfo/CodeView/GlobalTypeDenseMap.h:342
+                                                     sys::Memory::MF_WRITE |
+                                                     llvm::sys::Memory::MF_HUGE,
+        EC);
----------------
How much do huge pages matter relative to the custom hash table?

Repository:
  rL LLVM

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D55585/new/

https://reviews.llvm.org/D55585