[PATCH] D108850: [LLD] Remove global state in lldCommon

Tue Aug 31 13:23:10 PDT 2021

aganea added a comment.

Thanks for taking a look!

In D108850#2970244 <https://reviews.llvm.org/D108850#2970244>, @amccarth wrote:

> This is a larger change than I anticipated.

It's a first basis for discussions. Please let me know if I should take this in a different direction.

> I was surprised by the need to have the pointer be thread local (and then ensure that the thread local pointers for new threads are set up, too).  If I'm understanding correctly, the idea is that a single program could have n threads doing n links at the same time.  Is that right?  I'm not arguing against that.  I'm just trying to make sure I understand why I was surprised.

Yes that is correct.

The dream is, getting to a point where `LLVM_PARALLEL_LINK_JOBS` wouldn't be needed (if linking in-process). Currently each LLD process spawns its own thread pool, and the applications don't talk to each other. That increases the memory usage and (surprisingly) sometimes can starve the CPU, for example when doing ThinLTO. I feel that we could have better control over resources if we had a single "build/compile/link" application.

> It also introduces more indirection than I expected.

Indeed, more memory loads are now required to get to the global. In the vast majority of cases it shouldn't matter, but in tight loops we will need to pass down the globals pointer.

> I believe Rui (and others) favored globals in order to go as fast as possible.  Are there lld benchmarks to give us an idea of the cost of this?  If there's pushback from the performance die-hards, it would be nice to have an actual measurement or two to show (I hope) that the cost for this flexibility is low.

There seems to be a really tiny departure from baseline after this change, but it's hard to tell since LLD is mostly I/O-bound.
The data set is ~5 GB input .OBJs, approx. 2.5 GB .PDB, 475 MB executable.
Sometimes just running the same command line again exchanges the outcome, "before" comes before "after" or vice versa:

  Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz (6 cores)
  >hyperfine "before\lld-link.exe @link.rsp" "after\lld-link.exe @link.rsp"
  Benchmark #1: before\lld-link.exe @link.rsp                                                                                     
    Time (mean ± σ):     13.740 s ±  0.578 s    [User: 0.0 ms, System: 53.5 ms]                                                         
    Range (min … max):   13.391 s … 15.321 s    10 runs

  Benchmark #2: after\lld-link.exe @link.rsp
    Time (mean ± σ):     13.737 s ±  0.198 s    [User: 4.4 ms, System: 55.8 ms]                                                         
    Range (min … max):   13.566 s … 14.262 s    10 runs

  Summary
    'after\lld-link.exe @link.rsp' ran
      1.00 ± 0.04 times faster than 'before\lld-link.exe @link.rsp'

  AMD Ryzen 9 5950X 16-Core Processor @ 3.40 GHz
  >hyperfine "before\lld-link.exe @link.rsp" "after\lld-link.exe @link.rsp"
  Benchmark #1: before\lld-link.exe @link.rsp                                                                      
    Time (mean ± σ):     10.178 s ±  0.091 s    [User: 3.0 ms, System: 5.2 ms]                                           
    Range (min … max):    9.978 s … 10.303 s    10 runs

  Benchmark #2: after\lld-link.exe @link.rsp
    Time (mean ± σ):     10.243 s ±  0.088 s    [User: 4.5 ms, System: 5.2 ms]                                           
    Range (min … max):   10.115 s … 10.357 s    10 runs

  Summary
    'before\lld-link.exe @link.rsp' ran
      1.01 ± 0.01 times faster than 'after\lld-link.exe @link.rsp'

  AMD Ryzen Threadripper PRO 3975WX 32-Cores @ 3.50 GHz
  >hyperfine "before\lld-link.exe @link.rsp" "after\lld-link.exe @link.rsp"
  Benchmark #1: before\lld-link.exe @link.rsp                                                                          
    Time (mean ± σ):     15.451 s ±  0.545 s    [User: 1.4 ms, System: 8.6 ms]                                               
    Range (min … max):   15.024 s … 16.849 s    10 runs

  Benchmark #2: after\lld-link.exe @link.rsp
    Time (mean ± σ):     15.243 s ±  0.087 s    [User: 0.0 ms, System: 8.6 ms]                                               
    Range (min … max):   15.109 s … 15.392 s    10 runs

  Summary
    'after\lld-link.exe @link.rsp' ran
      1.01 ± 0.04 times faster than 'before\lld-link.exe @link.rsp'

  2x Intel(R) Xeon(R) Gold 6248R CPU @ 3.00GHz (48 cores, dual CPU)
  >hyperfine "before\lld-link.exe @link.rsp" "after\lld-link.exe @link.rsp"
  Benchmark #1: before\lld-link.exe @link.rsp                                                                                                   
    Time (mean ± σ):     17.433 s ±  0.852 s    [User: 0.0 ms, System: 6.5 ms]                                                                        
    Range (min … max):   15.202 s … 18.267 s    10 runs

  Benchmark #2: after\lld-link.exe @link.rsp
    Time (mean ± σ):     17.661 s ±  0.876 s    [User: 1.4 ms, System: 2.6 ms]                                                                        
    Range (min … max):   15.304 s … 18.442 s    10 runs

  Summary
    'before\lld-link.exe @link.rsp' ran
      1.01 ± 0.07 times faster than 'after\lld-link.exe @link.rsp'

Profiling shows a difference in functions that seem to read from memory-mapped files, doing hard page faults:
F18767391: lldCommon TLS.PNG <https://reviews.llvm.org/F18767391>
First line (the lambda) is the first loop in `GSIHashStreamBuilder::finalizeBuckets`.
The second line is the `JamCRC` calculation in `createSectionContrib`.

I thought some cache lines could evicted by the read of `CommonGlobls` TLS. However two random runs of "before" builds are exhibiting the same difference. It seems really we're only seeing statistical fluctuation in the tests above, but I'd be happy to run tests on anything that you judge a representative data set.

================
Comment at: lld/COFF/Chunks.cpp:12
+#include "SymbolTable.h"
 #include "Symbols.h"
 #include "Writer.h"
----------------
amccarth wrote:
> I guess we're using ASCIIbetical rather than alphabetical ordering. :-)
;-)

================
Comment at: lld/MachO/Driver.cpp:149
         if (fs::is_directory(buffer)) {
-          paths.push_back(saver.save(buffer.str()));
+          paths.push_back(saver(&cg).save(buffer.str()));
           found = true;
----------------
int3 wrote:
> why not just `saver()`?
Changed. Please see my other comment at lld/MachO/DriverUtils.cpp, L263.

================
Comment at: lld/MachO/DriverUtils.cpp:263
       if (exists)
-        return saver.save(location.str());
+        return saver(&cg).save(location.str());
     }
----------------
amccarth wrote:
> In most cases, it's just `saver()`.  Is there something special about these instances where you explicitly pass a pointer to the common globals?
Only the fact that it's a loop and I wanted to clarify the intent, rather than relying on the optimizer to (maybe) prevent the call to `commonGlobals()`. But I switched to just `saver()`, since other operations in this function would dominate.

Ideally we should be passing state down the callstack and not rely on TLS. This could be done in time-critical functions.

================
Comment at: lld/tools/lld/lld.cpp:190
+  // Delete global state.
+  delete cg;
+  return !r ? 1 : 0;
----------------
amccarth wrote:
> Given that this is the not-exiting-early path, should this thread detach from the global state before deleting it?  That might help catch instances where something in the chain of destructors inadvertently tries to reference a global.
It's detaching just at the end of `~CommonGlobals()`.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D108850/new/

https://reviews.llvm.org/D108850