[PATCH] D108850: [LLD] Remove global state in lldCommon
Alexandre Ganea via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Tue Aug 31 13:23:10 PDT 2021
aganea added a comment.
Thanks for taking a look!
In D108850#2970244 <https://reviews.llvm.org/D108850#2970244>, @amccarth wrote:
> This is a larger change than I anticipated.
It's a first basis for discussions. Please let me know if I should take this in a different direction.
> I was surprised by the need to have the pointer be thread local (and then ensure that the thread local pointers for new threads are set up, too). If I'm understanding correctly, the idea is that a single program could have n threads doing n links at the same time. Is that right? I'm not arguing against that. I'm just trying to make sure I understand why I was surprised.
Yes that is correct.
The dream is, getting to a point where `LLVM_PARALLEL_LINK_JOBS` wouldn't be needed (if linking in-process). Currently each LLD process spawns its own thread pool, and the applications don't talk to each other. That increases the memory usage and (surprisingly) sometimes can starve the CPU, for example when doing ThinLTO. I feel that we could have better control over resources if we had a single "build/compile/link" application.
> It also introduces more indirection than I expected.
Indeed, more memory loads are now required to get to the global. In the vast majority of cases it shouldn't matter, but in tight loops we will need to pass down the globals pointer.
> I believe Rui (and others) favored globals in order to go as fast as possible. Are there lld benchmarks to give us an idea of the cost of this? If there's pushback from the performance die-hards, it would be nice to have an actual measurement or two to show (I hope) that the cost for this flexibility is low.
There seems to be a really tiny departure from baseline after this change, but it's hard to tell since LLD is mostly I/O-bound.
The data set is ~5 GB input .OBJs, approx. 2.5 GB .PDB, 475 MB executable.
Sometimes just running the same command line again exchanges the outcome, "before" comes before "after" or vice versa:
Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz (6 cores)
>hyperfine "before\lld-link.exe @link.rsp" "after\lld-link.exe @link.rsp"
Benchmark #1: before\lld-link.exe @link.rsp
Time (mean ± σ): 13.740 s ± 0.578 s [User: 0.0 ms, System: 53.5 ms]
Range (min … max): 13.391 s … 15.321 s 10 runs
Benchmark #2: after\lld-link.exe @link.rsp
Time (mean ± σ): 13.737 s ± 0.198 s [User: 4.4 ms, System: 55.8 ms]
Range (min … max): 13.566 s … 14.262 s 10 runs
Summary
'after\lld-link.exe @link.rsp' ran
1.00 ± 0.04 times faster than 'before\lld-link.exe @link.rsp'
AMD Ryzen 9 5950X 16-Core Processor @ 3.40 GHz
>hyperfine "before\lld-link.exe @link.rsp" "after\lld-link.exe @link.rsp"
Benchmark #1: before\lld-link.exe @link.rsp
Time (mean ± σ): 10.178 s ± 0.091 s [User: 3.0 ms, System: 5.2 ms]
Range (min … max): 9.978 s … 10.303 s 10 runs
Benchmark #2: after\lld-link.exe @link.rsp
Time (mean ± σ): 10.243 s ± 0.088 s [User: 4.5 ms, System: 5.2 ms]
Range (min … max): 10.115 s … 10.357 s 10 runs
Summary
'before\lld-link.exe @link.rsp' ran
1.01 ± 0.01 times faster than 'after\lld-link.exe @link.rsp'
AMD Ryzen Threadripper PRO 3975WX 32-Cores @ 3.50 GHz
>hyperfine "before\lld-link.exe @link.rsp" "after\lld-link.exe @link.rsp"
Benchmark #1: before\lld-link.exe @link.rsp
Time (mean ± σ): 15.451 s ± 0.545 s [User: 1.4 ms, System: 8.6 ms]
Range (min … max): 15.024 s … 16.849 s 10 runs
Benchmark #2: after\lld-link.exe @link.rsp
Time (mean ± σ): 15.243 s ± 0.087 s [User: 0.0 ms, System: 8.6 ms]
Range (min … max): 15.109 s … 15.392 s 10 runs
Summary
'after\lld-link.exe @link.rsp' ran
1.01 ± 0.04 times faster than 'before\lld-link.exe @link.rsp'
2x Intel(R) Xeon(R) Gold 6248R CPU @ 3.00GHz (48 cores, dual CPU)
>hyperfine "before\lld-link.exe @link.rsp" "after\lld-link.exe @link.rsp"
Benchmark #1: before\lld-link.exe @link.rsp
Time (mean ± σ): 17.433 s ± 0.852 s [User: 0.0 ms, System: 6.5 ms]
Range (min … max): 15.202 s … 18.267 s 10 runs
Benchmark #2: after\lld-link.exe @link.rsp
Time (mean ± σ): 17.661 s ± 0.876 s [User: 1.4 ms, System: 2.6 ms]
Range (min … max): 15.304 s … 18.442 s 10 runs
Summary
'before\lld-link.exe @link.rsp' ran
1.01 ± 0.07 times faster than 'after\lld-link.exe @link.rsp'
Profiling shows a difference in functions that seem to read from memory-mapped files, doing hard page faults:
F18767391: lldCommon TLS.PNG <https://reviews.llvm.org/F18767391>
First line (the lambda) is the first loop in `GSIHashStreamBuilder::finalizeBuckets`.
The second line is the `JamCRC` calculation in `createSectionContrib`.
I thought some cache lines could evicted by the read of `CommonGlobls` TLS. However two random runs of "before" builds are exhibiting the same difference. It seems really we're only seeing statistical fluctuation in the tests above, but I'd be happy to run tests on anything that you judge a representative data set.
================
Comment at: lld/COFF/Chunks.cpp:12
+#include "SymbolTable.h"
#include "Symbols.h"
#include "Writer.h"
----------------
amccarth wrote:
> I guess we're using ASCIIbetical rather than alphabetical ordering. :-)
;-)
================
Comment at: lld/MachO/Driver.cpp:149
if (fs::is_directory(buffer)) {
- paths.push_back(saver.save(buffer.str()));
+ paths.push_back(saver(&cg).save(buffer.str()));
found = true;
----------------
int3 wrote:
> why not just `saver()`?
Changed. Please see my other comment at lld/MachO/DriverUtils.cpp, L263.
================
Comment at: lld/MachO/DriverUtils.cpp:263
if (exists)
- return saver.save(location.str());
+ return saver(&cg).save(location.str());
}
----------------
amccarth wrote:
> In most cases, it's just `saver()`. Is there something special about these instances where you explicitly pass a pointer to the common globals?
Only the fact that it's a loop and I wanted to clarify the intent, rather than relying on the optimizer to (maybe) prevent the call to `commonGlobals()`. But I switched to just `saver()`, since other operations in this function would dominate.
Ideally we should be passing state down the callstack and not rely on TLS. This could be done in time-critical functions.
================
Comment at: lld/tools/lld/lld.cpp:190
+ // Delete global state.
+ delete cg;
+ return !r ? 1 : 0;
----------------
amccarth wrote:
> Given that this is the not-exiting-early path, should this thread detach from the global state before deleting it? That might help catch instances where something in the chain of destructors inadvertently tries to reference a global.
It's detaching just at the end of `~CommonGlobals()`.
Repository:
rG LLVM Github Monorepo
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D108850/new/
https://reviews.llvm.org/D108850
More information about the llvm-commits
mailing list