[PATCH] D122922: [lld][common][lld-macho][lld-coff] Support per-thread allocators and StringSavers
Fangrui Song via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Fri Apr 22 19:29:17 PDT 2022
MaskRay added a comment.
In D122922#3467683 <https://reviews.llvm.org/D122922#3467683>, @oontvoo wrote:
> In D122922#3467647 <https://reviews.llvm.org/D122922#3467647>, @aganea wrote:
>
>>> Right, but if the MachO part of lld does allocations from multiple threads, then it's essentially unreliable unless this option is enabled? So anyone building lld wanting to use the MachO part of it would need to enable it.
>>
>> This question was left unanswered -- how do you plan on using `LLD_THREAD_SAFE_MEMORY`? Will it be enabled only when building for Darwin? What happens for cross-compilation or if `LLD_THREAD_SAFE_MEMORY` isn't enabled? The make/makePerThread/makeUnsafe APIs seem a bit error-prone to me. Ideally it is a choice that should be avoided, if possible, by the business logic/driver developer.
>
> Sorry, forgot to follow up on this. From offline chat with int3, I think decided to remove this and to just use the existing `LLD_ENABLE_THREADS` instead. That is to say, code would have thread safe saver/make() by default, unless it goes out of its way to call the unsafe() variants.
>
> It also wasn't clear from @MaskRay's comment whehter the 1% slow down was for llvm::ThreadLocal alone or if it also applied to the native TLS approach. Was not able to reproduce the regression from my ends with either approach, and I didn't want to keep updating the patch .
I have a chrome build on Linux x86-64. Say /tmp/c/0 has a lld built from main branch. /tmp/c/1 is a lld built with this patch.
I have measured this:
% hyperfine --warmup 2 --min-runs 16 "numactl -C 20-27 "{/tmp/c/0,/tmp/c/1}" -flavor gnu @response.txt --threads=8"
Benchmark 1: numactl -C 20-27 /tmp/c/0 -flavor gnu @response.txt --threads=8
Time (mean ± σ): 5.584 s ± 0.035 s [User: 9.164 s, System: 2.318 s]
Range (min … max): 5.536 s … 5.656 s 16 runs
Benchmark 2: numactl -C 20-27 /tmp/c/1 -flavor gnu @response.txt --threads=8
Time (mean ± σ): 5.637 s ± 0.048 s [User: 9.205 s, System: 2.305 s]
Range (min … max): 5.565 s … 5.765 s 16 runs
Summary
'numactl -C 20-27 /tmp/c/0 -flavor gnu @response.txt --threads=8' ran
1.01 ± 0.01 times faster than 'numactl -C 20-27 /tmp/c/1 -flavor gnu @response.txt --threads=8'
% hyperfine --warmup 2 --min-runs 16 "numactl -C 20-23 "{/tmp/c/0,/tmp/c/1}" -flavor gnu @response.txt --threads=4"
Benchmark 1: numactl -C 20-23 /tmp/c/0 -flavor gnu @response.txt --threads=4
Time (mean ± σ): 6.227 s ± 0.044 s [User: 8.547 s, System: 2.122 s]
Range (min … max): 6.165 s … 6.353 s 16 runs
Benchmark 2: numactl -C 20-23 /tmp/c/1 -flavor gnu @response.txt --threads=4
Time (mean ± σ): 6.260 s ± 0.033 s [User: 8.605 s, System: 2.096 s]
Range (min … max): 6.200 s … 6.325 s 16 runs
Summary
'numactl -C 20-23 /tmp/c/0 -flavor gnu @response.txt --threads=4' ran
1.01 ± 0.01 times faster than 'numactl -C 20-23 /tmp/c/1 -flavor gnu @response.txt --threads=4'
Note that I use mimalloc and tend to care about performance more with a better malloc (mimalloc/tcmalloc/jemalloc/snmalloc/etc) than the glibc malloc.
> I think it's a bit unfortunate that the library-fication work here is making this change hard to land
I share the same concern. I know that I replied a +1 on that thread, and at that time I did not pay more attention on the performance.
Now I am probably more on the fence. I think the original CommonLinkerContext.h change probably caused 1+% regression. Adding another 1% will be too much.
I am pretty sure `llvm/Support/ThreadLocal.h` will not be an acceptable solution for ELF due to the pthread_getspecific cost.
`thread_local` shall be fine and as mentioned previous has been used in several places in llvm and clang.
Repository:
rG LLVM Github Monorepo
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D122922/new/
https://reviews.llvm.org/D122922
More information about the llvm-commits
mailing list