[PATCH] D113073: [lld-macho] Cache library paths from findLibrary

Tue Nov 9 22:23:47 PST 2021

smeenai added subscribers: ruiu, thakis, rnk.
smeenai added a comment.

In D113073#3118817 <https://reviews.llvm.org/D113073#3118817>, @keith wrote:

> In D113073#3118477 <https://reviews.llvm.org/D113073#3118477>, @oontvoo wrote:
>
>> In D113073#3106528 <https://reviews.llvm.org/D113073#3106528>, @keith wrote:
>>
>>> In D113073#3106474 <https://reviews.llvm.org/D113073#3106474>, @oontvoo wrote:
>>>
>>>> In D113073#3105039 <https://reviews.llvm.org/D113073#3105039>, @keith wrote:
>>>>
>>>>> In D113073#3104904 <https://reviews.llvm.org/D113073#3104904>, @int3 wrote:
>>>>>
>>>>>> By the way, thanks for contributing all these optimizations! I was quite surprised to hear that ld64 was faster, given that LLD is typically much faster for our own workloads, but I guess you have rather different inputs. Hopefully we can make LLD the fastest Mach-O linker for all builds :)
>>>>>
>>>>> Thanks for all the reviews! For context our project is a huge iOS application with on the order of thousands of static libraries, and many iOS system framework + system library dependencies. Is this a case you've benchmarked? If so I'd be interested to dig a bit deeper into the differences to try and understand why it has been slower for us.
>>>>
>>>> For one of our largest ios apps that I've measured(different from int3's ):
>>>>
>>>> - ~7100 archives
>>>> - 56 frameworks  (including system ones)
>>>> - 12 weak frameworks
>>>
>>> Thanks for the info! What's your bottleneck at this point? After all my changes here our biggest area is we now spend 20+ seconds in `lld::macho::readFile`
>>
>> *With* all of the patches improving load input applied, linking the app above is only slightly faster than our optimised LD64:
>> (cross-linking, on Linux - Broadwell (yeah, kind of old) RAM 32GB)
>>
>>   x ./LLD_with_cache_patches.txt
>>   + ./LD64_local_imprv.txt
>>   +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>   |x     *           x  *x  x  * *          +      +   +                         +                                                                +                      +|
>>   |        |_|________A__M______|__________________M____________A____________________________________________________|                                                    |
>>   +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>       N           Min           Max        Median           Avg        Stddev
>>   x  10         23.64         23.89         23.82        23.799   0.077667382
>>   +  10         23.69         25.01         24.04        24.146     0.4379041
>>   Difference at 95.0% confidence
>>   	0.347 +/- 0.295482
>>   	1.45804% +/- 1.24157%
>>   	(Student's t, pooled s = 0.314478)
>>
>> Looking at the trace, the bottleneck is still in loading input and writing output. 
>> It seems it could be improved further ...
>>
>> F20184452: Screen Shot 2021-11-09 at 10.20.13 AM.png <https://reviews.llvm.org/F20184452>
>
> Thanks for the data point! You might have mentioned before but does this mean these changes _were_ an improvement for you as well? Do you have a time for this benchmark before these changes?

This is interesting. LLD has been at least 2x faster for us than ld64 (and also faster than zld, which is a public ld64 fork with speed improvements), and I believe it was 3 to 4x faster for @thakis. You have some impressive ld64 improvements :)

It's interesting that you have the large gap between parsing input files and the next trace. I haven't seen that on our end yet.

I spent some time looking at the performance for loading inputs. Most of the time was going into parsing symbols and relocations, as expected.

Doing symbol resolution in parallel seems pretty tricky because of archive loading semantics and weak symbols. A while back, @ruiu had the idea to at least parallelize all the symbol table insertions (https://lists.llvm.org/pipermail/llvm-dev/2019-April/131902.html), but I don't think he prototyped it (or at least I can't find the results from the prototype anywhere). More recently, @int3 was looking into parallelizing bits of LLD (https://lists.llvm.org/pipermail/llvm-dev/2021-April/149651.html), and there was some interesting discussion there (particularly Jez's follow-up in https://lists.llvm.org/pipermail/llvm-dev/2021-April/149675.html, and @rnk's parallel symbol resolution idea in https://lists.llvm.org/pipermail/llvm-dev/2021-April/149689.html).

For relocation processing specifically, the expensive parts on our end were (a) fetching embedded addends <https://github.com/llvm/llvm-project/blob/b4f6f1c9369ec4bb1c10852283a8c7e8c39e1a8d/lld/MachO/InputFiles.cpp#L444> from the instruction stream (which results in lots of cache misses), and (b) for relocations pointing to a section (i.e. where the `r_extern` bit isn't set), finding the subsection to associate them with <https://github.com/llvm/llvm-project/blob/b4f6f1c9369ec4bb1c10852283a8c7e8c39e1a8d/lld/MachO/InputFiles.cpp#L474>. (a) seems kinda unavoidable, and for (b), I tried packing the offsets tightly into a packed array instead of having them spread out across a struct, which didn't help, and doing a linear search instead of a binary search, which was much worse. I also had the idea of parallelizing relocation parsing across all InputFiles (it's after symbol resolution, so it should be trivially parallelizable), but that also didn't make any difference, surprisingly (I only tried it with one link, and I haven't looked into why). (A WIP sketch of the parallel relocation processing which won't work for LTO is in D113542 <https://reviews.llvm.org/D113542>, for reference.)

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D113073/new/

https://reviews.llvm.org/D113073