[llvm-dev] [lld] We call SymbolBody::getVA redundantly a lot...

Sean Silva via llvm-dev llvm-dev at lists.llvm.org
Thu Mar 2 16:48:22 PST 2017


On Wed, Mar 1, 2017 at 5:08 PM, Rui Ueyama <ruiu at google.com> wrote:

> On Wed, Mar 1, 2017 at 2:31 PM, Sean Silva <chisophugis at gmail.com> wrote:
>
>>
>>
>> On Tue, Feb 28, 2017 at 11:39 PM, Rui Ueyama <ruiu at google.com> wrote:
>>
>>> I also did a quick profiling a few months ago and noticed just like you
>>> that scanRelocations consumes a fairly large percentage of overall
>>> execution time. That caught my attention because at the time I was looking
>>> for a place that I can parallelize.
>>>
>>> scanRelocations is not parallelizable as-is because it does some global
>>> operations as side effects such as adding new symbols to GOT/PLT. My idea
>>> is to split the function into two halves. In the first half, we scan
>>> relocations and mark some symbols in parallel. In the second half, we add
>>> them to GOT/PLT.
>>>
>>
>> Won't that require scanning all symbols in the second pass? I think that
>> we want to do O(number of GOT/PLT entries) work (and cache footprint) in
>> the second pass instead of O(symbols).
>>
>
> It will, but it might not be that bad than you imagined, because number of
> global symbols are much smaller than the number of relocations. For
> exmaple, at SymbolTable.cpp#673
> <https://github.com/llvm-project/llvm-project/blob/0ad2ea48e8fa9b6243c2ec20b57b4268c279f9bf/lld/ELF/SymbolTable.cpp#L673>,
> it iterates over all non-local symbols, but that is pretty cheap (which
> surprised me at first.)
>

Interesting. I guess it makes sense though.


>
>
>
>>
>>> I think it's doable although I haven't actually tried that yet.
>>> Relocation processing is one of the most complex things in LLD at least
>>> partly due to the nature of the ELF format, and it is not easy to hack that
>>> code to try some new ideas. But it should be worth a try.
>>>
>>
>> Yes, I noticed too that this code has a lot of essential complexity.
>> However, it looks like it has grown organically, so there are likely
>> opportunities to simplify it.
>>
>>
>>>
>>> I believe it has a potential to make LLD 10% or more faster, as
>>> scanRelocations currently takes more than 20% of total execution time.
>>>
>>
>> Wasn't there the idea also of merging this scan into the final relocation
>> processing?
>>
>> In the code, it says:
>> ```
>> // The reason we have to do this early scan is as follows
>> // * To mmap the output file, we need to know the size
>> // * For that, we need to know how many dynamic relocs we will have.
>> ```
>>
>> But I think you can resize a memory mapping (e.g. ftruncate) to add data
>> at the end. Or conservatively overallocate and then truncate at the end
>> (pages we don't touch won't be allocated, so the overallocation shouldn't
>> cost too much)
>>
>
> Putting a variable length thing at end of a file works only if you have
> only one variable-sized stuff. There could be multiple variable-size things
> in linkers, such as string table, GOT, PLT, thunks, etc. So it is probably
> hard to do that. Also, it doesn't work for linker scripts if linker scripts
> enforces some specific section layout.
>

Gah, that's annoying :(

-- Sean Silva


>
>
>> -- Sean Silva
>>
>>
>>>
>>> On Tue, Feb 28, 2017 at 11:15 PM, Sean Silva <chisophugis at gmail.com>
>>> wrote:
>>>
>>>>
>>>>
>>>> On Tue, Feb 28, 2017 at 12:10 PM, Rui Ueyama <ruiu at google.com> wrote:
>>>>
>>>>> I don't think getVA is particularly expensive, and if it is not
>>>>> expensive I wouldn't cache its result. Did you experiment to cache getVA
>>>>> results? I think you can do that fairly easily by adding a
>>>>> std::atomic_uint64_t to SymbolBody and use it as a cache for getVA.
>>>>>
>>>>
>>>>
>>>> You're right, caching it didn't have any significant effect (though I
>>>> wasn't measuring super precisely ). I think I was remembering the profile
>>>> wrong. I remember measuring that we had some very bad cache/TLB misses
>>>> here, but I guess those aren't too important on the current profile (at
>>>> least, not on this test case; the locality of these accesses depends a lot
>>>> on the test case).
>>>>
>>>> Also, it seems like our performance is a lot more stable w.r.t.
>>>> InputSectionBase::relocate than it used to be (or maybe my current CPU is
>>>> just less affected; it's a desktop class processor instead of a xeon).
>>>>
>>>>
>>>> I took a quick profile of this workload and it looks like it is:
>>>>
>>>> 65% in the writer ("backend")
>>>> 30% in the "frontend" (everything called by SymbolTable::addFile)
>>>>
>>>> The frontend work seems to be largely dominated by ObjectFile::parse
>>>> (as you would expect), though there is about 10% of total runtime slipping
>>>> through the cracks here in various other "frontend" tasks.
>>>>
>>>> The backend work is split about evenly between scanRelocations and
>>>> OutputSection::writeTo. InputSectionBase::relocate is only about 10% of the
>>>> total runtime (part of OutputSection::writeTo).
>>>>
>>>> Some slightly cleaned up `perf report` output with some more details:
>>>> https://reviews.llvm.org/P7972
>>>>
>>>> So it seems like overall, the profile is basically split 3 ways (about
>>>> 30% each):
>>>> - frontend (reading input files and building the symbol table and
>>>> associated data structures)
>>>> - scanRelocations (initial pass over relocations)
>>>> - writeTo (mostly IO and InputSectionBase::relocate)
>>>>
>>>> -- Sean Silva
>>>>
>>>>
>>>>>
>>>>> On Tue, Feb 28, 2017 at 4:19 AM, Sean Silva <chisophugis at gmail.com>
>>>>> wrote:
>>>>>
>>>>>> tl;dr: it looks like we call SymbolBody::getVA about 5x more times
>>>>>> than we need to
>>>>>>
>>>>>> Should we cache it  or something? (careful with threads).
>>>>>>
>>>>>>
>>>>>> Here is a link to a PDF of my Mathematica notebook which has all the
>>>>>> details of my investigation:
>>>>>> https://drive.google.com/open?id=0B8v10qJ6EXRxVDQ3YnZtUlFtZ1k
>>>>>>
>>>>>>
>>>>>> There seem to be two main regimes that we redundantly call
>>>>>> SymbolBody::getVA:
>>>>>>
>>>>>> 1. most redundant calls on the same symbol (about 80%) happen in
>>>>>> quick succession with few intervening calls for other symbols. Most likely
>>>>>> we are processing a bunch of relocations right next to each other that all
>>>>>> refer to the same symbol (or small set of symbols); e.g. within a TU
>>>>>>
>>>>>> 2. there is a long-ish tail (about 20% of calls to SymbolBody::getVA)
>>>>>> which happen at a long temporal distance from any previous call to
>>>>>> SymbolBody::getVA on the same symbol. I don't know off the top of my head
>>>>>> where these are coming from, but it doesn't sound like relocations. A quick
>>>>>> grepping shows a bunch of source locations that match getVA, so it's hard
>>>>>> at a glance to see. Any ideas where these other calls are coming from?
>>>>>>
>>>>>> The particular link I was looking at was a release without debug info
>>>>>> link, using `-O0 --no-gc-sections --no-threads`. The particular test case
>>>>>> is LLD itself.
>>>>>>
>>>>>> -- Sean Silva
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170302/7281f086/attachment.html>


More information about the llvm-dev mailing list