[llvm-dev] LLD: Possible optimization for TargetInfo

Wed Mar 30 18:47:02 PDT 2016

On Wed, Mar 30, 2016 at 6:42 PM, Sean Silva <chisophugis at gmail.com> wrote:

>
>
> On Wed, Mar 30, 2016 at 6:17 PM, Rui Ueyama <ruiu at google.com> wrote:
>
>> On Wed, Mar 30, 2016 at 5:34 PM, Sean Silva <chisophugis at gmail.com>
>> wrote:
>>
>>>
>>>
>>> On Wed, Mar 30, 2016 at 4:25 PM, Rui Ueyama <ruiu at google.com> wrote:
>>>
>>>> On Wed, Mar 30, 2016 at 4:20 PM, Sean Silva <chisophugis at gmail.com>
>>>> wrote:
>>>>
>>>>> I believe the relocation stuff that Rafael is currently working on
>>>>> will make this a non-issue (it will make relocation application much
>>>>> friendlier for the CPU).
>>>>>
>>>>
>>>> I don't think Rafael's patch would make this a non-issue. He's making
>>>> scanRelocs to create data, which would reduce the number of calls to the
>>>> virtual functions, but it wouldn't be reduced to zero.
>>>>
>>>> However, even in the current scheme, since the target is fixed, all the
>>>>> indirect call sites should be monomorphic and so there shouldn't be much
>>>>> branch-prediction cost (certainly nothing that would cause 1.8% performance
>>>>> delta for the entire link).
>>>>>
>>>>
>>>> Agreed. We could template functions that call TargetInfo's member
>>>> functions for each target to eliminate the virtual function calls.
>>>>
>>>
>>> From what has been presented I would not conclude that virtual calls are
>>> actually the problem (or a problem at all). A root-cause analysis is
>>> necessary. As r263227 shows, the relocation application loop is very
>>> sensitive to small changes.
>>>
>>> One quick thing you may also want to try as a sanity check is inserting
>>> nops in different places in the function. I suspect you'll find that the
>>> performance swings (both speedups and slowdowns) from doing that are
>>> similar in magnitude to what you are seeing. You may also want to try
>>> editing the indirect call instruction to a direct call without otherwise
>>> modifying the binary; if that reproduces the 1.8% speedup then it will be
>>> convincing.
>>>
>>
>> Honestly I was somewhat skeptical about what you wrote here, but I
>> observed 0.4% *slowdown* when I used gcc to compile it, so looks like I was
>> wrong. It is possible that devirtualization might have been effective for
>> clang-generated code, but it is more likely that that was a result of some
>> performance deviation caused by some other factor.
>>
>> The relocation handling loop is really a tight loop and therefore
>> sensitive to small changes. How can we optimize this? Maybe PGO?
>>
>
> Rafael's change will fix it. That is why he is doing it in the first place
> :)
> (The idea came when we were crunching the numbers for "LLD performance
> w.r.t. local symbols (and --build-id)" and looked at r263227. I suggested
> that this looked like it was because this loop is getting long enough to
> run-out the CPU's reorder buffer waiting on the cache misses, preventing it
> from seeing the memory accesses of the next iteration and thus failing to
> pipeline the memory accesses across iterations. Small changes in
> scheduling, instruction count, etc. will tickle this and cause large
> performance changes. The solution is to make the relocation application
> loop tighter. Especially by separating some of the stuff that we currently
> have inside one huge loop to be in separate loops, but also by making the
> loops tighter.)
>

Well, as much as I'm now skeptical about my result, I'm skeptical about
that would really fix it, but maybe we should just wait and measure it once
Rafael's patch is ready. :)

-- Sean Silva
>
>
>>
>> If you haven't read it, I think you would enjoy this paper:
>>> http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/37077.pdf
>>>
>>> -- Sean Silva
>>>
>>>
>>>>
>>>>
>>>>> Notice that 1.8% is smaller than the performance variation from r263227
>>>>> which is a very innocuous-looking change but caused ~2-3% slowdown
>>>>> for ScyllaDB (see the thread "LLD performance w.r.t. local symbols (and
>>>>> --build-id)").
>>>>>
>>>>> -- Sean Silva
>>>>>
>>>>> On Wed, Mar 30, 2016 at 3:39 PM, Rui Ueyama via llvm-dev <
>>>>> llvm-dev at lists.llvm.org> wrote:
>>>>>
>>>>>> I was wandering how much is the overhead of virtual function calls of
>>>>>> TargetInfo member functions. TargetInfo handles platform-specific details,
>>>>>> and we have target-specific subclasses of that class. The subclasses
>>>>>> override functions defined in TargetInfo.
>>>>>>
>>>>>> The TargetInfo member functions are called multiple times for each
>>>>>> relocation. So the cost of virtual function calls may be non-neglible. That
>>>>>> is a motication to do the following test.
>>>>>>
>>>>>> As a test, I removed all TargetInfo subclasses except for x86-64,
>>>>>> move all code from X86_64TargetInfo to TargetInfo, and remove `virtual`
>>>>>> from TargetInfo.
>>>>>>
>>>>>> The original LLD links itself (with debug info) in 7.499 seconds. The
>>>>>> de-virtualized version did the same thing in 7.364 seconds. So it can
>>>>>> improve it by 1.8%.
>>>>>>
>>>>>> I'm just pointing out that there's room there to improve performance,
>>>>>> and I'm not suggesting we do something for this right now. We probably
>>>>>> shouldn't do anything for this because the current code is pretty
>>>>>> straightforward. But I'd expect that we will eventually want do something
>>>>>> for this in future.
>>>>>>
>>>>>> _______________________________________________
>>>>>> LLVM Developers mailing list
>>>>>> llvm-dev at lists.llvm.org
>>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160330/8eb0f030/attachment.html>