[lld] Dealing with limited branch reach?

Tue Oct 20 18:23:30 PDT 2015

On Tue, Oct 20, 2015 at 6:15 PM, Sean Silva <chisophugis at gmail.com> wrote:

>
>
> On Tue, Oct 20, 2015 at 6:09 PM, Rui Ueyama <ruiu at google.com> wrote:
>
>> On Tue, Oct 20, 2015 at 6:03 PM, Sean Silva <chisophugis at gmail.com>
>> wrote:
>>
>>>
>>>
>>> On Tue, Oct 20, 2015 at 5:32 PM, Rui Ueyama via llvm-commits <
>>> llvm-commits at lists.llvm.org> wrote:
>>>
>>>> COFF ARM has the same problem which has not been solved yet. On ARM,
>>>> just like PPC, jump instruction's displacement is limited, thus if the
>>>> relocation target is too far, the linker has to create a stub which jumps
>>>> to the desired function using an instruction with larger displacement, and
>>>> rewrite the relocation to point to that stub.
>>>>
>>>> There are I think few requirements for production-quality linkers when
>>>> handling such relocations.
>>>>
>>>>  - We don't want to use stubs unless needed: The most simple "solution"
>>>> would be to always create stubs at end of each function and always use
>>>> them, but that makes all function calls indirect. That is not acceptable
>>>> from the performance perspective.
>>>>
>>>>  - We don't want to create unnecessary room between functions: We could
>>>> make room between each function and backfill that space with stubs if stubs
>>>> are needed. That's simple, but bloats code, so it's probably unacceptable.
>>>>
>>>> This is an interesting packing problem because we don't know the exact
>>>> distance between two arbitrary instructions until we layout output
>>>> sections. But when we fix the layout, it's too late to make room for stubs.
>>>>
>>>> I don't really know what's the best way to layout sections for such
>>>> architecture, but what I was thinking for ARM is something like this:
>>>>
>>>> 1. Layout output sections without considering relocations
>>>> 2. Visit all sections and all relocations to check if all displacements
>>>> are within their range. If not, create a new "stub section", insert it
>>>> after the current section, and rewrite the relocations.
>>>> 3. Re-assign VMA and file offsets for each section.
>>>> 4. Repeat 2 and 3 until a convergence is obtained.
>>>>
>>>> The reason why we need step 4 is because inserting a stub may make some
>>>> relocations, which are previously reachable, unreachable. (I think that
>>>> happens rarely, so it should converge quickly.)
>>>>
>>>
>>> A variation of this approach is to do the opposite: initially lay out
>>> everything using stubs, then iteratively relax. The advantage of the
>>> relaxation approach is that the result is always "correct" and so we can
>>> choose to always run only a single iteration of relaxation (or a fixed
>>> number). This avoids a pathological worst-case behavior.
>>>
>>
>> That's an interesting approach. I didn't think about that. It might be
>> too pessimistic as it's going to create lots of stubs (one stub for one
>> function call), though.
>>
>
> The first round of relaxation should remove most of the stubs.
>
>
>> My approach can be relaxed to mitigate worst-case behavior by creating
>> stubs for relocations that barely reach to their targets.
>>
>
> I don't think that will necessarily fix the worst case. A relocation might
> potentially jump over multiple inserted stubs, so "barely" actually has to
> be quite pessimistic to guarantee, say, <50 iterations.
>

I'm not trying to fix the worst case. As long as it converges for sane
inputs in a reasonable amount of time, it's fine. It needs experiments to
conclude which is better.

>
> -- Sean Silva
>
>
>>
>>
>>>
>>>
>>
>>> -- Sean Silva
>>>
>>>
>>>>
>>>> On Tue, Oct 20, 2015 at 4:56 PM, Hal Finkel <hfinkel at anl.gov> wrote:
>>>>
>>>>> Hi Rui, Rafael, et al.,
>>>>>
>>>>> In order to move PPC64 support in lld to a point where it can self
>>>>> host, we need to deal with the following problem:
>>>>>
>>>>> On PPC, a relative branch can only have a signed 24-bit displacement
>>>>> (which is really a 26-bit signed displacement, once the two assumed
>>>>> lower-order bits are tacked on). Thus, the range is limited to +/- a few
>>>>> (tens of) megabytes, and if there is more code than that, we need to make
>>>>> other arrangements.
>>>>>
>>>>> As I understand it, other architectures (AArch64, for example), have
>>>>> similar limitations.
>>>>>
>>>>> Existing linkers handle this situation by inserting branch stubs, and
>>>>> placing the branch stubs close enough to the call sites.
>>>>>
>>>>> Here's a quick example:
>>>>>
>>>>> $ cat main.c
>>>>> void foo();
>>>>> int main() {
>>>>>   foo();
>>>>>   asm(".fill 50000000, 4, 0x60000000"); // lots of nops
>>>>>   return 0;
>>>>> }
>>>>>
>>>>> $ cat foo.c
>>>>> void foo() {}
>>>>>
>>>>> $ gcc -o btest main.c foo.c
>>>>>
>>>>> Now running objdump -d btest shows this relevant bit:
>>>>>
>>>>> 0000000010000500 <0000003a.plt_branch.foo+0>:
>>>>>     10000500:   3d 82 ff ff     addis   r12,r2,-1
>>>>>     10000504:   e9 6c 7f e8     ld      r11,32744(r12)
>>>>>     10000508:   7d 69 03 a6     mtctr   r11
>>>>>     1000050c:   4e 80 04 20     bctr
>>>>>
>>>>> 0000000010000510 <.main>:
>>>>>     10000510:   7c 08 02 a6     mflr    r0
>>>>>     10000514:   f8 01 00 10     std     r0,16(r1)
>>>>>     10000518:   fb e1 ff f8     std     r31,-8(r1)
>>>>>     1000051c:   f8 21 ff 81     stdu    r1,-128(r1)
>>>>>     10000520:   7c 3f 0b 78     mr      r31,r1
>>>>>     10000524:   4b ff ff dd     bl      10000500
>>>>> <0000003a.plt_branch.foo+0>
>>>>>     10000528:   60 00 00 00     nop
>>>>>     1000052c:   60 00 00 00     nop
>>>>>     10000530:   60 00 00 00     nop
>>>>>     10000534:   60 00 00 00     nop
>>>>> ...
>>>>>
>>>>> So it has taken the actual call target address and stuck it in a data
>>>>> section (referenced from the TOC base pointer), and the stub loads the
>>>>> address and jumps there.
>>>>>
>>>>> Currently, lld seems to write each input section that is part of an
>>>>> output section, in order, consecutively into that output section. Dealing
>>>>> properly with long-branch stubs, however, seems to require inserting
>>>>> intervening stub segments in between other .text sections.  This affects
>>>>> not only direct calls, but calls into .plt too (since they too need to be
>>>>> in range), or we need to split (and, perhaps, duplicate .plt entries) in
>>>>> order to make sure they're close enough as well.
>>>>>
>>>>> One possible way to do this is:
>>>>>
>>>>>  if (total size < some threshold) {
>>>>>    everything will fit, so do what we do now
>>>>>  } else {
>>>>>    group the input text segments so that each group (including the
>>>>> size of stubs) is below the threshold (we can scan each segment for branch
>>>>> relocations to determine if stubs are necessary)
>>>>>    insert the necessary stub segments after each grouping
>>>>>  }
>>>>>
>>>>> Various heuristics can make the groupings chosen more or less optimal,
>>>>> but perhaps that's another matter.
>>>>>
>>>>> Thoughts?
>>>>>
>>>>> Thanks again,
>>>>> Hal
>>>>>
>>>>> --
>>>>> Hal Finkel
>>>>> Assistant Computational Scientist
>>>>> Leadership Computing Facility
>>>>> Argonne National Laboratory
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> llvm-commits mailing list
>>>> llvm-commits at lists.llvm.org
>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20151020/6501536c/attachment.html>