[PATCH] D148701: [LLD][ELF][AArch64] Add AArch64 short range thunk support

Thu Apr 20 02:29:23 PDT 2023

peter.smith added a comment.

In D148701#4281776 <https://reviews.llvm.org/D148701#4281776>, @MaskRay wrote:

> Thanks for the patch. AArch64ADRPThunk and AArch64ABSLongThunk duplicate `writeTo` and `getMayUseShortThunk` now. Shall we define a base class for the two classes to share code?

That is possible as the short branch code will be the same in both cases. It was one of the cases where for two cases the duplication may end up simpler than a base class, but if there were 3 or more it wouldn't. I can certainly change that.

>> This makes it suitable for use a short range thunk in the same way as short thunks are implemented in Arm and PPC.
>
> Is Arm ambiguous here? AArch32 and PPC64?

I meant this uses the same strategy of using a branch as the ARM/Thumb (AArch32) thunks and I think PPC64.

> ---
>
> Is there any analysis how frequently this short thunk mode is going to trigger?

The vast majority of user-space AArch64 programs need no range-extension thunks at all as the executable segment is contiguous and smaller than 128 MiB. We have seen a small number of programs in the 128 MiB to 256 Mib such as a fully instrumented Chromium build, some Haskell programs also get this large naturally. For programs in this size range with a contiguous text segment I'd expect short thunks to replace the larger ones.

I would not expect it to trigger for linker scripts that separate the code into separate disjoint OutputSections.

> We need a thunk for `b far` or `bl far`. By using a `b` instruction, we can reach from +-128MiB at the thunk section location (instead of the original call site).
> This does not guarantee a 256MiB range without indirect branches, as we don't necessarily place the thunk 128MiB from the call site.
> This is in part because `ThunkCreator::getThunk` picks the first available thunk, not the best one.
>
> Let's use `aarch64-call26-thunk.s` as an example. If I change `Inputs/abs.s` to use `big = 0x8210120` (shorter addresses don't need a thunk), I'll get a short range thunk. If I use `big = 0x8210124` or higher, I'll get a long range thunk.

The way code is written today is not optimal for short thunks but it can work reasonably well for large contiguous OutputSections like a .text. The initial pools are spaced at roughly branch-range intervals, so in a 256 MiB .text OutputSection there will be two pools, one roughly central. Callers at the start of the .text section to a destination near the end can nearly double their range, although the closer a caller gets to a pool the lower the benefits of range extension.

Arm's proprietary linker armlink has a much more complicated thunk assignment algorithm. Essentially it works out from the callers addresses and the destination address what the valid address range for the thunk insertion is and uses the mid-point of that range. This does have its drawbacks as while the address ranges are close to continuous, the valid insertion points in between sections are not so we can end up with the only valid insertion points being in the middle of a section, which requires special case code. All possible, but quite a lot more complexity. It also makes the binary layout a lot messier and harder to predict as thunks are scattered around the image.

I'll send an update with a base class. My summary is that I think that this will help programs of the 128 MiB to 256 MiB in a single .text OutputSection. Most likely a small handful of programs to date though. They are much more useful on Arm (Thumb) as many programs are larger than 16 MiB.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D148701/new/

https://reviews.llvm.org/D148701