[llvm-dev] RFC: LLD range extension thunks

Wed Jan 4 10:34:47 PST 2017

I'm about to start working on range extension thunks in lld. This is
an attempt to summarize the approach I'd like to take and what the
impact will be on lld outside of thunks. I'm interested if anyone has
any constraints the approach will break, alternative suggestions, or
is working on something I'll need to take account of?

I expect range extension thunks to be important for ARM and useful for
AArch64. In principle any target with a limited range branch immediate
instruction could benefit from them.

I've put some detail about range extension thunks at the end of this
message for those not familiar with them. The name range extension
thunks is by no means universal, for example in ARM's ABI
documentation they are called veneers, The GNU linker calls them
stubs.

Summary of existing thunk implementation (ARM interworking and Mips
PIC to non-PIC calls):
- A Regular, Shared or Undefined symbol may have a single thunk
- For each relocation to a symbol S, if we need a thunk we use the
thunk for S as the target of the relocation rather than S. The thunk
will transfer control to S.
- Thunks are assigned to an InputSection, these are written out when
the InputSection is written. The InputSection with the Thunk contains
either the caller (ARM) or callee (Mips).
- For all existing thunks, the decision of whether a thunk is needed
is not dependent on address. A Thumb branch to ARM will always need a
thunk, no matter the distance. Thunks can therefore be generated
relatively early in Writer::run().

High level impact of range extension thunks:

There may be more than one than one thunk per callee:
- A range extension thunk must be placed within range of the caller,
there may be cases where no single thunk for a callee is in range of
all callers.
- An ARM Target may need a different Thunk for ARM and Thumb callers.

Address information is needed to determine if a range extension thunk is needed:
- The more precise the address information available the less thunks
will be generated. the most precise address information is the final
address of caller and callee is known at thunk creation time, the
least precise is neither the address of the caller or callee is known.

Range extension thunks can be combined or replace other thunks
- Thunks may also be used for instruction set interworking (ARM) or
for calling between position independent and non-position independent
code (Mips). Either a chain of thunks or a combined thunk that does
both operations is needed. For ARM all range extension thunks can
trivially be interworking thunks as well.

Range extension thunk placement can be important
- Many callers may need a range extension. Placing a range extension
thunk so that it is in range of the most callers minimizes number of
thunks needed.
- Thunks may be better as synthetic sections rather than as additions
to input sections.

Adding/removing content must not break the range calculations used in
range extension thunks.
- If any caller, callee or thunk address is changed after range
extension thunks are calculated it could invalidate the range
calculation.
- Ideally range extension thunks are the last operation the linker
does prior to resolving relocations.

I think that there are two separate areas to a range extension thunk
implementation that can be considered separately.
1.) Moving thunk generation to a later stage, at a minimum we need an
estimation of the address of each caller and callee, in an ideal world
we know the final address of each caller and callee. This could mean
assigning section addresses multiple times.
2.) The alterations to the core data structures to permit more than
one Thunk per symbol and the logic to select the "right" Thunk for
each relocation.

The design I'd like to aim at moves thunk creation into
finalizeSections() at a point where the sizes and addresses of all the
SyntheticSections are known. This would mean that the final address of
each caller and callee could be used, and after thunk creation there
would be no further content changes. This would mean:
- All code that runs prior to thunk creation may have the offset in
the OutputSection altered by the addition of thunks. In particular
scanRelocs() calculates the offset in the OutputSection of some
relocations. We would need to find alternative ways of handling these
cases so that they could either survive thunk creation or be patched
up afterwards.
- assignAddresses() will need to run at least twice if thunks are
created. At least once to give the thunk creation the caller and
callee addresses, and at least once after all thunks have been
created.

There is an alternative design that only uses estimates of caller and
callee address to decide if a thunk is needed. In effect we use a
heuristic to predict how much extra synthetic content, such as plt and
got size, will be added after Thunk creation when deciding if a Thunk
is needed. I'm not in favour of this approach as from bitter
experience it tends to result in hard to debug problems when the
heuristics break down. Precise addresses would also allow errata
patching thunks [*]

I've not thought too hard about how to alter the core data structures
yet. I think this will mostly be implementation detail though.

Next steps:
I'd like to proceed with the following plan:
1.) Move the existing thunk implementation to where it would need to
be in finalizeSections(). This should flush out all the non-thunk
related assumptions about addresses without adding any existing
complexity to the Thunk implementation.
2.) Add support for multiple thunks per symbol
3.) If it turns out to be a good idea, implement thunks as SyntheticSections
4.) Add support for range extensions.

I think the first implementation of range extension thunks should be
simple and not try too hard to minimize the number of thunks needed.
If there is a need to optimize it can be done later as the changes
should be within the thunk creation module.

Thanks for reading

Peter

The remainder of the message is a brief explanation of range extension
and errata patching thunks.

What are range extension thunks?
Many architectures have branch instructions that have a finite range
that is insufficient to reach all possible program locations. For
example the ARM branch immediate instruction has an immediate that
encodes an offset of +-32Mb from the branch instruction. A range
extension thunk is a linker generated code sequence, inserted between
the caller and the callee, that completes the transfer of control to
the callee when the distance between the caller and callee exceeds the
range of the branch instruction. A simple example in psuedo assembly
for a non-position independent ARM function call.

source:
BL long_range_thunk_to_target
...
long_range_thunk_to_target
LDR r12, target_address ; r12 is the corruptible interprocedural
scratch register (ip)
BX r12
target_address:
.word target ;
...
target:
...

What is an errata patching thunk?
Some CPU errata (hardware bugs) can be fixed at a link time by
replacing an instruction with a branch to a sequence of equivalent
instructions that are guaranteed to to not trigger the erratum. In
some cases the trigger sequence is dependent on precise addresses such
as immediates crossing page boundaries, for example
https://sourceware.org/ml/binutils-cvs/2015-04/msg00012.html . Errata
patching is out of the scope of implementing range extension thunks
but can be seen as a generalization of it.