[PATCH] PowerPC support for the ELFv2 ABI (powerpc64le-linux)

Fri Jul 18 16:02:08 PDT 2014

----- Original Message -----
> From: "Ulrich Weigand" <Ulrich.Weigand at de.ibm.com>
> To: "LLVM Commits" <llvm-commits at cs.uiuc.edu>
> Cc: "Hal Finkel" <hfinkel at anl.gov>
> Sent: Monday, July 14, 2014 10:16:40 AM
> Subject: [PATCH] PowerPC support for the ELFv2 ABI (powerpc64le-linux)
> 
> 
> 
> Hello,
> 
> this patch series implements support in LLVM for the PowerPC ELFv2
> ABI.
> Together with a companion patch to clang (posted on cfe-commits),
> this
> makes clang/LLVM fully usable on powerpc64le-linux.  Overall the
> patch
> series passed the following testing (both on powerpc64-linux (ELFv1)
> and
> powerpc64le-linux (ELFv2)):
> - building LLVM & clang, running the regression test suite
> - running projects/test-suite
> - full 3-stage bootstrap of clang
> - GCC ABI compatibility test suite GCC vs. clang  [*]
> 
> [*] There are some failures due to GCC features clang does not
> implement
> (or implements slightly differently than GCC, like
> attribute((aligned)) on
> bit field base types), but those seem platform-independent, and are
> the
> same on ELFv1 and ELFv2.
> 
> I've broken up ELFv2 support into the following pieces:

Hi Uli,

I've commented on each of the patches below...

> 
> 
> - MC support for .abiversion directive
> 
> ELFv2 binaries are marked by a bit in the ELF header e_flags field.
>  A new
> assembler directive .abiversion can be used to set that flag.  This
> patch
> implements support in the PowerPC MC streamers to emit the
> .abiversion
> directive (both into assembler and ELF binary output), as well as
> support
> in the assembler parser to parse the .abiversion directive.
> 
> (See attached file: diff-llvm-elfv2-abiversion)

LGTM.

> 
> 
> - MC support for .localentry directive
> 
> A second binutils feature needed to support ELFv2 is the .localentry
> directive.  In the ELFv2 ABI, functions may have two entry points:
> one for
> calling the routine locally via "bl", and one for calling the
> function via
> function pointer (either at the source level, or implicitly via a PLT
> stub
> for global calls).  The two entry points share a single ELF symbol,
> where
> the ELF symbol address identifies the global entry point address,
> while the
> local entry point is found by adding a delta offset to the symbol
> address.
> That offset is encoded into three platform-specific bits of the ELF
> symbol
> st_other field.
> 
> The .localentry directive instructs the assembler to set those fields
> to
> encode a particular offset.  This is typically used by a function
> prologue
> sequence like this:
> 
> func:
>         addis r2, r12, (.TOC.-func)@ha
>         addi r2, r2, (.TOC.-func)@l
>         .localentry func, .-func
> 
> Note that according to the ABI, when calling the global entry point,
> r12
> must be set to point the global entry point address itself; while
> when
> calling the local entry point, r2 must be set to point to the TOC
> base.
> The two instructions between the global and local entry point in the
> above
> example translate the first requirement into the second.
> 
> This following patch implements support in the PowerPC MC streamers
> to emit
> the .localentry directive (both into assembler and ELF object
> output), as
> well as support in the assembler parser to parse the .localentry
> directive.
> 
> In addition, there is another change required in MC fixup/relocation
> handling to properly deal with relocations targeting function symbols
> with
> two entry points: When the target function is known local, the MC
> layer
> would immediately handle the fixup by inserting the target address --
> this
> is wrong, since the call may need to go to the local entry point
> instead.
> The GNU assembler handles this case by *not* directly resolving
> fixups
> targeting functions with two entry points, but always emits the
> relocation
> and relies on the linker to handle this case correctly.  This patch
> changes
> LLVM MC to do the same (this is done via the processFixupValue
> routine).
> 
> Similarly, there are cases where the assembler would normally emit a
> relocation, but "simplify" it to a relocation targeting a *section*
> instead
> of the actual symbol.  For the same reason as above, this may be
> wrong when
> the target symbol has two entry points.  The GNU assembler again
> handles
> this case by not performing this simplification in that case, but
> leaving
> the relocation targeting the full symbol, which is then resolved by
> the
> linker.  This patch changes LLVM MC to do the same (the
> needsRelocateWithSymbol routine).   NOTE: the LLVM code is actually
> overly
> pessimistic, since the needsRelocateWithSymbol routine currently does
> not
> have access to the actual target symbol, and thus must always assume
> that
> it might have two entry points.  This can be improved upon by
> modifying
> common code to pass the target symbol when calling
> needsRelocateWithSymbol
> (probably best done as a follow-on patch).

Yes, please provide such a patch.

> 
> (See attached file: diff-llvm-elfv2-localentry)

+static inline int64_t
+PPC64_LOCAL_ENTRY_OFFSET(unsigned Other) {
+  unsigned Val = (Other & STO_PPC64_LOCAL_MASK) >> STO_PPC64_LOCAL_BIT;
+  return ((1 << Val) >> 2) << 2;
+}
+static inline unsigned
+PPC64_SET_LOCAL_ENTRY_OFFSET(int64_t Offset) {
+  unsigned Val = (Offset >= 4 * 4
+                  ? (Offset >= 8 * 4
+                     ? (Offset >= 16 * 4 ? 6 : 5)
+                     : 4)
+                  : (Offset >= 2 * 4
+                     ? 3
+                     : (Offset >= 1 * 4 ? 2 : 0)));
+  return Val << STO_PPC64_LOCAL_BIT;
+}
+

inline functions are fine, but please follow the LLVM coding convention for naming them (and specifically, they should not look like macros).

Otherwise, LGTM.

> 
> 
> - ELFv2 function call changes: two entry points instead of function
> descriptors
> 
> This patch build upon the two preceding MC changes to implement the
> basic
> ELFv2 function call convention.  In the ELFv1 ABI, a "function
> descriptor"
> was associated with every function, pointing to both the entry
> address and
> the related TOC base (and a static chain pointer for nested
> functions).
> Function pointers would actually refer to that descriptor, and the
> indirect
> call sequence needed to load up both entry address and TOC base.
> 
> In the ELFv2 ABI, there are no more function descriptors, and
> function
> pointers simply refer to the (global) entry point of the function
> code.
> Indirect function calls simply branch to that address, after loading
> it up
> into r12 (as required by the ABI rules for a global entry point).
>  Direct
> function calls continue to just do a "bl" to the target symbol; this
> will
> be resolved by the linker to the local entry point of the target
> function
> if it is local, and to a PLT stub if it is global.  That PLT stub
> would
> then load the (global) entry point address of the final target into
> r12 and
> branch to it.  Note that when performing a local function call, r2
> must be
> set up to point to the current TOC base: if the target ends up local,
> the
> ABI requires that its local entry point is called with r2 set up; if
> the
> target ends up global, the PLT stub requires that r2 is set up.
> 
> This patch implements all LLVM changes to implement that scheme:
> - No longer create a function descriptor when emitting a function
> definition (in EmitFunctionEntryLabel)
> - Emit two entry points *if* the function needs the TOC base (r2)
> anywhere
> (this is done EmitFunctionBodyStart; note that this cannot be done in
> EmitFunctionBodyStart because the global entry point prologue code
> must be
> *part* of the function as covered by debug info).
> - In order to make use tracking of r2 (as needed above) work
> correctly,
> mark direct function calls as implicitly using r2.
> - Implement the ELFv2 indirect function call sequence (no function
> descriptors; load target address into r12).
> - When creating an ELFv2 object file, emit the .abiversion 2
> directive to
> tell the linker to create the appropriate version of PLT stubs.
> 
> Note that all this is triggered by a predicate isELFv2ABI.  This is
> currently hard-coded to be true iff the "little-endian 64-bit
> SVR4" (ppc64le) triple is selected.  To be fully compatible with GCC,
> we
> should really implement the -mabi=elfv1 / -mabi=elfv2 option pair and
> support both ELFv1 and ELFv2 on both powerpc64-linux and
> powerpc64le-linux
> targets, with big-endian defaulting to ELFv1 and little-endian
> defaulting
> to ELFv2.  However, since the BE ELFv2 and LE ELFv1 case are only
> theoretical options at this point (there's no library support for
> those in
> any current or planned Linux distribution), I haven't implemented
> this yet.
> It should be straightforward to add this support as a follow-on patch
> by
> just implementing the option machinery and hooking it up to the
> isELFv2ABI
> predicate.
> 
> (See attached file: diff-llvm-elfv2-funcdesc)

+/// EmitFunctionBodyStart - Emit a global entry point prefix for ELFv2.
+void PPCLinuxAsmPrinter::EmitFunctionBodyStart() {
+  if (Subtarget.isELFv2ABI()
+      && !MF->getRegInfo().use_empty(PPC::X2)) {

Please add a comment here explaining this check -- some of the text above would be good ;)

Otherwise, LGTM.

> 
> 
> - ELFv2 stack space reductions
> 
> The ELFv2 ABI reduces the amount of stack required to implement an
> ABI-compliant function call in two ways:
> * the "linkage area" is reduced from 48 bytes to 32 bytes by
> eliminating
> two unused doublewords
> * the 64-byte "parameter save area" is now optional and need not be
> present
> in certain cases
>    (it remains mandatory in functions with variable arguments, and
> functions that have any parameter that is passed on the stack)
> 
> The following patch implements this required changes:
> - reducing the linkage area, and associated relocation of the TOC
> save
> slot, in getLinkageSize / getTOCSaveOffset
>   (this requires updating all callers of these routines to pass in
>   the
> isELFv2ABI flag).
> - (partially) handling the case where the parameter save are is
> optional
> 
> This latter part requires some extra explanation:  Currently, we
> still
> always allocate the parameter save area when *calling* a function.
>  That is
> certainly always compliant with the ABI, but may cause code to
> allocate
> stack unnecessarily.  This can be addressed by a follow-on
> optimization
> patch.
> 
> On the *callee* side, in LowerFormalArguments, we *must* track
> correctly
> whether the ABI guarantees that the caller has allocated the
> parameter save
> area for our use, and the patch does so. However, there is one
> complication: the code that handles incoming "byval" arguments will
> currently *always* write to the parameter save area, because it has
> to
> force incoming register arguments to the stack since it must return
> an
> *address* to implement the byval semantics.  This is already
> inefficient in
> some cases in the ELFv1 ABI, but in the ELFv2 ABI it would be
> actually
> buggy since it would write to the argument save area that the caller
> actually did *not* allocate.
> 
> There are two options to fix this: One would be that the
> LowerFormalArguments code could keep its overall logic, except it
> writes
> arguments to a freshly allocated stack slot on the function's own
> stack
> frame instead of the argument save area in those cases where that
> area is
> not present.  I chose *not* to implement this, since writing
> arguments that
> already fit fully in registers to the stack *is* inefficient.
>  Instead I
> chose the second option: have the front-end pass such arguments in a
> way
> that does *not* use the "byval" scheme in the first place.  This is
> implemented in the diff-llvm-elfv2-aggregates patch below and the
> associated clang patch.  In this patch I simply verify that if there
> is no
> argument save area guaranteed by the ABI, we have no byval arguments,
> and
> report a fatal LLVM ERROR otherwise.   This unfortunately makes the
> platform-independent DebugInfo/2010-10-01-crash.ll case fail since it
> uses
> a byval parameter in a way that is now unsupported.
> 
> (See attached file: diff-llvm-elfv2-stack)

This last part makes me a bit uncomfortable. Clang is not the only frontend, and placing a restriction on byval like that, which is a fairly generic parameter attribute seems unnecessary. Moreover, the Clang change and whether we support byval seems orthogonal. Can't we keep the Clang change (which is an optimization), and also have a slow-path byval implementation that uses a local stack slot when necessary? I suspect the answer is yes, and if I'm right, please implement that.

+  // area, and parameter passing area.  We start with at least 48/32 bytes,

Please write 48(ELFv1)/32(ELFv2) bytes.

Otherwise, LGTM.

> 
> 
> - ELFv2 explicit CFI for CR fields
> 
> This is a minor improvement in the ELFv2 ABI.   In ELFv1, DWARF CFI
> would
> represent a saved CR word (holding CR fields CR2, CR3, and CR4) using
> just
> a single CFI record refering to CR2.   In ELFv2 instead, each of the
> CR
> fields is represented by its own CFI record.  The advantage is that
> the
> compiler can now chose to save just a single (or two) CR fields
> instead of
> all of them, if those are the only ones that actually need saving.
>  That
> can lead to more efficient code using mf(o)crf instead of the (slow)
> mfcr
> instruction.
> 
> Note that the following patch does not (yet) implement this more
> efficient
> code generation, but it does implement the part that is required to
> be ABI
> compliant: creating multiple CFI records if multiple CR fields are
> saved.
> 
> (See attached file: diff-llvm-elfv2-crsave)

Please add FIXME near the current code-generation logic (in PPCFrameLowering::emitPrologue?). Otherwise, LGTM.

> 
> 
> - ELFv2 aggregate passing support
> 
> This patch is intended to work together with the clang companion
> patch.
> The LLVM patch provides infrastructure that allows the clang side to
> implement the missing pieces of the ELFv2 ABI relating to aggregates
> passed
> by value.  Specifically, we need to:
> - pass (and return) "homogeneous" floating-point or vector aggregates
> in
> FPRs and VRs (this is similar to the ARM homogeneous aggregate ABI)
> - return aggregates of up to 16 bytes in one or two GPRs
> - pass aggregates that fit fully in registers without using the
> "byval"
> mechanism (see discussion of the diff-llvm-elfv2-stack)
> 
> As infrastructure to enable those changes, this LLVM patch adds
> support for
> passing array types directly.  These can be used by the front-end to
> pass
> aggregate types (coerced to an appropriate array type).  The details
> of the
> array type being used inform the back-end about ABI-relevant
> properties.
> Specifically, the array element type encodes:
> - whether the parameter should be passed in FPRs, VRs, or just
> GPRs/stack
> slots  (for float / vector / integer element types, respectively)
> - what the alignment requirements of the parameter are when passed in
> GPRs/stack slots  (8 for float / 16 for vector / the element type
> size for
> integer element types) -- this corresponds to the "byval align" field
> 
> The following patch uses the
> functionArgumentNeedsConsecutiveRegisters
> callback to encode that special treatment is required for all
> directly-passed array types.  The isInConsecutiveRegs /
> isInConsecutiveRegsLast bits set as a results are then used to
> implement
> the required size and alignment rules in CalculateStackSlotSize /
> CalculateStackSlotAlignment etc.
> 
> As a related change, the ABI routines have to be modified to support
> passing floating-point types in GPRs.  This is necessary because with
> homogeneous aggregates of 4-byte float type we can now run out of
> FPRs
> *before* we run out of the 64-byte argument save area that is
> shadowed by
> GPRs.  Any extra floating-point arguments that no longer fit in FPRs
> must
> now be passed in GPRs until we run out of those too.  Note that there
> was
> already code to pass floating-point arguments in GPRs used with
> vararg
> parameters, which was done by writing the argument out to the
> argument save
> area first and then reloading into GPRs.  The patch re-implements
> this,
> however, in favor of code packing float arguments directly via
> extension/truncation, BITCAST, and BUILD_PAIR operations.  This has
> some
> advantages:
> - we no longer rely on the argument save area being present
> - while the BITCASTs will currently often also result in values being
> written to the stack and then reloaded, this should improve once we
> implement the Power8 GPR<->FPR move instructions.
> 
> The final part of the patch enables up to 8 FPRs and VRs for argument
> return in PPCCallingConv.td; this is required to support returning
> ELFv2
> homogeneous aggregates.  (Note that this doesn't affect other ABIs
> since
> LLVM wil only look for which register to use if the parameter is
> marked as
> "direct" return anyway.)
> 
> (See attached file: diff-llvm-elfv2-aggregates)

+      // then the parameter save area.  For now, put all arguments to vararg
+      // routines always in both locations (FPR *and* GPR or stack slot).
+      bool NeedGPROrStack = isVarArg || FPR_idx == NumFPRs;

What does "For now" mean? Does this not make it even more expensive to call a vararg routine, even under ELFv1? If I'm correct about it making things more expensive, please don't make this change for ELFv1.

(I certainly understand that on the P8 this will be better, but as I'm sure you know, for older architectures the only way to do this FRP -> GPR transfer is via memory which is not fast -- and transferring via GPR means two memory load/store trips instead of one).

Generally speaking, I suppose that Clang will only use array types for ELFv2, but Clang is not the only possible frontend, and I'm concerned that a lot of these changes don't have any kind of ABI version check (especially for things that are more expensive on older architectures for ELFv2 than for ELFv1).

In addition, the comments don't do a good job of explaining what parts of this behavior are ELFv2 and which apply to both versions; this definitely needs to be improved.

> 
> 
> - ELFv2 dynamic loader support
> 
> This is the final piece of ELFv2 support in LLVM: it enables the new
> ABI in
> the runtime dynamic loader.  The loader has to implement the
> following
> features:
> - In the ELFv2 ABI, do not look up a function descriptor in .opd, but
> instead use the local entry point when resolving a direct call.
> - Update the TOC restore code to use the new TOC slot linkage area
> offset.
> - Create PLT stubs appropriate for the ELFv2 ABI.
> 
> Note that this patch also adds common-code changes. These are
> necessary
> because the loader must check the newly added ELF flags: the e_flags
> header
> bits encoding the ABI version, and the st_other symbol table entry
> bits
> encoding the local entry point offset.  There is currently no way to
> access
> these, so I've added ObjectFile::getPlatformFlags and
> SymbolRef::getOther
> accessors.
> 
> (See attached file: diff-llvm-elfv2-dyld)

+          Value.Addend += ELF::PPC64_LOCAL_ENTRY_OFFSET (SymOther);

Remove the space after OFFSET (and fix the function name as noted earlier).

Otherwise, LGTM.

Thanks again,
Hal

> 
> 
> I'd appreciate any review of the patch series!   I'm aware it's a lot
> of
> code, but I'd really like to see clang/LLVM usable out-of-the-box on
> powercp64le-linux soon (hopefully even in 3.5)!
> 
> 
> Mit freundlichen Gruessen / Best Regards
> 
> Ulrich Weigand
> 
> --
>   Dr. Ulrich Weigand | Phone: +49-7031/16-3727
>   STSM, GNU/Linux compilers and toolchain
>   IBM Deutschland Research & Development GmbH
>   Vorsitzende des Aufsichtsrats: Martina Koederitz |
>   Geschäftsführung: Dirk
> Wittkopp
>   Sitz der Gesellschaft: Böblingen | Registergericht: Amtsgericht
> Stuttgart, HRB 243294

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory