[lld] [lld] Add target support for SystemZ (s390x) (PR #75643)

Mon Feb 12 16:51:44 PST 2024

MaskRay wrote:

> Thanks for the thorough review - I think everything should be addressed with the latest patch.
> 
> Thanks also for your comments on our ABI at http://maskray.me/blog/2024-02-11-toolchain-notes-on-z-architecture - you found a number of issues we'd indeed like to change if that were possible without breaking the ABI ...
> 
> Some comments on your blog post:
>
> > r6 being non-volatile for argument storage seems uncommon compared to other architectures.
> 
> This is an unfortunate historical accident. If we ever define a new ABI, that would be the first thing to change.
> 
> > Only 4 registers are used for integer argument storage, which is inadequate.
> 
> That would be the second thing to change. At least r7 should be volatile & argument as well.
> 
> > It is unclear why r0 and r1 are not used.
> 
> We need at least one and sometimes two scratch registers during the call sequence (e.g. within a PLT stub). The ABI reserves r0 (which is limited in usability anyway as it cannot be used to form adresses) and r1 for that purpose.
> 
> > The stack alignment is 8 bytes. Most 64-bit architectures employ 16-byte alignment.
> 
> Agreed, this is also something we would like to change.

I appreciate your comments!
It's true that complaining is often easier than creating.
Designing an ABI 20+ years ago overcoming many technical hurdles wasn't always apparent from the outside...

> > s390x: 64-bit thread pointer split across a0 and a1, both still 32-bit.
> 
> Back when we designed the 64-bit TLS ABI, we thought this would still be preferable to reserving one of the 64-bit GPRs as thread pointer. (It's an easier choice for POWER which has 32 GPRs anyway ...) It is still debatable which choice would have been better - in code bases where the thread pointer needs to be reloaded from ARs frequently, the current logic is indeed somewhat inefficient.

I agree that using non-GPR for the thread pointer is better than using GPR, when there are only 16 GPRs.
The instruction formats had fixated the number of registers.

I guess that are reasons that access registers were not extended to 64-bit.

> > Ineffecient tls_index argument (similar to AArch32): This requires an extra lookup in .data.rel.ro.
> 
> We modeled the 64-bit TLS ABI closely after the 32-bit version, which in term was modeled after the x86 version. Other platforms used the 32-bit -> 64-bit transition to make TLS improvements, which we neglected to do. (In part because the initial 64-bit ISA didn't yet have some of the instructions we're now used to, like LGRL or LGFI ...)

Understood. These nice instructions appear to come from the extended-immediate and general-instructions-extension facilities, September 2005 and February 2008.

> > Unnecessary use of _GLOBAL_OFFSET_TABLE_: PC-relative addressing is available. Instead of loading a at TLSGD, and then adding _GLOBAL_OFFSET_TABLE_, it is easier to just load the GOT entry address directly.
> 
> Back in 32-bit days we didn't have LARL, so using the offset allowed to avoid using relocations in position-independent code.
> 
> > Redundant argument: __tls_get_offset takes the GOT offset instead of the direct GOT entry address.
> 
> This was the most obvious blunder when moving to 64-bit, yes. In the 32-bit ABI, nearly every function in PIC code had set up r12 anyway, so this was not really causing any overhead. In (modern) 64-bit code, nothing uses r12 anymore, except for this TLS sequence (because of the old __tls_get_offset ABI).

Yes. It seems that most 32-bit architectures before 2000 have struggled with `_GLOBAL_OFFSET_TABLE_`...
The nice facilities that addressed the inconvenience were late additions.

> > Indirect return value: Instead of returning the final TLS symbol address directly, __tls_get_offset only provides an offset, requiring an extra instruction for addition with the TP.
> 
> This was a deliberate decision to simplify relaxation: after relaxation, we need to add the TP anyway, so if we add TP before relaxation as well, that part doesn't need to be rewritten by the linker.

This is indeed an interesting design that the linker only needs to patch one instruction, instead of four for PPC64.
It still seems preferable to include the TP value and patch `lgf %r2, 0(%r2,%r1)` to a NOP, like PPC.
The downside will be one more relocation.

> Also, because now the addition can be emitted by the _compiler_ instead of the linker during relaxation, it will usually be folded for free into base+index address generation.

Seems so for local-dynamic. For general-dynamic, this folding seems to not kick in unless general-dynamic to local-dynamic compiler optimization also appies?

https://github.com/llvm/llvm-project/pull/75643