[lld] [lld] Add target support for SystemZ (s390x) (PR #75643)

Tue Feb 13 11:00:12 PST 2024

MaskRay wrote:

> > I guess that are reasons that access registers were not extended to 64-bit.
> 
> Well, access registers have an architected purpose and are supposed to hold an "access-list-entry token", a 32-bit value, so there's no reason to extend access registers beyond that. It just that we're not using the access-list mechanism on Linux at all, and therefore can use access registers to hold arbitrary values. That's similar to how a segment register is used on x86 for TLS, even though Linux doesn't really use segments ...

Thanks for the additional context.

> > > This was a deliberate decision to simplify relaxation: after relaxation, we need to add the TP anyway, so if we add TP before relaxation as well, that part doesn't need to be rewritten by the linker.
> > 
> > 
> > This is indeed an interesting design that the linker only needs to patch one instruction, instead of four for PPC64. It still seems preferable to include the TP value and patch `lgf %r2, 0(%r2,%r1)` to a NOP, like PPC. The downside will be one more relocation.
> 
> No, the LGF would still be needed - note that this instruction loads the _value_ of the thread-local variable, not its addess. The address would be given by %r2 + %r1; this addition is never explicitly performed, but done implicitly by using %r2 and %r1 as index and base registers for a memory access. If `__tls_get_offset` were to return the full address, we'd still need an LGF, it just could use only a base register `lgf %2, 0(%r2)`. But there's no performance difference whether an index is used or not.

You are right. If `__tls_get_offset` were to return the full address, we'd still need an LGF, but the code sequence can omit `TP` computation (3 instructions).

> > > Also, because now the addition can be emitted by the _compiler_ instead of the linker during relaxation, it will usually be folded for free into base+index address generation.
> > 
> > 
> > Seems so for local-dynamic. For general-dynamic, this folding seems to not kick in unless general-dynamic to local-dynamic compiler optimization also appies?
> 
> Actually, it's only local-dynamic where the addition is explicitly performed (the `la %r2, 0(%r2, %r4)` in your example).

https://github.com/llvm/llvm-project/pull/75643