[llvm-dev] A Propeller link (similar to a Thin Link as used by ThinLTO)?

Mon Mar 2 11:02:47 PST 2020

On 2020-02-28, Rui Ueyama wrote:
>On Fri, Feb 28, 2020 at 11:34 AM Fangrui Song <maskray at google.com> wrote:
>
>> I met with the Propeller team today (we work for the same company but it
>> was my first time meeting two members on the team:) ).
>> One thing I have been reassured:
>>
>> * There is no general disassembly work. General
>> disassembly work would assuredly frighten off developers.  (Inherently
>> unreliable, memory usage heavy and difficult to deal with CFI, debug
>> information, etc)
>>
>> Minimal amount of plumbing work (https://reviews.llvm.org/D68065) is
>> acceptable: locating the jump relocation, detecting the jump type,
>> inverting the direction of a jump, and deleting trailing bytes of an
>> input section. The existing linker relaxation schemes already do similar
>> things. Deleting a trailing jump is similar to RISC-V where sections can
>> shrink (not implemented in lld; R_RISCV_ALIGN and R_RISCV_RELAX are in
>> my mind)) (binutils supports deleting bytes for a few other
>> architectures, e.g.  msp430, sh, mips, ft32, rl78).  With just minimal
>> amount of disassembly work, conceptually the framework should not be too
>> hard to be ported to another target.
>>
>> One thing I was not aware of (perhaps the description did not make it
>> clear) is that
>> Propeller intends to **reorder basic block sections across translation
>> units**.
>> This is something that full LTO can do while ThinLTO cannot.
>> Our internal systems cannot afford doing a full LTO (**Can we fix the
>> bottleneck of full LTO** [1]?)
>> for large executables and I believe some other users are in the same camp.
>>
>> Now, with ThinLTO, the post link optimization scheme will inevitably
>> require
>> help from the linker/compiler. It seems we have two routes:
>>
>> ## Route 1: Current Propeller framework
>>
>> lld does whole-program reordering of basic block sections.  We can extend
>> it in
>> the future to overalign some sections and pad gaps with NOPs.  What else
>> can we
>> do? Source code/IR/MCInst is lost at this stage. Without general assembly
>> work, it may be difficult to do more optimization.
>>
>> This makes me concerned of another thing: Intel's Jump Condition Code
>> Erratum.
>>
>> https://www.intel.com/content/dam/support/us/en/documents/processors/mitigations-jump-conditional-code-erratum.pdf
>>
>> Put it in the simplest way, a Jcc instruction whose address ≡ 30 or 31
>> (mod 32) should be avoided.  There are assembler level (MC) mitigations
>> (function sections are overaligned to 32), but because we use basic
>> block sections (sh_addralign<32) and need reordering, we have to redo
>> some work at the linking stage.
>>
>> After losing the representation of MCInst, it is not clear to me how we can
>> insert NOPs/segment override prefixes without doing disassembly work in
>> the linker.
>>
>
>I'm not sure how the basic-block sections feature makes it hard to
>implement a mitigation for that specific JCC erratum. I may be missing
>something, but doesn't the BB sections actually make it easier to
>implement, as the JCC occurs only at the ending of each basic block, and
>with the BB sections we know what the ending instruction is for each block?
>I mean, when we are reordering sections, and if we found some BB with JCC
>is not at a desired address, we can add a padding before that BB.

Loss of MachineInstr/MCInst (what we have are ELF object files) and
refraining from disassembly makes it hard to implement the Intel JCC
Erratum mitigation.

Inserting padding can increase the distance of JCC_1/JMP_1 instructions.
JCC_1/JMP_1 may need to be relaxed to JCC_4/JMP_4.

jb 1f; nop; 1:                # 72 01             jb 0x3
jb 2f; .space 0x7f, 0x90; 2:  # 72 7f             jb 0x84
jb 3f; .space 0x80, 0x90; 3:  # 0f 82 80 00 00 00 jb 0x10a

Without disassembly, we can only add NOPs, but not the superior segment
override prefixes. Note that x86 has several instructions which are
documented (Table 24-3. Format of Interruptibility State") as enabling
interrupts exactly one instruction after the one which changes the SS
segment register. Inserting a nop allows an interrupt to arrive before
the execution of the following instruction which changes semantic
behaviour.

   # NOP before jmp can change semantic behavior.
   sti; jmp baz
   movl %esi, %ss; jmp baz
   movw (%rsi), %ss; jmp baz

Well, we may go the MSP430 route: always generate the maximum range branch
instructions, and rely on the linker to relax instructions. This is also
what RISC-V does. (I mentioned both MSP430 and RISC-V in my previous
message:) ) There are many challenges here.

Coming back to the topic. We really have 2 routes.

a) Start from ELF object files, add sufficient metadata that supports post-link optimization
b) Start from machine level representations (MIR-like), remove unneeded details to make whole-program post-link optimization affordable.

Our current route is a). To avoid creating JCC_1/JMP_1 which are close
to the jump distance limit, we may have to implement something similar
to TargetInstrInfo::getInstSizeInBytes for x86, add BranchRelaxation
to addPreEmitPass(), so that JCC_1/JMP_1 can be avoided in the first place.
Unfortunately, getInstSizeInBytes is not great when there is inline
assembly, because a pseudo instruction or a macro can expand to multiple
real instructions. In this regard, LLVM is better than GCC because GCC
just counts the statements
(https://gcc.gnu.org/onlinedocs/gcc/Size-of-an-asm.html#Size-of-an-asm)

Debug information is another hassle. We only have relocation records,
and we will use conservative forms to make debug information not break
after Propeller optimization. If we start from DwarfDebug, we will be
more confident that nothing will break.

I understand that out focus has always been b), and it is (huge) sunk
cost if we change the direction to a). I am just concerned how many new
things we will discover in the future which is mandatory to annotate ELF
object files.

Starting from b) is a challenge, because it will push us to rethink
various LLVM infrastructures. The nice thing in return for Propeller is
immediate reuse of facility already provided by the compiler. The nice
thing in a long term is flexible frameworks that will benefit the
overall LLVM project and other future optimizations.

Additionaly, Sri told me that we would also do compiler-inserted
prefetching (this topic has been thoroughly studied by prior art, so I
don't think exposing the information is sensitive at all). I cannot
imagine how we would do it without more machine level information.

>>Route 2 does heavy lifting work in the compiler, which can naturally reuse
>> the assembler level mitigation,
>> CFI and debug information generating, and probably other stuff.
>> (How will debug information be bloated?)
>>
>> ## Route 2: Add another link stage, similar to a Thin Link as used by
>> ThinLTO.
>>
>> Regular ThinLTO with minimized bitcode files:
>>
>>         all: compile thin_link thinlto_backend final_link
>>
>>         compile a.o b.o a.indexing.o b.indexing.o: a.c b.c
>>                 $(clang) -O2 -c -flto=thin
>> -fthin-link-bitcode=a.indexing.o a.c
>>                 $(clang) -O2 -c -flto=thin
>> -fthin-link-bitcode=b.indexing.o b.c
>>
>>         thin_link lto/a.o.thinlto.bc lto/b.o.thinlto.bc a.rsp:
>> a.indexing.o b.indexing.o
>>                 $(clang) -fuse-ld=lld -Wl,--thinlto-index-only=a.rsp
>> -Wl,--thinlto-prefix-replace=';lto'
>> -Wl,--thinlto-object-suffix-replace='.indexing.o;.o' a.indexing.o
>> b.indexing.o
>>
>>         thinlto_backend lto/a.o lto/b.o: a.o b.o lto/a.o.thinlto.bc
>> lto/b.o.thinlto.bc
>>                 $(clang) -O2 -c -fthinlto-index=lto/a.o.thinlto.bc a.o -o
>> lto/a.o
>>                 $(clang) -O2 -c -fthinlto-index=lto/b.o.thinlto.bc b.o -o
>> lto/b.o
>>
>>         final_link exe: lto/a.o lto/b.o a.rsp
>>                 # Propeller does basic block section reordering here.
>>                 $(clang) -fuse-ld=lld @a.rsp -o exe
>>
>> We need to replace the two stages thinlto_backend and final_link with
>> three.
>>
>> Propelled ThinLTO with minimized bitcode files:
>>
>>         propelled_thinlto_backend lto/a.mir lto/b.mir: a.o b.o
>> lto/a.o.thinlto.bc lto/b.o.thinlto.bc
>>                 # Propeller emits something similar to a Machine IR file.
>>                 # a.o and b.o are all IR files.
>>                 $(clang) -O2 -c -fthinlto-index=lto/a.o.thinlto.bc
>> -fpropeller a.o -o lto/a.mir
>>                 $(clang) -O2 -c -fthinlto-index=lto/b.o.thinlto.bc
>> -fpropeller b.o -o lto/b.mir
>>
>>         propeller_link propeller/a.o propeller/b.o: lto/a.mir lto/b.mir
>>                 # Propeller collects input Machine IR files,
>>                 # spawn threads to generate object files parallelly.
>>                 $(clang) -fpropeller-backend
>> -fpropeller-prefix-replace='lto;propeller' lto/a.mir lto/b.mir
>>
>>         final_link exe: propeller/a.o propeller/b.o
>>                 # GNU ld/gold/lld links object files.
>>                 $(clang) $^ -o exe
>>
>> A .mir may be much large than an object file. So lto/a.mir may be
>> actually an object file annotated with some information, or some lower
>> level representation than a Machine IR (there should be a guarantee that
>> the produced object file will keep the basic block structure unchanged
>> => otherwise basic block profiling information will not be too useful).
>>
>>
>>
>> [1]: **Can we fix the bottleneck of full LTO** [1]?
>>
>> I wonder whether we have reached a "local maximum" of ThinLTO.
>> If full LTO were nearly as fast as ThinLTO, how would we design a
>> post-link optimization framework?
>> Apparently, if full LTO did not have the scalability problem, we would
>> not do so much work in the linker?
>>