[llvm-dev] A Propeller link (similar to a Thin Link as used by ThinLTO)?

Mon Mar 2 11:13:41 PST 2020

On Mon, Mar 2, 2020 at 11:03 AM Fangrui Song via llvm-dev <
llvm-dev at lists.llvm.org> wrote:

> On 2020-02-28, Rui Ueyama wrote:
> >On Fri, Feb 28, 2020 at 11:34 AM Fangrui Song <maskray at google.com> wrote:
> >
> >> I met with the Propeller team today (we work for the same company but it
> >> was my first time meeting two members on the team:) ).
> >> One thing I have been reassured:
> >>
> >> * There is no general disassembly work. General
> >> disassembly work would assuredly frighten off developers.  (Inherently
> >> unreliable, memory usage heavy and difficult to deal with CFI, debug
> >> information, etc)
> >>
> >> Minimal amount of plumbing work (https://reviews.llvm.org/D68065) is
> >> acceptable: locating the jump relocation, detecting the jump type,
> >> inverting the direction of a jump, and deleting trailing bytes of an
> >> input section. The existing linker relaxation schemes already do similar
> >> things. Deleting a trailing jump is similar to RISC-V where sections can
> >> shrink (not implemented in lld; R_RISCV_ALIGN and R_RISCV_RELAX are in
> >> my mind)) (binutils supports deleting bytes for a few other
> >> architectures, e.g.  msp430, sh, mips, ft32, rl78).  With just minimal
> >> amount of disassembly work, conceptually the framework should not be too
> >> hard to be ported to another target.
> >>
> >> One thing I was not aware of (perhaps the description did not make it
> >> clear) is that
> >> Propeller intends to **reorder basic block sections across translation
> >> units**.
> >> This is something that full LTO can do while ThinLTO cannot.
> >> Our internal systems cannot afford doing a full LTO (**Can we fix the
> >> bottleneck of full LTO** [1]?)
> >> for large executables and I believe some other users are in the same
> camp.
> >>
> >> Now, with ThinLTO, the post link optimization scheme will inevitably
> >> require
> >> help from the linker/compiler. It seems we have two routes:
> >>
> >> ## Route 1: Current Propeller framework
> >>
> >> lld does whole-program reordering of basic block sections.  We can
> extend
> >> it in
> >> the future to overalign some sections and pad gaps with NOPs.  What else
> >> can we
> >> do? Source code/IR/MCInst is lost at this stage. Without general
> assembly
> >> work, it may be difficult to do more optimization.
> >>
> >> This makes me concerned of another thing: Intel's Jump Condition Code
> >> Erratum.
> >>
> >>
> https://www.intel.com/content/dam/support/us/en/documents/processors/mitigations-jump-conditional-code-erratum.pdf
> >>
> >> Put it in the simplest way, a Jcc instruction whose address ≡ 30 or 31
> >> (mod 32) should be avoided.  There are assembler level (MC) mitigations
> >> (function sections are overaligned to 32), but because we use basic
> >> block sections (sh_addralign<32) and need reordering, we have to redo
> >> some work at the linking stage.
> >>
> >> After losing the representation of MCInst, it is not clear to me how we
> can
> >> insert NOPs/segment override prefixes without doing disassembly work in
> >> the linker.
> >>
> >
> >I'm not sure how the basic-block sections feature makes it hard to
> >implement a mitigation for that specific JCC erratum. I may be missing
> >something, but doesn't the BB sections actually make it easier to
> >implement, as the JCC occurs only at the ending of each basic block, and
> >with the BB sections we know what the ending instruction is for each
> block?
> >I mean, when we are reordering sections, and if we found some BB with JCC
> >is not at a desired address, we can add a padding before that BB.
>
> Loss of MachineInstr/MCInst (what we have are ELF object files) and
> refraining from disassembly makes it hard to implement the Intel JCC
> Erratum mitigation.
>
> Inserting padding can increase the distance of JCC_1/JMP_1 instructions.
> JCC_1/JMP_1 may need to be relaxed to JCC_4/JMP_4.
>
> jb 1f; nop; 1:                # 72 01             jb 0x3
> jb 2f; .space 0x7f, 0x90; 2:  # 72 7f             jb 0x84
> jb 3f; .space 0x80, 0x90; 3:  # 0f 82 80 00 00 00 jb 0x10a
>
> Without disassembly, we can only add NOPs, but not the superior segment
> override prefixes. Note that x86 has several instructions which are
> documented (Table 24-3. Format of Interruptibility State") as enabling
> interrupts exactly one instruction after the one which changes the SS
> segment register. Inserting a nop allows an interrupt to arrive before
> the execution of the following instruction which changes semantic
> behaviour.
>
>    # NOP before jmp can change semantic behavior.
>    sti; jmp baz
>    movl %esi, %ss; jmp baz
>    movw (%rsi), %ss; jmp baz
>
> Well, we may go the MSP430 route: always generate the maximum range branch
> instructions, and rely on the linker to relax instructions. This is also
> what RISC-V does. (I mentioned both MSP430 and RISC-V in my previous
> message:) ) There are many challenges here.
>
>
> Coming back to the topic. We really have 2 routes.
>
> a) Start from ELF object files, add sufficient metadata that supports
> post-link optimization
> b) Start from machine level representations (MIR-like), remove unneeded
> details to make whole-program post-link optimization affordable.
>
> Our current route is a). To avoid creating JCC_1/JMP_1 which are close
> to the jump distance limit, we may have to implement something similar
> to TargetInstrInfo::getInstSizeInBytes for x86, add BranchRelaxation
> to addPreEmitPass(), so that JCC_1/JMP_1 can be avoided in the first place.
> Unfortunately, getInstSizeInBytes is not great when there is inline
> assembly, because a pseudo instruction or a macro can expand to multiple
> real instructions. In this regard, LLVM is better than GCC because GCC
> just counts the statements
> (https://gcc.gnu.org/onlinedocs/gcc/Size-of-an-asm.html#Size-of-an-asm)
>
> Debug information is another hassle. We only have relocation records,
> and we will use conservative forms to make debug information not break
> after Propeller optimization. If we start from DwarfDebug, we will be
> more confident that nothing will break.
>

FWIW on this point in particular (Debug info + Propeller), I've worked with
Sri & others to fairly well (to the best of my ability) validate the DWARF
produced by propeller. There are some things that still need generalizing &
improving (essentially, the ability to detect when a label difference is
known at compile-time or not & then using that to select slightly different
DWARF output - for now the Propeller prototype hardcodes a few of these
cases rather than using such a general mechanism) but in general I'm
satisfied DWARF can handle this in a way that doesn't require special
handling by the linker. There are some opportunities (applicable without
Propeller, but perhaps a bit more significant with Propeller) to do some
selectively content-aware changes in the linker (eg: range list collapsing
- doesn't require reading all the DWARF, just the range lists which are
much smaller/simpler).

>
> I understand that out focus has always been b), and it is (huge) sunk
> cost if we change the direction to a). I am just concerned how many new
> things we will discover in the future which is mandatory to annotate ELF
> object files.
>
> Starting from b) is a challenge, because it will push us to rethink
> various LLVM infrastructures. The nice thing in return for Propeller is
> immediate reuse of facility already provided by the compiler. The nice
> thing in a long term is flexible frameworks that will benefit the
> overall LLVM project and other future optimizations.
>
> Additionaly, Sri told me that we would also do compiler-inserted
> prefetching (this topic has been thoroughly studied by prior art, so I
> don't think exposing the information is sensitive at all). I cannot
> imagine how we would do it without more machine level information.
>
> >>Route 2 does heavy lifting work in the compiler, which can naturally
> reuse
> >> the assembler level mitigation,
> >> CFI and debug information generating, and probably other stuff.
> >> (How will debug information be bloated?)
> >>
> >> ## Route 2: Add another link stage, similar to a Thin Link as used by
> >> ThinLTO.
> >>
> >> Regular ThinLTO with minimized bitcode files:
> >>
> >>         all: compile thin_link thinlto_backend final_link
> >>
> >>         compile a.o b.o a.indexing.o b.indexing.o: a.c b.c
> >>                 $(clang) -O2 -c -flto=thin
> >> -fthin-link-bitcode=a.indexing.o a.c
> >>                 $(clang) -O2 -c -flto=thin
> >> -fthin-link-bitcode=b.indexing.o b.c
> >>
> >>         thin_link lto/a.o.thinlto.bc lto/b.o.thinlto.bc a.rsp:
> >> a.indexing.o b.indexing.o
> >>                 $(clang) -fuse-ld=lld -Wl,--thinlto-index-only=a.rsp
> >> -Wl,--thinlto-prefix-replace=';lto'
> >> -Wl,--thinlto-object-suffix-replace='.indexing.o;.o' a.indexing.o
> >> b.indexing.o
> >>
> >>         thinlto_backend lto/a.o lto/b.o: a.o b.o lto/a.o.thinlto.bc
> >> lto/b.o.thinlto.bc
> >>                 $(clang) -O2 -c -fthinlto-index=lto/a.o.thinlto.bc a.o
> -o
> >> lto/a.o
> >>                 $(clang) -O2 -c -fthinlto-index=lto/b.o.thinlto.bc b.o
> -o
> >> lto/b.o
> >>
> >>         final_link exe: lto/a.o lto/b.o a.rsp
> >>                 # Propeller does basic block section reordering here.
> >>                 $(clang) -fuse-ld=lld @a.rsp -o exe
> >>
> >> We need to replace the two stages thinlto_backend and final_link with
> >> three.
> >>
> >> Propelled ThinLTO with minimized bitcode files:
> >>
> >>         propelled_thinlto_backend lto/a.mir lto/b.mir: a.o b.o
> >> lto/a.o.thinlto.bc lto/b.o.thinlto.bc
> >>                 # Propeller emits something similar to a Machine IR
> file.
> >>                 # a.o and b.o are all IR files.
> >>                 $(clang) -O2 -c -fthinlto-index=lto/a.o.thinlto.bc
> >> -fpropeller a.o -o lto/a.mir
> >>                 $(clang) -O2 -c -fthinlto-index=lto/b.o.thinlto.bc
> >> -fpropeller b.o -o lto/b.mir
> >>
> >>         propeller_link propeller/a.o propeller/b.o: lto/a.mir lto/b.mir
> >>                 # Propeller collects input Machine IR files,
> >>                 # spawn threads to generate object files parallelly.
> >>                 $(clang) -fpropeller-backend
> >> -fpropeller-prefix-replace='lto;propeller' lto/a.mir lto/b.mir
> >>
> >>         final_link exe: propeller/a.o propeller/b.o
> >>                 # GNU ld/gold/lld links object files.
> >>                 $(clang) $^ -o exe
> >>
> >> A .mir may be much large than an object file. So lto/a.mir may be
> >> actually an object file annotated with some information, or some lower
> >> level representation than a Machine IR (there should be a guarantee that
> >> the produced object file will keep the basic block structure unchanged
> >> => otherwise basic block profiling information will not be too useful).
> >>
> >>
> >>
> >> [1]: **Can we fix the bottleneck of full LTO** [1]?
> >>
> >> I wonder whether we have reached a "local maximum" of ThinLTO.
> >> If full LTO were nearly as fast as ThinLTO, how would we design a
> >> post-link optimization framework?
> >> Apparently, if full LTO did not have the scalability problem, we would
> >> not do so much work in the linker?
> >>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200302/e4305bd3/attachment-0001.html>