[llvm-dev] A Propeller link (similar to a Thin Link as used by ThinLTO)?

Mon Mar 2 23:55:22 PST 2020

On Thu, Feb 27, 2020 at 6:34 PM Fangrui Song via llvm-dev <
llvm-dev at lists.llvm.org> wrote:

> I met with the Propeller team today (we work for the same company but it
> was my first time meeting two members on the team:) ).
> One thing I have been reassured:
>
> * There is no general disassembly work. General
> disassembly work would assuredly frighten off developers.  (Inherently
> unreliable, memory usage heavy and difficult to deal with CFI, debug
> information, etc)
>
> Minimal amount of plumbing work (https://reviews.llvm.org/D68065) is
> acceptable: locating the jump relocation, detecting the jump type,
> inverting the direction of a jump, and deleting trailing bytes of an
> input section

. The existing linker relaxation schemes already do similar
> things. Deleting a trailing jump is similar to RISC-V where sections can
> shrink (not implemented in lld; R_RISCV_ALIGN and R_RISCV_RELAX are in
> my mind)) (binutils supports deleting bytes for a few other
> architectures, e.g.  msp430, sh, mips, ft32, rl78).  With just minimal
> amount of disassembly work, conceptually the framework should not be too
> hard to be ported to another target.
>
> One thing I was not aware of (perhaps the description did not make it
> clear) is that
> Propeller intends to **reorder basic block sections across translation
> units**.
> This is something that full LTO can do while ThinLTO cannot.
> Our internal systems cannot afford doing a full LTO (**Can we fix the
> bottleneck of full LTO** [1]?)
> for large executables and I believe some other users are in the same camp.
>

Right, beyond distributed build system, even on a single machine and for
"small" projects like clang: building on a laptop with FullLTO can be
challenging in terms of memory consumption, and the iterative development
is just not practical.

>
> Now, with ThinLTO, the post link optimization scheme will inevitably
> require
> help from the linker/compiler. It seems we have two routes:
>
> ## Route 1: Current Propeller framework
>
> lld does whole-program reordering of basic block sections.  We can extend
> it in
> the future to overalign some sections and pad gaps with NOPs.  What else
> can we
> do? Source code/IR/MCInst is lost at this stage. Without general assembly
> work, it may be difficult to do more optimization.
>
> This makes me concerned of another thing: Intel's Jump Condition Code
> Erratum.
>
> https://www.intel.com/content/dam/support/us/en/documents/processors/mitigations-jump-conditional-code-erratum.pdf
>
> Put it in the simplest way, a Jcc instruction whose address ≡ 30 or 31
> (mod 32) should be avoided.  There are assembler level (MC) mitigations
> (function sections are overaligned to 32), but because we use basic
> block sections (sh_addralign<32) and need reordering, we have to redo
> some work at the linking stage.
>
> After losing the representation of MCInst, it is not clear to me how we can
> insert NOPs/segment override prefixes without doing disassembly work in
> the linker.
>
> Route 2 does heavy lifting work in the compiler, which can naturally reuse
> the assembler level mitigation,
> CFI and debug information generating, and probably other stuff.
> (How will debug information be bloated?)
>
> ## Route 2: Add another link stage, similar to a Thin Link as used by
> ThinLTO.
>
> Regular ThinLTO with minimized bitcode files:
>
>         all: compile thin_link thinlto_backend final_link
>
>         compile a.o b.o a.indexing.o b.indexing.o: a.c b.c
>                 $(clang) -O2 -c -flto=thin
> -fthin-link-bitcode=a.indexing.o a.c
>                 $(clang) -O2 -c -flto=thin
> -fthin-link-bitcode=b.indexing.o b.c
>
>         thin_link lto/a.o.thinlto.bc lto/b.o.thinlto.bc a.rsp:
> a.indexing.o b.indexing.o
>                 $(clang) -fuse-ld=lld -Wl,--thinlto-index-only=a.rsp
> -Wl,--thinlto-prefix-replace=';lto'
> -Wl,--thinlto-object-suffix-replace='.indexing.o;.o' a.indexing.o
> b.indexing.o
>
>         thinlto_backend lto/a.o lto/b.o: a.o b.o lto/a.o.thinlto.bc
> lto/b.o.thinlto.bc
>                 $(clang) -O2 -c -fthinlto-index=lto/a.o.thinlto.bc a.o -o
> lto/a.o
>                 $(clang) -O2 -c -fthinlto-index=lto/b.o.thinlto.bc b.o -o
> lto/b.o
>
>         final_link exe: lto/a.o lto/b.o a.rsp
>                 # Propeller does basic block section reordering here.
>                 $(clang) -fuse-ld=lld @a.rsp -o exe
>
> We need to replace the two stages thinlto_backend and final_link with
> three.
>
> Propelled ThinLTO with minimized bitcode files:
>
>         propelled_thinlto_backend lto/a.mir lto/b.mir: a.o b.o
> lto/a.o.thinlto.bc lto/b.o.thinlto.bc
>                 # Propeller emits something similar to a Machine IR file.
>                 # a.o and b.o are all IR files.
>                 $(clang) -O2 -c -fthinlto-index=lto/a.o.thinlto.bc
> -fpropeller a.o -o lto/a.mir
>                 $(clang) -O2 -c -fthinlto-index=lto/b.o.thinlto.bc
> -fpropeller b.o -o lto/b.mir
>
>         propeller_link propeller/a.o propeller/b.o: lto/a.mir lto/b.mir
>                 # Propeller collects input Machine IR files,
>                 # spawn threads to generate object files parallelly.
>                 $(clang) -fpropeller-backend
> -fpropeller-prefix-replace='lto;propeller' lto/a.mir lto/b.mir
>
>         final_link exe: propeller/a.o propeller/b.o
>                 # GNU ld/gold/lld links object files.
>                 $(clang) $^ -o exe
>

There was an interesting talk last week at the LLVM performance
workshop: Global
Machine Outliner for ThinLTO <https://llvm.org/devmtg/2020-02-23/#kl> which
introduced a similar stage in ThinLTO (for another purpose though). I
believe they avoid the serialization of MIR by running the CodeGen twice
instead (once to collect the cross-module informations, and the second time
using these informations).
CC the author in case the slides are already available online.

>
> A .mir may be much large than an object file. So lto/a.mir may be
> actually an object file annotated with some information, or some lower
> level representation than a Machine IR (there should be a guarantee that
> the produced object file will keep the basic block structure unchanged
> => otherwise basic block profiling information will not be too useful).
>
>
>
> [1]: **Can we fix the bottleneck of full LTO** [1]?
>
> I wonder whether we have reached a "local maximum" of ThinLTO.
> If full LTO were nearly as fast as ThinLTO, how would we design a
> post-link optimization framework?
> Apparently, if full LTO did not have the scalability problem, we would
> not do so much work in the linker?
>

At lot of work went into ThinLTO because the scalability issue of LTO was
considered inherent to the design. It isn't clear what you're suggesting
here though?

-- 
Mehdi
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200302/bf0f8773/attachment.html>