<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Mar 2, 2020 at 11:03 AM Fangrui Song via llvm-dev <<a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On 2020-02-28, Rui Ueyama wrote:<br>

>On Fri, Feb 28, 2020 at 11:34 AM Fangrui Song <<a href="mailto:maskray@google.com" target="_blank">maskray@google.com</a>> wrote:<br>

><br>

>> I met with the Propeller team today (we work for the same company but it<br>

>> was my first time meeting two members on the team:) ).<br>

>> One thing I have been reassured:<br>

>><br>

>> * There is no general disassembly work. General<br>

>> disassembly work would assuredly frighten off developers.  (Inherently<br>

>> unreliable, memory usage heavy and difficult to deal with CFI, debug<br>

>> information, etc)<br>

>><br>

>> Minimal amount of plumbing work (<a href="https://reviews.llvm.org/D68065" rel="noreferrer" target="_blank">https://reviews.llvm.org/D68065</a>) is<br>

>> acceptable: locating the jump relocation, detecting the jump type,<br>

>> inverting the direction of a jump, and deleting trailing bytes of an<br>

>> input section. The existing linker relaxation schemes already do similar<br>

>> things. Deleting a trailing jump is similar to RISC-V where sections can<br>

>> shrink (not implemented in lld; R_RISCV_ALIGN and R_RISCV_RELAX are in<br>

>> my mind)) (binutils supports deleting bytes for a few other<br>

>> architectures, e.g.  msp430, sh, mips, ft32, rl78).  With just minimal<br>

>> amount of disassembly work, conceptually the framework should not be too<br>

>> hard to be ported to another target.<br>

>><br>

>> One thing I was not aware of (perhaps the description did not make it<br>

>> clear) is that<br>

>> Propeller intends to **reorder basic block sections across translation<br>

>> units**.<br>

>> This is something that full LTO can do while ThinLTO cannot.<br>

>> Our internal systems cannot afford doing a full LTO (**Can we fix the<br>

>> bottleneck of full LTO** [1]?)<br>

>> for large executables and I believe some other users are in the same camp.<br>

>><br>

>> Now, with ThinLTO, the post link optimization scheme will inevitably<br>

>> require<br>

>> help from the linker/compiler. It seems we have two routes:<br>

>><br>

>> ## Route 1: Current Propeller framework<br>

>><br>

>> lld does whole-program reordering of basic block sections.  We can extend<br>

>> it in<br>

>> the future to overalign some sections and pad gaps with NOPs.  What else<br>

>> can we<br>

>> do? Source code/IR/MCInst is lost at this stage. Without general assembly<br>

>> work, it may be difficult to do more optimization.<br>

>><br>

>> This makes me concerned of another thing: Intel's Jump Condition Code<br>

>> Erratum.<br>

>><br>

>> <a href="https://www.intel.com/content/dam/support/us/en/documents/processors/mitigations-jump-conditional-code-erratum.pdf" rel="noreferrer" target="_blank">https://www.intel.com/content/dam/support/us/en/documents/processors/mitigations-jump-conditional-code-erratum.pdf</a><br>

>><br>

>> Put it in the simplest way, a Jcc instruction whose address ≡ 30 or 31<br>

>> (mod 32) should be avoided.  There are assembler level (MC) mitigations<br>

>> (function sections are overaligned to 32), but because we use basic<br>

>> block sections (sh_addralign<32) and need reordering, we have to redo<br>

>> some work at the linking stage.<br>

>><br>

>> After losing the representation of MCInst, it is not clear to me how we can<br>

>> insert NOPs/segment override prefixes without doing disassembly work in<br>

>> the linker.<br>

>><br>

><br>

>I'm not sure how the basic-block sections feature makes it hard to<br>

>implement a mitigation for that specific JCC erratum. I may be missing<br>

>something, but doesn't the BB sections actually make it easier to<br>

>implement, as the JCC occurs only at the ending of each basic block, and<br>

>with the BB sections we know what the ending instruction is for each block?<br>

>I mean, when we are reordering sections, and if we found some BB with JCC<br>

>is not at a desired address, we can add a padding before that BB.<br>

<br>

Loss of MachineInstr/MCInst (what we have are ELF object files) and<br>

refraining from disassembly makes it hard to implement the Intel JCC<br>

Erratum mitigation.<br>

<br>

Inserting padding can increase the distance of JCC_1/JMP_1 instructions.<br>

JCC_1/JMP_1 may need to be relaxed to JCC_4/JMP_4.<br>

<br>

jb 1f; nop; 1:                # 72 01             jb 0x3<br>

jb 2f; .space 0x7f, 0x90; 2:  # 72 7f             jb 0x84<br>

jb 3f; .space 0x80, 0x90; 3:  # 0f 82 80 00 00 00 jb 0x10a<br>

<br>

Without disassembly, we can only add NOPs, but not the superior segment<br>

override prefixes. Note that x86 has several instructions which are<br>

documented (Table 24-3. Format of Interruptibility State") as enabling<br>

interrupts exactly one instruction after the one which changes the SS<br>

segment register. Inserting a nop allows an interrupt to arrive before<br>

the execution of the following instruction which changes semantic<br>

behaviour.<br>

<br>

   # NOP before jmp can change semantic behavior.<br>

   sti; jmp baz<br>

   movl %esi, %ss; jmp baz<br>

   movw (%rsi), %ss; jmp baz<br>

<br>

Well, we may go the MSP430 route: always generate the maximum range branch<br>

instructions, and rely on the linker to relax instructions. This is also<br>

what RISC-V does. (I mentioned both MSP430 and RISC-V in my previous<br>

message:) ) There are many challenges here.<br>

<br>

<br>

Coming back to the topic. We really have 2 routes.<br>

<br>

a) Start from ELF object files, add sufficient metadata that supports post-link optimization<br>

b) Start from machine level representations (MIR-like), remove unneeded details to make whole-program post-link optimization affordable.<br>

<br>

Our current route is a). To avoid creating JCC_1/JMP_1 which are close<br>

to the jump distance limit, we may have to implement something similar<br>

to TargetInstrInfo::getInstSizeInBytes for x86, add BranchRelaxation<br>

to addPreEmitPass(), so that JCC_1/JMP_1 can be avoided in the first place.<br>

Unfortunately, getInstSizeInBytes is not great when there is inline<br>

assembly, because a pseudo instruction or a macro can expand to multiple<br>

real instructions. In this regard, LLVM is better than GCC because GCC<br>

just counts the statements<br>

(<a href="https://gcc.gnu.org/onlinedocs/gcc/Size-of-an-asm.html#Size-of-an-asm" rel="noreferrer" target="_blank">https://gcc.gnu.org/onlinedocs/gcc/Size-of-an-asm.html#Size-of-an-asm</a>)<br>

<br>

Debug information is another hassle. We only have relocation records,<br>

and we will use conservative forms to make debug information not break<br>

after Propeller optimization. If we start from DwarfDebug, we will be<br>

more confident that nothing will break.<br></blockquote><div><br></div><div>FWIW on this point in particular (Debug info + Propeller), I've worked with Sri & others to fairly well (to the best of my ability) validate the DWARF produced by propeller. There are some things that still need generalizing & improving (essentially, the ability to detect when a label difference is known at compile-time or not & then using that to select slightly different DWARF output - for now the Propeller prototype hardcodes a few of these cases rather than using such a general mechanism) but in general I'm satisfied DWARF can handle this in a way that doesn't require special handling by the linker. There are some opportunities (applicable without Propeller, but perhaps a bit more significant with Propeller) to do some selectively content-aware changes in the linker (eg: range list collapsing - doesn't require reading all the DWARF, just the range lists which are much smaller/simpler).</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

I understand that out focus has always been b), and it is (huge) sunk<br>

cost if we change the direction to a). I am just concerned how many new<br>

things we will discover in the future which is mandatory to annotate ELF<br>

object files.<br>

<br>

Starting from b) is a challenge, because it will push us to rethink<br>

various LLVM infrastructures. The nice thing in return for Propeller is<br>

immediate reuse of facility already provided by the compiler. The nice<br>

thing in a long term is flexible frameworks that will benefit the<br>

overall LLVM project and other future optimizations.<br>

<br>

Additionaly, Sri told me that we would also do compiler-inserted<br>

prefetching (this topic has been thoroughly studied by prior art, so I<br>

don't think exposing the information is sensitive at all). I cannot<br>

imagine how we would do it without more machine level information.<br>

<br>

>>Route 2 does heavy lifting work in the compiler, which can naturally reuse<br>

>> the assembler level mitigation,<br>

>> CFI and debug information generating, and probably other stuff.<br>

>> (How will debug information be bloated?)<br>

>><br>

>> ## Route 2: Add another link stage, similar to a Thin Link as used by<br>

>> ThinLTO.<br>

>><br>

>> Regular ThinLTO with minimized bitcode files:<br>

>><br>

>>         all: compile thin_link thinlto_backend final_link<br>

>><br>

>>         compile a.o b.o a.indexing.o b.indexing.o: a.c b.c<br>

>>                 $(clang) -O2 -c -flto=thin<br>

>> -fthin-link-bitcode=a.indexing.o a.c<br>

>>                 $(clang) -O2 -c -flto=thin<br>

>> -fthin-link-bitcode=b.indexing.o b.c<br>

>><br>

>>         thin_link lto/a.o.thinlto.bc lto/b.o.thinlto.bc a.rsp:<br>

>> a.indexing.o b.indexing.o<br>

>>                 $(clang) -fuse-ld=lld -Wl,--thinlto-index-only=a.rsp<br>

>> -Wl,--thinlto-prefix-replace=';lto'<br>

>> -Wl,--thinlto-object-suffix-replace='.indexing.o;.o' a.indexing.o<br>

>> b.indexing.o<br>

>><br>

>>         thinlto_backend lto/a.o lto/b.o: a.o b.o lto/a.o.thinlto.bc<br>

>> lto/b.o.thinlto.bc<br>

>>                 $(clang) -O2 -c -fthinlto-index=lto/a.o.thinlto.bc a.o -o<br>

>> lto/a.o<br>

>>                 $(clang) -O2 -c -fthinlto-index=lto/b.o.thinlto.bc b.o -o<br>

>> lto/b.o<br>

>><br>

>>         final_link exe: lto/a.o lto/b.o a.rsp<br>

>>                 # Propeller does basic block section reordering here.<br>

>>                 $(clang) -fuse-ld=lld @a.rsp -o exe<br>

>><br>

>> We need to replace the two stages thinlto_backend and final_link with<br>

>> three.<br>

>><br>

>> Propelled ThinLTO with minimized bitcode files:<br>

>><br>

>>         propelled_thinlto_backend lto/a.mir lto/b.mir: a.o b.o<br>

>> lto/a.o.thinlto.bc lto/b.o.thinlto.bc<br>

>>                 # Propeller emits something similar to a Machine IR file.<br>

>>                 # a.o and b.o are all IR files.<br>

>>                 $(clang) -O2 -c -fthinlto-index=lto/a.o.thinlto.bc<br>

>> -fpropeller a.o -o lto/a.mir<br>

>>                 $(clang) -O2 -c -fthinlto-index=lto/b.o.thinlto.bc<br>

>> -fpropeller b.o -o lto/b.mir<br>

>><br>

>>         propeller_link propeller/a.o propeller/b.o: lto/a.mir lto/b.mir<br>

>>                 # Propeller collects input Machine IR files,<br>

>>                 # spawn threads to generate object files parallelly.<br>

>>                 $(clang) -fpropeller-backend<br>

>> -fpropeller-prefix-replace='lto;propeller' lto/a.mir lto/b.mir<br>

>><br>

>>         final_link exe: propeller/a.o propeller/b.o<br>

>>                 # GNU ld/gold/lld links object files.<br>

>>                 $(clang) $^ -o exe<br>

>><br>

>> A .mir may be much large than an object file. So lto/a.mir may be<br>

>> actually an object file annotated with some information, or some lower<br>

>> level representation than a Machine IR (there should be a guarantee that<br>

>> the produced object file will keep the basic block structure unchanged<br>

>> => otherwise basic block profiling information will not be too useful).<br>

>><br>

>><br>

>><br>

>> [1]: **Can we fix the bottleneck of full LTO** [1]?<br>

>><br>

>> I wonder whether we have reached a "local maximum" of ThinLTO.<br>

>> If full LTO were nearly as fast as ThinLTO, how would we design a<br>

>> post-link optimization framework?<br>

>> Apparently, if full LTO did not have the scalability problem, we would<br>

>> not do so much work in the linker?<br>

>><br>

_______________________________________________<br>

LLVM Developers mailing list<br>

<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a><br>

<a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" rel="noreferrer" target="_blank">https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><br>

</blockquote></div></div>