[llvm-dev] A Propeller link (similar to a Thin Link as used by ThinLTO)?

Fri Mar 20 21:55:31 PDT 2020

Hereby, we discuss our plan for handling Intel's JCC mitigation as follows.

TLDR;  By computing basic block groupings early, the compiler can form
larger clusters of basic blocks (each cluster in a section) which will
allow Propeller to just reuse the assembler’s mitigation.  Our experiments
show that when JCC mitigation causes only 0.2% slowdown for Propeller,
compared to the 0.6% slowdown incurred for the vanilla configuration.

A slightly longer summary:

   -

   We evaluated a Propeller prototype to reuse the existing assembler
   mitigation in llvm, -mbranches-within-32B-boundary, which currently uses
   only NOPs for mitigation.
   -

   With some changes, Propeller is able to reuse the existing assembler
   mitigation. To do this, we form large basic block clusters (sections
   containing multiple basic blocks) in the compiler by computing the basic
   block layout earlier.
   -

   Vanilla clang benchmark (no Propeller) regresses by ~0.6% with this flag.
   -

   With Propeller, the exact same flag regresses clang only by ~0.2%,
   reducing the total speedup from 7.8% to 7.6%.
   -

   For similar problems, the solution is most optimally implemented in the
   linker. However, for this particular problem, it appears that the
   assembler's mitigation is good enough when combined with Propeller.

Background

The JCC erratum
<https://www.intel.com/content/www/us/en/support/articles/000055650/processors.html>
is a CPU bug affecting Skylake processors which results in unpredictable
behaviour under complex micro-architectural states involving the Decoded
I-cache, specifically, when executing branches which cross a cache line.

MicroCode Update (MCU) Mitigation

The CPU avoids this bug by bypassing the Decoded ICache for branches
crossing 32B boundaries. This sacrifices some performance (0-4%) in return
for correctness. The compiler can alleviate this effect by aligning the
code such that branches do not cross a 32B boundary. There are two ways
that the compiler can do this:

   1.

   Inserting NOP instructions
   2.

   Inserting prefixes for instructions

The current solution shipped with clang-10 (under
-mbranches-within-32B-boundary) aligns every function at 32B and uses NOPs
between instructions. Our experiment shows enabling this option results in
0.6% performance degradation for Clang. There have been some efforts to
improve this using instruction prefixes (https://reviews.llvm.org/D72225,
https://reviews.llvm.org/D75268) even though there has been some
uncertainty about the available headroom (
https://reviews.llvm.org/D72225#1818149).

JCC Mitigation in Propeller

Propeller modifies the code layout by emitting basic blocks into sections
and reordering them at link time. This means the assembler’s mitigation
could be corrupted by Propeller.

There are two ways in which Propeller can solve the problem:

   1.

   Redo the full mitigation in the linker
   2.

   Reuse the mitigation that is being implemented in the assembler

Next we discuss each of the two strategies in more detail.
Full Mitigation in the Linker

The current compiler solution is implemented in the assembler backend and
its scope is limited to one function at a time (with -function-sections),
which requires excessive alignment of 32B for the function entry.

As a post-link optimization infrastructure, Propeller has the global view
of all sections in the link time and is at a better position for global
optimal JCC mitigation. The challenge for Propeller is finding the location
of affected branch instructions, and inserting paddings or prefixes at the
right places (some instructions cannot be prepended with prefixes or NOPs).
This is easier for the assembler as it has higher-level information about
instructions and can use the MC layer structures (such as
MCRelaxableFragment) to emit variable-sized paddings or prefixes.

As we discuss next, our prototype relying on the assembler's mitigation
incurs no significant overhead and therefore we do not plan to address this
problem in the linker.
Relying on the Assembler’s Mitigation

Propeller can use the assembler’s mitigation on every basic block section.
However, this means every basic block would be aligned at 32 bytes. The
paddings between the basic blocks may be executed nops which will put
significant pressure on the CPU's frontend.

To reduce the NOP paddings, we would need to emit BB sections at a coarser
level of granularity, which would mean emitting multiple basic blocks in
the same section. However, currently, Propeller delays the basic block
layout computation until link time and hence the actual group of basic
blocks (cluster) is only available at link time.

To make this work, we implemented a prototype by moving the layout
computation before the final round of Propeller compilation. After the
layout is computed, basic block partitions of each function are extracted
and passed to the compiler.

For example, consider the following BB layout for a program consisting of
two functions foo (with 5 basic blocks) and bar (with a single basic block).

foo

foo.BB.1

foo.BB.2

bar

foo.BB.3

foo.BB.4

The extracted BB partitions are as follows:

foo: {  [foo, foo.BB.1, foo.BB.2] , [foo.BB.3, foo.BB.4] }

Bar: { [bar] }

We instruct the compiler to emit foo’s basic blocks in two sections and bar’s
single basic block in one section. The assembler applies JCC mitigation on
each of the three sections by aligning them at 32 bytes and inserting
minimal paddings between instructions within every section. The only change
compared to the baseline mitigation with -function-sections is emitting an
excessive 32 bytes alignment for foo.BB.3. However, the introduced padding
is non-executed code (may have small pressure on the instruction cache and
TLB).

We note that the layout algorithm would scatter a function’s basic blocks
across multiple partitions judiciously and only if it is advantageous for
the performance. For intra-procedural layout, only two clusters are created
(hot and cold). Nonetheless, the non-executed paddings for clusters will
have minimal impact on performance.

On another note, better code layout could reduce the overhead of JCC
mitigation because the hot code would be packed together and the paddings
for the cold blocks will not affect the hot code.
Results

We evaluated Clang’s performance under different optimizations with and
without JCC mitigation. We used PGO + ThinLTO for all configurations. We
tested two propeller code layouts: inter-procedural, and intra-procedural.
The intra-procedural results in at most two clusters for every function,
while the inter-procedural layout could lead to more.

To use JCC mitigation, we use
“-Wl,-mllvm,--x86-branches-within-32B-boundaries
-mbranches-within-32B-boundaries".

We ran the clang bootstrap test 10 times for each configuration and
measured the average cpu time (user + sys in seconds).

We note that our evaluation is performed on a machine without the microcode
update installed.

Mitigation Enabled Mitigation Disabled
baseline (PGO + ThinLTO) 545.362 542.012
Propeller intra-proedural 506.828 504.861
Propeller inter-procedural 503.23 502.136Clang's cpu time relative to the
baseline, for different optimization flavors, with and without JCC
mitigation

FIrst, JCC Mitigation results in a 0.6% slowdown when applied to the
baseline. With Propeller, JCC mitigation incurs 0.4% slowdown for
intra-procedural and 0.2% for inter-procedural. The lesser JCC mitigation
slowdowns for Propeller configurations shows the impact of better code
layout. When hot and cold code are mixed together, the paddings in the cold
part could put more pressure on I-Cache and I-TLB.

Conclusion

Using BB clusters, we can reuse the assembler’s JCC mitigation with no
significant impact on performance. In fact the slowdown caused by JCC
mitigation is lower for Propeller, because of the better code layout.

Finally, we would like to stress once again that Propeller has the
potential to do a better job for problems like this JCC mitigation.
However, for this particular problem, we have shown that the assembler's
mitigation is good enough to be used along with Propeller.

On Mon, Mar 2, 2020 at 11:56 PM Mehdi AMINI via llvm-dev <
llvm-dev at lists.llvm.org> wrote:

>
>
> On Thu, Feb 27, 2020 at 6:34 PM Fangrui Song via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> I met with the Propeller team today (we work for the same company but it
>> was my first time meeting two members on the team:) ).
>> One thing I have been reassured:
>>
>> * There is no general disassembly work. General
>> disassembly work would assuredly frighten off developers.  (Inherently
>> unreliable, memory usage heavy and difficult to deal with CFI, debug
>> information, etc)
>>
>> Minimal amount of plumbing work (https://reviews.llvm.org/D68065) is
>> acceptable: locating the jump relocation, detecting the jump type,
>> inverting the direction of a jump, and deleting trailing bytes of an
>> input section
>
> . The existing linker relaxation schemes already do similar
>> things. Deleting a trailing jump is similar to RISC-V where sections can
>> shrink (not implemented in lld; R_RISCV_ALIGN and R_RISCV_RELAX are in
>> my mind)) (binutils supports deleting bytes for a few other
>> architectures, e.g.  msp430, sh, mips, ft32, rl78).  With just minimal
>> amount of disassembly work, conceptually the framework should not be too
>> hard to be ported to another target.
>>
>> One thing I was not aware of (perhaps the description did not make it
>> clear) is that
>> Propeller intends to **reorder basic block sections across translation
>> units**.
>> This is something that full LTO can do while ThinLTO cannot.
>> Our internal systems cannot afford doing a full LTO (**Can we fix the
>> bottleneck of full LTO** [1]?)
>> for large executables and I believe some other users are in the same camp.
>>
>
> Right, beyond distributed build system, even on a single machine and for
> "small" projects like clang: building on a laptop with FullLTO can be
> challenging in terms of memory consumption, and the iterative development
> is just not practical.
>
>
>>
>> Now, with ThinLTO, the post link optimization scheme will inevitably
>> require
>> help from the linker/compiler. It seems we have two routes:
>>
>> ## Route 1: Current Propeller framework
>>
>> lld does whole-program reordering of basic block sections.  We can extend
>> it in
>> the future to overalign some sections and pad gaps with NOPs.  What else
>> can we
>> do? Source code/IR/MCInst is lost at this stage. Without general assembly
>> work, it may be difficult to do more optimization.
>>
>> This makes me concerned of another thing: Intel's Jump Condition Code
>> Erratum.
>>
>> https://www.intel.com/content/dam/support/us/en/documents/processors/mitigations-jump-conditional-code-erratum.pdf
>>
>> Put it in the simplest way, a Jcc instruction whose address ≡ 30 or 31
>> (mod 32) should be avoided.  There are assembler level (MC) mitigations
>> (function sections are overaligned to 32), but because we use basic
>> block sections (sh_addralign<32) and need reordering, we have to redo
>> some work at the linking stage.
>>
>> After losing the representation of MCInst, it is not clear to me how we
>> can
>> insert NOPs/segment override prefixes without doing disassembly work in
>> the linker.
>>
>> Route 2 does heavy lifting work in the compiler, which can naturally
>> reuse the assembler level mitigation,
>> CFI and debug information generating, and probably other stuff.
>> (How will debug information be bloated?)
>>
>> ## Route 2: Add another link stage, similar to a Thin Link as used by
>> ThinLTO.
>>
>> Regular ThinLTO with minimized bitcode files:
>>
>>         all: compile thin_link thinlto_backend final_link
>>
>>         compile a.o b.o a.indexing.o b.indexing.o: a.c b.c
>>                 $(clang) -O2 -c -flto=thin
>> -fthin-link-bitcode=a.indexing.o a.c
>>                 $(clang) -O2 -c -flto=thin
>> -fthin-link-bitcode=b.indexing.o b.c
>>
>>         thin_link lto/a.o.thinlto.bc lto/b.o.thinlto.bc a.rsp:
>> a.indexing.o b.indexing.o
>>                 $(clang) -fuse-ld=lld -Wl,--thinlto-index-only=a.rsp
>> -Wl,--thinlto-prefix-replace=';lto'
>> -Wl,--thinlto-object-suffix-replace='.indexing.o;.o' a.indexing.o
>> b.indexing.o
>>
>>         thinlto_backend lto/a.o lto/b.o: a.o b.o lto/a.o.thinlto.bc
>> lto/b.o.thinlto.bc
>>                 $(clang) -O2 -c -fthinlto-index=lto/a.o.thinlto.bc a.o -o
>> lto/a.o
>>                 $(clang) -O2 -c -fthinlto-index=lto/b.o.thinlto.bc b.o -o
>> lto/b.o
>>
>>         final_link exe: lto/a.o lto/b.o a.rsp
>>                 # Propeller does basic block section reordering here.
>>                 $(clang) -fuse-ld=lld @a.rsp -o exe
>>
>> We need to replace the two stages thinlto_backend and final_link with
>> three.
>>
>> Propelled ThinLTO with minimized bitcode files:
>>
>>         propelled_thinlto_backend lto/a.mir lto/b.mir: a.o b.o
>> lto/a.o.thinlto.bc lto/b.o.thinlto.bc
>>                 # Propeller emits something similar to a Machine IR file.
>>                 # a.o and b.o are all IR files.
>>                 $(clang) -O2 -c -fthinlto-index=lto/a.o.thinlto.bc
>> -fpropeller a.o -o lto/a.mir
>>                 $(clang) -O2 -c -fthinlto-index=lto/b.o.thinlto.bc
>> -fpropeller b.o -o lto/b.mir
>>
>>         propeller_link propeller/a.o propeller/b.o: lto/a.mir lto/b.mir
>>                 # Propeller collects input Machine IR files,
>>                 # spawn threads to generate object files parallelly.
>>                 $(clang) -fpropeller-backend
>> -fpropeller-prefix-replace='lto;propeller' lto/a.mir lto/b.mir
>>
>>         final_link exe: propeller/a.o propeller/b.o
>>                 # GNU ld/gold/lld links object files.
>>                 $(clang) $^ -o exe
>>
>
> There was an interesting talk last week at the LLVM performance workshop: Global
> Machine Outliner for ThinLTO <https://llvm.org/devmtg/2020-02-23/#kl> which
> introduced a similar stage in ThinLTO (for another purpose though). I
> believe they avoid the serialization of MIR by running the CodeGen twice
> instead (once to collect the cross-module informations, and the second time
> using these informations).
> CC the author in case the slides are already available online.
>
>
>
>>
>> A .mir may be much large than an object file. So lto/a.mir may be
>> actually an object file annotated with some information, or some lower
>> level representation than a Machine IR (there should be a guarantee that
>> the produced object file will keep the basic block structure unchanged
>> => otherwise basic block profiling information will not be too useful).
>>
>>
>>
>> [1]: **Can we fix the bottleneck of full LTO** [1]?
>>
>> I wonder whether we have reached a "local maximum" of ThinLTO.
>> If full LTO were nearly as fast as ThinLTO, how would we design a
>> post-link optimization framework?
>> Apparently, if full LTO did not have the scalability problem, we would
>> not do so much work in the linker?
>>
>
> At lot of work went into ThinLTO because the scalability issue of LTO was
> considered inherent to the design. It isn't clear what you're suggesting
> here though?
>
> --
> Mehdi
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

On Mon, Mar 2, 2020 at 11:56 PM Mehdi AMINI via llvm-dev <
llvm-dev at lists.llvm.org> wrote:

>
>
> On Thu, Feb 27, 2020 at 6:34 PM Fangrui Song via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> I met with the Propeller team today (we work for the same company but it
>> was my first time meeting two members on the team:) ).
>> One thing I have been reassured:
>>
>> * There is no general disassembly work. General
>> disassembly work would assuredly frighten off developers.  (Inherently
>> unreliable, memory usage heavy and difficult to deal with CFI, debug
>> information, etc)
>>
>> Minimal amount of plumbing work (https://reviews.llvm.org/D68065) is
>> acceptable: locating the jump relocation, detecting the jump type,
>> inverting the direction of a jump, and deleting trailing bytes of an
>> input section
>
> . The existing linker relaxation schemes already do similar
>> things. Deleting a trailing jump is similar to RISC-V where sections can
>> shrink (not implemented in lld; R_RISCV_ALIGN and R_RISCV_RELAX are in
>> my mind)) (binutils supports deleting bytes for a few other
>> architectures, e.g.  msp430, sh, mips, ft32, rl78).  With just minimal
>> amount of disassembly work, conceptually the framework should not be too
>> hard to be ported to another target.
>>
>> One thing I was not aware of (perhaps the description did not make it
>> clear) is that
>> Propeller intends to **reorder basic block sections across translation
>> units**.
>> This is something that full LTO can do while ThinLTO cannot.
>> Our internal systems cannot afford doing a full LTO (**Can we fix the
>> bottleneck of full LTO** [1]?)
>> for large executables and I believe some other users are in the same camp.
>>
>
> Right, beyond distributed build system, even on a single machine and for
> "small" projects like clang: building on a laptop with FullLTO can be
> challenging in terms of memory consumption, and the iterative development
> is just not practical.
>
>
>>
>> Now, with ThinLTO, the post link optimization scheme will inevitably
>> require
>> help from the linker/compiler. It seems we have two routes:
>>
>> ## Route 1: Current Propeller framework
>>
>> lld does whole-program reordering of basic block sections.  We can extend
>> it in
>> the future to overalign some sections and pad gaps with NOPs.  What else
>> can we
>> do? Source code/IR/MCInst is lost at this stage. Without general assembly
>> work, it may be difficult to do more optimization.
>>
>> This makes me concerned of another thing: Intel's Jump Condition Code
>> Erratum.
>>
>> https://www.intel.com/content/dam/support/us/en/documents/processors/mitigations-jump-conditional-code-erratum.pdf
>>
>> Put it in the simplest way, a Jcc instruction whose address ≡ 30 or 31
>> (mod 32) should be avoided.  There are assembler level (MC) mitigations
>> (function sections are overaligned to 32), but because we use basic
>> block sections (sh_addralign<32) and need reordering, we have to redo
>> some work at the linking stage.
>>
>> After losing the representation of MCInst, it is not clear to me how we
>> can
>> insert NOPs/segment override prefixes without doing disassembly work in
>> the linker.
>>
>> Route 2 does heavy lifting work in the compiler, which can naturally
>> reuse the assembler level mitigation,
>> CFI and debug information generating, and probably other stuff.
>> (How will debug information be bloated?)
>>
>> ## Route 2: Add another link stage, similar to a Thin Link as used by
>> ThinLTO.
>>
>> Regular ThinLTO with minimized bitcode files:
>>
>>         all: compile thin_link thinlto_backend final_link
>>
>>         compile a.o b.o a.indexing.o b.indexing.o: a.c b.c
>>                 $(clang) -O2 -c -flto=thin
>> -fthin-link-bitcode=a.indexing.o a.c
>>                 $(clang) -O2 -c -flto=thin
>> -fthin-link-bitcode=b.indexing.o b.c
>>
>>         thin_link lto/a.o.thinlto.bc lto/b.o.thinlto.bc a.rsp:
>> a.indexing.o b.indexing.o
>>                 $(clang) -fuse-ld=lld -Wl,--thinlto-index-only=a.rsp
>> -Wl,--thinlto-prefix-replace=';lto'
>> -Wl,--thinlto-object-suffix-replace='.indexing.o;.o' a.indexing.o
>> b.indexing.o
>>
>>         thinlto_backend lto/a.o lto/b.o: a.o b.o lto/a.o.thinlto.bc
>> lto/b.o.thinlto.bc
>>                 $(clang) -O2 -c -fthinlto-index=lto/a.o.thinlto.bc a.o -o
>> lto/a.o
>>                 $(clang) -O2 -c -fthinlto-index=lto/b.o.thinlto.bc b.o -o
>> lto/b.o
>>
>>         final_link exe: lto/a.o lto/b.o a.rsp
>>                 # Propeller does basic block section reordering here.
>>                 $(clang) -fuse-ld=lld @a.rsp -o exe
>>
>> We need to replace the two stages thinlto_backend and final_link with
>> three.
>>
>> Propelled ThinLTO with minimized bitcode files:
>>
>>         propelled_thinlto_backend lto/a.mir lto/b.mir: a.o b.o
>> lto/a.o.thinlto.bc lto/b.o.thinlto.bc
>>                 # Propeller emits something similar to a Machine IR file.
>>                 # a.o and b.o are all IR files.
>>                 $(clang) -O2 -c -fthinlto-index=lto/a.o.thinlto.bc
>> -fpropeller a.o -o lto/a.mir
>>                 $(clang) -O2 -c -fthinlto-index=lto/b.o.thinlto.bc
>> -fpropeller b.o -o lto/b.mir
>>
>>         propeller_link propeller/a.o propeller/b.o: lto/a.mir lto/b.mir
>>                 # Propeller collects input Machine IR files,
>>                 # spawn threads to generate object files parallelly.
>>                 $(clang) -fpropeller-backend
>> -fpropeller-prefix-replace='lto;propeller' lto/a.mir lto/b.mir
>>
>>         final_link exe: propeller/a.o propeller/b.o
>>                 # GNU ld/gold/lld links object files.
>>                 $(clang) $^ -o exe
>>
>
> There was an interesting talk last week at the LLVM performance workshop: Global
> Machine Outliner for ThinLTO <https://llvm.org/devmtg/2020-02-23/#kl> which
> introduced a similar stage in ThinLTO (for another purpose though). I
> believe they avoid the serialization of MIR by running the CodeGen twice
> instead (once to collect the cross-module informations, and the second time
> using these informations).
> CC the author in case the slides are already available online.
>
>
>
>>
>> A .mir may be much large than an object file. So lto/a.mir may be
>> actually an object file annotated with some information, or some lower
>> level representation than a Machine IR (there should be a guarantee that
>> the produced object file will keep the basic block structure unchanged
>> => otherwise basic block profiling information will not be too useful).
>>
>>
>>
>> [1]: **Can we fix the bottleneck of full LTO** [1]?
>>
>> I wonder whether we have reached a "local maximum" of ThinLTO.
>> If full LTO were nearly as fast as ThinLTO, how would we design a
>> post-link optimization framework?
>> Apparently, if full LTO did not have the scalability problem, we would
>> not do so much work in the linker?
>>
>
> At lot of work went into ThinLTO because the scalability issue of LTO was
> considered inherent to the design. It isn't clear what you're suggesting
> here though?
>
> --
> Mehdi
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200320/3d1a680b/attachment-0001.html>