[llvm-dev] [RFC] Machine Function Splitter - Split out cold blocks from machine functions using profile data

Tue Aug 4 17:31:06 PDT 2020

Greetings,

We present “Machine Function Splitter”, a codegen optimization pass which
splits functions into hot and cold parts. This pass leverages the basic
block sections feature recently introduced in LLVM from the Propeller
project. The pass targets functions with profile coverage, identifies cold
blocks and moves them to a separate section. The linker groups all cold
blocks across functions together, decreasing fragmentation and improving
icache and itlb utilization. Our experiments show >2% performance
improvement on clang bootstrap, ~1% improvement on Google workloads and
1.6% mean performance improvement on SPEC IntRate 2017.
Motivation

Recent work at Google has shown that aggressive, profile-driven inlining
for performance has led to significant code bloat and icache
fragmentation (AsmDB
- Ayers et al ‘2019 <https://research.google/pubs/pub48320/>). We find that
most functions 5 KiB or larger have inlined children more than 10 layers
deep bringing in exponentially more code at each inline level, not all of
which is necessarily hot. Generally, in roughly half of even the hottest
functions, more than 50% of the code bytes are never executed, but likely
to be in the cache.

Function splitting is a well known compiler transformation primarily
targeting improved code locality to improve performance. LLVM has a
middle-end, target agnostic hot cold splitting pass
<https://llvm.org/devmtg/2019-10/slides/Kumar-HotColdSplitting.pdf> as well
as a partial inlining pass
<https://github.com/llvm/llvm-project/blob/master/llvm/lib/Transforms/IPO/PartialInlining.cpp>
which performs similar transformations, as noted by the authors in a recent
email thread
<https://lists.llvm.org/pipermail/llvm-dev/2020-June/142429.html>. However,
due to the timing of the respective passes as well as the code extraction
techniques employed, the overall gains on large, complex applications leave
headroom for improvement. By deferring function splitting to the codegen
phase we can maximize the opportunity to remove cold code as well as refine
the code extraction technique. Furthermore, by performing function
splitting very late, earlier passes can perform more aggressive
optimizations.
Implementation

We propose a new machine function splitting pass which leverages the basic
block sections feature <https://reviews.llvm.org/D68063> to split functions
without the caveats of code extraction in the middle-end. The pass uses
profile information to identify cold basic blocks very late in LLVM
CodeGen, after regalloc and all other machine passes have executed. This
allows our implementation to be precise in its assessment of cold regions
while maximizing opportunity.

Each function is split into two parts. The hot cluster includes the
function entry and all blocks which are not cold. All the cold blocks are
grouped together as a Cold Section cluster
<https://github.com/llvm/llvm-project/blob/5934df0c9abe94fc450fbcf0ceca21cf838840e9/llvm/include/llvm/CodeGen/MachineBasicBlock.h#L63>.
With basic block sections, the cold blocks are assigned appropriate debug
and call frame information and emitted as part of the .text.unlikely
section. Unlike Propeller
<https://lists.llvm.org/pipermail/llvm-dev/2019-September/135393.html>,
which is presently the main user of the basic block sections feature, this
pass does not require an additional round of profiling and uses existing
instrumentation based FDO or CSFDO profile information.

[image: Machine Function Splitter.png]

In the illustration above, the functions foo and bar contain a cold block
each, index 5 and E respectively. We show a possible layout for these
functions which optimizes for fall throughs. Note that all the blocks are
kept in a contiguous region described by the symbols foo and bar. Using the
machine function splitter, the cold blocks (5 and E) are moved to a
separate section. These blocks can then be grouped along with other cold
blocks (and functions) in a separate output section in the final binary.
The key highlights of this approach are:

   -

   Profile driven, profile type agnostic approach.
   -

   Cold basic blocks are split out using jumps.
   -

   No additional instructions are added to the function for setup/teardown.
   -

   Runs as the last step before emitting assembly, no
   analysis/optimizations are hindered.

Exceptions

All eh pads are grouped together regardless of their coldness and are part
of the original function. There are outstanding issues with splitting eh
pads if they reside in separate sections in the binary. This remains as
part of future work.

DebugInfo and CFI

Debug information and CFI directives are updated and kept consistent by the
underlying basic block sections framework. Support added in the following
patches

   -

   DebugInfo (https://reviews.llvm.org/D78851)
   -

   CFI (https://reviews.llvm.org/D79978).

Distinction between Machine Function Splitter and Propeller

Full Propeller optimizations include function splitting and layout
optimizations, however it requires an additional round of profiling using
perf on top of the peak (FDO/CSFDO + ThinLTO) binary. In this work we
experiment with applying function splitting using the instrumented profile
in the build instead of adding an additional round of profiling.

Link to Propeller RFC
<https://lists.llvm.org/pipermail/llvm-dev/2019-September/135393.html>

Split Binary Characteristics

Binaries produced by the compiler with function splitting enabled contain
additional symbols. A function which has been split into a hot and cold
part is non-contiguous. The symbol table entry for the hot part retains the
symbol name of the original function with type FUNC. The symbol for the
cold part contains a “.cold” suffix attached to the original symbol name,
the type is not set for this symbol. Using a suffix has been the norm for
such optimizations e.g. -hot-cold-split in LLVM and the prior GCC
implementation detailed earlier. We expect standardized tooling to handle
split functions appropriately, e.g demangling works as expected --

$ c++filt _Z3foov.cold

foo() [clone .cold]

Contrast with HotColdSplit (HCS)

Function splitting in the middle-end in LLVM employs extraction of cold
single-entry-single-exit (SESE) regions into separate functions. In
general, the pass has been found to be impactful in reducing code size by
deduplication of cold regions; however our experiments show it does not
improve performance of large workloads.

The key differences are:

Extraction methodology and tradeoffs

HCS extracts cold code from SESE regions using a function call. This may
incur a spill and fill of caller registers along with additional setup and
teardown if live values modified in the cold region need to be communicated
back to the original function. This has a couple of implications

   1.

   The “residue” of each extracted region is non-trivial and there is a
   tradeoff between the amount of code that needs to be cold before it is
   profitable to extract. Thus the cost of mischaracterization is high.
   2.

   Since each SESE region is extracted separately the net reduction in code
   size of the original function is less.

In contrast, the machine function splitter extracts cold code into a
separate section. Control is transferred to cold code via jumps. More often
than not these jumps may already exist as part of the original layout thus
incurring no additional cost. No additional instructions are inserted to
accommodate splitting. Finally, no additional setup/teardown is necessary
for live values modified in cold regions.

Pass timing and interaction with other optimizations

The HCS pass is run on the IR in the optimizer. This allows it to be target
agnostic and allow later stages to merge identical code if necessary.
However, there are some drawbacks to this approach. In particular,

   1.

   Splitting early may miss opportunities introduced by later passes such
   as library call inlining and CFG simplification resulting from a
   combination of optimizations. Furthermore, this may not play well with
   optimization passes such as MachineOutliner.
   2.

   Synergistic optimizations are harder to reason about due to the pass
   timing. For example, inlining can be more aggressive if any cold code
   introduced is trimmed.

In contrast, the machine function splitter runs as the last step in
codegen. This ensures that the opportunity for splitting is maximised
without hindering existing analyses and synergistic decisions can be made
in earlier optimization passes. We rely on accurate profile count
propagation across optimizations to maximise opportunities. This works
particularly well for instrumented profiles while improving the pass for
sampled profiles is ongoing work.

We have provided a contrived example in the Appendix which demonstrates the
code generated for both approaches. The key differences are highlighted
inline.

Evaluation

In this section, we present an in-depth evaluation of the impact on clang
bootstrap and summary results for two google internal workloads, Search1
and Search2 as well overall results on the SPECInt 2017 benchmarks. All
experiments are conducted on Intel Skylake based systems unless otherwise
noted. Profile guided optimizations using instrumented profiles are enabled
for all builds.

clang-bootstrap

We pick 500 compiler invocations from a bootstrap build of clang and then
evaluate the performance of a PGO+ThinLTO optimized version with that of
PGO+ThinLTO+Split compiler. For the latter, the final optimized build
includes the machine function splitter.

Results:

We observe a mean 2.33% improvement in end to end runtime. The improvements
in runtime are driven by reduction in icache and TLB miss rates. The table
below summarizes our experiment, each data point is averaged over multiple
iterations. The observed variation for each metric is < 1%.

Event

Split (MPKI)

Baseline (MPKI)

% Reduction

itlb_miss

0.87

1.28

31.70

stlb_miss

0.08

0.12

32.51

l1i_miss

5.98

6.61

9.56

l2_miss

0.27

0.34

20.02

In this experiment, the function splitting pass moved cold code from ~30K
functions in .text and .text.hot. We present a comparison of the binary
contents using bloaty <https://github.com/google/bloaty>

    FILE SIZE        VM SIZE

 --------------  --------------

   +23% +8.26Mi   +23% +8.26Mi    .text.unlikely

  +6.5%  +761Ki  [ = ]       0    .strtab

  +4.8%  +247Ki  +4.8%  +247Ki    .eh_frame

  +6.1%  +193Ki  [ = ]       0    .symtab

  +8.5% +63.1Ki  +8.5% +63.1Ki    .eh_frame_hdr

  +0.3% +31.3Ki  +0.3% +31.3Ki    .rodata

  +0.4%      +3  [ = ]       0    [Unmapped]

  -0.3%      -8  -0.3%      -8    .init_array

  [ = ]       0 -33.3%      -8    [LOAD #4 [RW]]

  [ = ]       0  -0.2%    -416    .bss

 -57.1% -4.04Mi -57.1% -4.04Mi    .text.hot

 -48.4% -4.13Mi -48.4% -4.13Mi    .text

  +1.6% +1.35Mi  +0.6%  +430Ki    TOTAL

We see that 48% and 57% of code in .text and .text.hot respectively was
moved to the .text.unlikely section. We also note a small increase in
overall binary size due to the following reasons:

   -

   Some additional jump instructions may be inserted.
   -

   Small increase in associated metadata, e.g. debug information.
   -

   Additional symbols of type foo.cold for cold parts.
   -

   Alignment requirements for both original and split function parts.

Comparison with HotColdSplit

For the clang-bootstrap benchmark we also compared the performance of the
hot-cold-split pass with split-machine-functions. We summarize the results
for performance and the characteristics of the binary built by each pass in
the table below. Each metric is presented as change vs the baseline, an FDO
optimized build of clang.

Hot Cold Split

Machine Function Splitter

Performance

1.10%

2.65%

.text size

-41.5% -2.89Mi

-49.2% -3.43Mi

.text.hot size

-46.9% -2.52Mi

-57.1% -3.07Mi

Full binary size

9.6% +7.56Mi

1.7% +1.37Mi

Note that the increase in overall binary size increase for HCS is due to
the increase in .eh_frame (+61% +3.03Mi). HCS extracts each cold SESE
region as a separate function whereas the machine function splitter
extracts the cold code as a single region thus incurring a constant
overhead per function.

Google workloads

We evaluated the impact of function splitting on a couple of search
workloads, Search1 and Search2. A key difference with respect to the clang
experiment above is the use of huge pages for code. Overall, we find that
on Intel Skylake the key benefit is from reduction of iTLB misses whereas
on AMD the key benefit is from the reduction of icache misses. This is due
to the fewer iTLB entries available for hugepages on Intel architectures.
We find that overall throughput for Search1 and Search2 improve between
0.8% to 1.2%; a significant improvement on these benchmarks. The workloads
are built with FDO and CSFDO respectively. On Intel Skylake, iTLB misses
reduce by 16% to 35%, sTLB misses reduce by 62% to 67%. On AMD, L1 icache
misses improve by 1.2% to 2.6% whereas L2 instruction misses improve by
4.8% to 5.1%.

Comparison with HotColdSplit

An evaluation of the hot-cold-split pass did not yield performance
improvements on google workloads.

SPECInt 2017

We evaluated the impact of the machine function splitter on SPECInt 2017
using the int rate metrics. Overall, we found a 1.6% geomean intrate
improvement for the benchmarks where performance improved (500.perlbench_r,
502.gcc_r, 505.mcf_r, 520.omnetpp_r). For the benchmarks that didn’t
improve performance, the average degradation was 0.6% (523.xalancbmk_r,
525.x264_r, 531.deepsjeng_r, 541.leela_r).

We note that the instruction footprint of SPEC workloads are smaller than
most modern workloads and our work is primarily focused on reducing the
footprint to improve performance. These experiments were performed on Intel
Haswell machines.

Appendix

Example to illustrate hot-cold-split and split-machine-functions

Input IR

```

@i = external global i32, align 4

define i32 @foo(i32 %0, i32 %1) nounwind !prof !1 {

  %3 = icmp eq i32 %0, 0

  br i1 %3, label %6, label %4, !prof !2

4:                                                ; preds = %2

  %5 =  call i32 @L1()

  br label %9

6:                                                ; preds = %2

  %7 = call i32 @R1()

  %8 = add nsw i32 %1, 1

  br label %9

9:                                               ; preds = %6, %4

  %10 = phi i32 [ %1, %4 ], [ %8, %6 ]

  %11 = load i32, i32* @i, align 4

  %12 = add nsw i32 %10, %11

  store i32 %12, i32* @i, align 4

  ret i32 %12

}

declare i32 @L1()

declare i32 @R1() cold nounwind

!1 = !{!"function_entry_count", i64 7}

!2 = !{!"branch_weights", i32 0, i32 7}

```

Code generated by Machine Function Splitter

$ llc < example.ll -mtriple=x86_64-unknown-linux-gnu
-split-machine-functions

```

        .text

        .file   "<stdin>"

        .globl  foo                             # -- Begin function foo

        .p2align        4, 0x90

        .type   foo, at function

foo:                                    # @foo

# %bb.0:

        pushq   %rbx

        movl    %esi, %ebx

        testl   %edi, %edi

        je      foo.cold                # Jump to cold code

# %bb.1:

        callq   L1

.LBB0_2:

        addl    i(%rip), %ebx

        movl    %ebx, i(%rip)

        movl    %ebx, %eax

        popq    %rbx

        retq

        .section        .text.unlikely.foo,"ax", at progbits

foo.cold:

        callq   R1

        incl    %ebx                    # Directly increment value

        jmp     .LBB0_2

.LBB_END0_3:

        .size   foo.cold, .LBB_END0_3-foo.cold

        .text

.Lfunc_end0:

        .size   foo, .Lfunc_end0-foo

                                        # -- End function

        .section        ".note.GNU-stack","", at progbits

```

Code generated by Hot Cold Split

$ clang -c -O2 -S -mllvm --hot-cold-split -mllvm --hotcoldsplit-threshold=0
-x ir example.ll

```

        .text

        .file   "example.ll"

        .globl  foo                             # -- Begin function foo

        .p2align        4, 0x90

        .type   foo, at function

foo:                                    # @foo

# %bb.0:

        pushq   %rbx

        subq    $16, %rsp

        movl    %esi, %ebx

        testl   %edi, %edi

        jne     .LBB0_1

# %bb.2:                                # Residue block in original function

        leaq    12(%rsp), %rsi

        movl    %ebx, %edi              # Pass param to increment

        callq   foo.cold.1              # Call to cold code

        movl    12(%rsp), %ebx          # Fill incremented value from stack

.LBB0_3:

        addl    i(%rip), %ebx

        movl    %ebx, i(%rip)

        movl    %ebx, %eax

        addq    $16, %rsp

        popq    %rbx

        retq

.LBB0_1:

        callq   L1

        jmp     .LBB0_3

.Lfunc_end0:

        .size   foo, .Lfunc_end0-foo

                                        # -- End function

        .p2align        4, 0x90                         # -- Begin function
foo.cold.1

        .type   foo.cold.1, at function

foo.cold.1:                             # @foo.cold.1

# %bb.0:                                # %newFuncRoot

        pushq   %rbp

        pushq   %rbx

        pushq   %rax

        movq    %rsi, %rbx

        movl    %edi, %ebp

        callq   R1

        incl    %ebp

        movl    %ebp, (%rbx)

        addq    $8, %rsp

        popq    %rbx

        popq    %rbp

        retq

.Lfunc_end1:

        .size   foo.cold.1, .Lfunc_end1-foo.cold.1

                                        # -- End function

        .cg_profile foo, L1, 0

        .cg_profile foo, foo.cold.1, 7

        .section        ".note.GNU-stack","", at progbits

        .addrsig

        .addrsig_sym foo.cold.1

```

Thanks,
Snehasish Kumar
Software Engineer, Google
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200804/21ed1330/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Machine Function Splitter.png
Type: image/png
Size: 49967 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200804/21ed1330/attachment-0001.png>