[llvm-dev] [RFC] Machine Function Splitter - Split out cold blocks from machine functions using profile data

Tue Aug 11 09:05:51 PDT 2020

Thank you for the nice writeup.

This sounds like a useful thing to have in tree.  As you point out, 
there are obvious tradeoffs between the IR level and late codegen 
approaches.  There's always going to be cases where one wins and one 
looses.   Having both in tree, and tuning heuristics to focus on the 
complement wins seems like a very reasonable approach.

As an aside, one interesting idea on the IR level would be to explore 
cases where we can specifically split cold suffixes (that is, paths not 
rejoining hot paths before function return).  We have musttail support 
(for the branch lowering), and should be able to adjust the calling 
convention for the call if it ends up with one caller.  This might 
address some of the challenges with IR level splitting.

Philip

On 8/4/20 5:31 PM, Snehasish Kumar via llvm-dev wrote:
>
>
>   Greetings,
>
> We present “Machine Function Splitter”, a codegen optimization pass 
> which splits functions into hot and cold parts. This pass leverages 
> the basic block sections feature recently introduced in LLVM from the 
> Propeller project. The pass targets functions with profile coverage, 
> identifies cold blocks and moves them to a separate section. The 
> linker groups all cold blocks across functions together, decreasing 
> fragmentation and improving icache and itlb utilization. Our 
> experiments show >2% performance improvement on clang bootstrap, ~1% 
> improvement on Google workloads and 1.6% mean performance improvement 
> on SPEC IntRate 2017.
>
>
>     Motivation
>
> Recent work at Google has shown that aggressive, profile-driven 
> inlining for performance has led to significant code bloat and icache 
> fragmentation (AsmDB - Ayers et al ‘2019 
> <https://research.google/pubs/pub48320/>). We find that most functions 
> 5 KiB or larger have inlined children more than 10 layers deep 
> bringing in exponentially more code at each inline level, not all of 
> which is necessarily hot. Generally, in roughly half of even the 
> hottest functions, more than 50% of the code bytes are never executed, 
> but likely to be in the cache.
>
>
> Function splitting is a well known compiler transformation primarily 
> targeting improved code locality to improve performance. LLVM hasa 
> middle-end, target agnostic hot cold splitting pass 
> <https://llvm.org/devmtg/2019-10/slides/Kumar-HotColdSplitting.pdf>as 
> well as a partial inlining pass 
> <https://github.com/llvm/llvm-project/blob/master/llvm/lib/Transforms/IPO/PartialInlining.cpp>which 
> performs similar transformations, as noted by the authors in a recent 
> email thread 
> <https://lists.llvm.org/pipermail/llvm-dev/2020-June/142429.html>. 
> However, due to the timing of the respective passes as well as the 
> code extraction techniques employed, the overall gains on large, 
> complex applications leave headroom for improvement. By deferring 
> function splitting to the codegen phase we can maximize the 
> opportunity to remove cold code as well as refine the code extraction 
> technique. Furthermore, by performing function splitting very late, 
> earlier passes can perform more aggressive optimizations.
>
>
>     Implementation
>
> We propose a new machine function splitting pass which leverages the 
> basic block sections feature <https://reviews.llvm.org/D68063>to split 
> functions without the caveats of code extraction in the middle-end. 
> The pass uses profile information to identify cold basic blocks very 
> late in LLVM CodeGen, after regalloc and all other machine passes have 
> executed. This allows our implementation to be precise in its 
> assessment of cold regions while maximizing opportunity.
>
>
> Each function is split into two parts. The hot cluster includes the 
> function entry and all blocks which are not cold. All the cold blocks 
> are grouped together as a Cold Section cluster 
> <https://github.com/llvm/llvm-project/blob/5934df0c9abe94fc450fbcf0ceca21cf838840e9/llvm/include/llvm/CodeGen/MachineBasicBlock.h#L63>. 
> With basic block sections, the cold blocks are assigned appropriate 
> debug and call frame information and emitted as part of the 
> .text.unlikely section. Unlike Propeller 
> <https://lists.llvm.org/pipermail/llvm-dev/2019-September/135393.html>, 
> which is presently the main user of the basic block sections feature, 
> this pass does not require an additional round of profiling and uses 
> existing instrumentation based FDO or CSFDO profile information.
>
>
> Machine Function Splitter.png
>
>
> In the illustration above, the functions foo and bar contain a cold 
> block each, index 5 and E respectively. We show a possible layout for 
> these functions which optimizes for fall throughs. Note that all the 
> blocks are kept in a contiguous region described by the symbols foo 
> and bar. Using the machine function splitter, the cold blocks (5 and 
> E) are moved to a separate section. These blocks can then be grouped 
> along with other cold blocks (and functions) in a separate output 
> section in the final binary. The key highlights of this approach are:
>
>  *
>
>     Profile driven, profile type agnostic approach.
>
>  *
>
>     Cold basic blocks are split out using jumps.
>
>  *
>
>     No additional instructions are added to the function for
>     setup/teardown.
>
>  *
>
>     Runs as the last step before emitting assembly, no
>     analysis/optimizations are hindered.
>
>
> Exceptions
>
> All eh pads are grouped together regardless of their coldness and are 
> part of the original function. There are outstanding issues with 
> splitting eh pads if they reside in separate sections in the binary. 
> This remains as part of future work.
>
>
> DebugInfo and CFI
>
> Debug information and CFI directives are updated and kept consistent 
> by the underlying basic block sections framework. Support added in the 
> following patches
>
>  *
>
>     DebugInfo (https://reviews.llvm.org/D78851)
>
>  *
>
>     CFI (https://reviews.llvm.org/D79978).
>
>
> Distinction between Machine Function Splitter and Propeller
>
> Full Propeller optimizations include function splitting and layout 
> optimizations, however it requires an additional round of profiling 
> using perf on top of the peak (FDO/CSFDO + ThinLTO) binary. In this 
> work we experiment with applying function splitting using the 
> instrumented profile in the build instead of adding an additional 
> round of profiling.
>
> Link to Propeller RFC 
> <https://lists.llvm.org/pipermail/llvm-dev/2019-September/135393.html>
>
> Split Binary Characteristics
>
> Binaries produced by the compiler with function splitting enabled 
> contain additional symbols. A function which has been split into a hot 
> and cold part is non-contiguous. The symbol table entry for the hot 
> part retains the symbol name of the original function with type FUNC. 
> The symbol for the cold part contains a “.cold” suffix attached to the 
> original symbol name, the type is not set for this symbol. Using a 
> suffix has been the norm for such optimizations e.g. -hot-cold-split 
> in LLVM and the prior GCC implementation detailed earlier. We expect 
> standardized tooling to handle split functions appropriately, e.g 
> demangling works as expected --
>
> $ c++filt _Z3foov.cold
>
> foo()[clone .cold]
>
>
>     Contrast with HotColdSplit (HCS)
>
> Function splitting in the middle-end in LLVM employs extraction of 
> cold single-entry-single-exit (SESE) regions into separate functions. 
> In general, the pass has been found to be impactful in reducing code 
> size by deduplication of cold regions; however our experiments show it 
> does not improve performance of large workloads.
>
>
> The key differences are:
>
>
> Extraction methodology and tradeoffs
>
> HCS extracts cold code from SESE regions using a function call. This 
> may incur a spill and fill of caller registers along with additional 
> setup and teardown if live values modified in the cold region need to 
> be communicated back to the original function. This has a couple of 
> implications
>
> 1.
>
>     The “residue” of each extracted region is non-trivial and there is
>     a tradeoff between the amount of code that needs to be cold before
>     it is profitable to extract. Thus the cost of mischaracterization
>     is high.
>
> 2.
>
>     Since each SESE region is extracted separately the net reduction
>     in code size of the original function is less.
>
> In contrast, the machine function splitter extracts cold code into a 
> separate section. Control is transferred to cold code via jumps. More 
> often than not these jumps may already exist as part of the original 
> layout thus incurring no additional cost. No additional instructions 
> are inserted to accommodate splitting. Finally, no additional 
> setup/teardown is necessary for live values modified in cold regions.
>
>
> Pass timing and interaction with other optimizations
>
> The HCS pass is run on the IR in the optimizer. This allows it to be 
> target agnostic and allow later stages to merge identical code if 
> necessary. However, there are some drawbacks to this approach. In 
> particular,
>
> 1.
>
>     Splitting early may miss opportunities introduced by later passes
>     such as library call inlining and CFG simplification resulting
>     from a combination of optimizations. Furthermore, this may not
>     play well with optimization passes such as MachineOutliner.
>
> 2.
>
>     Synergistic optimizations are harder to reason about due to the
>     pass timing. For example, inlining can be more aggressive if any
>     cold code introduced is trimmed.
>
> In contrast, the machine function splitter runs as the last step in 
> codegen. This ensures that the opportunity for splitting is maximised 
> without hindering existing analyses and synergistic decisions can be 
> made in earlier optimization passes. We rely on accurate profile count 
> propagation across optimizations to maximise opportunities. This works 
> particularly well for instrumented profiles while improving the pass 
> for sampled profiles is ongoing work.
>
>
> We have provided a contrived example in the Appendix which 
> demonstrates the code generated for both approaches. The key 
> differences are highlighted inline.
>
>
>     Evaluation
>
> In this section, we present an in-depth evaluation of the impact on 
> clang bootstrap and summary results for two google internal workloads, 
> Search1 and Search2 as well overall results on the SPECInt 2017 
> benchmarks. All experiments are conducted on Intel Skylake based 
> systems unless otherwise noted. Profile guided optimizations using 
> instrumented profiles are enabled for all builds.
>
>
> clang-bootstrap
>
> We pick 500 compiler invocations from a bootstrap build of clang and 
> then evaluate the performance of a PGO+ThinLTO optimized version with 
> that of PGO+ThinLTO+Split compiler. For the latter, the final 
> optimized build includes the machine function splitter.
>
>
> Results:
>
> We observe a mean 2.33% improvement in end to end runtime. The 
> improvements in runtime are driven by reduction in icache and TLB miss 
> rates. The table below summarizes our experiment, each data point is 
> averaged over multiple iterations. The observed variation for each 
> metric is < 1%.
>
>
> Event
>
> 	
>
> Split (MPKI)
>
> 	
>
> Baseline (MPKI)
>
> 	
>
> % Reduction
>
> itlb_miss
>
> 	
>
> 0.87
>
> 	
>
> 1.28
>
> 	
>
> 31.70
>
> stlb_miss
>
> 	
>
> 0.08
>
> 	
>
> 0.12
>
> 	
>
> 32.51
>
> l1i_miss
>
> 	
>
> 5.98
>
> 	
>
> 6.61
>
> 	
>
> 9.56
>
> l2_miss
>
> 	
>
> 0.27
>
> 	
>
> 0.34
>
> 	
>
> 20.02
>
>
> In this experiment, the function splitting pass moved cold code from 
> ~30K functions in .text and .text.hot. We present a comparison of the 
> binary contents using bloaty <https://github.com/google/bloaty>
>
>
>     FILE SIZE        VM SIZE
>
>  --------------  --------------
>
>    +23% +8.26Mi   +23% +8.26Mi    .text.unlikely
>
>   +6.5%  +761Ki  [ = ]       0    .strtab
>
>   +4.8%  +247Ki  +4.8%  +247Ki    .eh_frame
>
>   +6.1%  +193Ki  [ = ]       0    .symtab
>
>   +8.5% +63.1Ki  +8.5% +63.1Ki    .eh_frame_hdr
>
>   +0.3% +31.3Ki  +0.3% +31.3Ki    .rodata
>
>   +0.4%      +3  [ = ]       0    [Unmapped]
>
>   -0.3%      -8  -0.3%      -8    .init_array
>
>   [ = ]       0 -33.3%      -8    [LOAD #4 [RW]]
>
>   [ = ]       0  -0.2%    -416    .bss
>
>  -57.1% -4.04Mi -57.1% -4.04Mi    .text.hot
>
>  -48.4% -4.13Mi -48.4% -4.13Mi    .text
>
>   +1.6% +1.35Mi  +0.6%  +430Ki    TOTAL
>
>
> We see that 48% and 57% of code in .text and .text.hot respectively 
> was moved to the .text.unlikely section. We also note a small increase 
> in overall binary size due to the following reasons:
>
>  *
>
>     Some additional jump instructions may be inserted.
>
>  *
>
>     Small increase in associated metadata, e.g. debug information.
>
>  *
>
>     Additional symbols of type foo.cold for cold parts.
>
>  *
>
>     Alignment requirements for both original and split function parts.
>
>
> Comparison with HotColdSplit
>
> For the clang-bootstrap benchmark we also compared the performance of 
> the hot-cold-split pass with split-machine-functions. We summarize the 
> results for performance and the characteristics of the binary built by 
> each pass in the table below. Each metric is presented as change vs 
> the baseline, an FDO optimized build of clang.
>
>
>
> 	
>
> Hot Cold Split
>
> 	
>
> Machine Function Splitter
>
> Performance
>
> 	
>
> 1.10%
>
> 	
>
> 2.65%
>
> .text size
>
> 	
>
> -41.5% -2.89Mi
>
> 	
>
> -49.2% -3.43Mi
>
> .text.hot size
>
> 	
>
> -46.9% -2.52Mi
>
> 	
>
> -57.1% -3.07Mi
>
> Full binary size
>
> 	
>
> 9.6% +7.56Mi
>
> 	
>
> 1.7% +1.37Mi
>
>
> Note that the increase in overall binary size increase for HCS is due 
> to the increase in .eh_frame (+61% +3.03Mi). HCS extracts each cold 
> SESE region as a separate function whereas the machine function 
> splitter extracts the cold code as a single region thus incurring a 
> constant overhead per function.
>
>
> Google workloads
>
> We evaluated the impact of function splitting on a couple of search 
> workloads, Search1 and Search2. A key difference with respect to the 
> clang experiment above is the use of huge pages for code. Overall, we 
> find that on Intel Skylake the key benefit is from reduction of iTLB 
> misses whereas on AMD the key benefit is from the reduction of icache 
> misses. This is due to the fewer iTLB entries available for hugepages 
> on Intel architectures. We find that overall throughput for Search1 
> and Search2 improve between 0.8% to 1.2%; a significant improvement on 
> these benchmarks. The workloads are built with FDO and CSFDO 
> respectively. On Intel Skylake, iTLB misses reduce by 16% to 35%, sTLB 
> misses reduce by 62% to 67%. On AMD, L1 icache misses improve by 1.2% 
> to 2.6% whereas L2 instruction misses improve by 4.8% to 5.1%.
>
>
> Comparison with HotColdSplit
>
> An evaluation of the hot-cold-split pass did not yield performance 
> improvements on google workloads.
>
>
> SPECInt 2017
>
> We evaluated the impact of the machine function splitter on SPECInt 
> 2017 using the int rate metrics. Overall, we found a 1.6% geomean 
> intrate improvement for the benchmarks where performance improved 
> (500.perlbench_r, 502.gcc_r, 505.mcf_r, 520.omnetpp_r). For the 
> benchmarks that didn’t improve performance, the average degradation 
> was 0.6% (523.xalancbmk_r, 525.x264_r, 531.deepsjeng_r, 541.leela_r).
>
>
> We note that the instruction footprint of SPEC workloads are smaller 
> than most modern workloads and our work is primarily focused on 
> reducing the footprint to improve performance. These experiments were 
> performed on Intel Haswell machines.
>
>
>     Appendix
>
> Example to illustrate hot-cold-split and split-machine-functions
>
>
> Input IR
>
> ```
>
> @i = external global i32, align 4
>
>
> define i32 @foo(i32 %0, i32 %1) nounwind !prof !1 {
>
>   %3 = icmp eq i32 %0, 0
>
>   br i1 %3, label %6, label %4, !prof !2
>
>
> 4:                                                ; preds = %2
>
>   %5 =  call i32 @L1()
>
>   br label %9
>
>
> 6:                                                ; preds = %2
>
>   %7 = call i32 @R1()
>
>   %8 = add nsw i32 %1, 1
>
>   br label %9
>
>
> 9:                                               ; preds = %6, %4
>
>   %10 = phi i32 [ %1, %4 ], [ %8, %6 ]
>
>   %11 = load i32, i32* @i, align 4
>
>   %12 = add nsw i32 %10, %11
>
>   store i32 %12, i32* @i, align 4
>
>   ret i32 %12
>
> }
>
>
> declare i32 @L1()
>
> declare i32 @R1() cold nounwind
>
>
> !1 = !{!"function_entry_count", i64 7}
>
> !2 = !{!"branch_weights", i32 0, i32 7}
>
> ```
>
>
> Code generated by Machine Function Splitter
>
> $ llc < example.ll -mtriple=x86_64-unknown-linux-gnu 
> -split-machine-functions
>
>
> ```
>
>         .text
>
>         .file   "<stdin>"
>
>         .globl  foo                             # -- Begin function foo
>
>         .p2align        4, 0x90
>
>         .type   foo, at function
>
> foo:                                    # @foo
>
> # %bb.0:
>
>         pushq   %rbx
>
>         movl    %esi, %ebx
>
>         testl   %edi, %edi
>
> je      foo.cold                # Jump to cold code
>
> # %bb.1:
>
>         callq   L1
>
> .LBB0_2:
>
>         addl    i(%rip), %ebx
>
>         movl    %ebx, i(%rip)
>
>         movl    %ebx, %eax
>
>         popq    %rbx
>
>         retq
>
>         .section        .text.unlikely.foo,"ax", at progbits
>
> foo.cold:
>
>         callq   R1
>
> incl    %ebx                    # Directly increment value
>
>         jmp     .LBB0_2
>
> .LBB_END0_3:
>
>         .size   foo.cold, .LBB_END0_3-foo.cold
>
>         .text
>
> .Lfunc_end0:
>
>         .size   foo, .Lfunc_end0-foo
>
>                                         # -- End function
>
>         .section        ".note.GNU-stack","", at progbits
>
>
> ```
>
>
> Code generated by Hot Cold Split
>
> $ clang -c -O2 -S -mllvm --hot-cold-split -mllvm 
> --hotcoldsplit-threshold=0 -x ir example.ll
>
>
> ```
>
>         .text
>
>         .file   "example.ll"
>
>         .globl  foo                             # -- Begin function foo
>
>         .p2align        4, 0x90
>
>         .type   foo, at function
>
> foo:                                    # @foo
>
> # %bb.0:
>
>         pushq   %rbx
>
>         subq    $16, %rsp
>
>         movl    %esi, %ebx
>
>         testl   %edi, %edi
>
>         jne     .LBB0_1
>
> # %bb.2:                                # Residue block in original 
> function
>
>         leaq    12(%rsp), %rsi
>
> movl    %ebx, %edi              # Pass param to increment
>
> callq   foo.cold.1              # Call to cold code
>
> movl    12(%rsp), %ebx          # Fill incremented value from stack
>
> .LBB0_3:
>
>         addl    i(%rip), %ebx
>
>         movl    %ebx, i(%rip)
>
>         movl    %ebx, %eax
>
>         addq    $16, %rsp
>
>         popq    %rbx
>
>         retq
>
> .LBB0_1:
>
>         callq   L1
>
>         jmp     .LBB0_3
>
> .Lfunc_end0:
>
>         .size   foo, .Lfunc_end0-foo
>
>                                         # -- End function
>
>         .p2align        4, 0x90                         # -- Begin 
> function foo.cold.1
>
>         .type   foo.cold.1, at function
>
> foo.cold.1:                             # @foo.cold.1
>
> # %bb.0:                                # %newFuncRoot
>
>         pushq   %rbp
>
>         pushq   %rbx
>
>         pushq   %rax
>
>         movq    %rsi, %rbx
>
>         movl    %edi, %ebp
>
>         callq   R1
>
>         incl    %ebp
>
>         movl    %ebp, (%rbx)
>
>         addq    $8, %rsp
>
>         popq    %rbx
>
>         popq    %rbp
>
>         retq
>
> .Lfunc_end1:
>
>         .size   foo.cold.1, .Lfunc_end1-foo.cold.1
>
>                                         # -- End function
>
>         .cg_profile foo, L1, 0
>
>         .cg_profile foo, foo.cold.1, 7
>
>         .section        ".note.GNU-stack","", at progbits
>
>         .addrsig
>
>         .addrsig_sym foo.cold.1
>
> ```
>
>
> Thanks,
> Snehasish Kumar
> Software Engineer, Google
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200811/b24dc216/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Machine Function Splitter.png
Type: image/png
Size: 49967 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200811/b24dc216/attachment-0001.png>