[llvm-dev] Basic block cloning enhancement for Propeller
Fatih Bakir via llvm-dev
llvm-dev at lists.llvm.org
Fri Sep 25 23:08:25 PDT 2020
TL;DR
Propeller introduced link time function layout optimizations that improves
clang, SPEC and internal Google workload run times and relevant PMU
metrics. I worked on enhancing Propeller with basic block cloning, which
manages to get up to 1% speed up over Propeller on the same loads. This
work enhances Propeller with the ability to clone machine basic blocks to
take advantage of certain layout opportunities that are currently left on
the table.
Motivation
Propeller
<https://lists.llvm.org/pipermail/llvm-dev/2019-September/135393.html> is a
profile guided function layout optimization using LLVM. It attempts to
improve cache and branch behaviour by favouring frequently executed edges
and making them fall throughs in the optimized binary.
However, in cases where multiple edges are good candidates, Propeller has
to pick one option and potentially leave some performance on the table. For
instance, when presented with the following CFG fragment, it has to pick
one of the edges:
┌───┐ 50000 ┌───┐ 49999 ┌───┐
│ A │ ──────▶ │ C │ ◀────── │ B │
└───┘ └───┘ └───┘
However, the BC edge is equally hot and having it to be a jump is a missed
opportunity since we cannot make it a fallthrough, especially if the
execution pattern looks like AC,AC,AC…,AC,BC,BC,BC,...,BC, i.e. the edges
execute in separate phases of the program rather than being interleaved. We
call blocks that have multiple hot incoming edges such as C hot join blocks.
To be able to make both edges fallthroughs in such cases, I worked on basic
block cloning for Propeller during the summer and got some promising
results. Part of the changes were in LLVM itself, specifically, adding the
ability to clone Machine Basic Blocks in codegen.
Implementation
While LLVM has the ability to clone IR Basic Blocks through CloneBasicBlock,
there is no corresponding facility to clone MachineBasicBlocks.
I implemented the support for MBB cloning in BasicBlockSections pass as a 2
step process: making the clone and adjusting its predecessor to make it
reachable and have a correct CFG with the new clone.
The cloning step makes a copy of the cloned block in the original block’s
MachineFunction verbatim, copying all the instructions and successors. To
ensure correctness, if the original block has a layout fallthrough, we
insert an unconditional jump to the same block.
After a clone is made, it is made reachable by modifying one of the
predecessors. For instance, in the below example, after the clone is made,
the predecessor B is modified to now jump to C’ rather than C:
┌───┐ 50000 ┌───┐
│ A │ ──────▶T │ C │
└───┘ └───┘
┌───┐ 49999 ┌───┐
│ B │ ──────▶ │ C'│
└───┘ └───┘
We use TargetInstrInfo::removeBranch and TargetInstrInfo::insertBranch to
modify the branches accordingly.
Which blocks to clone are decided in create_llvm_prof
<https://github.com/shenhanc78/autofdo>. To accommodate the cloning
information, we add a new cloning decisions section which contains recipes
to make clones for every function. The decision algorithm discovers
frequently executed paths by cloning hot join blocks and updating the CFG
weights correctly by querying the Last Branch Records (LBR).
Due to the inaccurate nature of profiles, or any other change that renders
the profile inaccurate, it is possible to have clone decisions that are
impossible to perform in LLVM. Therefore, there also exists the correct
handling of bad clone decisions. In such cases where a clone decision is
impossible to perform (missing blocks, wrong predecessor information etc.),
we back out of that function and make no clones.
Evaluation
With path profile guided cloning decisions, we see consistent performance
and branch characteristic improvements across the board and some increased
i-cache pressure compared to vanilla Propeller. There is also no
significant binary bloat in any programs.
clang
150 Clones
Baseline
BB Cloning
Difference
Run time
135.217
134.023
*-0.88%*
Taken branches
31051306225
30524253874
*-1.7%*
Not-taken branches
56745337469
56954834930
*+0.37%*
I-cache misses
15979272719
16161290846
+1.1%
Binary size
75613543
75615575
+0.002%
(2032 bytes)
SPEC2017
All the numbers are percent changes from propeller. With 100 Clones in all
loads.
gcc
perl
mcf
omnetpp
xalancbmk
x264
deepsjeng
leela
xz
Run time
-0.30%
-2.59%
+2.09%
-0.16%
0%
-2.17%
0%
+1.13%
-0.71%
Taken branch
-1.8%
+2.2%
-10.6%
-1.3%
-2%
-1.3%
-6.1%
-9.2%
-11%
Not taken branch
+0.8%
0%
+3.4%
+0.9%
+12%
-4.6%
+1%
+2.7%
+15.2%
I-cache miss
+0.7%
+2%
+6.3%
+15.2%
-2.4%
+6.5%
+63.3%
+131.3%
0%
Google workloads
On the Search2 benchmark with 2000 clones.
Difference
Throughput
+0.05%
Taken branches
*-1.53%*
Not-taken branches
*+0.23%*
I-cache misses
+2.62%
Binary size
0.004%
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200925/c72e3e71/attachment-0001.html>
More information about the llvm-dev
mailing list