[llvm-dev] Basic block cloning enhancement for Propeller

Fri Sep 25 23:08:25 PDT 2020

TL;DR

Propeller introduced link time function layout optimizations that improves
clang, SPEC and internal Google workload run times and relevant PMU
metrics. I worked on enhancing Propeller with basic block cloning, which
manages to get up to 1% speed up over Propeller on the same loads. This
work enhances Propeller with the ability to clone machine basic blocks to
take advantage of certain layout opportunities that are currently left on
the table.
Motivation

Propeller
<https://lists.llvm.org/pipermail/llvm-dev/2019-September/135393.html> is a
profile guided function layout optimization using LLVM. It attempts to
improve cache and branch behaviour by favouring frequently executed edges
and making them fall throughs in the optimized binary.

However, in cases where multiple edges are good candidates, Propeller has
to pick one option and potentially leave some performance on the table. For
instance, when presented with the following CFG fragment, it has to pick
one of the edges:

┌───┐  50000   ┌───┐  49999   ┌───┐

│ A │ ──────▶ │ C │ ◀────── │ B │

└───┘          └───┘          └───┘

However, the BC edge is equally hot and having it to be a jump is a missed
opportunity since we cannot make it a fallthrough, especially if the
execution pattern looks like AC,AC,AC…,AC,BC,BC,BC,...,BC, i.e. the edges
execute in separate phases of the program rather than being interleaved. We
call blocks that have multiple hot incoming edges such as C hot join blocks.

To be able to make both edges fallthroughs in such cases, I worked on basic
block cloning for Propeller during the summer and got some promising
results. Part of the changes were in LLVM itself, specifically, adding the
ability to clone Machine Basic Blocks in codegen.
Implementation

While LLVM has the ability to clone IR Basic Blocks through CloneBasicBlock,
there is no corresponding facility to clone MachineBasicBlocks.

I implemented the support for MBB cloning in BasicBlockSections pass as a 2
step process: making the clone and adjusting its predecessor to make it
reachable and have a correct CFG with the new clone.

The cloning step makes a copy of the cloned block in the original block’s
MachineFunction verbatim, copying all the instructions and successors. To
ensure correctness, if the original block has a layout fallthrough, we
insert an unconditional jump to the same block.

After a clone is made, it is made reachable by modifying one of the
predecessors. For instance, in the below example, after the clone is made,
the predecessor B is modified to now jump to C’ rather than C:

┌───┐  50000   ┌───┐

│ A │ ──────▶T │ C │

└───┘          └───┘

┌───┐  49999   ┌───┐

│ B │ ──────▶ │ C'│

└───┘          └───┘

We use TargetInstrInfo::removeBranch and TargetInstrInfo::insertBranch to
modify the branches accordingly.

Which blocks to clone are decided in create_llvm_prof
<https://github.com/shenhanc78/autofdo>. To accommodate the cloning
information, we add a new cloning decisions section which contains recipes
to make clones for every function. The decision algorithm discovers
frequently executed paths by cloning hot join blocks and updating the CFG
weights correctly by querying the Last Branch Records (LBR).

Due to the inaccurate nature of profiles, or any other change that renders
the profile inaccurate, it is possible to have clone decisions that are
impossible to perform in LLVM. Therefore, there also exists the correct
handling of bad clone decisions. In such cases where a clone decision is
impossible to perform (missing blocks, wrong predecessor information etc.),
we back out of that function and make no clones.
Evaluation

With path profile guided cloning decisions, we see consistent performance
and branch characteristic improvements across the board and some increased
i-cache pressure compared to vanilla Propeller. There is also no
significant binary bloat in any programs.
clang

150 Clones

Baseline

BB Cloning

Difference

Run time

135.217

134.023

*-0.88%*

Taken branches

31051306225

30524253874

*-1.7%*

Not-taken branches

56745337469

56954834930

*+0.37%*

I-cache misses

15979272719

16161290846

+1.1%

Binary size

75613543

75615575

+0.002%

(2032 bytes)

SPEC2017

All the numbers are percent changes from propeller. With 100 Clones in all
loads.

gcc

perl

mcf

omnetpp

xalancbmk

x264

deepsjeng

leela

xz

Run time

-0.30%

-2.59%

+2.09%

-0.16%

0%

-2.17%

0%

+1.13%

-0.71%

Taken branch

-1.8%

+2.2%

-10.6%

-1.3%

-2%

-1.3%

-6.1%

-9.2%

-11%

Not taken branch

+0.8%

0%

+3.4%

+0.9%

+12%

-4.6%

+1%

+2.7%

+15.2%

I-cache miss

+0.7%

+2%

+6.3%

+15.2%

-2.4%

+6.5%

+63.3%

+131.3%

0%

Google workloads

On the Search2 benchmark with 2000 clones.

Difference

Throughput

+0.05%

Taken branches

*-1.53%*

Not-taken branches

*+0.23%*

I-cache misses

+2.62%

Binary size

0.004%
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200925/c72e3e71/attachment-0001.html>