[llvm-dev] [RFC] Delaying phi-to-select transformation until later in the pass pipeline

Tue Aug 14 10:17:35 PDT 2018

Summary
=======

I'm planning on adjusting SimplifyCFG so that it doesn't turn two-entry phi
nodes into selects until later in the pass pipeline, to give passes which can
understand phis but not selects more opportunity to optimize. The thing I'm
trying to do which made me think of doing this is described below, but from the
benchmarking I've done it looks like this is overall a good idea regardless of
if I manage to get that done or not.

Motivation
==========

My goal is to get clang to optimize some code containing a call to
std::min_element which is dereferenced, so something like:

  float min_element_example(float *data, int size)
  {
    return *std::min_element(data, data+size);
  }

which, after inlining a specialization, looks like:

  float min_element_example_inlined(float *first, float *last)
  {
    for (float *p = first; p != last; ++p)
    {
      if (*p < *first)
        first = p;
    }
    return *first;
  }

There are two loads in the loop, *p and *first, but actually the load *p can be
eliminated by using either the previous load *p or the previous *first,
depending on if the if-condition was taken or not. However the
"if (*p < *first) first = p" gets turned by simplifycfg into a select and this
makes optimizing this a lot harder because you no longer have distinct paths
through the CFG.

I have some ideas on how to do the optimization (see my previous RFC "Making GVN
able to visit the same block more than once" posted in April, though I've
decided that the specific idea presented there isn't the right way to do it),
but I think the first step is to make sure we don't have a select when we try
to optimise this.

Approach
========

I've posted a patch to https://reviews.llvm.org/D50723 showing what I'm
intending to do. An extra parameter is added to SimplifyCFG to control whether
two-entry phi nodes are converted into select, and this is set to false in all
instances before the end of module simplification. At the end of module
simplification we do SimplifyCFG, then Instcombine to optimise the selects that
are introduced, then EarlyCSE to eliminate common subexpressions introduced by
instcombine.

Benchmark Results
=================

These are performance differences reported by LNT when running llvm-test-suite,
spec2000, and spec2006 at -O3 with and without the patch linked above (using
trunk llvm from a week or so ago).

AArch64 results on ARM Cortex-A72:

Performance Regressions - execution_time                              Change
SingleSource/Benchmarks/Shootout/Shootout-ary3                         9.48%
MultiSource/Benchmarks/TSVC/Packing-flt/Packing-flt                    3.79%
SingleSource/Benchmarks/CoyoteBench/huffbench                          1.40%

Performance Improvements - execution_time                             Change
MultiSource/Benchmarks/TSVC/Searching-dbl/Searching-dbl              -23.74%
External/SPEC/CINT2000/256.bzip2/256.bzip2                            -9.82%
MultiSource/Benchmarks/TSVC/Searching-flt/Searching-flt               -9.57%
MultiSource/Benchmarks/TSVC/Equivalencing-flt/Equivalencing-flt       -4.38%
MultiSource/Benchmarks/TSVC/LinearDependence-flt/LinearDependence-flt -3.94%
MultiSource/Benchmarks/TSVC/Packing-dbl/Packing-dbl                   -3.44%
External/SPEC/CFP2006/453.povray/453.povray                           -2.50%
SingleSource/Benchmarks/Adobe-C++/stepanov_vector                     -1.49%

X86_64 results on Intel Xeon E5-2690:

Performance Regressions - execution_time           Change
MultiSource/Benchmarks/Ptrdist/yacr2/yacr2          5.62%

Performance Improvements - execution_time          Change
SingleSource/Benchmarks/Misc-C++/Large/sphereflake -4.43%
External/SPEC/CINT2006/456.hmmer/456.hmmer         -2.50%
External/SPEC/CINT2006/464.h264ref/464.h264ref     -1.60%
MultiSource/Benchmarks/nbench/nbench               -1.19%
SingleSource/Benchmarks/Adobe-C++/functionobjects  -1.07%

I had a brief look at the regressions and they all look to be caused by
getting bad luck with branch mispredictions: I looked into the Shootout-ary3 and
yacr2 cases and in both the hot code path was the same with and without the
patch, but with more mispredictions probably caused by changes elsewhere.

John