[llvm-dev] [RFC] Delaying phi-to-select transformation until later in the pass pipeline

Tue Aug 14 10:45:14 PDT 2018

I think it would be good to have the CFG simplification options provided 
by TTI, including the limit on phi->select transformation. We have some 
code on Hexagon that would benefit not only from converting 1 phi to 
select (as is the limit now), but 4 of them.

-Krzysztof

On 8/14/2018 12:17 PM, John Brawn via llvm-dev wrote:
> Summary
> =======
> 
> I'm planning on adjusting SimplifyCFG so that it doesn't turn two-entry phi
> nodes into selects until later in the pass pipeline, to give passes which can
> understand phis but not selects more opportunity to optimize. The thing I'm
> trying to do which made me think of doing this is described below, but from the
> benchmarking I've done it looks like this is overall a good idea regardless of
> if I manage to get that done or not.
> 
> Motivation
> ==========
> 
> My goal is to get clang to optimize some code containing a call to
> std::min_element which is dereferenced, so something like:
> 
>    float min_element_example(float *data, int size)
>    {
>      return *std::min_element(data, data+size);
>    }
> 
> which, after inlining a specialization, looks like:
> 
>    float min_element_example_inlined(float *first, float *last)
>    {
>      for (float *p = first; p != last; ++p)
>      {
>        if (*p < *first)
>          first = p;
>      }
>      return *first;
>    }
> 
> There are two loads in the loop, *p and *first, but actually the load *p can be
> eliminated by using either the previous load *p or the previous *first,
> depending on if the if-condition was taken or not. However the
> "if (*p < *first) first = p" gets turned by simplifycfg into a select and this
> makes optimizing this a lot harder because you no longer have distinct paths
> through the CFG.
> 
> I have some ideas on how to do the optimization (see my previous RFC "Making GVN
> able to visit the same block more than once" posted in April, though I've
> decided that the specific idea presented there isn't the right way to do it),
> but I think the first step is to make sure we don't have a select when we try
> to optimise this.
> 
> Approach
> ========
> 
> I've posted a patch to https://reviews.llvm.org/D50723 showing what I'm
> intending to do. An extra parameter is added to SimplifyCFG to control whether
> two-entry phi nodes are converted into select, and this is set to false in all
> instances before the end of module simplification. At the end of module
> simplification we do SimplifyCFG, then Instcombine to optimise the selects that
> are introduced, then EarlyCSE to eliminate common subexpressions introduced by
> instcombine.
> 
> Benchmark Results
> =================
> 
> These are performance differences reported by LNT when running llvm-test-suite,
> spec2000, and spec2006 at -O3 with and without the patch linked above (using
> trunk llvm from a week or so ago).
> 
> AArch64 results on ARM Cortex-A72:
> 
> Performance Regressions - execution_time                              Change
> SingleSource/Benchmarks/Shootout/Shootout-ary3                         9.48%
> MultiSource/Benchmarks/TSVC/Packing-flt/Packing-flt                    3.79%
> SingleSource/Benchmarks/CoyoteBench/huffbench                          1.40%
> 
> Performance Improvements - execution_time                             Change
> MultiSource/Benchmarks/TSVC/Searching-dbl/Searching-dbl              -23.74%
> External/SPEC/CINT2000/256.bzip2/256.bzip2                            -9.82%
> MultiSource/Benchmarks/TSVC/Searching-flt/Searching-flt               -9.57%
> MultiSource/Benchmarks/TSVC/Equivalencing-flt/Equivalencing-flt       -4.38%
> MultiSource/Benchmarks/TSVC/LinearDependence-flt/LinearDependence-flt -3.94%
> MultiSource/Benchmarks/TSVC/Packing-dbl/Packing-dbl                   -3.44%
> External/SPEC/CFP2006/453.povray/453.povray                           -2.50%
> SingleSource/Benchmarks/Adobe-C++/stepanov_vector                     -1.49%
> 
> X86_64 results on Intel Xeon E5-2690:
> 
> Performance Regressions - execution_time           Change
> MultiSource/Benchmarks/Ptrdist/yacr2/yacr2          5.62%
> 
> Performance Improvements - execution_time          Change
> SingleSource/Benchmarks/Misc-C++/Large/sphereflake -4.43%
> External/SPEC/CINT2006/456.hmmer/456.hmmer         -2.50%
> External/SPEC/CINT2006/464.h264ref/464.h264ref     -1.60%
> MultiSource/Benchmarks/nbench/nbench               -1.19%
> SingleSource/Benchmarks/Adobe-C++/functionobjects  -1.07%
> 
> I had a brief look at the regressions and they all look to be caused by
> getting bad luck with branch mispredictions: I looked into the Shootout-ary3 and
> yacr2 cases and in both the hot code path was the same with and without the
> patch, but with more mispredictions probably caused by changes elsewhere.
> 
> John
> 
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> 

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, 
hosted by The Linux Foundation