[LLVMdev] Enabling the SLP-vectorizer by default for -O3

Sun Jul 28 00:20:24 PDT 2013

Sorry for not posting sooner. I forgot to send an update with the results.

I also have some benchmark data. It confirms much of what you posted --
binary size increase is essentially 0, performance increases across the
board. It looks really good to me.

However, there was one crash that I'd like to check if it still fires. Will
update later today (feel free to ping me if you don't hear anything.).

That said, why -O3? I think we should just enable this across the board, as
it doesn't seem to cause any size regression under any mode, and the
compile time hit is really low.

On Sat, Jul 27, 2013 at 11:54 PM, Nadav Rotem <nrotem at apple.com> wrote:

> Hi,
>
> Below you can see the updated benchmark results for the new
> SLP-vectorizer.  As you can see, there is a small number of compile time
> regressions, a single major runtime *regression, and many performance
> gains. There is a tiny increase in code size: 30k for the whole test-suite.
> Based on the numbers below I would like to enable the SLP-vectorizer by
> default for -O3. Please let me know if you have any concerns.
>
> Thanks,
> Nadav
>
>
> * - I now understand the Olden/BH regression better. BH is slower because
> of a store-buffer stall. This means that the store buffer fills up and the
> CPU has to wait for some stores to finish. I can think of two reasons
> that may cause this problem. First, our vectorized stores are followed by
> a memcpy that's expanded to a list of scalar-read/writes to the same
> addresses as the vector store. Maybe the processors can’t prune multiple
> stores to the same address with different sizes (Section 2.2.4 in the
> optimization guide has some info on this). Another possibility (less
> likely) is that we increase the critical path by adding a new pshufd
> instruction before the last vector store and that affects the store-buffer
> somehow. In any case, there is not much we can do at the IR-level to
> predict this.
>
>
>
> Performance Regressions - Compile TimeΔPreviousCurrentσ
> MultiSource/Benchmarks/VersaBench/beamformer/beamformer18.98%0.07220.0859
> 0.0003MultiSource/Benchmarks/FreeBench/pifft/pifft5.66%0.50030.52860.0015
> MultiSource/Benchmarks/TSVC/LinearDependence-flt/LinearDependence-flt4.85%
> 0.40840.42820.0014
> MultiSource/Benchmarks/TSVC/LoopRestructuring-flt/LoopRestructuring-flt
> 4.36%0.38560.40240.0018
> MultiSource/Benchmarks/TSVC/ControlFlow-flt/ControlFlow-flt2.62%0.4424
> 0.45400.0019External/SPEC/CINT2006/401_bzip2/401_bzip21.50%1.06131.0772
> 0.0010MultiSource/Benchmarks/tramp3d-v4/tramp3d-v41.23%12.133712.2831
> 0.0296MultiSource/Applications/kimwitu++/kc1.15%9.36909.47690.0186
> SingleSource/Benchmarks/Misc-C++-EH/spirit1.13%3.27693.31390.0079
> External/SPEC/CFP2000/188_ammp/188_ammp1.01%1.86321.88200.0059
>
>
> Performance Regressions - Execution TimeΔPreviousCurrentσ
> MultiSource/Benchmarks/Olden/bh/bh19.24%1.15511.37730.0021
> SingleSource/Benchmarks/SmallPT/smallpt3.75%5.87796.09830.0146
> SingleSource/Benchmarks/Misc-C++/Large/ray1.08%1.81941.83900.0009
>
>
> Performance Improvements - Execution TimeΔPreviousCurrentσ
> SingleSource/Benchmarks/Misc/matmul_f64_4x4-53.67%1.40640.65160.0007
> External/Nurbs/nurbs-19.47%2.53892.04450.0029
> MultiSource/Benchmarks/Olden/power/power-18.49%1.25721.02480.0004
> SingleSource/Benchmarks/Misc/flops-4-15.93%0.77670.65300.0348
> MultiSource/Benchmarks/TSVC/LoopRerolling-flt/LoopRerolling-flt-14.72%
> 2.39252.04040.0013SingleSource/Benchmarks/Misc/flops-6-11.05%1.14271.0164
> 0.0009SingleSource/Benchmarks/Misc/flops-5-10.43%1.27711.14390.0015
> MultiSource/Benchmarks/TSVC/LinearDependence-flt/LinearDependence-flt
> -8.10%2.34682.15680.0195SingleSource/Benchmarks/Misc/pi-7.18%0.60420.5608
> 0.0000External/SPEC/CFP2006/444_namd/444_namd-4.01%9.60539.22000.0064
> SingleSource/Benchmarks/Linpack/linpack-pc-3.85%95.531391.85221.1151
> MultiSource/Benchmarks/TSVC/LoopRerolling-dbl/LoopRerolling-dbl-3.52%
> 3.19623.08370.0063
> MultiSource/Benchmarks/TSVC/LinearDependence-dbl/LinearDependence-dbl
> -2.93%2.93362.84770.0037
> MultiSource/Benchmarks/VersaBench/beamformer/beamformer-2.79%0.88450.8598
> 0.0026SingleSource/Benchmarks/Misc-C++/Large/sphereflake-2.79%1.85171.8001
> 0.0014External/SPEC/CFP2000/177_mesa/177_mesa-2.15%1.72141.68440.0017
> SingleSource/Benchmarks/CoyoteBench/fftbench-2.05%0.72800.71310.0049
> MultiSource/Benchmarks/TSVC/NodeSplitting-dbl/NodeSplitting-dbl-1.96%
> 3.14943.08780.0034SingleSource/Benchmarks/Misc/oourafft-1.70%3.46253.4035
> 0.0009SingleSource/Benchmarks/Misc/flops-1.31%7.07756.98450.0014
> MultiSource/Applications/JM/lencod/lencod-1.12%4.59724.54550.0050
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130728/e7b6cc46/attachment.html>