[LLVMdev] Enabling the SLP vectorizer by default for -O3

Sun Jul 14 22:55:42 PDT 2013

On Jul 14, 2013, at 9:52 PM, Chris Lattner <clattner at apple.com> wrote:

> 
> On Jul 13, 2013, at 11:30 PM, Nadav Rotem <nrotem at apple.com> wrote:
> 
>> Hi, 
>> 
>> LLVM’s SLP-vectorizer is a new pass that combines similar independent instructions in a straight-line code.  It is currently not enabled by default, and people who want to experiment with it can use the clang command line flag “-fslp-vectorize”.  I ran LLVM’s test suite with and without the SLP vectorizer on a Sandybridge mac (using SSE4, w/o AVX).  Based on my performance measurements (below) I would like to enable the SLP-vectorizer by default on -O3.  I would like to hear what others in the community think about this and give other people the opportunity to perform their own performance measurements. 
> 
> This looks great Nadav.  The performance wins are really big.  How you investigated the bh and bullet regression though?  

Thanks.  Yes, I looked at both.  The hot function in BH is “gravsub”.  The vectorized IR looks fine and the assembly looks fine, but for some reason Instruments reports that the first vector-subtract instruction takes 18% of the time. The regression happens both with the VEX prefix and without. I suspected that the problem is the movupd's that load xmm0 and xmm1. I started looking at some performance counters on Friday, but I did not find anything suspicious yet. 

+0x00 movupd              16(%rsi), %xmm0
+0x05 movupd              16(%rsp), %xmm1
+0x0b subpd                %xmm1, %xmm0    <———— 18% of the runtime of bh ?
+0x0f movapd               %xmm0, %xmm2
+0x13 mulsd                %xmm2, %xmm2
+0x17 xorpd                %xmm1, %xmm1
+0x1b addsd                %xmm2, %xmm1 

I spent less time on Bullet.  Bullet also has one hot function (“resolveSingleConstraintRowLowerLimit”).  On this code the vectorizer generates several trees that use the <3 x float> type. This is risky because the loads/stores are inefficient, but unfortunately triples of RGB and XYZ are very popular in some domains and we do want to vectorize them.  I skimmed through the IR and the assembly and I did not see anything too bad. The next step would be to do a binary search on the places where the vectorizer fires to locate the bad pattern. 

On AVX we have another regression that I did not mention: Flops-7.  When we vectorize we cause more spills because we do a poor job scheduling non-destructive source instructions (related to PR10928). Hopefully Andy’s scheduler will fix this regression once it is enabled. 

I did not measure code size, but I did measure compile time.  There are 4-5 workloads (not counting workloads that run below 0.5 seconds) where the compile time increase is more than 5%.  I am aware of a problem in the (quadratic) code that looks for consecutive stores. This code calls SCEV too many times. I plan to fix this. 

Thanks,
Nadav  

> We should at least understand what is going wrong there.  bh is pretty tiny, so it should be straight-forward.  It would also be really useful to see what the code size and compile time impact is.
> 
> -Chris
> 
>> 
>> — Performance Gains — 
>> SingleSource/Benchmarks/Misc/matmul_f64_4x4 -53.68%
>> MultiSource/Benchmarks/Olden/power/power  -18.55%
>> MultiSource/Benchmarks/TSVC/LoopRerolling-flt/LoopRerolling-flt -14.71%
>> SingleSource/Benchmarks/Misc/flops-6  -11.02%
>> SingleSource/Benchmarks/Misc/flops-5  -10.03%
>> MultiSource/Benchmarks/TSVC/LinearDependence-flt/LinearDependence-flt -8.37%
>> External/Nurbs/nurbs  -7.98%
>> SingleSource/Benchmarks/Misc/pi -7.29%
>> External/SPEC/CINT2000/252_eon/252_eon  -5.78%
>> External/SPEC/CFP2006/444_namd/444_namd -4.52%
>> External/SPEC/CFP2000/188_ammp/188_ammp -4.45%
>> MultiSource/Applications/SIBsim4/SIBsim4  -3.58%
>> MultiSource/Benchmarks/TSVC/LoopRerolling-dbl/LoopRerolling-dbl -3.52%
>> SingleSource/Benchmarks/Misc-C++/Large/sphereflake  -2.96%
>> MultiSource/Benchmarks/TSVC/LinearDependence-dbl/LinearDependence-dbl -2.75%
>> MultiSource/Benchmarks/VersaBench/beamformer/beamformer -2.70%
>> MultiSource/Benchmarks/TSVC/NodeSplitting-dbl/NodeSplitting-dbl -1.95%
>> SingleSource/Benchmarks/Misc/flops  -1.89%
>> SingleSource/Benchmarks/Misc/oourafft -1.71%
>> MultiSource/Benchmarks/mafft/pairlocalalign -1.16%
>> External/SPEC/CFP2006/447_dealII/447_dealII -1.06%
>> 
>> — Regressions — 
>> MultiSource/Benchmarks/Olden/bh/bh  22.47%
>> MultiSource/Benchmarks/Bullet/bullet  7.31%
>> SingleSource/Benchmarks/Misc-C++-EH/spirit  5.68%
>> SingleSource/Benchmarks/SmallPT/smallpt 3.91%
>> 
>> Thanks,
>> Nadav
>> 
>> 
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130714/01bb31e2/attachment.html>