<html><head><meta http-equiv="Content-Type" content="text/html charset=windows-1252"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;"><br><div><div>On Jul 14, 2013, at 9:52 PM, Chris Lattner <<a href="mailto:clattner@apple.com">clattner@apple.com</a>> wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite"><div style="letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;"><br>On Jul 13, 2013, at 11:30 PM, Nadav Rotem <<a href="mailto:nrotem@apple.com">nrotem@apple.com</a>> wrote:<br><br><blockquote type="cite">Hi,<span class="Apple-converted-space"> </span><br><br>LLVM’s SLP-vectorizer is a new pass that combines similar independent instructions in a straight-line code. It is currently not enabled by default, and people who want to experiment with it can use the clang command line flag “-fslp-vectorize”. I ran LLVM’s test suite with and without the SLP vectorizer on a Sandybridge mac (using SSE4, w/o AVX). Based on my performance measurements (below) I would like to enable the SLP-vectorizer by default on -O3. I would like to hear what others in the community think about this and give other people the opportunity to perform their own performance measurements.<span class="Apple-converted-space"> </span><br></blockquote><br>This looks great Nadav. The performance wins are really big. How you investigated the bh and bullet regression though? </div></blockquote><div dir="auto"><br></div><div dir="auto">Thanks. Yes, I looked at both. The hot function in BH is “gravsub”. The vectorized IR looks fine and the assembly looks fine, but for some reason Instruments reports that the first vector-subtract instruction takes 18% of the time. The regression happens both with the VEX prefix and without. I suspected that the problem is the movupd's that load xmm0 and xmm1. I started looking at some performance counters on Friday, but I did not find anything suspicious yet. </div><div dir="auto"><br></div>+0x00 movupd 16(%rsi), %xmm0<br>+0x05 movupd 16(%rsp), %xmm1<br>+0x0b subpd %xmm1, %xmm0 <———— 18% of the runtime of bh ?<br>+0x0f movapd %xmm0, %xmm2<br>+0x13 mulsd %xmm2, %xmm2<br>+0x17 xorpd %xmm1, %xmm1<br><div dir="auto">+0x1b addsd %xmm2, %xmm1 </div><div dir="auto"><br></div><div dir="auto">I spent less time on Bullet. Bullet also has one hot function (“resolveSingleConstraintRowLowerLimit”). On this code the vectorizer generates several trees that use the <3 x float> type. This is risky because the loads/stores are inefficient, but unfortunately triples of RGB and XYZ are very popular in some domains and we do want to vectorize them. I skimmed through the IR and the assembly and I did not see anything too bad. The next step would be to do a binary search on the places where the vectorizer fires to locate the bad pattern. </div><div dir="auto"><br></div><div dir="auto">On AVX we have another regression that I did not mention: Flops-7. When we vectorize we cause more spills because we do a poor job scheduling non-destructive source instructions (related to PR10928). Hopefully Andy’s scheduler will fix this regression once it is enabled. </div><div dir="auto"><br></div><div dir="auto">I did not measure code size, but I did measure compile time. There are 4-5 workloads (not counting workloads that run below 0.5 seconds) where the compile time increase is more than 5%. I am aware of a problem in the (quadratic) code that looks for consecutive stores. This code calls SCEV too many times. I plan to fix this. </div><div dir="auto"><br></div><div dir="auto">Thanks,</div><div dir="auto">Nadav </div><div dir="auto"><br></div><div dir="auto"><br></div><blockquote type="cite"><div style="letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;">We should at least understand what is going wrong there. bh is pretty tiny, so it should be straight-forward. It would also be really useful to see what the code size and compile time impact is.<br><br>-Chris<br><br><blockquote type="cite"><br>— Performance Gains —<span class="Apple-converted-space"> </span><br>SingleSource/Benchmarks/Misc/matmul_f64_4x4 -53.68%<br>MultiSource/Benchmarks/Olden/power/power -18.55%<br>MultiSource/Benchmarks/TSVC/LoopRerolling-flt/LoopRerolling-flt -14.71%<br>SingleSource/Benchmarks/Misc/flops-6 -11.02%<br>SingleSource/Benchmarks/Misc/flops-5 -10.03%<br>MultiSource/Benchmarks/TSVC/LinearDependence-flt/LinearDependence-flt -8.37%<br>External/Nurbs/nurbs -7.98%<br>SingleSource/Benchmarks/Misc/pi -7.29%<br>External/SPEC/CINT2000/252_eon/252_eon -5.78%<br>External/SPEC/CFP2006/444_namd/444_namd -4.52%<br>External/SPEC/CFP2000/188_ammp/188_ammp -4.45%<br>MultiSource/Applications/SIBsim4/SIBsim4 -3.58%<br>MultiSource/Benchmarks/TSVC/LoopRerolling-dbl/LoopRerolling-dbl -3.52%<br>SingleSource/Benchmarks/Misc-C++/Large/sphereflake -2.96%<br>MultiSource/Benchmarks/TSVC/LinearDependence-dbl/LinearDependence-dbl -2.75%<br>MultiSource/Benchmarks/VersaBench/beamformer/beamformer -2.70%<br>MultiSource/Benchmarks/TSVC/NodeSplitting-dbl/NodeSplitting-dbl -1.95%<br>SingleSource/Benchmarks/Misc/flops -1.89%<br>SingleSource/Benchmarks/Misc/oourafft -1.71%<br>MultiSource/Benchmarks/mafft/pairlocalalign -1.16%<br>External/SPEC/CFP2006/447_dealII/447_dealII -1.06%<br><br>— Regressions —<span class="Apple-converted-space"> </span><br>MultiSource/Benchmarks/Olden/bh/bh 22.47%<br>MultiSource/Benchmarks/Bullet/bullet 7.31%<br>SingleSource/Benchmarks/Misc-C++-EH/spirit 5.68%<br>SingleSource/Benchmarks/SmallPT/smallpt 3.91%<br><br>Thanks,<br>Nadav<br><br><br>_______________________________________________<br>LLVM Developers mailing list<br><a href="mailto:LLVMdev@cs.uiuc.edu">LLVMdev@cs.uiuc.edu</a><span class="Apple-converted-space"> </span> <a href="http://llvm.cs.uiuc.edu/">http://llvm.cs.uiuc.edu</a><br><a href="http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev">http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev</a></blockquote></div></blockquote></div><br></body></html>