[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

Sean Silva chisophugis at gmail.com
Tue Sep 9 13:47:53 PDT 2014


On Tue, Sep 9, 2014 at 12:53 PM, Quentin Colombet <qcolombet at apple.com>
wrote:

> Hi Chandler,
>
> I had observed some improvements and regressions with the new lowering.
>
> Here are the numbers for an Ivy Bridge machine fixed at 2900MHz.
>
> I’ll look into the regressions to provide test cases.
>
> ** Numbers **
>
> Smaller is better. Only reported tests that run for at least one second.
> Reference is the default lowering, Test is the new lowering.
> The Os numbers are overall neutral, but the O3 numbers mainly expose
> regressions.
>
> Note: I can attach the raw numbers if you want.
>

That would be great. Please do.

-- Sean Silva


>
> * Os *
> Benchmark_ID    Reference Test    Expansion Percent
>
> -------------------------------------------------------------------------------
> External/Nurbs/nurbs                          2.3302       2.3122     0.99
>   -1%
> External/SPEC/CFP2000/183.equake/183.eq       3.2606       3.2419     0.99
>   -1%
> External/SPEC/CFP2006/447.dealII/447.de       16.4638       16.1313
> 0.98     -2%
> External/SPEC/CFP2006/470.lbm/470.lbm         2.0159       1.9931     0.99
>   -1%
> External/SPEC/CINT2000/164.gzip/164.gzi       8.7611       8.6981     0.99
>   -1%
> External/SPEC/CINT2006/456.hmmer/456.hm       2.5674       2.5819     1.01
>   +1%
> External/SPEC/CINT2006/462.libquantum/4       1.2924         1.347
> 1.04     +4%
> MultiSource/Benchmarks/TSVC/CrossingThr       2.4703       2.4852     1.01
>   +1%
> MultiSource/Benchmarks/TSVC/LoopRerolli       2.6611       2.5668     0.96
>   -4%
> MultiSource/Benchmarks/mafft/pairlocala       24.676       24.5372
> 0.99     -1%
> SingleSource/Benchmarks/Adobe-C++/simpl       1.0579       1.1048     1.04
>   +4%
> SingleSource/Benchmarks/Linpack/linpack       4.2817       4.3298     1.01
>   +1%
> SingleSource/Benchmarks/Misc-C++/stepan       4.1821         4.226
> 1.01     +1%
> SingleSource/Benchmarks/Misc/oourafft         3.0305       3.1777     1.05
>   +5%
>
> -------------------------------------------------------------------------------
> Min (14)                                           -             -
> 0.96       -
>
> -------------------------------------------------------------------------------
> Max (14)                                           -             -
> 1.05       -
>
> -------------------------------------------------------------------------------
> Sum (14)                                          79           79       1
>   +0%
>
> -------------------------------------------------------------------------------
> A.Mean (14)                                        -             -
> 1.01     +1%
>
> -------------------------------------------------------------------------------
> G.Mean 2 (14)                                      -             -
> 1.01     +1%
>
> -------------------------------------------------------------------------------
>
> * O3 *
> Benchmark_ID    Reference Test    Expansion Percent
>
> -------------------------------------------------------------------------------
> External/Nurbs/nurbs                          2.2322       2.2131     0.99
>   -1%
> External/Povray/povray                        2.2638       2.2762     1.01
>   +1%
> External/SPEC/CFP2000/177.mesa/177.mesa       1.6675       1.6828     1.01
>   +1%
> External/SPEC/CFP2000/188.ammp/188.ammp       10.9309       11.1191
> 1.02     +2%
> External/SPEC/CFP2006/433.milc/433.milc       6.9214       7.1696     1.04
>   +4%
> External/SPEC/CINT2000/164.gzip/164.gzi       8.5327       8.8114     1.03
>   +3%
> External/SPEC/CINT2000/186.crafty/186.c       4.1266         4.16     1.01
>   +1%
> External/SPEC/CINT2000/253.perlbmk/253.       5.6991       5.7309     1.01
>   +1%
> External/SPEC/CINT2000/256.bzip2/256.bz       6.7917       6.8763     1.01
>   +1%
> External/SPEC/CINT2006/400.perlbench/40         6.243       6.1464
> 0.98     -2%
> External/SPEC/CINT2006/401.bzip2/401.bz         2.095       2.0588
> 0.98     -2%
> External/SPEC/CINT2006/462.libquantum/4           1.2       1.2108
> 1.01     +1%
> MultiSource/Applications/SIBsim4/SIBsim       2.4547       2.5129     1.02
>   +2%
> MultiSource/Benchmarks/Bullet/bullet          4.1687       4.0882     0.98
>   -2%
> MultiSource/Benchmarks/TSVC/LinearDepen       3.0389       3.0566     1.01
>   +1%
> MultiSource/Benchmarks/TSVC/LinearDepen       2.1298       2.1997     1.03
>   +3%
> MultiSource/Benchmarks/TSVC/LoopRerolli       2.6458       2.5552     0.97
>   -3%
> MultiSource/Benchmarks/TSVC/Symbolics-f       1.6243       1.6612     1.02
>   +2%
> MultiSource/Benchmarks/mafft/pairlocala       23.8979       24.0547
> 1.01     +1%
> SingleSource/Benchmarks/Misc/oourafft         3.0374       3.1846     1.05
>   +5%
> SingleSource/Benchmarks/SmallPT/smallpt       6.5533       6.6683     1.02
>   +2%
>
> -------------------------------------------------------------------------------
> Min (21)                                           -             -
> 0.97       -
>
> -------------------------------------------------------------------------------
> Max (21)                                           -             -
> 1.05       -
>
> -------------------------------------------------------------------------------
> Sum (21)                                         108           109
> 1.01     -1%
>
> -------------------------------------------------------------------------------
> A.Mean (21)                                        -             -
> 1.01     +1%
>
> -------------------------------------------------------------------------------
> G.Mean 2 (21)                                      -             -
> 1.01     +1%
>
> -------------------------------------------------------------------------------
>
> Thanks,
> -Quentin
>
> On Sep 9, 2014, at 6:13 AM, Andrea Di Biagio <andrea.dibiagio at gmail.com>
> wrote:
>
> Hi Chandler,
>
> Thanks for fixing the problem with the insertps mask.
>
> Generally the new shuffle lowering looks promising, however there are
> some cases where the codegen is now worse causing runtime performance
> regressions in some of our internal codebase.
>
> You have already mentioned how the new shuffle lowering is missing
> some features; for example, you explicitly said that we currently lack
> of SSE4.1 blend support. Unfortunately, this seems to be one of the
> main reasons for the slowdown we are seeing.
>
> Here is a list of what we found so far that we think is causing most
> of the slowdown:
> 1) shufps is always emitted in cases where we could emit a single
> blendps; in these cases, blendps is preferable because it has better
> reciprocal throughput (this is true on all modern Intel and AMD cpus).
>
> Things get worse when it comes to lowering shuffles where the shuffle
> mask indices refer to elements from both input vectors in each lane.
> For example, a shuffle mask of <0,5,2,7> could be easily lowered into
> a single blendps; instead it gets lowered into two shufps
> instructions.
>
> Example:
> ;;;
> define <4 x float> @foo(<4 x float> %A, <4 x float> %B) {
>  %1 = shufflevector <4 x float> %A, <4 x float> %B, <4 x i32> <i32 0,
> i32 5, i32 2, i32 7>
>  ret <4 x float> %1
> }
> ;;;
>
> llc (-mcpu=corei7-avx):
>  vblendps  $10, %xmm1, %xmm0, %xmm0   # xmm0 =
> xmm0[0],xmm1[5],xmm0[2],xmm1[7]
>
> llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx):
>  vshufps $-40, %xmm0, %xmm1, %xmm0 # xmm0 = xmm1[0,2],xmm0[1,3]
>  vshufps $-40, %xmm0, %xmm0, %xmm0 # xmm0[0,2,1,3]
>
>
> 2) On SSE4.1, we should try not to emit an insertps if the shuffle
> mask identifies a blend. At the moment the new lowering logic is very
> aggressively emitting insertps instead of cheaper blendps.
>
> Example:
> ;;;
> define <4 x float> @bar(<4 x float> %A, <4 x float> %B) {
>  %1 = shufflevector <4 x float> %A, <4 x float> %B, <4 x i32> <i32 4,
> i32 5, i32 2, i32 7>
>  ret <4 x float> %1
> }
> ;;;
>
> llc (-mcpu=corei7-avx):
>  vblendps  $11, %xmm0, %xmm1, %xmm0   # xmm0 = xmm0[0,1],xmm1[2],xmm0[3]
>
> llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx):
>  vinsertps $-96, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0,1],xmm1[2],xmm0[3]
>
>
> 3) When a shuffle performs an insert at index 0 we always generate an
> insertps, while a movss would do a better job.
> ;;;
> define <4 x float> @baz(<4 x float> %A, <4 x float> %B) {
>  %1 = shufflevector <4 x float> %A, <4 x float> %B, <4 x i32> <i32 4,
> i32 1, i32 2, i32 3>
>  ret <4 x float> %1
> }
> ;;;
>
> llc (-mcpu=corei7-avx):
>  vmovss %xmm1, %xmm0, %xmm0
>
> llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx):
>  vinsertps $0, %xmm1, %xmm0, %xmm0 # xmm0 = xmm1[0],xmm0[1,2,3]
>
> I hope this is useful. We would be happy to contribute patches to
> improve some of the above cases, but we obviously know that this is
> still a work in progress, so we don't want to introduce conflicts with
> your work. Please let us know what you think.
>
> We will keep looking at this and follow up with any further findings.
>
> Thanks,
> Andrea Di Biagio
> SN Systems - Sony Computer Entertainment Inc.
>
> On Mon, Sep 8, 2014 at 6:08 PM, Quentin Colombet <qcolombet at apple.com>
> wrote:
>
> Hi Chandler,
>
> Forget about that I said.
> It seems I have some weird dependencies in my built system.
> My binaries are out-of-sync.
>
> Let me sort that out, this is likely the problem is already fixed, and I
> can
> resume the measurements.
>
> Sorry for the noise.
>
> Q.
>
> On Sep 8, 2014, at 9:32 AM, Quentin Colombet <qcolombet at apple.com> wrote:
>
>
> On Sep 7, 2014, at 8:49 PM, Quentin Colombet <qcolombet at apple.com> wrote:
>
> Sure,
>
> Here is the command line:
> clang -cc1 -triple x86_64-apple-macosx -S -disable-free
> -disable-llvm-verifier -main-file-name tmp.i -mrelocation-model pic
> -pic-level 2 -mdisable-fp-elim -masm-verbose -munwind-tables -target-cpu
> core-avx-i  -O3  -ferror-limit 19 -fmessage-length 114 -stack-protector 1
> -mstackrealign -fblocks  -fencode-extended-block-signature
> -fmax-type-align=16 -fdiagnostics-show-option -fcolor-diagnostics
> -vectorize-loops -vectorize-slp -mllvm
> -x86-experimental-vector-shuffle-lowering=true -o tmp.s -x cpp-output tmp.i
>
> This was with trunk 215249.
>
> I meant, r217281.
>
>
> Thanks,
> -Quentin
>
> <tmp.i>
>
> On Sep 6, 2014, at 4:27 PM, Chandler Carruth <chandlerc at gmail.com> wrote:
>
> I've run the SingleSource test suite for core-avx-i and have no failures
> here so a preprocessed file + commandline would be very useful if this
> reproduces for you still.
>
> On Sat, Sep 6, 2014 at 4:07 PM, Chandler Carruth <chandlerc at gmail.com>
> wrote:
>
>
> I'm having trouble reproducing this. I'm trying to get LNT to actually
> run, but manually compiling the given source file didn't reproduce it for
> me.
>
> It might have been fixed recently (although I'd be surprised if so), but
> it would help to get the actual command line for which compiling this file
> in the test suite failed.
>
> -Chandler
>
> On Fri, Sep 5, 2014 at 4:36 PM, Quentin Colombet <qcolombet at apple.com>
> wrote:
>
>
> Hi Chandler,
>
> While doing the performance measurement on a Ivy Bridge, I ran into
> compile time errors.
>
> I saw a bunch of “cannot select" in the LLVM test suite with
> -march=core-avx-i.
> E.g., SingleSource/UnitTests/Vector/SSE/sse.isamax.c is failing at O3
> -march=core-avx-i with:
> fatal error: error in backend: Cannot select: 0x7f91b99a6420: v4i32 =
> bitcast 0x7f91b99b0e10 [ORD=3] [ID=27]
>  0x7f91b99b0e10: v4i64 = insert_subvector 0x7f91b99a7210,
> 0x7f91b99a6d68, 0x7f91b99ace70 [ORD=2] [ID=25]
>    0x7f91b99a7210: v4i64 = undef [ID=15]
>    0x7f91b99a6d68: v2i64 = scalar_to_vector 0x7f91b99ab840 [ORD=2]
> [ID=23]
>      0x7f91b99ab840: i64 = AssertZext 0x7f91b99acc60, 0x7f91b99ac738
> [ORD=2] [ID=20]
>        0x7f91b99acc60: i64,ch = CopyFromReg 0x7f91b8d52820,
> 0x7f91b99a3a10 [ORD=2] [ID=16]
>          0x7f91b99a3a10: i64 = Register %vreg68 [ID=1]
>    0x7f91b99ace70: i64 = Constant<0> [ID=3]
> In function: isamax0
> clang: error: clang frontend command failed with exit code 70 (use -v to
> see invocation)
> clang version 3.6.0 (215249)
> Target: x86_64-apple-darwin14.0.0
>
> For some reason, I cannot reproduce the problem with the test case that
> clang gives me using -emit-llvm. Since the source is public, I guess you
> can
> try to reproduce on your side.
> Indeed, if you run the test-suite with -march=core-avx-i you’ll likely
> see all those failures.
>
> Let me know if you cannot and I’ll try harder to produce a test case.
>
> Note: This is the same failure all over the place, i.e., cannot select a
> bit cast from various types to v4i32 or v4i64.
>
> Thanks,
> -Quentin
>
>
> On Sep 5, 2014, at 11:09 AM, Robert Lougher <rob.lougher@
>
> gmail.com> wrote:
>
> Hi Chandler,
>
> On 5 September 2014 17:38, Chandler Carruth <chandlerc at gmail.com> wrote:
>
>
> On Fri, Sep 5, 2014 at 9:32 AM, Robert Lougher <rob.lougher at gmail.com>
> wrote:
>
>
> Unfortunately, another team, while doing internal testing has seen the
> new path generating illegal insertps masks.  A sample here:
>
>   vinsertps    $256, %xmm0, %xmm13, %xmm4 # xmm4 = xmm0[0],xmm13[1,2,3]
>   vinsertps    $256, %xmm1, %xmm0, %xmm6 # xmm6 = xmm1[0],xmm0[1,2,3]
>   vinsertps    $256, %xmm13, %xmm1, %xmm7 # xmm7 = xmm13[0],xmm1[1,2,3]
>   vinsertps    $416, %xmm1, %xmm4, %xmm14 # xmm14 =
> xmm4[0,1],xmm1[2],xmm4[3]
>   vinsertps    $416, %xmm13, %xmm6, %xmm13 # xmm13 =
> xmm6[0,1],xmm13[2],xmm6[3]
>   vinsertps    $416, %xmm0, %xmm7, %xmm0 # xmm0 =
> xmm7[0,1],xmm0[2],xmm7[3]
>
> We'll continue to look into this and do additional testing.
>
>
>
> Interesting. Let me know if you get a test case. The insertps code path
> was
> added recently though and has been much less well tested. I'll start fuzz
> testing it and should hopefully uncover the bug.
>
>
> Here's two small test cases.  Hope they are of use.
>
> Thanks,
> Rob.
>
> ------
> define <4 x float> @test(<4 x float> %xyzw, <4 x float> %abcd) {
> %1 = extractelement <4 x float> %xyzw, i32 0
> %2 = insertelement <4 x float> undef, float %1, i32 0
> %3 = insertelement <4 x float> %2, float 0.000000e+00, i32 1
> %4 = shufflevector <4 x float> %3, <4 x float> %xyzw, <4 x i32> <i32
> 0, i32 1, i32 6, i32 undef>
> %5 = shufflevector <4 x float> %4, <4 x float> %abcd, <4 x i32> <i32
> 0, i32 1, i32 2, i32 4>
> ret <4 x float> %5
> }
>
> define <4 x float> @test2(<4 x float> %xyzw, <4 x float> %abcd) {
> %1 = shufflevector <4 x float> %xyzw, <4 x float> %abcd, <4 x i32>
> <i32 0, i32 undef, i32 2, i32 4>
> %2 = shufflevector <4 x float> <float undef, float 0.000000e+00,
> float undef, float undef>, <4 x float> %1, <4 x i32> <i32 4, i32 1,
> i32 6, i32 7>
> ret <4 x float> %2
> }
>
>
> llc -march=x86-64 -mattr=+avx test.ll -o -
>
> test:                                   # @test
>   vxorps    %xmm2, %xmm2, %xmm2
>   vmovss    %xmm0, %xmm2, %xmm2
>   vblendps    $4, %xmm0, %xmm2, %xmm0 # xmm0 = xmm2[0,1],xmm0[2],xmm2[3]
>   vinsertps    $48, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0,1,2],xmm1[0]
>   retl
>
> test2:                                  # @test2
>   vinsertps    $48, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0,1,2],xmm1[0]
>   vxorps    %xmm1, %xmm1, %xmm1
>   vblendps    $13, %xmm0, %xmm1, %xmm0 # xmm0 =
> xmm0[0],xmm1[1],xmm0[2,3]
>   retl
>
> llc -march=x86-64 -mattr=+avx
> -x86-experimental-vector-shuffle-lowering test.ll -o -
>
> test:                                   # @test
>   vinsertps    $270, %xmm0, %xmm0, %xmm2 # xmm2 = xmm0[0],zero,zero,zero
>   vinsertps    $416, %xmm0, %xmm2, %xmm0 # xmm0 =
> xmm2[0,1],xmm0[2],xmm2[3]
>   vinsertps    $304, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0,1,2],xmm1[0]
>   retl
>
> test2:                                  # @test2
>   vinsertps    $304, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0,1,2],xmm1[0]
>   vxorps    %xmm1, %xmm1, %xmm1
>   vinsertps    $336, %xmm1, %xmm0, %xmm0 # xmm0 =
> xmm0[0],xmm1[1],xmm0[2,3]
>   retl
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140909/4444ed3d/attachment.html>


More information about the llvm-dev mailing list