[PATCH] D40865: X86 AVX2: Prefer one VPERMV over ShuffleAsRepeatedMaskAndLanePermute

Thu Dec 7 06:00:03 PST 2017

zvi added a comment.

@spatel, this patch is for lowerV8I32VectorShuffle() which won't be called for AVX1-only targets. Would be nice if we could somehow get AVX covered as well, if profitable.
I did not observe any speedups with this patch, but FWIW IACA reports that (for Intel processors, of course) the throughput can be higher even if the load is not hoisted.
What triggered this patch was a case i discovered while working on deprecation of llvm.x86.avx2.permd and llvm.x86.avx2.permps. After trashing these intrinsics that case ends up with:

  define <8 x i32> @shuffle_test_vpermd(<8 x i32> %a0) {
     %1 = shufflevector <8 x i32> %a0, <8 x i32> undef, <8 x i32> <i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0>
     ret <8 x i32> %1
   }

So without this patch we will be generating regressed code for Intel, but faster code for AMD, according to what was posted in the above comments.

@RKSimon, I'm not too familiar with the MachineCombiner. Are there already any shuffle cases that are handled or was that wishful thinking? :)

https://reviews.llvm.org/D40865