<div dir="ltr">clang test fixed to account for this patch in r232857:<br><a href="http://llvm.org/viewvc/llvm-project?view=revision&revision=232857">http://llvm.org/viewvc/llvm-project?view=revision&revision=232857</a><br></div><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Mar 20, 2015 at 4:29 PM, Sanjay Patel <span dir="ltr"><<a href="mailto:spatel@rotateright.com" target="_blank">spatel@rotateright.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">This checkin caused a failure in a clang regression test. Working on a testcase fix now.<br></div><div class="HOEnZb"><div class="h5"><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Mar 20, 2015 at 3:47 PM, Sanjay Patel <span dir="ltr"><<a href="mailto:spatel@rotateright.com" target="_blank">spatel@rotateright.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Author: spatel<br>

Date: Fri Mar 20 16:47:56 2015<br>

New Revision: 232852<br>

<br>

URL: <a href="http://llvm.org/viewvc/llvm-project?rev=232852&view=rev" target="_blank">http://llvm.org/viewvc/llvm-project?rev=232852&view=rev</a><br>

Log:<br>

[X86, AVX] instcombine common cases of vperm2* intrinsics into shuffles<br>

<br>

vperm2* intrinsics are just shuffles.<br>

In a few special cases, they're not even shuffles.<br>

<br>

Optimizing intrinsics in InstCombine is better than<br>

handling this in the front-end for at least two reasons:<br>

<br>

1. Optimizing custom-written SSE intrinsic code at -O0 makes vector coders<br>

   really angry (and so I have regrets about some patches from last week).<br>

<br>

2. Doing mask conversion logic in header files is hard to write and<br>

   subsequently read.<br>

<br>

There are a couple of TODOs in this patch to complete this optimization.<br>

<br>

Differential Revision: <a href="http://reviews.llvm.org/D8486" target="_blank">http://reviews.llvm.org/D8486</a><br>

<br>

<br>

Added:<br>

    llvm/trunk/test/Transforms/InstCombine/x86-vperm2.ll<br>

Modified:<br>

    llvm/trunk/lib/Transforms/InstCombine/InstCombineCalls.cpp<br>

<br>

Modified: llvm/trunk/lib/Transforms/InstCombine/InstCombineCalls.cpp<br>

URL: <a href="http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Transforms/InstCombine/InstCombineCalls.cpp?rev=232852&r1=232851&r2=232852&view=diff" target="_blank">http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Transforms/InstCombine/InstCombineCalls.cpp?rev=232852&r1=232851&r2=232852&view=diff</a><br>

==============================================================================<br>

--- llvm/trunk/lib/Transforms/InstCombine/InstCombineCalls.cpp (original)<br>

+++ llvm/trunk/lib/Transforms/InstCombine/InstCombineCalls.cpp Fri Mar 20 16:47:56 2015<br>

@@ -197,6 +197,57 @@ Instruction *InstCombiner::SimplifyMemSe<br>

   return nullptr;<br>

 }<br>

<br>

+/// The shuffle mask for a perm2*128 selects any two halves of two 256-bit<br>

+/// source vectors, unless a zero bit is set. If a zero bit is set,<br>

+/// then ignore that half of the mask and clear that half of the vector.<br>

+static Value *SimplifyX86vperm2(const IntrinsicInst &II,<br>

+                                InstCombiner::BuilderTy &Builder) {<br>

+  if (auto CInt = dyn_cast<ConstantInt>(II.getArgOperand(2))) {<br>

+    VectorType *VecTy = cast<VectorType>(II.getType());<br>

+    uint8_t Imm = CInt->getZExtValue();<br>

+<br>

+    // The immediate permute control byte looks like this:<br>

+    //    [1:0] - select 128 bits from sources for low half of destination<br>

+    //    [2]   - ignore<br>

+    //    [3]   - zero low half of destination<br>

+    //    [5:4] - select 128 bits from sources for high half of destination<br>

+    //    [6]   - ignore<br>

+    //    [7]   - zero high half of destination<br>

+<br>

+    if ((Imm & 0x88) == 0x88) {<br>

+      // If both zero mask bits are set, this was just a weird way to<br>

+      // generate a zero vector.<br>

+      return ConstantAggregateZero::get(VecTy);<br>

+    }<br>

+<br>

+    // TODO: If a single zero bit is set, replace one of the source operands<br>

+    // with a zero vector and use the same mask generation logic as below.<br>

+<br>

+    if ((Imm & 0x88) == 0x00) {<br>

+      // If neither zero mask bit is set, this is a simple shuffle.<br>

+      unsigned NumElts = VecTy->getNumElements();<br>

+      unsigned HalfSize = NumElts / 2;<br>

+      unsigned HalfBegin;<br>

+      SmallVector<int, 8> ShuffleMask(NumElts);<br>

+<br>

+      // Permute low half of result.<br>

+      HalfBegin = (Imm & 0x3) * HalfSize;<br>

+      for (unsigned i = 0; i != HalfSize; ++i)<br>

+        ShuffleMask[i] = HalfBegin + i;<br>

+<br>

+      // Permute high half of result.<br>

+      HalfBegin = ((Imm >> 4) & 0x3) * HalfSize;<br>

+      for (unsigned i = HalfSize; i != NumElts; ++i)<br>

+        ShuffleMask[i] = HalfBegin + i - HalfSize;<br>

+<br>

+      Value *Op0 = II.getArgOperand(0);<br>

+      Value *Op1 = II.getArgOperand(1);<br>

+      return Builder.CreateShuffleVector(Op0, Op1, ShuffleMask);<br>

+    }<br>

+  }<br>

+  return nullptr;<br>

+}<br>

+<br>

 /// visitCallInst - CallInst simplification.  This mostly only handles folding<br>

 /// of intrinsic instructions.  For normal calls, it allows visitCallSite to do<br>

 /// the heavy lifting.<br>

@@ -904,6 +955,14 @@ Instruction *InstCombiner::visitCallInst<br>

     return ReplaceInstUsesWith(CI, Shuffle);<br>

   }<br>

<br>

+  case Intrinsic::x86_avx_vperm2f128_pd_256:<br>

+  case Intrinsic::x86_avx_vperm2f128_ps_256:<br>

+  case Intrinsic::x86_avx_vperm2f128_si_256:<br>

+    // TODO: Add the AVX2 version of this instruction.<br>

+    if (Value *V = SimplifyX86vperm2(*II, *Builder))<br>

+      return ReplaceInstUsesWith(*II, V);<br>

+    break;<br>

+<br>

   case Intrinsic::ppc_altivec_vperm:<br>

     // Turn vperm(V1,V2,mask) -> shuffle(V1,V2,mask) if mask is a constant.<br>

     // Note that ppc_altivec_vperm has a big-endian bias, so when creating<br>

<br>

Added: llvm/trunk/test/Transforms/InstCombine/x86-vperm2.ll<br>

URL: <a href="http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Transforms/InstCombine/x86-vperm2.ll?rev=232852&view=auto" target="_blank">http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Transforms/InstCombine/x86-vperm2.ll?rev=232852&view=auto</a><br>

==============================================================================<br>

--- llvm/trunk/test/Transforms/InstCombine/x86-vperm2.ll (added)<br>

+++ llvm/trunk/test/Transforms/InstCombine/x86-vperm2.ll Fri Mar 20 16:47:56 2015<br>

@@ -0,0 +1,236 @@<br>

+; RUN: opt < %s -instcombine -S | FileCheck %s<br>

+<br>

+; This should never happen, but make sure we don't crash handling a non-constant immediate byte.<br>

+<br>

+define <4 x double> @perm2pd_non_const_imm(<4 x double> %a0, <4 x double> %a1, i8 %b) {<br>

+  %res = call <4 x double> @llvm.x86.avx.vperm2f128.pd.256(<4 x double> %a0, <4 x double> %a1, i8 %b)<br>

+  ret <4 x double> %res<br>

+<br>

+; CHECK-LABEL: @perm2pd_non_const_imm<br>

+; CHECK-NEXT:  call <4 x double> @llvm.x86.avx.vperm2f128.pd.256(<4 x double> %a0, <4 x double> %a1, i8 %b)<br>

+; CHECK-NEXT:  ret <4 x double><br>

+}<br>

+<br>

+<br>

+; In the following 3 tests, both zero mask bits of the immediate are set.<br>

+<br>

+define <4 x double> @perm2pd_0x88(<4 x double> %a0, <4 x double> %a1) {<br>

+  %res = call <4 x double> @llvm.x86.avx.vperm2f128.pd.256(<4 x double> %a0, <4 x double> %a1, i8 136)<br>

+  ret <4 x double> %res<br>

+<br>

+; CHECK-LABEL: @perm2pd_0x88<br>

+; CHECK-NEXT:  ret <4 x double> zeroinitializer<br>

+}<br>

+<br>

+define <8 x float> @perm2ps_0x88(<8 x float> %a0, <8 x float> %a1) {<br>

+  %res = call <8 x float> @llvm.x86.avx.vperm2f128.ps.256(<8 x float> %a0, <8 x float> %a1, i8 136)<br>

+  ret <8 x float> %res<br>

+<br>

+; CHECK-LABEL: @perm2ps_0x88<br>

+; CHECK-NEXT:  ret <8 x float> zeroinitializer<br>

+}<br>

+<br>

+define <8 x i32> @perm2si_0x88(<8 x i32> %a0, <8 x i32> %a1) {<br>

+  %res = call <8 x i32> @llvm.x86.avx.vperm2f128.si.256(<8 x i32> %a0, <8 x i32> %a1, i8 136)<br>

+  ret <8 x i32> %res<br>

+<br>

+; CHECK-LABEL: @perm2si_0x88<br>

+; CHECK-NEXT:  ret <8 x i32> zeroinitializer<br>

+}<br>

+<br>

+<br>

+; The other control bits are ignored when zero mask bits of the immediate are set.<br>

+<br>

+define <4 x double> @perm2pd_0xff(<4 x double> %a0, <4 x double> %a1) {<br>

+  %res = call <4 x double> @llvm.x86.avx.vperm2f128.pd.256(<4 x double> %a0, <4 x double> %a1, i8 255)<br>

+  ret <4 x double> %res<br>

+<br>

+; CHECK-LABEL: @perm2pd_0xff<br>

+; CHECK-NEXT:  ret <4 x double> zeroinitializer<br>

+}<br>

+<br>

+<br>

+; The following 16 tests are simple shuffles, except for 2 cases where we can just return one of the<br>

+; source vectors. Verify that we generate the right shuffle masks and undef source operand where possible..<br>

+<br>

+define <4 x double> @perm2pd_0x00(<4 x double> %a0, <4 x double> %a1) {<br>

+  %res = call <4 x double> @llvm.x86.avx.vperm2f128.pd.256(<4 x double> %a0, <4 x double> %a1, i8 0)<br>

+  ret <4 x double> %res<br>

+<br>

+; CHECK-LABEL: @perm2pd_0x00<br>

+; CHECK-NEXT:  %1 = shufflevector <4 x double> %a0, <4 x double> undef, <4 x i32> <i32 0, i32 1, i32 0, i32 1><br>

+; CHECK-NEXT:  ret <4 x double> %1<br>

+}<br>

+<br>

+define <4 x double> @perm2pd_0x01(<4 x double> %a0, <4 x double> %a1) {<br>

+  %res = call <4 x double> @llvm.x86.avx.vperm2f128.pd.256(<4 x double> %a0, <4 x double> %a1, i8 1)<br>

+  ret <4 x double> %res<br>

+<br>

+; CHECK-LABEL: @perm2pd_0x01<br>

+; CHECK-NEXT:  %1 = shufflevector <4 x double> %a0, <4 x double> undef, <4 x i32> <i32 2, i32 3, i32 0, i32 1><br>

+; CHECK-NEXT:  ret <4 x double> %1<br>

+}<br>

+<br>

+define <4 x double> @perm2pd_0x02(<4 x double> %a0, <4 x double> %a1) {<br>

+  %res = call <4 x double> @llvm.x86.avx.vperm2f128.pd.256(<4 x double> %a0, <4 x double> %a1, i8 2)<br>

+  ret <4 x double> %res<br>

+<br>

+; CHECK-LABEL: @perm2pd_0x02<br>

+; CHECK-NEXT:  %1 = shufflevector <4 x double> %a0, <4 x double> %a1, <4 x i32> <i32 4, i32 5, i32 0, i32 1><br>

+; CHECK-NEXT:  ret <4 x double> %1<br>

+}<br>

+<br>

+define <4 x double> @perm2pd_0x03(<4 x double> %a0, <4 x double> %a1) {<br>

+  %res = call <4 x double> @llvm.x86.avx.vperm2f128.pd.256(<4 x double> %a0, <4 x double> %a1, i8 3)<br>

+  ret <4 x double> %res<br>

+<br>

+; CHECK-LABEL: @perm2pd_0x03<br>

+; CHECK-NEXT:  %1 = shufflevector <4 x double> %a0, <4 x double> %a1, <4 x i32> <i32 6, i32 7, i32 0, i32 1><br>

+; CHECK-NEXT:  ret <4 x double> %1<br>

+}<br>

+<br>

+define <4 x double> @perm2pd_0x10(<4 x double> %a0, <4 x double> %a1) {<br>

+  %res = call <4 x double> @llvm.x86.avx.vperm2f128.pd.256(<4 x double> %a0, <4 x double> %a1, i8 16)<br>

+  ret <4 x double> %res<br>

+<br>

+; CHECK-LABEL: @perm2pd_0x10<br>

+; CHECK-NEXT:  ret <4 x double> %a0<br>

+}<br>

+<br>

+define <4 x double> @perm2pd_0x11(<4 x double> %a0, <4 x double> %a1) {<br>

+  %res = call <4 x double> @llvm.x86.avx.vperm2f128.pd.256(<4 x double> %a0, <4 x double> %a1, i8 17)<br>

+  ret <4 x double> %res<br>

+<br>

+; CHECK-LABEL: @perm2pd_0x11<br>

+; CHECK-NEXT:  %1 = shufflevector <4 x double> %a0, <4 x double> undef, <4 x i32> <i32 2, i32 3, i32 2, i32 3><br>

+; CHECK-NEXT:  ret <4 x double> %1<br>

+}<br>

+<br>

+define <4 x double> @perm2pd_0x12(<4 x double> %a0, <4 x double> %a1) {<br>

+  %res = call <4 x double> @llvm.x86.avx.vperm2f128.pd.256(<4 x double> %a0, <4 x double> %a1, i8 18)<br>

+  ret <4 x double> %res<br>

+<br>

+; CHECK-LABEL: @perm2pd_0x12<br>

+; CHECK-NEXT:  %1 = shufflevector <4 x double> %a0, <4 x double> %a1, <4 x i32> <i32 4, i32 5, i32 2, i32 3><br>

+; CHECK-NEXT:  ret <4 x double> %1<br>

+}<br>

+<br>

+define <4 x double> @perm2pd_0x13(<4 x double> %a0, <4 x double> %a1) {<br>

+  %res = call <4 x double> @llvm.x86.avx.vperm2f128.pd.256(<4 x double> %a0, <4 x double> %a1, i8 19)<br>

+  ret <4 x double> %res<br>

+<br>

+; CHECK-LABEL: @perm2pd_0x13<br>

+; CHECK-NEXT:  %1 = shufflevector <4 x double> %a0, <4 x double> %a1, <4 x i32> <i32 6, i32 7, i32 2, i32 3><br>

+; CHECK-NEXT:  ret <4 x double> %1<br>

+}<br>

+<br>

+define <4 x double> @perm2pd_0x20(<4 x double> %a0, <4 x double> %a1) {<br>

+  %res = call <4 x double> @llvm.x86.avx.vperm2f128.pd.256(<4 x double> %a0, <4 x double> %a1, i8 32)<br>

+  ret <4 x double> %res<br>

+<br>

+; CHECK-LABEL: @perm2pd_0x20<br>

+; CHECK-NEXT:  %1 = shufflevector <4 x double> %a0, <4 x double> %a1, <4 x i32> <i32 0, i32 1, i32 4, i32 5><br>

+; CHECK-NEXT:  ret <4 x double> %1<br>

+}<br>

+<br>

+define <4 x double> @perm2pd_0x21(<4 x double> %a0, <4 x double> %a1) {<br>

+  %res = call <4 x double> @llvm.x86.avx.vperm2f128.pd.256(<4 x double> %a0, <4 x double> %a1, i8 33)<br>

+  ret <4 x double> %res<br>

+<br>

+; CHECK-LABEL: @perm2pd_0x21<br>

+; CHECK-NEXT:  %1 = shufflevector <4 x double> %a0, <4 x double> %a1, <4 x i32> <i32 2, i32 3, i32 4, i32 5><br>

+; CHECK-NEXT:  ret <4 x double> %1<br>

+}<br>

+<br>

+define <4 x double> @perm2pd_0x22(<4 x double> %a0, <4 x double> %a1) {<br>

+  %res = call <4 x double> @llvm.x86.avx.vperm2f128.pd.256(<4 x double> %a0, <4 x double> %a1, i8 34)<br>

+  ret <4 x double> %res<br>

+<br>

+; CHECK-LABEL: @perm2pd_0x22<br>

+; CHECK-NEXT:  %1 = shufflevector <4 x double> %a1, <4 x double> undef, <4 x i32> <i32 0, i32 1, i32 0, i32 1><br>

+; CHECK-NEXT:  ret <4 x double> %1<br>

+}<br>

+<br>

+define <4 x double> @perm2pd_0x23(<4 x double> %a0, <4 x double> %a1) {<br>

+  %res = call <4 x double> @llvm.x86.avx.vperm2f128.pd.256(<4 x double> %a0, <4 x double> %a1, i8 35)<br>

+  ret <4 x double> %res<br>

+<br>

+; CHECK-LABEL: @perm2pd_0x23<br>

+; CHECK-NEXT:  %1 = shufflevector <4 x double> %a1, <4 x double> undef, <4 x i32> <i32 2, i32 3, i32 0, i32 1><br>

+; CHECK-NEXT:  ret <4 x double> %1<br>

+}<br>

+<br>

+define <4 x double> @perm2pd_0x30(<4 x double> %a0, <4 x double> %a1) {<br>

+  %res = call <4 x double> @llvm.x86.avx.vperm2f128.pd.256(<4 x double> %a0, <4 x double> %a1, i8 48)<br>

+  ret <4 x double> %res<br>

+<br>

+; CHECK-LABEL: @perm2pd_0x30<br>

+; CHECK-NEXT:  %1 = shufflevector <4 x double> %a0, <4 x double> %a1, <4 x i32> <i32 0, i32 1, i32 6, i32 7><br>

+; CHECK-NEXT:  ret <4 x double> %1<br>

+}<br>

+<br>

+define <4 x double> @perm2pd_0x31(<4 x double> %a0, <4 x double> %a1) {<br>

+  %res = call <4 x double> @llvm.x86.avx.vperm2f128.pd.256(<4 x double> %a0, <4 x double> %a1, i8 49)<br>

+  ret <4 x double> %res<br>

+<br>

+; CHECK-LABEL: @perm2pd_0x31<br>

+; CHECK-NEXT:  %1 = shufflevector <4 x double> %a0, <4 x double> %a1, <4 x i32> <i32 2, i32 3, i32 6, i32 7><br>

+; CHECK-NEXT:  ret <4 x double> %1<br>

+}<br>

+<br>

+define <4 x double> @perm2pd_0x32(<4 x double> %a0, <4 x double> %a1) {<br>

+  %res = call <4 x double> @llvm.x86.avx.vperm2f128.pd.256(<4 x double> %a0, <4 x double> %a1, i8 50)<br>

+  ret <4 x double> %res<br>

+<br>

+; CHECK-LABEL: @perm2pd_0x32<br>

+; CHECK-NEXT:  ret <4 x double> %a1<br>

+}<br>

+<br>

+define <4 x double> @perm2pd_0x33(<4 x double> %a0, <4 x double> %a1) {<br>

+  %res = call <4 x double> @llvm.x86.avx.vperm2f128.pd.256(<4 x double> %a0, <4 x double> %a1, i8 51)<br>

+  ret <4 x double> %res<br>

+<br>

+; CHECK-LABEL: @perm2pd_0x33<br>

+; CHECK-NEXT:  %1 = shufflevector <4 x double> %a1, <4 x double> undef, <4 x i32> <i32 2, i32 3, i32 2, i32 3><br>

+; CHECK-NEXT:  ret <4 x double> %1<br>

+}<br>

+<br>

+; Confirm that a mask for 32-bit elements is also correct.<br>

+<br>

+define <8 x float> @perm2ps_0x31(<8 x float> %a0, <8 x float> %a1) {<br>

+  %res = call <8 x float> @llvm.x86.avx.vperm2f128.ps.256(<8 x float> %a0, <8 x float> %a1, i8 49)<br>

+  ret <8 x float> %res<br>

+<br>

+; CHECK-LABEL: @perm2ps_0x31<br>

+; CHECK-NEXT:  %1 = shufflevector <8 x float> %a0, <8 x float> %a1, <8 x i32> <i32 4, i32 5, i32 6, i32 7, i32 12, i32 13, i32 14, i32 15><br>

+; CHECK-NEXT:  ret <8 x float> %1<br>

+}<br>

+<br>

+<br>

+; Confirm that when a single zero mask bit is set, we do nothing.<br>

+<br>

+define <4 x double> @perm2pd_0x83(<4 x double> %a0, <4 x double> %a1) {<br>

+  %res = call <4 x double> @llvm.x86.avx.vperm2f128.pd.256(<4 x double> %a0, <4 x double> %a1, i8 131)<br>

+  ret <4 x double> %res<br>

+<br>

+; CHECK-LABEL: @perm2pd_0x83<br>

+; CHECK-NEXT:  call <4 x double> @llvm.x86.avx.vperm2f128.pd.256(<4 x double> %a0, <4 x double> %a1, i8 -125)<br>

+; CHECK-NEXT:  ret <4 x double><br>

+}<br>

+<br>

+<br>

+; Confirm that when the other zero mask bit is set, we do nothing. Also confirm that an ignored bit has no effect.<br>

+<br>

+define <4 x double> @perm2pd_0x48(<4 x double> %a0, <4 x double> %a1) {<br>

+  %res = call <4 x double> @llvm.x86.avx.vperm2f128.pd.256(<4 x double> %a0, <4 x double> %a1, i8 72)<br>

+  ret <4 x double> %res<br>

+<br>

+; CHECK-LABEL: @perm2pd_0x48<br>

+; CHECK-NEXT:  call <4 x double> @llvm.x86.avx.vperm2f128.pd.256(<4 x double> %a0, <4 x double> %a1, i8 72)<br>

+; CHECK-NEXT:  ret <4 x double><br>

+}<br>

+<br>

+declare <4 x double> @llvm.x86.avx.vperm2f128.pd.256(<4 x double>, <4 x double>, i8) nounwind readnone<br>

+declare <8 x float> @llvm.x86.avx.vperm2f128.ps.256(<8 x float>, <8 x float>, i8) nounwind readnone<br>

+declare <8 x i32> @llvm.x86.avx.vperm2f128.si.256(<8 x i32>, <8 x i32>, i8) nounwind readnone<br>

+<br>

<br>

<br>

_______________________________________________<br>

llvm-commits mailing list<br>

<a href="mailto:llvm-commits@cs.uiuc.edu" target="_blank">llvm-commits@cs.uiuc.edu</a><br>

<a href="http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits" target="_blank">http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits</a><br>

</blockquote></div><br></div>

</div></div></blockquote></div><br></div>