[PATCH] Lower certain build_vectors to insertps instructions

Sun Apr 27 00:16:40 PDT 2014

Let me know if you want these changed.

Thanks,
Filipe

================
Comment at: lib/Target/X86/X86ISelLowering.cpp:5418
@@ +5417,3 @@
+  SDValue V = FirstNonZero.getOperand(0);
+  unsigned CorrectIdx = cast<ConstantSDNode>(FirstNonZero.getOperand(1))
+                            ->getZExtValue() == FirstNonZeroIdx;
----------------
Elena Demikhovsky wrote:
> CorrectIdx here is boolean - 1 or 0. Further you do ++.
CorrectIdx is an unsigned int. It gets initialized to 0 or 1 according to that test.

================
Comment at: lib/Target/X86/X86ISelLowering.cpp:5432
@@ +5431,3 @@
+    // ex: Getting one element from a vector, and the rest from another.
+    if (Elem.getOperand(0) != V)
+      return SDValue();
----------------
Elena Demikhovsky wrote:
> Looks like you are looking for a splat vector. But if you want to use INSERTPS, your build-vector should include only one non-zero element.
> 
> 
Not necessarily a splat vector.
Right now I'm only doing the optimization for a build_vector of elements from one single vector. But for an INSERTPS we can have up to 3 (non-zero) elements from one vector (if their destination index is the same as the source index) and one (non-zero) vector from another.
We can also set any position to zero.

This test is here to bail early and serve as a start for further optimizations. But I think if I optimized for every case, this diff would be too big. Optimizing for every case will also take time, since the IR for this can vary wildly (extractelement + insertelement + vectorshuffle, etc).

================
Comment at: test/CodeGen/X86/sse41.ll:331
@@ +330,3 @@
+; CHECK: ret
+  %vecext = extractelement <4 x float> %x, i32 0
+  %vecinit = insertelement <4 x float> undef, float %vecext, i32 0
----------------
Elena Demikhovsky wrote:
> Could you, please, explain what code you expect to see here? Is it only one insertps instruction?
> Usually, such extract-insert chain we have for matrix transpose. But in this case the elements are extracted from different vectors. 
Eventually all these shuf_???? should be reduced to a single insertps instruction.
Right now my patch doesn't accomplish this, since we need to match several other cases (and also match on lowershufflevector).

I can match the exact code that should be emitted, if you prefer.
I can also match more code in the functions that aren't yet fully reduced to an insertps instruction.
Let me know if you want me to do any of these.

I figured splitting this optimization would be easier to review and accept and it still improves our code generation gradually.

http://reviews.llvm.org/D3521