[PATCH] Lower certain build_vectors to insertps instructions

Sun Apr 27 00:25:33 PDT 2014

Let's assume that you need to insert 2 or 3 elements that extracted from vector X.
But you create an INSERPS node that puts one element to UNDEF.

return DAG.getNode(X86ISD::INSERTPS, dl, VT, V, DAG.getUNDEF(VT),
                     InsertpsMask);

-  Elena

-----Original Message-----
From: Filipe Cabecinhas [mailto:filcab+llvm.phabricator at gmail.com] 
Sent: Sunday, April 27, 2014 10:17
To: filcab+llvm.phabricator at gmail.com; nrotem at apple.com; craig.topper at gmail.com; Demikhovsky, Elena
Cc: llvm-commits at cs.uiuc.edu
Subject: Re: [PATCH] Lower certain build_vectors to insertps instructions

Let me know if you want these changed.

Thanks,
Filipe

================
Comment at: lib/Target/X86/X86ISelLowering.cpp:5418
@@ +5417,3 @@
+  SDValue V = FirstNonZero.getOperand(0);  unsigned CorrectIdx = 
+ cast<ConstantSDNode>(FirstNonZero.getOperand(1))
+                            ->getZExtValue() == FirstNonZeroIdx;
----------------
Elena Demikhovsky wrote:
> CorrectIdx here is boolean - 1 or 0. Further you do ++.
CorrectIdx is an unsigned int. It gets initialized to 0 or 1 according to that test.

================
Comment at: lib/Target/X86/X86ISelLowering.cpp:5432
@@ +5431,3 @@
+    // ex: Getting one element from a vector, and the rest from another.
+    if (Elem.getOperand(0) != V)
+      return SDValue();
----------------
Elena Demikhovsky wrote:
> Looks like you are looking for a splat vector. But if you want to use INSERTPS, your build-vector should include only one non-zero element.
> 
> 
Not necessarily a splat vector.
Right now I'm only doing the optimization for a build_vector of elements from one single vector. But for an INSERTPS we can have up to 3 (non-zero) elements from one vector (if their destination index is the same as the source index) and one (non-zero) vector from another.
We can also set any position to zero.

This test is here to bail early and serve as a start for further optimizations. But I think if I optimized for every case, this diff would be too big. Optimizing for every case will also take time, since the IR for this can vary wildly (extractelement + insertelement + vectorshuffle, etc).

================
Comment at: test/CodeGen/X86/sse41.ll:331 @@ +330,3 @@
+; CHECK: ret
+  %vecext = extractelement <4 x float> %x, i32 0
+  %vecinit = insertelement <4 x float> undef, float %vecext, i32 0
----------------
Elena Demikhovsky wrote:
> Could you, please, explain what code you expect to see here? Is it only one insertps instruction?
> Usually, such extract-insert chain we have for matrix transpose. But in this case the elements are extracted from different vectors. 
Eventually all these shuf_???? should be reduced to a single insertps instruction.
Right now my patch doesn't accomplish this, since we need to match several other cases (and also match on lowershufflevector).

I can match the exact code that should be emitted, if you prefer.
I can also match more code in the functions that aren't yet fully reduced to an insertps instruction.
Let me know if you want me to do any of these.

I figured splitting this optimization would be easier to review and accept and it still improves our code generation gradually.

http://reviews.llvm.org/D3521

---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.