[PATCH] D11439: [X86][SSE] Vectorize i64 ASHR operations

Tue Jul 28 10:47:22 PDT 2015

RKSimon added a comment.

Thanks Quentin, if its not too much trouble please can you check the sibling patch to this one (http://reviews.llvm.org/D11327)?


================
Comment at: lib/Target/X86/X86ISelLowering.cpp:17457
@@ +17456,3 @@
+  // M = SIGN_BIT u>> A
+  // R s>> a === ((R u>> A) ^ M) - M
+  if ((VT == MVT::v2i64 || (VT == MVT::v4i64 && Subtarget->hasInt256())) &&
----------------
qcolombet wrote:
> That wasn’t immediately clear to me that s>> and u>> referred to signed and unsigned shift.
> Use lshr and ashr instead, like in llvm ir (or the SD name variant if you prefer).
No problem.

================
Comment at: test/CodeGen/X86/vector-shift-ashr-128.ll:27
@@ -27,1 +26,3 @@
+; SSE2-NEXT:    xorpd %xmm4, %xmm2
+; SSE2-NEXT:    psubq %xmm4, %xmm2
 ; SSE2-NEXT:    movdqa %xmm2, %xmm0
----------------
qcolombet wrote:
> Is this sequence actually better?
> 
> I guess the GPR to vector and vector to GPR copies are quite expensive so that is the case.
> Just double checking.
Its very target dependent - on my old Penryn avoiding GPR/SSE transfers is by far the best option. Jaguar/Sandy Bridge don't care that much at the 64-bit integer level (it will probably come down to register pressure issues). Haswell has AVX2 so we can do per-lane v2i64 logical shifts where this patch really flies. When in 32-bit mode its always best to avoid trying to do (split) 64-bit shifts on GPR (see D11327).

Overall, the general per-lane ashr v2i64 lowering is the weakest improvement, but the splat and constant cases gain a lot more.


Repository:
  rL LLVM

http://reviews.llvm.org/D11439