[PATCH] [X86][SSE] Keep 4i32 vector insertions in integer domain on pre-SSE4.1 targets

Thu Dec 4 10:28:02 PST 2014

Added comments - I'll add a new patch using that movq/pshufd pattern shortly.

================
Comment at: test/CodeGen/X86/vector-shuffle-128-v4.ll:663-665
@@ -662,5 +662,5 @@
 ; SSE2:       # BB#0:
-; SSE2-NEXT:    xorps %xmm1, %xmm1
-; SSE2-NEXT:    movss %xmm0, %xmm1
-; SSE2-NEXT:    movaps %xmm1, %xmm0
+; SSE2-NEXT:    pxor %xmm1, %xmm1
+; SSE2-NEXT:    punpckldq {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1]
+; SSE2-NEXT:    movq %xmm0, %xmm0
 ; SSE2-NEXT:    retq
----------------
chandlerc wrote:
> I think an even better pattern is: movq, pshufd 0,2,2,2?
> 
> Also, do we correctly match to movd when the source is a foldable load? I can't remember if there is a test case for that, but its really important to not do a shuffle when just loading a single i32 from memory into an xmm register.
Yup - that'd be a nicer pattern (single register!) - easy enough to change.

There is an existing movd folded load pattern using VMOVDI2PDIrm - I haven't seen any tests for it but it does seem to work alright.

================
Comment at: test/CodeGen/X86/vector-shuffle-128-v4.ll:700-703
@@ -699,5 +699,6 @@
 ; SSE2:       # BB#0:
-; SSE2-NEXT:    xorps %xmm1, %xmm1
-; SSE2-NEXT:    movss %xmm0, %xmm1
-; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm1[1,0,1,1]
+; SSE2-NEXT:    pxor %xmm1, %xmm1
+; SSE2-NEXT:    punpckldq {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1]
+; SSE2-NEXT:    movq %xmm0, %xmm0
+; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[1,0,1,1]
 ; SSE2-NEXT:    retq
----------------
chandlerc wrote:
> This highlights that our lowering for this is completely wrong. movq + pshufd is better even with SEE4.1, and movd + pshufd is better when we can fold the load....
I am seeing lowerVectorShuffleAsElementInsertion interfere with a number of better shuffles candidates for these kind of patterns.

I'm also finding that we don't do a good job of tracking elements that are known to be zero - X86ISelLowering has computeZeroableShuffleElements but I'm starting to think about providing a better implementation inside the DAGCombiner instead. It'd need to know the difference between known zeros and zeroable, peek inside more ops etc. - but it could help a lot and there is no reason for this to be target specific.

http://reviews.llvm.org/D6526