[PATCH] [X86][SSE] Keep 4i32 vector insertions in integer domain on pre-SSE4.1 targets
Simon Pilgrim
llvm-dev at redking.me.uk
Thu Dec 4 10:28:02 PST 2014
Added comments - I'll add a new patch using that movq/pshufd pattern shortly.
================
Comment at: test/CodeGen/X86/vector-shuffle-128-v4.ll:663-665
@@ -662,5 +662,5 @@
; SSE2: # BB#0:
-; SSE2-NEXT: xorps %xmm1, %xmm1
-; SSE2-NEXT: movss %xmm0, %xmm1
-; SSE2-NEXT: movaps %xmm1, %xmm0
+; SSE2-NEXT: pxor %xmm1, %xmm1
+; SSE2-NEXT: punpckldq {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1]
+; SSE2-NEXT: movq %xmm0, %xmm0
; SSE2-NEXT: retq
----------------
chandlerc wrote:
> I think an even better pattern is: movq, pshufd 0,2,2,2?
>
> Also, do we correctly match to movd when the source is a foldable load? I can't remember if there is a test case for that, but its really important to not do a shuffle when just loading a single i32 from memory into an xmm register.
Yup - that'd be a nicer pattern (single register!) - easy enough to change.
There is an existing movd folded load pattern using VMOVDI2PDIrm - I haven't seen any tests for it but it does seem to work alright.
================
Comment at: test/CodeGen/X86/vector-shuffle-128-v4.ll:700-703
@@ -699,5 +699,6 @@
; SSE2: # BB#0:
-; SSE2-NEXT: xorps %xmm1, %xmm1
-; SSE2-NEXT: movss %xmm0, %xmm1
-; SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm1[1,0,1,1]
+; SSE2-NEXT: pxor %xmm1, %xmm1
+; SSE2-NEXT: punpckldq {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1]
+; SSE2-NEXT: movq %xmm0, %xmm0
+; SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm0[1,0,1,1]
; SSE2-NEXT: retq
----------------
chandlerc wrote:
> This highlights that our lowering for this is completely wrong. movq + pshufd is better even with SEE4.1, and movd + pshufd is better when we can fold the load....
I am seeing lowerVectorShuffleAsElementInsertion interfere with a number of better shuffles candidates for these kind of patterns.
I'm also finding that we don't do a good job of tracking elements that are known to be zero - X86ISelLowering has computeZeroableShuffleElements but I'm starting to think about providing a better implementation inside the DAGCombiner instead. It'd need to know the difference between known zeros and zeroable, peek inside more ops etc. - but it could help a lot and there is no reason for this to be target specific.
http://reviews.llvm.org/D6526
More information about the llvm-commits
mailing list