[PATCH] D41062: [X86] Legalize v2i32 via widening rather than promoting
Craig Topper via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Mon Feb 12 16:59:17 PST 2018
craig.topper added inline comments.
================
Comment at: test/Analysis/CostModel/X86/sitofp.ll:73
- ; SSE2: cost of 20 {{.*}} sitofp <2 x i32>
+ ; SSE2: cost of 40 {{.*}} sitofp <2 x i32>
; AVX1: cost of 4 {{.*}} sitofp <2 x i32>
----------------
RKSimon wrote:
> What happened? This is way out!
I assume it switched between these rows in the cost table.
{ ISD::SINT_TO_FP, MVT::v2f64, MVT::v2i64, 2*10 },
{ ISD::SINT_TO_FP, MVT::v2f64, MVT::v4i32, 4*10 },
================
Comment at: test/CodeGen/X86/2012-01-18-vbitcast.ll:7
+; CHECK: # %bb.0:
+; CHECK-NEXT: movdqa (%rcx), %xmm0
+; CHECK-NEXT: psubd (%rdx), %xmm0
----------------
RKSimon wrote:
> I think this is OK, but it still makes me nervous. We go from accessing 64-bits to 128-bits per argument.
The 64-bits we get today is just dumb luck. If we compile with sse2 we get
```
subq $16, %rsp
.seh_stackalloc 16
.seh_endprologue
pshufd $212, (%rdx), %xmm1 # xmm1 = mem[0,1,1,3]
pshufd $212, (%rcx), %xmm0 # xmm0 = mem[0,1,1,3]
psubq %xmm1, %xmm0
addq $16, %rsp
retq
```
================
Comment at: test/CodeGen/X86/avx2-masked-gather.ll:770
+; X86-NEXT: vpxor %xmm2, %xmm2, %xmm2
+; X86-NEXT: vpcmpgtq %xmm0, %xmm2, %xmm0
+; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
----------------
This compare is emulating an a v2i64 arithmetic shift right since we don't have that instruction. We only consider the lower bit of each mask to be valid coming in so we have to do a sign_extend_inreg operation.
I thought we had a combine that used demanded bits that should have removed the right shift. But I think its getting tripped up by the concat_vectors that's in front of the gather.
================
Comment at: test/CodeGen/X86/avx2-masked-gather.ll:772
+; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
+; X86-NEXT: vmovd {{.*#+}} xmm2 = mem[0],zero,zero,zero
+; X86-NEXT: vpinsrd $1, 4(%eax), %xmm2, %xmm2
----------------
RKSimon wrote:
> zvi wrote:
> > Any way to easily fix vmovd+vpinsrd -> vmovq?
> Yes - why didn't EltsFromConsecutiveLoads convert this to a i64 VZEXT_LOAD (VMOVQ)?
I'm not sure anything ever saw the VPINSRD as anything more than a insert_vector_elt. We never had it as a shuffle or build_vector where we could detect multiple elements.
I wonder if we shouldn't just custom legalize v2i32 loads to VZEXT_LOAD during type legalization?
================
Comment at: test/CodeGen/X86/avx2-masked-gather.ll:774
+; X86-NEXT: vpinsrd $1, 4(%eax), %xmm2, %xmm2
+; X86-NEXT: vmovdqa %xmm0, %xmm0
+; X86-NEXT: vgatherdpd %ymm0, (,%xmm2), %ymm1
----------------
zvi wrote:
> Is this redundant move a known issue?
It's there to clear bits 255:128 because we don't do a good job of detecting when the producer already zeroed those bits. I think we only whitelist a couple of instructions today.
================
Comment at: test/CodeGen/X86/known-signbits-vector.ll:19
+; X64-NEXT: vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
; X64-NEXT: vcvtdq2pd %xmm0, %xmm0
; X64-NEXT: retq
----------------
RKSimon wrote:
> Regression
After type leglaization we have this. We could probably add a combine to catch the truncated build vector with sign extended inputs and squash them.
Type-legalized selection DAG: %bb.0 'signbits_sext_v2i64_sitofp_v2f64:'
SelectionDAG has 21 nodes:
t0: ch = EntryToken
t33: i32 = extract_vector_elt t27, Constant:i64<0>
t35: i32 = extract_vector_elt t27, Constant:i64<1>
t37: v4i32 = BUILD_VECTOR t33, t35, undef:i32, undef:i32
t30: v2f64 = X86ISD::CVTSI2P t37
t17: ch,glue = CopyToReg t0, Register:v2f64 $xmm0, t30
t2: i32,ch = CopyFromReg t0, Register:i32 %0
t5: i64 = sign_extend t2
t4: i32,ch = CopyFromReg t0, Register:i32 %1
t6: i64 = sign_extend t4
t26: v4i64 = BUILD_VECTOR t5, t6, undef:i64, undef:i64
t27: v4i32 = truncate t26
t18: ch = X86ISD::RET_FLAG t17, TargetConstant:i32<0>, Register:v2f64 $xmm0, t17:1
================
Comment at: test/CodeGen/X86/shuffle-vs-trunc-128.ll:276
+; AVX2-SLOW-NEXT: vmovaps (%rdi), %xmm0
+; AVX2-SLOW-NEXT: vpermilps {{.*#+}} ymm0 = ymm0[0,2,2,3,4,6,6,7]
+; AVX2-SLOW-NEXT: vpermpd {{.*#+}} ymm0 = ymm0[0,2,2,3]
----------------
zvi wrote:
> What about this?
We type legalized the v2i64->v2i32 truncate by widening to v4i64 and then truncating. Maybe we just need to emit a bitcast to v4i32 and a vector shuffle ourselves?
https://reviews.llvm.org/D41062
More information about the llvm-commits
mailing list