[PATCH] D41062: [X86] Legalize v2i32 via widening rather than promoting

Mon Feb 12 16:59:17 PST 2018

craig.topper added inline comments.

================
Comment at: test/Analysis/CostModel/X86/sitofp.ll:73

-  ; SSE2: cost of 20 {{.*}} sitofp <2 x i32>
+  ; SSE2: cost of 40 {{.*}} sitofp <2 x i32>
   ; AVX1: cost of 4 {{.*}} sitofp <2 x i32>
----------------
RKSimon wrote:
> What happened? This is way out!
I assume it switched between these rows in the cost table.

    { ISD::SINT_TO_FP, MVT::v2f64, MVT::v2i64, 2*10 },

    { ISD::SINT_TO_FP, MVT::v2f64, MVT::v4i32, 4*10 },

================
Comment at: test/CodeGen/X86/2012-01-18-vbitcast.ll:7
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    movdqa (%rcx), %xmm0
+; CHECK-NEXT:    psubd (%rdx), %xmm0
----------------
RKSimon wrote:
> I think this is OK, but it still makes me nervous. We go from accessing 64-bits to 128-bits per argument.
The 64-bits we get today is just dumb luck. If we compile with sse2 we get

```
        subq    $16, %rsp
        .seh_stackalloc 16
        .seh_endprologue
        pshufd  $212, (%rdx), %xmm1     # xmm1 = mem[0,1,1,3]
        pshufd  $212, (%rcx), %xmm0     # xmm0 = mem[0,1,1,3]
        psubq   %xmm1, %xmm0
        addq    $16, %rsp
        retq
```

================
Comment at: test/CodeGen/X86/avx2-masked-gather.ll:770
+; X86-NEXT:    vpxor %xmm2, %xmm2, %xmm2
+; X86-NEXT:    vpcmpgtq %xmm0, %xmm2, %xmm0
+; X86-NEXT:    movl {{[0-9]+}}(%esp), %eax
----------------
This compare is emulating an a v2i64 arithmetic shift right since we don't have that instruction. We only consider the lower bit of each mask to be valid coming in so we have to do a sign_extend_inreg operation.

I thought we had a combine that used demanded bits that should have removed the right shift. But I think its getting tripped up by the concat_vectors that's in front of the gather.

================
Comment at: test/CodeGen/X86/avx2-masked-gather.ll:772
+; X86-NEXT:    movl {{[0-9]+}}(%esp), %eax
+; X86-NEXT:    vmovd {{.*#+}} xmm2 = mem[0],zero,zero,zero
+; X86-NEXT:    vpinsrd $1, 4(%eax), %xmm2, %xmm2
----------------
RKSimon wrote:
> zvi wrote:
> > Any way to easily fix vmovd+vpinsrd -> vmovq?
> Yes - why didn't EltsFromConsecutiveLoads convert this to a i64 VZEXT_LOAD (VMOVQ)?
I'm not sure anything ever saw the VPINSRD as anything more than a insert_vector_elt. We never had it as a shuffle or build_vector where we could detect multiple elements.

I wonder if we shouldn't just custom legalize v2i32 loads to VZEXT_LOAD during type legalization?

================
Comment at: test/CodeGen/X86/avx2-masked-gather.ll:774
+; X86-NEXT:    vpinsrd $1, 4(%eax), %xmm2, %xmm2
+; X86-NEXT:    vmovdqa %xmm0, %xmm0
+; X86-NEXT:    vgatherdpd %ymm0, (,%xmm2), %ymm1
----------------
zvi wrote:
> Is this redundant move a known issue?
It's there to clear bits 255:128 because we don't do a good job of detecting when the producer already zeroed those bits. I think we only whitelist a couple of instructions today.

================
Comment at: test/CodeGen/X86/known-signbits-vector.ll:19
+; X64-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
 ; X64-NEXT:    vcvtdq2pd %xmm0, %xmm0
 ; X64-NEXT:    retq
----------------
RKSimon wrote:
> Regression
After type leglaization we have this. We could probably add a combine to catch the truncated build vector with sign extended inputs and squash them.

Type-legalized selection DAG: %bb.0 'signbits_sext_v2i64_sitofp_v2f64:'
SelectionDAG has 21 nodes:
  t0: ch = EntryToken
        t33: i32 = extract_vector_elt t27, Constant:i64<0>
        t35: i32 = extract_vector_elt t27, Constant:i64<1>
      t37: v4i32 = BUILD_VECTOR t33, t35, undef:i32, undef:i32
    t30: v2f64 = X86ISD::CVTSI2P t37
  t17: ch,glue = CopyToReg t0, Register:v2f64 $xmm0, t30
        t2: i32,ch = CopyFromReg t0, Register:i32 %0
      t5: i64 = sign_extend t2
        t4: i32,ch = CopyFromReg t0, Register:i32 %1
      t6: i64 = sign_extend t4
    t26: v4i64 = BUILD_VECTOR t5, t6, undef:i64, undef:i64
  t27: v4i32 = truncate t26
  t18: ch = X86ISD::RET_FLAG t17, TargetConstant:i32<0>, Register:v2f64 $xmm0, t17:1

================
Comment at: test/CodeGen/X86/shuffle-vs-trunc-128.ll:276
+; AVX2-SLOW-NEXT:    vmovaps (%rdi), %xmm0
+; AVX2-SLOW-NEXT:    vpermilps {{.*#+}} ymm0 = ymm0[0,2,2,3,4,6,6,7]
+; AVX2-SLOW-NEXT:    vpermpd {{.*#+}} ymm0 = ymm0[0,2,2,3]
----------------
zvi wrote:
> What about this?
We type legalized the v2i64->v2i32 truncate by widening to v4i64 and then truncating. Maybe we just need to emit a bitcast to v4i32 and a vector shuffle ourselves?

https://reviews.llvm.org/D41062