[PATCH] D55600: [TargetLowering] Add ISD::OR + ISD::XOR handling to SimplifyDemandedVectorElts

Thu Dec 13 05:43:19 PST 2018

andreadb added inline comments.

================
Comment at: test/CodeGen/X86/known-bits-vector.ll:173-177
+; X64-NEXT:    vandps {{.*}}(%rip), %xmm0, %xmm0
+; X64-NEXT:    vorps {{.*}}(%rip), %xmm0, %xmm0
+; X64-NEXT:    vpermilps {{.*#+}} xmm0 = xmm0[2,2,3,3]
+; X64-NEXT:    vcvtdq2ps %xmm0, %xmm0
 ; X64-NEXT:    retq
----------------
RKSimon wrote:
> andreadb wrote:
> > We lose the ability to fold this entire computation into a constant.
> > 
> > After the AND+OR sequence, element 2 and 3 of %xmm0 are known at compile time (i.e. those are both value `65535`).
> > The vector permute can therefore be folded away. And we can perform the int2fp conversion at compile time. Effectively folding away the entire computation into a load from constant pool.
> > 
> > To be fair, we could even shrink the constant pool entry by using a vbroadcastss instead of a vmovaps (on AVX).
> The problem here is that SimplifyDemandedElts runs first and updates the and/or to:
> ```
> %1 = and <4 x i32> %a0, <i32 undef, i32 undef, i32 255, i32 4085>
> %2 = or <4 x i32> %1, <i32 65535, i32 65535, i32 65535, i32 65535> // broadcasts are preserved
> ```
> Then SimplifyDemandedBits runs, doesn't know about which vector elements are needed, and so can't fold to:
> ```
> uitofp <4 x i32><i32 65535, i32 65535, i32 65535, i32 65535> to <4 x float>
> ```
> for constant folding as it used to do.
> 
> I may be able to attempt to constant fold more aggressively in SimplifyDemandedElts, but failing that the best option going forward for this kind of regression would be to merge SimplifyDemandedBits and SimplifyDemandedElts into a single pass, matching what ComputeNumBits does with scalar/vectors.
> 
> Also, the original purpose of this test wasn't to constant fold but to recognise that the uitofp could be simplified to sitofp (so x86 could use cvtdq2ps).
If fixing this regression is not simple, then can raise a bug for it and work on it later. What do you think?

================
Comment at: test/CodeGen/X86/packss.ll:271-275
   %1 = shl <4 x i64> %a0, <i64 63, i64 0, i64 63, i64 0>
   %2 = ashr <4 x i64> %1, <i64 63, i64 0, i64 63, i64 0>
   %3 = bitcast <4 x i64> %2 to <8 x i32>
   %4 = shufflevector <8 x i32> %3, <8 x i32> undef, <8 x i32> <i32 0, i32 0, i32 0, i32 0, i32 4, i32 4, i32 4, i32 4>
   %5 = trunc <8 x i32> %4 to <8 x i16>
----------------
Unrelated to this patch.

This may be hard to catch...

On AVX2 and AVX, we could probably simplify it to this (didn't verify that the shuffle mask is correct):

```
vpslld    $31, %ymm0, %ymm0
vpsrad  $31, %ymm0, %ymm0
vpshufd $1, %ymm0, %ymm0
vextractf128 $1, %ymm0, %xmm1
vpackssdw %xmm1, %xmm0, %xmm0
```

That would require quite a lot of knowledge about both demanded bits and demanded elts. Also, it requires that we sink the bitcast in the shift operands, and then we shuffle elements after.

This may be something worthy to investigate in future..

Repository:
  rL LLVM

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D55600/new/

https://reviews.llvm.org/D55600