[PATCH] D62100: [DAGCombine][X86][AMDGPU][AArch64] (srl (shl x, c1), c2) with c1 != c2 handling

Thu Jul 18 08:57:51 PDT 2019

lebedev.ri marked an inline comment as done and an inline comment as not done.
lebedev.ri added inline comments.

================
Comment at: test/CodeGen/AArch64/arm64-bitfield-extract.ll:396
 ; LLC-NEXT:    ldr w8, [x0]
+; LLC-NEXT:    and w8, w8, #0x3ffffff8
 ; LLC-NEXT:    bfxil w8, w1, #16, #3
----------------
lebedev.ri wrote:
> RKSimon wrote:
> > lebedev.ri wrote:
> > > After actually looking at `-debug` output, this regression happens because of `SimplifyDemandedBits()`,
> > > which ignores `AArch64TargetLowering::isDesirableToCommuteWithShift()` override.
> > > So when we get to `AArch64TargetLowering::isBitfieldExtractOp()`, we have
> > > ```
> > > t18: i32 = srl t15, Constant:i64<2>
> > >   t15: i32 = or t26, t24
> > >     t26: i32 = and t7, Constant:i32<1073741816>
> > >       t7: i32,ch = load<(load 4 from %ir.y, align 8)> t0, t2, undef:i64
> > >         t2: i64,ch = CopyFromReg t0, Register:i64 %0
> > >           t1: i64 = Register %0
> > >         t6: i64 = undef
> > >       t25: i32 = Constant<1073741816>
> > >     t24: i32 = and t12, Constant:i32<4>
> > >       t12: i32 = srl t4, Constant:i64<16>
> > >         t4: i32,ch = CopyFromReg t0, Register:i32 %1
> > >           t3: i32 = Register %1
> > >         t11: i64 = Constant<16>
> > >       t23: i32 = Constant<4>
> > >   t17: i64 = Constant<2>
> > > ```
> > > I'm not sure how to turn this pattern on it's head to produce `ubfx` again.
> > @lebedev.ri Which SimplifyDemandedBits transforms is missing this?
> The other way around, some SimplifyDemandedBits produces this, and the rest is unable to recover.
Still don't have any ideas on how to approach this.

================
Comment at: test/CodeGen/AMDGPU/llvm.amdgcn.ubfe.ll:686-687
 ; SI-NEXT:    s_waitcnt vmcnt(0)
-; SI-NEXT:    v_lshlrev_b32_e32 v0, 31, v0
-; SI-NEXT:    v_lshrrev_b32_e32 v0, 1, v0
+; SI-NEXT:    v_lshlrev_b32_e32 v0, 30, v0
+; SI-NEXT:    v_and_b32_e32 v0, 2.0, v0
 ; SI-NEXT:    buffer_store_dword v0, off, s[4:7], 0
----------------
arsenm wrote:
> arsenm wrote:
> > lebedev.ri wrote:
> > > @arsenm will AMDGPU prefer 2 shifts or shift+mask here?
> > In this particular case, they're the same. In general 2 shifts is probably better. The mask value is less likely to be an inline immediate. It seems like we have a BFE matching problem though
> For 64-bit, shift and mask would be better
```
bool AMDGPUTargetLowering::shouldFoldConstantShiftPairToMask(
    const SDNode *N, CombineLevel Level) const {
  EVT VT = N->getValueType(0);
  return VT.isScalarInteger() && VT.getScalarSizeInBits() == 64;
}
```
results in many AMDGPU test regressions, i can submit a patch, but i thought i'd double-check first - is that what you meant?

================
Comment at: test/CodeGen/X86/rotate-extract-vector.ll:168
+; X64:       # %bb.0:
+; X64-NEXT:    vpbroadcastq {{.*#+}} ymm1 = [16383,16383,16383,16383]
+; X64-NEXT:    vpsrlq $39, %ymm0, %ymm2
----------------
RKSimon wrote:
> xbolva00 wrote:
> > @rksimon is  this better?
> Unlikely - a pair of shifts by uniform constants is almost certainly better
Hmm, could you be more specific please?
For vectors, do we want to keep two shifts even if shift amounts are equal, or only if shift amounts are unequal?
Also, what does "almost certainly better" mean given that we have `-mattr=+fast-vector-shift-masks`?

Repository:
  rL LLVM

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D62100/new/

https://reviews.llvm.org/D62100