[PATCH] D88783: [InstCombine] matchFunnelShift - fold or(shl(a,x),lshr(b,sub(bw,x))) -> fshl(a,b,x) iff x < bw

Thu Oct 8 06:39:58 PDT 2020

RKSimon added inline comments.

================
Comment at: llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp:2090-2091
+    if (match(R, m_OneUse(m_Sub(m_SpecificInt(Width), m_Specific(L))))) {
+      KnownBits KnownL = IC.computeKnownBits(L, /*Depth*/ 0, &Or);
+      return KnownL.getMaxValue().ult(Width) ? L : nullptr;
+    }
----------------
lebedev.ri wrote:
> spatel wrote:
> > lebedev.ri wrote:
> > > Either this needs a motivational comment, or this can be dropped.
> > > ```
> > > ----------------------------------------
> > > define i64 @src(i64 %x, i64 %y, i64 %a) {
> > > %0:
> > >   %mask = and i64 %a, -1
> > >   %shl = shl i64 %x, %mask
> > >   %sub = sub nsw nuw i64 64, %mask
> > >   %shr = lshr i64 %y, %sub
> > >   %r = or i64 %shl, %shr
> > >   ret i64 %r
> > > }
> > > =>
> > > define i64 @tgt(i64 %x, i64 %y, i64 %a) {
> > > %0:
> > >   %r = fshl i64 %x, i64 %y, i64 %a
> > >   ret i64 %r
> > > }
> > > Transformation seems to be correct!
> > > 
> > > ```
> > We can correctly convert to the intrinsic because the intrinsic makes guarantees about the shift amount (modulo bitwidth) that may not be present in the original IR.
> > 
> > But without proper masking/select for UB shift amounts, the expanded code will be worse in codegen. So this is an irreversible transform, and we can't do it universally.
> > 
> > So even with the computeKnownBits restriction in this patch, this transform may not be allowed. For example, can we fix this in codegen:
> > https://alive2.llvm.org/ce/z/IsuRoN
> > 
> > ```
> > $ cat fsh.ll 
> > define i32 @fake_fshl(i32 %x, i32 %y, i32 %a) {
> >   %mask = and i32 %a, 31
> >   %shl = shl i32 %x, %mask
> >   %sub = sub nuw nsw i32 32, %mask
> >   %shr = lshr i32 %y, %sub
> >   %r = or i32 %shl, %shr
> >   ret i32 %r
> > }
> > 
> > define i32 @fshl(i32 %x, i32 %y, i32 %a) {
> >   %r = call i32 @llvm.fshl.i32(i32 %x, i32 %y, i32 %a)
> >   ret i32 %r
> > }
> > 
> > declare i32 @llvm.fshl.i32(i32, i32, i32)
> > 
> > $ llc -o - fsh.ll -mtriple=riscv64
> > fake_fshl:                              # @fake_fshl
> > 	sllw	a0, a0, a2
> > 	addi	a3, zero, 32
> > 	sub	a2, a3, a2
> > 	srlw	a1, a1, a2
> > 	or	a0, a0, a1
> > 	ret
> > fshl:                                   # @fshl
> > 	andi	a2, a2, 31
> > 	sll	a0, a0, a2
> > 	not	a2, a2
> > 	slli	a1, a1, 32
> > 	srli	a1, a1, 1
> > 	srl	a1, a1, a2
> > 	or	a0, a0, a1
> > 	ret
> > 
> > ```
> > We can correctly convert to the intrinsic because the intrinsic makes guarantees about the shift amount (modulo bitwidth) that may not be present in the original IR.
> 
> Yep, the intrinsic has an implicit masking, as we can see by said masking being dropped after the transform.
> 
> > But without proper masking/select for UB shift amounts, the expanded code will be worse in codegen. So this is an irreversible transform, and we can't do it universally.
> 
> Yep, that is why i'm asking for more blurb.
> The problematic case in mind is, what we were able to proove that the value is limited because we had range info on a load?
At least part of the riscv64 issue is due to it deciding to promote the fshl to i64 but not the or+shifts pattern (iirc the sllw/srlw variants are i32 shifts with mod32 that sext the results to i64).

```
SelectionDAG has 15 nodes:
  t0: ch = EntryToken
      t2: i64,ch = CopyFromReg t0, Register:i64 %0
        t4: i64,ch = CopyFromReg t0, Register:i64 %1
      t16: i64 = shl t4, Constant:i64<32>
        t6: i64,ch = CopyFromReg t0, Register:i64 %2
      t21: i64 = and t6, Constant:i64<31>
    t18: i64 = fshl t2, t16, t21
  t13: ch,glue = CopyToReg t0, Register:i64 $x10, t18
  t14: ch = RISCVISD::RET_FLAG t13, Register:i64 $x10, t13:1
```

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D88783/new/

https://reviews.llvm.org/D88783