[PATCH] D136736: [LSR][TTI][RISCV] Add isAllowDropLSRSolution into TTI and enable it for RISC-V

Wed Nov 2 17:05:31 PDT 2022

eopXD added a comment.

@fhahn Thank you for checking this in the Arm backend.

I think this transformation makes sense for all targets and regression should come from an insufficient cost model. The fact that improvement in RISC-V is observed and Arm is producing regressed loops supports the TTI approach.

The regressed result of  `CodeGen/Thumb2/LowOverheadLoops/tail-pred-intrinsic-round.ll` shows that the {vector contiguous load/store + post-increment instruction} `vldrw.u32` and `vstrw.32` was not leveraged efficiently, which is why there are two more `add.w` instructions for the pointers.  Observing on the IR that produced the regressed loop [0], I would say that the lowering does not successfully recognize the pattern of vector load/store instructions using values of the gep instruction that is indexed by the primary IV. The cost model logs [1] make sense to me since address mode CAN be folded and its the codegen's reponsibility to recognize the pattern. The original IR that is generated after LSR is shown below [2] and I think [0] is capable of producing the same codegen with some additional pattern recognition.

I can create another patch to enable it in Arm so we can get attention from guys in the Arm backend, but at the same time I think the regression here should not be blocking the landing of this particular patch which only affects RISC-V.

[0] LLVM IR after LSR with -lsr-drop-solution enabled for `CodeGen/Thumb2/LowOverheadLoops/tail-pred-intrinsic-round.ll`

  *** Code after LSR ***

  define arm_aapcs_vfpcc void @fabs(float* noalias nocapture readonly %pSrcA, float* noalias nocapture %pDst, i32 %blockSize) #0 {
  entry:
    %cmp3 = icmp eq i32 %blockSize, 0
    br i1 %cmp3, label %while.end, label %vector.ph

  vector.ph:                                        ; preds = %entry
    %n.rnd.up = add i32 %blockSize, 3
    %n.vec = and i32 %n.rnd.up, -4
    br label %vector.body

  vector.body:                                      ; preds = %vector.body, %vector.ph
    %index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
    %next.gep = getelementptr float, float* %pDst, i32 %index
    %next.gep13 = getelementptr float, float* %pSrcA, i32 %index
    %active.lane.mask = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 %index, i32 %blockSize)
    %0 = bitcast float* %next.gep13 to <4 x float>*
    %wide.masked.load = call <4 x float> @llvm.masked.load.v4f32.p0v4f32(<4 x float>* %0, i32 4, <4 x i1> %active.lane.mask, <4 x float> undef)
    %1 = call fast <4 x float> @llvm.fabs.v4f32(<4 x float> %wide.masked.load)
    %2 = bitcast float* %next.gep to <4 x float>*
    call void @llvm.masked.store.v4f32.p0v4f32(<4 x float> %1, <4 x float>* %2, i32 4, <4 x i1> %active.lane.mask)
    %index.next = add i32 %index, 4
    %3 = icmp eq i32 %index.next, %n.vec
    br i1 %3, label %while.end.loopexit, label %vector.body

  while.end.loopexit:                               ; preds = %vector.body
    br label %while.end

  while.end:                                        ; preds = %while.end.loopexit, %entry
    ret void
  }

[1] Output log on LSR proposed solution and baseline solution

  LSR is examining the following uses:
    LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
      reg({(4 * ((3 + %blockSize) /u 4))<nuw>,+,-4}<%vector.body>)
      reg((4 * ((3 + %blockSize) /u 4))<nuw>) + -1*reg({0,+,4}<%vector.body>)
    LSR Use: Kind=Basic, Offsets={0}, widest fixup type: i32
      reg({0,+,4}<%vector.body>)
    LSR Use: Kind=Address of <4 x float> in addrspace(0), Offsets={0}, widest fixup type: <4 x float>*
      reg({%pSrcA,+,16}<%vector.body>)
      reg(%pSrcA) + 1*reg({0,+,16}<%vector.body>)
    LSR Use: Kind=Address of <4 x float> in addrspace(0), Offsets={0}, widest fixup type: <4 x float>*
      reg({%pDst,+,16}<%vector.body>)
      reg(%pDst) + 1*reg({0,+,16}<%vector.body>)
    LSR Use: Kind=Basic, Offsets={0}, widest fixup type: i32
      reg(%blockSize)
  New best at 1 instruction 5 regs, with addrec cost 1, plus 8 setup cost.
  Regs:
  - {(4 * ((3 + %blockSize) /u 4))<nuw>,+,-4}<%vector.body>
  - {0,+,4}<%vector.body>
  - {%pSrcA,+,16}<%vector.body>
  - {%pDst,+,16}<%vector.body>
  - %blockSize

[2] LLVM IR after LSR without `-lsr-drop-solution` enabled for `CodeGen/Thumb2/LowOverheadLoops/tail-pred-intrinsic-round.ll`

  *** Code after LSR ***

  define arm_aapcs_vfpcc void @fabs(float* noalias nocapture readonly %pSrcA, float* noalias nocapture %pDst, i32 %blockSize) #0 {
  entry:
    %cmp3 = icmp eq i32 %blockSize, 0
    br i1 %cmp3, label %while.end, label %vector.ph

  vector.ph:                                        ; preds = %entry
    %n.rnd.up = add i32 %blockSize, 3
    %n.vec = and i32 %n.rnd.up, -4
    br label %vector.body

  vector.body:                                      ; preds = %vector.body, %vector.ph
    %lsr.iv3 = phi float* [ %scevgep4, %vector.body ], [ %pDst, %vector.ph ]
    %lsr.iv1 = phi float* [ %scevgep, %vector.body ], [ %pSrcA, %vector.ph ]
    %lsr.iv = phi i32 [ %lsr.iv.next, %vector.body ], [ %n.vec, %vector.ph ]
    %index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
    %lsr.iv12 = bitcast float* %lsr.iv1 to <4 x float>*
    %lsr.iv35 = bitcast float* %lsr.iv3 to <4 x float>*
    %active.lane.mask = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 %index, i32 %blockSize)
    %wide.masked.load = call <4 x float> @llvm.masked.load.v4f32.p0v4f32(<4 x float>* %lsr.iv12, i32 4, <4 x i1> %active.lane.mask, <4 x float> undef)
    %0 = call fast <4 x float> @llvm.fabs.v4f32(<4 x float> %wide.masked.load)
    call void @llvm.masked.store.v4f32.p0v4f32(<4 x float> %0, <4 x float>* %lsr.iv35, i32 4, <4 x i1> %active.lane.mask)
    %index.next = add i32 %index, 4
    %lsr.iv.next = add i32 %lsr.iv, -4
    %scevgep = getelementptr float, float* %lsr.iv1, i32 4
    %scevgep4 = getelementptr float, float* %lsr.iv3, i32 4
    %1 = icmp eq i32 %lsr.iv.next, 0
    br i1 %1, label %while.end.loopexit, label %vector.body

  while.end.loopexit:                               ; preds = %vector.body
    br label %while.end

  while.end:                                        ; preds = %while.end.loopexit, %entry
    ret void
  }

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D136736/new/

https://reviews.llvm.org/D136736