[PATCH] D136736: [LSR][TTI][RISCV] Add isAllowDropLSRSolution into TTI and enable it for RISC-V
Yueh-Ting (eop) Chen via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Wed Nov 2 17:05:31 PDT 2022
eopXD added a comment.
@fhahn Thank you for checking this in the Arm backend.
I think this transformation makes sense for all targets and regression should come from an insufficient cost model. The fact that improvement in RISC-V is observed and Arm is producing regressed loops supports the TTI approach.
The regressed result of `CodeGen/Thumb2/LowOverheadLoops/tail-pred-intrinsic-round.ll` shows that the {vector contiguous load/store + post-increment instruction} `vldrw.u32` and `vstrw.32` was not leveraged efficiently, which is why there are two more `add.w` instructions for the pointers. Observing on the IR that produced the regressed loop [0], I would say that the lowering does not successfully recognize the pattern of vector load/store instructions using values of the gep instruction that is indexed by the primary IV. The cost model logs [1] make sense to me since address mode CAN be folded and its the codegen's reponsibility to recognize the pattern. The original IR that is generated after LSR is shown below [2] and I think [0] is capable of producing the same codegen with some additional pattern recognition.
I can create another patch to enable it in Arm so we can get attention from guys in the Arm backend, but at the same time I think the regression here should not be blocking the landing of this particular patch which only affects RISC-V.
[0] LLVM IR after LSR with -lsr-drop-solution enabled for `CodeGen/Thumb2/LowOverheadLoops/tail-pred-intrinsic-round.ll`
*** Code after LSR ***
define arm_aapcs_vfpcc void @fabs(float* noalias nocapture readonly %pSrcA, float* noalias nocapture %pDst, i32 %blockSize) #0 {
entry:
%cmp3 = icmp eq i32 %blockSize, 0
br i1 %cmp3, label %while.end, label %vector.ph
vector.ph: ; preds = %entry
%n.rnd.up = add i32 %blockSize, 3
%n.vec = and i32 %n.rnd.up, -4
br label %vector.body
vector.body: ; preds = %vector.body, %vector.ph
%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%next.gep = getelementptr float, float* %pDst, i32 %index
%next.gep13 = getelementptr float, float* %pSrcA, i32 %index
%active.lane.mask = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 %index, i32 %blockSize)
%0 = bitcast float* %next.gep13 to <4 x float>*
%wide.masked.load = call <4 x float> @llvm.masked.load.v4f32.p0v4f32(<4 x float>* %0, i32 4, <4 x i1> %active.lane.mask, <4 x float> undef)
%1 = call fast <4 x float> @llvm.fabs.v4f32(<4 x float> %wide.masked.load)
%2 = bitcast float* %next.gep to <4 x float>*
call void @llvm.masked.store.v4f32.p0v4f32(<4 x float> %1, <4 x float>* %2, i32 4, <4 x i1> %active.lane.mask)
%index.next = add i32 %index, 4
%3 = icmp eq i32 %index.next, %n.vec
br i1 %3, label %while.end.loopexit, label %vector.body
while.end.loopexit: ; preds = %vector.body
br label %while.end
while.end: ; preds = %while.end.loopexit, %entry
ret void
}
[1] Output log on LSR proposed solution and baseline solution
LSR is examining the following uses:
LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
reg({(4 * ((3 + %blockSize) /u 4))<nuw>,+,-4}<%vector.body>)
reg((4 * ((3 + %blockSize) /u 4))<nuw>) + -1*reg({0,+,4}<%vector.body>)
LSR Use: Kind=Basic, Offsets={0}, widest fixup type: i32
reg({0,+,4}<%vector.body>)
LSR Use: Kind=Address of <4 x float> in addrspace(0), Offsets={0}, widest fixup type: <4 x float>*
reg({%pSrcA,+,16}<%vector.body>)
reg(%pSrcA) + 1*reg({0,+,16}<%vector.body>)
LSR Use: Kind=Address of <4 x float> in addrspace(0), Offsets={0}, widest fixup type: <4 x float>*
reg({%pDst,+,16}<%vector.body>)
reg(%pDst) + 1*reg({0,+,16}<%vector.body>)
LSR Use: Kind=Basic, Offsets={0}, widest fixup type: i32
reg(%blockSize)
New best at 1 instruction 5 regs, with addrec cost 1, plus 8 setup cost.
Regs:
- {(4 * ((3 + %blockSize) /u 4))<nuw>,+,-4}<%vector.body>
- {0,+,4}<%vector.body>
- {%pSrcA,+,16}<%vector.body>
- {%pDst,+,16}<%vector.body>
- %blockSize
[2] LLVM IR after LSR without `-lsr-drop-solution` enabled for `CodeGen/Thumb2/LowOverheadLoops/tail-pred-intrinsic-round.ll`
*** Code after LSR ***
define arm_aapcs_vfpcc void @fabs(float* noalias nocapture readonly %pSrcA, float* noalias nocapture %pDst, i32 %blockSize) #0 {
entry:
%cmp3 = icmp eq i32 %blockSize, 0
br i1 %cmp3, label %while.end, label %vector.ph
vector.ph: ; preds = %entry
%n.rnd.up = add i32 %blockSize, 3
%n.vec = and i32 %n.rnd.up, -4
br label %vector.body
vector.body: ; preds = %vector.body, %vector.ph
%lsr.iv3 = phi float* [ %scevgep4, %vector.body ], [ %pDst, %vector.ph ]
%lsr.iv1 = phi float* [ %scevgep, %vector.body ], [ %pSrcA, %vector.ph ]
%lsr.iv = phi i32 [ %lsr.iv.next, %vector.body ], [ %n.vec, %vector.ph ]
%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%lsr.iv12 = bitcast float* %lsr.iv1 to <4 x float>*
%lsr.iv35 = bitcast float* %lsr.iv3 to <4 x float>*
%active.lane.mask = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 %index, i32 %blockSize)
%wide.masked.load = call <4 x float> @llvm.masked.load.v4f32.p0v4f32(<4 x float>* %lsr.iv12, i32 4, <4 x i1> %active.lane.mask, <4 x float> undef)
%0 = call fast <4 x float> @llvm.fabs.v4f32(<4 x float> %wide.masked.load)
call void @llvm.masked.store.v4f32.p0v4f32(<4 x float> %0, <4 x float>* %lsr.iv35, i32 4, <4 x i1> %active.lane.mask)
%index.next = add i32 %index, 4
%lsr.iv.next = add i32 %lsr.iv, -4
%scevgep = getelementptr float, float* %lsr.iv1, i32 4
%scevgep4 = getelementptr float, float* %lsr.iv3, i32 4
%1 = icmp eq i32 %lsr.iv.next, 0
br i1 %1, label %while.end.loopexit, label %vector.body
while.end.loopexit: ; preds = %vector.body
br label %while.end
while.end: ; preds = %while.end.loopexit, %entry
ret void
}
Repository:
rG LLVM Github Monorepo
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D136736/new/
https://reviews.llvm.org/D136736
More information about the llvm-commits
mailing list