[PATCH] D157870: [BPF] Replace BPFMIPeepholeTruncElim by custom logic in isZExtFree()
Eduard Zingerman via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Thu Aug 17 16:52:15 PDT 2023
eddyz87 created this revision.
Herald added a subscriber: hiraditya.
Herald added a project: All.
eddyz87 updated this revision to Diff 551319.
eddyz87 edited the summary of this revision.
eddyz87 added a comment.
Herald added subscribers: steven.zhang, kristof.beyls, arichardson.
eddyz87 updated this revision to Diff 551320.
eddyz87 edited the summary of this revision.
eddyz87 edited the summary of this revision.
eddyz87 published this revision for review.
eddyz87 added a reviewer: yonghong-song.
Herald added a project: LLVM.
Herald added a subscriber: llvm-commits.
Rebase, added detailed commit message.
eddyz87 added a comment.
Commit message fixes.
eddyz87 added a comment.
Coincidentally, this also fixes the following bug:
when LLVM is compiled with `LLVM_ENABLE_EXPENSIVE_CHECKS` some BPF kernel selftests fail to build with the following error:
*** Bad machine code: Illegal virtual register for instruction ***
- function: _dissect
- basic block: %bb.9 if.end10 (0x621000e80d98)
- instruction: %24:gpr32 = MOV_rr %8:gpr32, debug-location !339; progs/bpf_flow.c:120:2
- operand 0: %24:gpr32
Expected a GPR register, but got a GPR32 register
Using llvm-reduce it is possible to isolate the following test case:
define i1 @foo(ptr %p) {
entry:
%short = load i16, ptr %p, align 2
br label %next
next:
%cond = icmp eq i16 %short, 0
ret i1 %cond
}
Here is how this code looks before and after BPFMIPeepholeTruncElim transformation:
Before this revision After this revision
-------------------- -------------------
bb.0.entry: bb.0.entry:
%1:gpr = COPY $r1 %1:gpr = COPY $r1
%0:gpr32 = LDH32 %1:gpr, 0 %0:gpr32 = LDH32 %1:gpr, 0
bb.1.next: bb.1.next:
%2:gpr32 = AND_ri_32 %0:gpr32, 65535 %2:gpr32 = MOV_rr %0:gpr32
^^^^^^^^^ ^^^^^^
%4:gpr32 = MOV_ri_32 1 %4:gpr32 = MOV_ri_32 1
JEQ_ri_32 %2:gpr32, 0, %bb.3 JEQ_ri_32 %2:gpr32, 0, %bb.3
Note the MOV_rr instruction used with 32-bit sub-registers introduced by the transformation.
This happens because of the following code:
bool BPFMIPeepholeTruncElim::eliminateTruncSeq() {
MachineInstr* ToErase = nullptr;
bool Eliminated = false;
for (MachineBasicBlock &MBB : *MF) {
for (MachineInstr &MI : MBB) {
...
BuildMI(MBB, MI, MI.getDebugLoc(), TII->get(BPF::MOV_rr), DstReg)
.addReg(SrcReg);
...
}
}
return Eliminated;
}
The basic fix is to select between `MOV_rr` and `MOV_rr_32` in the above snippet. This fix leads to some size increase in the object files generated for BPF selftests: out of 655 object files 30 have more instructions, in total 107 more instructions is generated.
The increase is caused by a difference in `BPFMIPreEmitPeephole::eliminateRedundantMov()` behavior: it removes operations of form `rA = MOV_rr rA`, but does not remove `wA = MOV_rr_32 wA` (which is unsafe). E.g. for the running example:
Before this revision After this revision
-------------------- -------------------
bb.0.entry: bb.0.entry:
$w1 = LDH32 killed $r1, 0 $w1 = LDH32 killed $r1, 0
$w1 = MOV_rr killed $w1 <--- (1) $w1 = MOV_rr_32 killed $w1 <--- (2)
$w0 = MOV_ri_32 1 $w0 = MOV_ri_32 1
JEQ_ri_32 killed $w1, 0, %bb.2 JEQ_ri_32 killed $w1, 0, %bb.2
bb.1.next: bb.1.next:
$w0 = MOV_ri_32 0 $w0 = MOV_ri_32 0
bb.2.next: bb.2.next:
RET implicit $w0 RET implicit $w0
`eliminateRedundantMov()` will remove (1) but won't remove (2).
So, all-in-all I think that this revision has advantage over the basic fix.
eddyz87 added a comment.
Hi Yonghong, could you please take a look?
As described here <https://reviews.llvm.org/D157870#4597313>, this fixes one of the selftest compilation bugs that occur when `LLVM_ENABLE_EXPENSIVE_CHECKS` is on. Sorry for the long description, I tried to explain why instruction selection would always cover cases that `BPFMIPeepholeTruncElim` covered.
Replace `BPFMIPeepholeTruncElim` by adding an overload for `TargetLowering::isZExtFree()` aware that zero extension is free for `ISD::LOAD`.
Short description
=================
The `BPFMIPeepholeTruncElim` handles two patterns:
Pattern #1:
%1 = LDB %0, ... %1 = LDB %0, ...
%2 = AND_ri %1, 0xff -> %2 = MOV_ri %1 <-- (!)
Pattern #2:
bb.1: bb.1:
%a = LDB %0, ... %a = LDB %0, ...
br %bb3 br %bb3
bb.2: bb.2:
%b = LDB %0, ... -> %b = LDB %0, ...
br %bb3 br %bb3
bb.3: bb.3:
%1 = PHI %a, %b %1 = PHI %a, %b
%2 = AND_ri %1, 0xff %2 = MOV_ri %1 <-- (!)
Plus variations:
- AND_ri_32 instead of AND_ri
- SLL/SLR instead of AND_ri
- LDH, LDW, LDB32, LDH32, LDW32
Both patterns could be handled by built-in transformations at instruction selection phase if suitable `isZExtFree()` implementation is provided. The idea is borrowed from `ARMTargetLowering::isZExtFree`.
Replacing `BPFMIPeepholeTruncElim` has no regressions (no new un-eliminated zero extensions) when evaluating on BPF kernel selftests and remove_truncate_*.ll LLVM test cases.
Commit also adds a few test cases to make sure that patterns in question are handled.
Long description
================
Why this works: Pattern #1
--------------------------
Consider the following example:
define i1 @foo(ptr %p) {
entry:
%a = load i8, ptr %p, align 1
%cond = icmp eq i8 %a, 0
ret i1 %cond
}
Log for `llc -mcpu=v2 -mtriple=bpfel -debug-only=isel` command:
...
Type-legalized selection DAG: %bb.0 'foo:entry'
SelectionDAG has 13 nodes:
t0: ch,glue = EntryToken
t2: i64,ch = CopyFromReg t0, Register:i64 %0
t16: i64,ch = load<(load (s8) from %ir.p), anyext from i8> t0, t2, undef:i64
t19: i64 = and t16, Constant:i64<255>
t17: i64 = setcc t19, Constant:i64<0>, seteq:ch
t11: ch,glue = CopyToReg t0, Register:i64 $r0, t17
t12: ch = BPFISD::RET_GLUE t11, Register:i64 $r0, t11:1
...
Replacing.1 t19: i64 = and t16, Constant:i64<255>
With: t16: i64,ch = load<(load (s8) from %ir.p), anyext from i8> t0, t2, undef:i64
and 0 other values
...
Optimized type-legalized selection DAG: %bb.0 'foo:entry'
SelectionDAG has 11 nodes:
t0: ch,glue = EntryToken
t2: i64,ch = CopyFromReg t0, Register:i64 %0
t20: i64,ch = load<(load (s8) from %ir.p), zext from i8> t0, t2, undef:i64
t17: i64 = setcc t20, Constant:i64<0>, seteq:ch
t11: ch,glue = CopyToReg t0, Register:i64 $r0, t17
t12: ch = BPFISD::RET_GLUE t11, Register:i64 $r0, t11:1
...
Note:
- Optimized type-legalized selection DAG:
- `t19 = and t16, 255` had been replaced by `t16` (load).
- Patterns like `(and (load ... i8), 255)` are replaced by `load` in `DAGCombiner::BackwardsPropagateMask` called from `DAGCombiner::visitAND`.
- Similarly patterns like `(shl (srl ..., 56), 56)` are replaced by `(and ..., 255)` in `DAGCombiner::visitSRL` (this function is huge, look for `TLI.shouldFoldConstantShiftPairToMask()` call).
Why this works: Pattern #2
--------------------------
Consider the following example:
define i1 @foo(ptr %p) {
entry:
%a = load i8, ptr %p, align 1
br label %next
next:
%cond = icmp eq i8 %a, 0
ret i1 %cond
}
Consider log for `llc -mcpu=v2 -mtriple=bpfel -debug-only=isel` command.
Log for first basic block:
Initial selection DAG: %bb.0 'foo:entry'
SelectionDAG has 9 nodes:
t0: ch,glue = EntryToken
t3: i64 = Constant<0>
t2: i64,ch = CopyFromReg t0, Register:i64 %1
t5: i8,ch = load<(load (s8) from %ir.p)> t0, t2, undef:i64
t6: i64 = zero_extend t5
t8: ch = CopyToReg t0, Register:i64 %0, t6
...
Replacing.1 t6: i64 = zero_extend t5
With: t9: i64,ch = load<(load (s8) from %ir.p), zext from i8> t0, t2, undef:i64
and 0 other values
...
Optimized lowered selection DAG: %bb.0 'foo:entry'
SelectionDAG has 7 nodes:
t0: ch,glue = EntryToken
t2: i64,ch = CopyFromReg t0, Register:i64 %1
t9: i64,ch = load<(load (s8) from %ir.p), zext from i8> t0, t2, undef:i64
t8: ch = CopyToReg t0, Register:i64 %0, t9
Note:
- Initial selection DAG:
- `%a = load ...` is lowered as `t6 = (zero_extend (load ...))` w/o special `isZExtFree()` overload added by this commit it is instead lowered as `t6 = (any_extend (load ...))`.
- The decision to generate `zero_extend` or `any_extend` is done in `RegsForValue::getCopyToRegs` called from `SelectionDAGBuilder::CopyValueToVirtualRegister`:
- if `isZExtFree()` for load returns true `zero_extend` is used;
- `any_extend` is used otherwise.
- Optimized lowered selection DAG:
- `t6 = (any_extend (load ...))` is replaced by `t9 = load ..., zext from i8`. This is done by `DagCombiner.cpp:tryToFoldExtOfLoad()` called from `DAGCombiner::visitZERO_EXTEND`.
Log for second basic block:
Initial selection DAG: %bb.1 'foo:next'
SelectionDAG has 13 nodes:
t0: ch,glue = EntryToken
t2: i64,ch = CopyFromReg t0, Register:i64 %0
t4: i64 = AssertZext t2, ValueType:ch:i8
t5: i8 = truncate t4
t8: i1 = setcc t5, Constant:i8<0>, seteq:ch
t9: i64 = any_extend t8
t11: ch,glue = CopyToReg t0, Register:i64 $r0, t9
t12: ch = BPFISD::RET_GLUE t11, Register:i64 $r0, t11:1
...
Replacing.2 t18: i64 = and t4, Constant:i64<255>
With: t4: i64 = AssertZext t2, ValueType:ch:i8
...
Type-legalized selection DAG: %bb.1 'foo:next'
SelectionDAG has 13 nodes:
t0: ch,glue = EntryToken
t2: i64,ch = CopyFromReg t0, Register:i64 %0
t4: i64 = AssertZext t2, ValueType:ch:i8
t18: i64 = and t4, Constant:i64<255>
t16: i64 = setcc t18, Constant:i64<0>, seteq:ch
t11: ch,glue = CopyToReg t0, Register:i64 $r0, t16
t12: ch = BPFISD::RET_GLUE t11, Register:i64 $r0, t11:1
...
Optimized type-legalized selection DAG: %bb.1 'foo:next'
SelectionDAG has 11 nodes:
t0: ch,glue = EntryToken
t2: i64,ch = CopyFromReg t0, Register:i64 %0
t4: i64 = AssertZext t2, ValueType:ch:i8
t16: i64 = setcc t4, Constant:i64<0>, seteq:ch
t11: ch,glue = CopyToReg t0, Register:i64 $r0, t16
t12: ch = BPFISD::RET_GLUE t11, Register:i64 $r0, t11:1
...
Note:
- Initial selection DAG:
- `t0` is an input value for this basic block, it corresponds load instruction (`t9`) from the first basic block.
- It is accessed within basic block via `t4` (AssertZext (CopyFromReg t0, ...)).
- The `AssertZext` is generated by RegsForValue::getCopyFromRegs called from SelectionDAGBuilder::getCopyFromRegs, it is generated only when `LiveOutInfo` with known number of leading zeros is present for `t0`.
- Known register bits in `LiveOutInfo` are computed by `SelectionDAG::computeKnownBits` called from `SelectionDAGISel::ComputeLiveOutVRegInfo`.
- `computeKnownBits()` generates leading zeros information for `(load ..., zext from ...)` but *does not* generate leading zeros information for `(load ..., anyext from ...)`. This is why `isZExtFree()` added in this commit is important.
- Type-legalized selection DAG:
- `t5 = truncate t4` is replaced by `t18 = and t4, 255`
- Optimized type-legalized selection DAG:
- `t18 = and t4, 255` is replaced by `t4`, this is done by `DAGCombiner::SimplifyDemandedBits` called from `DAGCombiner::visitAND`, which simplifies patterns like `(and (assertzext ...))`.
Impact
------
This change covers all remove_truncate_*.ll test cases:
- for -mcpu=v4 there are no changes in the generated code;
- for -mcpu=v2 code generated for remove_truncate_7 and remove_truncate_8 improved slightly, for other tests it is unchanged.
For remove_truncate_7:
Before this revision After this revision
-------------------- -------------------
r1 <<= 0x20 r1 <<= 0x20
r1 >>= 0x20 r1 >>= 0x20
if r1 == 0x0 goto +0x2 <LBB0_2> if r1 == 0x0 goto +0x2 <LBB0_2>
r1 = *(u32 *)(r2 + 0x0) r0 = *(u32 *)(r2 + 0x0)
goto +0x1 <LBB0_3> goto +0x1 <LBB0_3>
<LBB0_2>: <LBB0_2>:
r1 = *(u32 *)(r2 + 0x4) r0 = *(u32 *)(r2 + 0x4)
<LBB0_3>: <LBB0_3>:
r0 = r1 exit
exit
For remove_truncate_8:
Before this revision After this revision
-------------------- -------------------
r2 = *(u32 *)(r1 + 0x0) r2 = *(u32 *)(r1 + 0x0)
r3 = r2 r3 = r2
r3 <<= 0x20 r3 <<= 0x20
r4 = r3 r3 s>>= 0x20
r4 s>>= 0x20
if r4 s> 0x2 goto +0x5 <LBB0_3> if r3 s> 0x2 goto +0x4 <LBB0_3>
r4 = *(u32 *)(r1 + 0x4) r3 = *(u32 *)(r1 + 0x4)
r3 >>= 0x20
if r3 >= r4 goto +0x2 <LBB0_3> if r2 >= r3 goto +0x2 <LBB0_3>
r2 += 0x2 r2 += 0x2
*(u32 *)(r1 + 0x0) = r2 *(u32 *)(r1 + 0x0) = r2
<LBB0_3>: <LBB0_3>:
r0 = 0x3 r0 = 0x3
exit exit
For kernel BPF selftests statistics is as follows: (-mcpu=v4):
- For -mcpu=v4: 9 out of 655 object files have differences, in all cases total number of instructions marginally decreased (-27 instructions).
- For -mcpu=v2: 9 out of 655 object files have differences:
- For 19 object files number of instruction decreased (-129 instruction in total): some redundant `rX &= 0xffff` and register to register assignments removed;
- For 2 object files number of instructions increased +2 instructions in each file.
Both -mcpu=v2 instruction increases could be reduced to the same example:
define void @foo(ptr %p) {
entry:
%a = load i32, ptr %p, align 4
%b = sext i32 %a to i64
%c = icmp ult i64 1, %b
br i1 %c, label %next, label %end
next:
call void inttoptr (i64 62 to ptr)(i32 %a)
br label %end
end:
ret void
}
Note that this example uses value loaded to `%a` both as a sign extended (`%b`) and as zero extended (`%a` passed as parameter). Here is the difference in final assembly code:
Before this revision After this revision
-------------------- -------------------
r1 = *(u32 *)(r1 + 0) r1 = *(u32 *)(r1 + 0)
r1 <<= 32 r1 <<= 32
r1 s>>= 32 r1 s>>= 32
if r1 < 2 goto <LBB0_2> if r1 < 2 goto <LBB0_2>
r1 <<= 32
r1 >>= 32
call 62 call 62
<LBB0_2>: <LBB0_2>:
exit exit
Before this commit `%a` is passed to call as a sign extended value, after this commit `%a` is passed to call as a zero extended value, both are correct as 32-bit sub-register is the same.
The difference comes from `DAGCombiner` operation on the initial DAG.
Initial selection DAG before this commit:
t5: i32,ch = load<(load (s32) from %ir.p)> t0, t2, undef:i64
t6: i64 = any_extend t5 <--------------------- (1)
t8: ch = CopyToReg t0, Register:i64 %0, t6
t9: i64 = sign_extend t5
t12: i1 = setcc Constant:i64<1>, t9, setult:ch
Initial selection DAG after this commit:
t5: i32,ch = load<(load (s32) from %ir.p)> t0, t2, undef:i64
t6: i64 = zero_extend t5 <--------------------- (2)
t8: ch = CopyToReg t0, Register:i64 %0, t6
t9: i64 = sign_extend t5
t12: i1 = setcc Constant:i64<1>, t9, setult:ch
The node `t9` is processed before node `t6` and `load` instruction is combined to load with sign extension:
Replacing.1 t9: i64 = sign_extend t5
With: t30: i64,ch = load<(load (s32) from %ir.p), sext from i32> t0, t2, undef:i64
and 0 other values
Replacing.1 t5: i32,ch = load<(load (s32) from %ir.p)> t0, t2, undef:i64
With: t31: i32 = truncate t30
and 1 other values
This is done by `DAGCombiner.cpp:tryToFoldExtOfLoad` called from `DAGCombiner::visitSIGN_EXTEND`. Note that `t5` is used by `t6` which is `any_extend` in (1) and `zero_extend` in (2). `tryToFoldExtOfLoad()` rewrites such uses of `t5` differently:
- `any_extend` is simply removed
- `zero_extend` is replaced by `and t30, 0xffffffff`, which is later converted to a pair of shifts. This pair of shifts survives till the end of translation.
Repository:
rG LLVM Github Monorepo
https://reviews.llvm.org/D157870
Files:
llvm/lib/Target/BPF/BPF.h
llvm/lib/Target/BPF/BPFISelLowering.cpp
llvm/lib/Target/BPF/BPFISelLowering.h
llvm/lib/Target/BPF/BPFMIPeephole.cpp
llvm/lib/Target/BPF/BPFTargetMachine.cpp
llvm/test/CodeGen/BPF/remove_truncate_9.ll
-------------- next part --------------
A non-text attachment was scrubbed...
Name: D157870.551320.patch
Type: text/x-patch
Size: 10790 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20230817/e987b3ed/attachment.bin>
More information about the llvm-commits
mailing list