[PATCH] D157870: [BPF] Replace BPFMIPeepholeTruncElim by custom logic in isZExtFree()

Thu Aug 17 16:52:15 PDT 2023

eddyz87 created this revision.
Herald added a subscriber: hiraditya.
Herald added a project: All.
eddyz87 updated this revision to Diff 551319.
eddyz87 edited the summary of this revision.
eddyz87 added a comment.
Herald added subscribers: steven.zhang, kristof.beyls, arichardson.
eddyz87 updated this revision to Diff 551320.
eddyz87 edited the summary of this revision.
eddyz87 edited the summary of this revision.
eddyz87 published this revision for review.
eddyz87 added a reviewer: yonghong-song.
Herald added a project: LLVM.
Herald added a subscriber: llvm-commits.

Rebase, added detailed commit message.

eddyz87 added a comment.

Commit message fixes.

eddyz87 added a comment.

Coincidentally, this also fixes the following bug:
when LLVM is compiled with `LLVM_ENABLE_EXPENSIVE_CHECKS` some BPF kernel selftests fail to build with the following error:

  *** Bad machine code: Illegal virtual register for instruction ***
  - function:    _dissect
  - basic block: %bb.9 if.end10 (0x621000e80d98)
  - instruction: %24:gpr32 = MOV_rr %8:gpr32, debug-location !339; progs/bpf_flow.c:120:2
  - operand 0:   %24:gpr32
  Expected a GPR register, but got a GPR32 register

Using llvm-reduce it is possible to isolate the following test case:

  define i1 @foo(ptr %p) {
  entry:
    %short = load i16, ptr %p, align 2
    br label %next

  next:
    %cond = icmp eq i16 %short, 0
    ret i1 %cond
  }

Here is how this code looks before and after BPFMIPeepholeTruncElim transformation:

    Before this revision                      After this revision
    --------------------                      -------------------
  bb.0.entry:                               bb.0.entry:
    %1:gpr = COPY $r1                         %1:gpr = COPY $r1
    %0:gpr32 = LDH32 %1:gpr, 0                %0:gpr32 = LDH32 %1:gpr, 0

  bb.1.next:                                bb.1.next:
    %2:gpr32 = AND_ri_32 %0:gpr32, 65535      %2:gpr32 = MOV_rr %0:gpr32
               ^^^^^^^^^                                 ^^^^^^
    %4:gpr32 = MOV_ri_32 1                    %4:gpr32 = MOV_ri_32 1
    JEQ_ri_32 %2:gpr32, 0, %bb.3              JEQ_ri_32 %2:gpr32, 0, %bb.3

Note the MOV_rr instruction used with 32-bit sub-registers introduced by the transformation.
This happens because of the following code:

  bool BPFMIPeepholeTruncElim::eliminateTruncSeq() {
    MachineInstr* ToErase = nullptr;
    bool Eliminated = false;

    for (MachineBasicBlock &MBB : *MF) {
      for (MachineInstr &MI : MBB) {
        ...
        BuildMI(MBB, MI, MI.getDebugLoc(), TII->get(BPF::MOV_rr), DstReg)
                .addReg(SrcReg);
        ...
      }
    }

    return Eliminated;
  }

The basic fix is to select between `MOV_rr` and `MOV_rr_32` in the above snippet. This fix leads to some size increase in the object files generated for BPF selftests: out of 655 object files 30 have more instructions, in total 107 more instructions is generated.

The increase is caused by a difference in `BPFMIPreEmitPeephole::eliminateRedundantMov()` behavior: it removes operations of form `rA = MOV_rr rA`, but does not remove `wA = MOV_rr_32 wA` (which is unsafe). E.g. for the running example:

    Before this revision                      After this revision
    --------------------                      -------------------
  bb.0.entry:                               bb.0.entry:
    $w1 = LDH32 killed $r1, 0                 $w1 = LDH32 killed $r1, 0
    $w1 = MOV_rr killed $w1   <--- (1)        $w1 = MOV_rr_32 killed $w1   <--- (2)
    $w0 = MOV_ri_32 1                         $w0 = MOV_ri_32 1
    JEQ_ri_32 killed $w1, 0, %bb.2            JEQ_ri_32 killed $w1, 0, %bb.2
  bb.1.next:                                bb.1.next:
    $w0 = MOV_ri_32 0                         $w0 = MOV_ri_32 0
  bb.2.next:                                bb.2.next:
    RET implicit $w0                          RET implicit $w0

`eliminateRedundantMov()` will remove (1) but won't remove (2).

So, all-in-all I think that this revision has advantage over the basic fix.

eddyz87 added a comment.

Hi Yonghong, could you please take a look?
As described here <https://reviews.llvm.org/D157870#4597313>, this fixes one of the selftest compilation bugs that occur when `LLVM_ENABLE_EXPENSIVE_CHECKS` is on.  Sorry for the long description, I tried to explain why instruction selection would always cover cases that `BPFMIPeepholeTruncElim` covered.

Replace `BPFMIPeepholeTruncElim` by adding an overload for `TargetLowering::isZExtFree()` aware that zero extension is free for `ISD::LOAD`.

Short description
=================

The `BPFMIPeepholeTruncElim` handles two patterns:

Pattern #1:

  %1 = LDB %0, ...              %1 = LDB %0, ...
  %2 = AND_ri %1, 0xff      ->  %2 = MOV_ri %1    <-- (!)

Pattern #2:

  bb.1:                         bb.1:
    %a = LDB %0, ...              %a = LDB %0, ...
    br %bb3                       br %bb3
  bb.2:                         bb.2:
    %b = LDB %0, ...        ->    %b = LDB %0, ...
    br %bb3                       br %bb3
  bb.3:                         bb.3:
    %1 = PHI %a, %b               %1 = PHI %a, %b
    %2 = AND_ri %1, 0xff          %2 = MOV_ri %1  <-- (!)

Plus variations:

- AND_ri_32 instead of AND_ri
- SLL/SLR instead of AND_ri
- LDH, LDW, LDB32, LDH32, LDW32

Both patterns could be handled by built-in transformations at instruction selection phase if suitable `isZExtFree()` implementation is provided. The idea is borrowed from `ARMTargetLowering::isZExtFree`.

Replacing `BPFMIPeepholeTruncElim` has no regressions (no new un-eliminated zero extensions) when evaluating on BPF kernel selftests and remove_truncate_*.ll LLVM test cases.

Commit also adds a few test cases to make sure that patterns in question are handled.

Long description
================

Why this works: Pattern #1
--------------------------

Consider the following example:

  define i1 @foo(ptr %p) {
  entry:
    %a = load i8, ptr %p, align 1
    %cond = icmp eq i8 %a, 0
    ret i1 %cond
  }

Log for `llc -mcpu=v2 -mtriple=bpfel -debug-only=isel` command:

  ...
  Type-legalized selection DAG: %bb.0 'foo:entry'
  SelectionDAG has 13 nodes:
    t0: ch,glue = EntryToken
            t2: i64,ch = CopyFromReg t0, Register:i64 %0
          t16: i64,ch = load<(load (s8) from %ir.p), anyext from i8> t0, t2, undef:i64
        t19: i64 = and t16, Constant:i64<255>
      t17: i64 = setcc t19, Constant:i64<0>, seteq:ch
    t11: ch,glue = CopyToReg t0, Register:i64 $r0, t17
    t12: ch = BPFISD::RET_GLUE t11, Register:i64 $r0, t11:1
  ...
  Replacing.1 t19: i64 = and t16, Constant:i64<255>
  With: t16: i64,ch = load<(load (s8) from %ir.p), anyext from i8> t0, t2, undef:i64
   and 0 other values
  ...
  Optimized type-legalized selection DAG: %bb.0 'foo:entry'
  SelectionDAG has 11 nodes:
    t0: ch,glue = EntryToken
          t2: i64,ch = CopyFromReg t0, Register:i64 %0
        t20: i64,ch = load<(load (s8) from %ir.p), zext from i8> t0, t2, undef:i64
      t17: i64 = setcc t20, Constant:i64<0>, seteq:ch
    t11: ch,glue = CopyToReg t0, Register:i64 $r0, t17
    t12: ch = BPFISD::RET_GLUE t11, Register:i64 $r0, t11:1
  ...

Note:

- Optimized type-legalized selection DAG:
  - `t19 = and t16, 255` had been replaced by `t16` (load).
  - Patterns like `(and (load ... i8), 255)` are replaced by `load` in `DAGCombiner::BackwardsPropagateMask` called from `DAGCombiner::visitAND`.
  - Similarly patterns like `(shl (srl ..., 56), 56)` are replaced by `(and ..., 255)` in `DAGCombiner::visitSRL` (this function is huge, look for `TLI.shouldFoldConstantShiftPairToMask()` call).

Why this works: Pattern #2
--------------------------

Consider the following example:

  define i1 @foo(ptr %p) {
  entry:
    %a = load i8, ptr %p, align 1
    br label %next

  next:
    %cond = icmp eq i8 %a, 0
    ret i1 %cond
  }

Consider log for `llc -mcpu=v2 -mtriple=bpfel -debug-only=isel` command.
Log for first basic block:

  Initial selection DAG: %bb.0 'foo:entry'
  SelectionDAG has 9 nodes:
    t0: ch,glue = EntryToken
    t3: i64 = Constant<0>
          t2: i64,ch = CopyFromReg t0, Register:i64 %1
        t5: i8,ch = load<(load (s8) from %ir.p)> t0, t2, undef:i64
      t6: i64 = zero_extend t5
    t8: ch = CopyToReg t0, Register:i64 %0, t6
  ...
  Replacing.1 t6: i64 = zero_extend t5
  With: t9: i64,ch = load<(load (s8) from %ir.p), zext from i8> t0, t2, undef:i64
   and 0 other values
  ...
  Optimized lowered selection DAG: %bb.0 'foo:entry'
  SelectionDAG has 7 nodes:
    t0: ch,glue = EntryToken
        t2: i64,ch = CopyFromReg t0, Register:i64 %1
      t9: i64,ch = load<(load (s8) from %ir.p), zext from i8> t0, t2, undef:i64
    t8: ch = CopyToReg t0, Register:i64 %0, t9

Note:

- Initial selection DAG:
  - `%a = load ...` is lowered as `t6 = (zero_extend (load ...))` w/o special `isZExtFree()` overload added by this commit it is instead lowered as `t6 = (any_extend (load ...))`.
  - The decision to generate `zero_extend` or `any_extend` is done in `RegsForValue::getCopyToRegs` called from `SelectionDAGBuilder::CopyValueToVirtualRegister`:
    - if `isZExtFree()` for load returns true `zero_extend` is used;
    - `any_extend` is used otherwise.
- Optimized lowered selection DAG:
  - `t6 = (any_extend (load ...))` is replaced by `t9 = load ..., zext from i8`. This is done by `DagCombiner.cpp:tryToFoldExtOfLoad()` called from `DAGCombiner::visitZERO_EXTEND`.

Log for second basic block:

  Initial selection DAG: %bb.1 'foo:next'
  SelectionDAG has 13 nodes:
    t0: ch,glue = EntryToken
              t2: i64,ch = CopyFromReg t0, Register:i64 %0
            t4: i64 = AssertZext t2, ValueType:ch:i8
          t5: i8 = truncate t4
        t8: i1 = setcc t5, Constant:i8<0>, seteq:ch
      t9: i64 = any_extend t8
    t11: ch,glue = CopyToReg t0, Register:i64 $r0, t9
    t12: ch = BPFISD::RET_GLUE t11, Register:i64 $r0, t11:1
  ...
  Replacing.2 t18: i64 = and t4, Constant:i64<255>
  With: t4: i64 = AssertZext t2, ValueType:ch:i8
  ...
  Type-legalized selection DAG: %bb.1 'foo:next'
  SelectionDAG has 13 nodes:
    t0: ch,glue = EntryToken
            t2: i64,ch = CopyFromReg t0, Register:i64 %0
          t4: i64 = AssertZext t2, ValueType:ch:i8
        t18: i64 = and t4, Constant:i64<255>
      t16: i64 = setcc t18, Constant:i64<0>, seteq:ch
    t11: ch,glue = CopyToReg t0, Register:i64 $r0, t16
    t12: ch = BPFISD::RET_GLUE t11, Register:i64 $r0, t11:1
  ...
  Optimized type-legalized selection DAG: %bb.1 'foo:next'
  SelectionDAG has 11 nodes:
    t0: ch,glue = EntryToken
          t2: i64,ch = CopyFromReg t0, Register:i64 %0
        t4: i64 = AssertZext t2, ValueType:ch:i8
      t16: i64 = setcc t4, Constant:i64<0>, seteq:ch
    t11: ch,glue = CopyToReg t0, Register:i64 $r0, t16
    t12: ch = BPFISD::RET_GLUE t11, Register:i64 $r0, t11:1
  ...

Note:

- Initial selection DAG:
  - `t0` is an input value for this basic block, it corresponds load instruction (`t9`) from the first basic block.
  - It is accessed within basic block via `t4` (AssertZext (CopyFromReg t0, ...)).
  - The `AssertZext` is generated by RegsForValue::getCopyFromRegs called from SelectionDAGBuilder::getCopyFromRegs, it is generated only when `LiveOutInfo` with known number of leading zeros is present for `t0`.
  - Known register bits in `LiveOutInfo` are computed by `SelectionDAG::computeKnownBits` called from `SelectionDAGISel::ComputeLiveOutVRegInfo`.
  - `computeKnownBits()` generates leading zeros information for `(load ..., zext from ...)` but *does not* generate leading zeros information for `(load ..., anyext from ...)`. This is why `isZExtFree()` added in this commit is important.
- Type-legalized selection DAG:
  - `t5 = truncate t4` is replaced by `t18 = and t4, 255`
- Optimized type-legalized selection DAG:
  - `t18 = and t4, 255` is replaced by `t4`, this is done by `DAGCombiner::SimplifyDemandedBits` called from `DAGCombiner::visitAND`, which simplifies patterns like `(and (assertzext ...))`.

Impact
------

This change covers all remove_truncate_*.ll test cases:

- for -mcpu=v4 there are no changes in the generated code;
- for -mcpu=v2 code generated for remove_truncate_7 and remove_truncate_8 improved slightly, for other tests it is unchanged.

For remove_truncate_7:

  Before this revision                 After this revision
  --------------------                 -------------------
      r1 <<= 0x20                          r1 <<= 0x20
      r1 >>= 0x20                          r1 >>= 0x20
      if r1 == 0x0 goto +0x2 <LBB0_2>      if r1 == 0x0 goto +0x2 <LBB0_2>
      r1 = *(u32 *)(r2 + 0x0)              r0 = *(u32 *)(r2 + 0x0)
      goto +0x1 <LBB0_3>                   goto +0x1 <LBB0_3>
  <LBB0_2>:                            <LBB0_2>:
      r1 = *(u32 *)(r2 + 0x4)              r0 = *(u32 *)(r2 + 0x4)
  <LBB0_3>:                            <LBB0_3>:
      r0 = r1                              exit
      exit

For remove_truncate_8:

  Before this revision                 After this revision
  --------------------                 -------------------
      r2 = *(u32 *)(r1 + 0x0)              r2 = *(u32 *)(r1 + 0x0)
      r3 = r2                              r3 = r2
      r3 <<= 0x20                          r3 <<= 0x20
      r4 = r3                              r3 s>>= 0x20
      r4 s>>= 0x20
      if r4 s> 0x2 goto +0x5 <LBB0_3>      if r3 s> 0x2 goto +0x4 <LBB0_3>
      r4 = *(u32 *)(r1 + 0x4)              r3 = *(u32 *)(r1 + 0x4)
      r3 >>= 0x20
      if r3 >= r4 goto +0x2 <LBB0_3>       if r2 >= r3 goto +0x2 <LBB0_3>
      r2 += 0x2                            r2 += 0x2
      *(u32 *)(r1 + 0x0) = r2              *(u32 *)(r1 + 0x0) = r2
  <LBB0_3>:                            <LBB0_3>:
      r0 = 0x3                             r0 = 0x3
      exit                                 exit

For kernel BPF selftests statistics is as follows: (-mcpu=v4):

- For -mcpu=v4: 9 out of 655 object files have differences, in all cases total number of instructions marginally decreased (-27 instructions).
- For -mcpu=v2: 9 out of 655 object files have differences:
  - For 19 object files number of instruction decreased (-129 instruction in total): some redundant `rX &= 0xffff` and register to register assignments removed;
  - For 2 object files number of instructions increased +2 instructions in each file.

Both -mcpu=v2 instruction increases could be reduced to the same example:

  define void @foo(ptr %p) {
  entry:
    %a = load i32, ptr %p, align 4
    %b = sext i32 %a to i64
    %c = icmp ult i64 1, %b
    br i1 %c, label %next, label %end

  next:
    call void inttoptr (i64 62 to ptr)(i32 %a)
    br label %end

  end:
    ret void
  }

Note that this example uses value loaded to `%a` both as a sign extended (`%b`) and as zero extended (`%a` passed as parameter). Here is the difference in final assembly code:

  Before this revision          After this revision
  --------------------          -------------------
      r1 = *(u32 *)(r1 + 0)         r1 = *(u32 *)(r1 + 0)
      r1 <<= 32                     r1 <<= 32
      r1 s>>= 32                    r1 s>>= 32
      if r1 < 2 goto <LBB0_2>       if r1 < 2 goto <LBB0_2>
                                    r1 <<= 32
                                    r1 >>= 32
      call 62                       call 62
  <LBB0_2>:                     <LBB0_2>:
      exit                          exit

Before this commit `%a` is passed to call as a sign extended value, after this commit `%a` is passed to call as a zero extended value, both are correct as 32-bit sub-register is the same.

The difference comes from `DAGCombiner` operation on the initial DAG.
Initial selection DAG before this commit:

  t5: i32,ch = load<(load (s32) from %ir.p)> t0, t2, undef:i64
        t6: i64 = any_extend t5         <--------------------- (1)
      t8: ch = CopyToReg t0, Register:i64 %0, t6
          t9: i64 = sign_extend t5
        t12: i1 = setcc Constant:i64<1>, t9, setult:ch

Initial selection DAG after this commit:

  t5: i32,ch = load<(load (s32) from %ir.p)> t0, t2, undef:i64
        t6: i64 = zero_extend t5        <--------------------- (2)
      t8: ch = CopyToReg t0, Register:i64 %0, t6
          t9: i64 = sign_extend t5
        t12: i1 = setcc Constant:i64<1>, t9, setult:ch

The node `t9` is processed before node `t6` and `load` instruction is combined to load with sign extension:

  Replacing.1 t9: i64 = sign_extend t5
  With: t30: i64,ch = load<(load (s32) from %ir.p), sext from i32> t0, t2, undef:i64
   and 0 other values
  Replacing.1 t5: i32,ch = load<(load (s32) from %ir.p)> t0, t2, undef:i64
  With: t31: i32 = truncate t30
   and 1 other values

This is done by `DAGCombiner.cpp:tryToFoldExtOfLoad` called from `DAGCombiner::visitSIGN_EXTEND`. Note that `t5` is used by `t6` which is `any_extend` in (1) and `zero_extend` in (2). `tryToFoldExtOfLoad()` rewrites such uses of `t5` differently:

- `any_extend` is simply removed
- `zero_extend` is replaced by `and t30, 0xffffffff`, which is later converted to a pair of shifts. This pair of shifts survives till the end of translation.

Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D157870

Files:
  llvm/lib/Target/BPF/BPF.h
  llvm/lib/Target/BPF/BPFISelLowering.cpp
  llvm/lib/Target/BPF/BPFISelLowering.h
  llvm/lib/Target/BPF/BPFMIPeephole.cpp
  llvm/lib/Target/BPF/BPFTargetMachine.cpp
  llvm/test/CodeGen/BPF/remove_truncate_9.ll

-------------- next part --------------
A non-text attachment was scrubbed...
Name: D157870.551320.patch
Type: text/x-patch
Size: 10790 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20230817/e987b3ed/attachment.bin>