[PATCH] D123394: [CodeGen] Late cleanup of redundant address/immediate definitions.

Jonas Paulsson via Phabricator via llvm-commits llvm-commits at lists.llvm.org
Fri Apr 8 08:21:31 PDT 2022


jonpa created this revision.
jonpa added reviewers: uweigand, eugenis, efriedma, danielkiss, rnk, pengfei, bogner.
Herald added subscribers: hiraditya, kristof.beyls.
Herald added a project: All.
jonpa requested review of this revision.
Herald added a project: LLVM.

On SystemZ, eliminateFrameIndex() may use a scratch register to load part of an offset that is out of range for the using instruction (this is likely also done on other targets - I believe at least AArch64 also does this). Since this is done per instruction (FI operand), several identical addressing "anchor points" may result in the MBB where actually only the first one is needed (provided the GPR is the same). I was expecting this to be cleaned up somehow afterwards, but to my surprise it is not.

I have therefore now made an initial experiment consisting of making PrologEpilogInserter try to clean up redundant definitions after all frame indices have been eliminated, and it would be great to get feedback from other targets to see if this is generally useful. It seems beneficial on SystemZ at least (see below). The patch isn't quite ready yet but I hope it can be the start of a discussion at least. It actually now seems that even though there are "load address" instructions removed, there are in addition many more redundant immediate loads removed (see below). This makes me think that maybe this would better belong in a separate pass run after PEI (or perhaps even later)?

The patch currently looks for simple "candidate" instructions which may be removed if an identical preceding one exists without any intervening clobbering of the value. The CFG is traversed breadth-first so that live-in definitions can be reused.

Given that the the global analysis adds complexity, I tried also doing this just locally (by passing -pei-localonly). I did find however that the reuse from predecessors is what gives the bulk of improvement, so I would say it is worth that extra effort. On SystemZ/SPEC17:

  main <> "local only":
  lghi           :               445967               443571    -2396  // Immediate load
  lay            :                55090                54783     -307  // Load address
  vgbm           :                10974                10848     -126  // Immediate load
  ...
  OPCDIFFS: -2969



  main <> patch:
  lghi           :               445967               432678   -13289  // Immediate load
  lhi            :               219663               216594    -3069  // Immediate load
  la             :               533531               531952    -1579  // Load address
  lay            :                55090                54774     -316  // Load address
  ...
  OPCDIFFS: -19862

I would say that this is a number of instructions removed (~20k) that makes this look interesting to me... (*) Some examples:

LAYs removed in ./f507.cactuBSSN_r/build/ML_BSSN_Advect.s:

          lay     %r1, 4096(%r15)
          std     %f4, 1024(%r1)
  -       lay     %r1, 4096(%r15)
          std     %f1, 1000(%r1)
  -       lay     %r1, 4096(%r15)
          std     %f5, 1016(%r1)
  -       lay     %r1, 4096(%r15)

LGHI removed in ./f507.cactuBSSN_r/build/RadiationBoundary.s:

          lghi    %r3, 0
          clgijl  %r5, 12, .LBB2_23     // Conditional branch
          risbgn  %r4, %r4, 1, 189, 0
          lghi    %r5, 0
  -       lghi    %r3, 0

VGBMs (vector immediate loads) removed in ./f510.parest_r/build/derivative_approximation.s

  ...
          vst     %v0, 344(%r15), 3
  -       vgbm    %v0, 0
          vst     %v0, 360(%r15), 3
  -       vgbm    %v0, 0
          vst     %v0, 312(%r15), 3
  -       vgbm    %v0, 0
          vst     %v0, 328(%r15), 3
  -       vgbm    %v0, 0
          vst     %v0, 280(%r15), 3
  -       vgbm    %v0, 0
          vst     %v0, 296(%r15), 3

(*) I also see a few hundred more lmg/br insructions in total, which I am guessing come from some missed CFG optimization, but not sure if this is important. For example:

  -       j       .LBB50_33
  +       lochie  %r13, 1
  +       llgfr   %r2, %r13
  +       lmg     %r10, %r15, 240(%r15)
  +       br      %r14
   .LBB50_32:
          lghi    %r0, -1
          sllg    %r0, %r0, 0(%r12)
          xihf    %r0, 4294967295
          xilf    %r0, 4294967295
          cg      %r0, 24(%r11)
  -.LBB50_33:
  -       lhi     %r13, 0
          lochie  %r13, 1
          llgfr   %r2, %r13
          lmg     %r10, %r15, 240(%r15)
          br      %r14

Current status is that benchmarks build with -verify-machineinstrs, but there are many regression tests failing (hopefully due to removed redundant defs :-). I am hoping that people working with those targets can help me understand those...


https://reviews.llvm.org/D123394

Files:
  llvm/lib/CodeGen/PrologEpilogInserter.cpp
  llvm/test/CodeGen/SystemZ/frame-28.mir

-------------- next part --------------
A non-text attachment was scrubbed...
Name: D123394.421528.patch
Type: text/x-patch
Size: 15143 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20220408/6b0278b4/attachment.bin>


More information about the llvm-commits mailing list