[PATCH] D98230: [LSR] Add reconciliation of unfoldable offsets

Mon Mar 8 18:29:40 PST 2021

jonpa created this revision.
jonpa added reviewers: uweigand, reames, SjoerdMeijer, greened.
Herald added subscribers: steven.zhang, hiraditya, kristof.beyls.
jonpa requested review of this revision.
Herald added a project: LLVM.

I recently found out that LBM performance can be improved by 5-10 % on SystemZ if the addressing in the hot loop (LBM_performStreamCollideTRT) is improved. It currently has a lot of unfolded offsets which all have to be computed with a register move + (two address) 32 bit immediate addition. I have experimented with LSR and found that this can be handled by doing two things:

- Reconcile unfoldable offsets. Currently, a Fixup with a foldable offset is placed into a pre-existing LSRUse. But all Fixups with unfoldable offsets get their own LSRUse - they are never grouped together even when their huge offsets have small (foldable) differences. A new method reconcileUnfoldedAddressOffsets() performs this task.

- Limit the number of filtered-out Formulas in NarrowSearchSpaceByFilterFormulaWithSameScaledReg() so that those without unfoldable offsets do not get lost.

Overall, this is an improvement of the AGFIs on SPEC, but there are also some rare cases where this gets worse. I think this is because SystemZTTI accepts long displacements in the LSR phase of building the LSRUses with their Fixups. Then, during Solve(), the Instruction pointer is passed to SystemZTTI::isLSRCostLess() which now then says that those offsets/Fixups are in fact not foldable, and a good solution is not to be found. I experimented with dissallowing the long displacements (for vector/fp) also in the early phase, but this changed a tremendous amount of files with mixed benchmark effects, so that seems to also be a matter of tuning. Since the cases that get worse with this patch are rare, and the patch now is relatively much simpler with a clear benchmark improvement, I would like to return to the other issues after this.

Four tests failed with this, and looking at CodeGen/ARM/ParallelDSP/unroll-n-jam-smlad.ll, it seemed that there were now more spills/reloads. I am not sure why, so I made this optional (for now) with a target hook TTI.LSRUnfOffsetsReconc().

  LLVM :: CodeGen/ARM/ParallelDSP/unroll-n-jam-smlad.ll
  LLVM :: CodeGen/ARM/loop-indexing.ll
  LLVM :: CodeGen/PowerPC/bdzlr.ll
  LLVM :: CodeGen/PowerPC/lsr-profitable-chain.ll

Is this the right approach to remedy the LBM loop?

https://reviews.llvm.org/D98230

Files:
  llvm/include/llvm/Analysis/TargetTransformInfo.h
  llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
  llvm/lib/Analysis/TargetTransformInfo.cpp
  llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.h
  llvm/lib/Transforms/Scalar/LoopStrengthReduce.cpp
  llvm/test/CodeGen/SystemZ/loop-01.ll

-------------- next part --------------
A non-text attachment was scrubbed...
Name: D98230.329183.patch
Type: text/x-patch
Size: 9125 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20210309/a391a41a/attachment.bin>