[PATCH] D152366: [LoopVectorize] Allow inner loop runtime checks to be hoisted above an outer loop

Wed Jun 7 05:11:17 PDT 2023

david-arm created this revision.
david-arm added reviewers: sdesmalen, fhahn, kmclaughlin, dmgreen.
Herald added subscribers: wlei, shiva0217, StephenFan, wenlei, hiraditya.
Herald added a project: All.
david-arm requested review of this revision.
Herald added subscribers: llvm-commits, pcwang-thead.
Herald added a project: LLVM.

Suppose we have a nested loop like this:

  void foo(int32_t *dst, int32_t *src, int m, int n) {
    for (int i = 0; i < m; i++) {
      for (int j = 0; j < n; j++) {
        dst[(i * n) + j] += src[(i * n) + j];
      }
    }
  }

We currently generate runtime memory checks as a precondition for
entering the vectorised version of the inner loop. However, if the
runtime-determined trip count for the inner loop is quite small
then the cost of these checks becomes quite expensive. This patch
attempts to mitigate these costs by adding a new option to
expand the memory ranges being checked to include the outer loop
as well. This leads to runtime checks that can then be hoisted
above the outer loop. For example, rather than looking for a
conflict between the memory ranges:

1. &dst[(i * n)] -> &dst[(i * n) + n]
2. &src[(i * n)] -> &src[(i * n) + n]

we can instead look at the expanded ranges:

1. &dst[0] -> &dst[((m - 1) * n) + n]
2. &src[0] -> &src[((m - 1) * n) + n]

which are outer-loop-invariant. As with many optimisations there
is a trade-off here, because there is a danger that using the
expanded ranges we may never enter the vectorised inner loop,
whereas with the smaller ranges we might enter at least once.

I have added a HoistRuntimeChecks option that is turned off by
default, but can be enabled for workloads where we know this is
guaranteed to be of real benefit. In future, we can also use
PGO to determine if this is worthwhile by using the inner loop
trip count information.

When enabling this option for SPEC2017 on neoverse-v1 with the
flags "-Ofast -mcpu=native -flto" I see an overall geomean
improvement of ~0.5%:

SPEC2017 results (+ is an improvement, - is a regression):
520.omnetpp: +2%
525.x264: +2%
557.xz: +1.2%
...
GEOMEAN: +0.5%

I suspect the omnetpp and xz differences are noise, but I know the
x264 improvement is real because it has some hot nested loops
with low trip counts where I can see this hoisting is beneficial.

Tests have been added here:

  Transforms/LoopVectorize/runtime-checks-hoist.ll

Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D152366

Files:
  llvm/include/llvm/Analysis/LoopAccessAnalysis.h
  llvm/include/llvm/Transforms/Utils/LoopUtils.h
  llvm/lib/Analysis/LoopAccessAnalysis.cpp
  llvm/lib/Transforms/Utils/LoopUtils.cpp
  llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
  llvm/test/Transforms/LoopVectorize/runtime-checks-hoist.ll

-------------- next part --------------
A non-text attachment was scrubbed...
Name: D152366.529260.patch
Type: text/x-patch
Size: 39592 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20230607/9961db31/attachment.bin>