[all-commits] [llvm/llvm-project] c02184: [LoopVectorize] Allow inner loop runtime checks to...

Thu Aug 24 05:14:44 PDT 2023

  Branch: refs/heads/main
  Home:   https://github.com/llvm/llvm-project
  Commit: c02184f286d13160460f0b2e38e4fd2a3edff7f0
      https://github.com/llvm/llvm-project/commit/c02184f286d13160460f0b2e38e4fd2a3edff7f0
  Author: David Sherwood <david.sherwood at arm.com>
  Date:   2023-08-24 (Thu, 24 Aug 2023)

  Changed paths:
    M llvm/include/llvm/Analysis/LoopAccessAnalysis.h
    M llvm/include/llvm/Transforms/Utils/LoopUtils.h
    M llvm/lib/Analysis/LoopAccessAnalysis.cpp
    M llvm/lib/Transforms/Utils/LoopUtils.cpp
    M llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
    M llvm/test/Transforms/LoopVectorize/runtime-checks-hoist.ll

  Log Message:
  -----------
  [LoopVectorize] Allow inner loop runtime checks to be hoisted above an outer loop

Suppose we have a nested loop like this:

  void foo(int32_t *dst, int32_t *src, int m, int n) {
    for (int i = 0; i < m; i++) {
      for (int j = 0; j < n; j++) {
        dst[(i * n) + j] += src[(i * n) + j];
      }
    }
  }

We currently generate runtime memory checks as a precondition for
entering the vectorised version of the inner loop. However, if the
runtime-determined trip count for the inner loop is quite small
then the cost of these checks becomes quite expensive. This patch
attempts to mitigate these costs by adding a new option to
expand the memory ranges being checked to include the outer loop
as well. This leads to runtime checks that can then be hoisted
above the outer loop. For example, rather than looking for a
conflict between the memory ranges:

1. &dst[(i * n)] -> &dst[(i * n) + n]
2. &src[(i * n)] -> &src[(i * n) + n]

we can instead look at the expanded ranges:

1. &dst[0] -> &dst[((m - 1) * n) + n]
2. &src[0] -> &src[((m - 1) * n) + n]

which are outer-loop-invariant. As with many optimisations there
is a trade-off here, because there is a danger that using the
expanded ranges we may never enter the vectorised inner loop,
whereas with the smaller ranges we might enter at least once.

I have added a HoistRuntimeChecks option that is turned off by
default, but can be enabled for workloads where we know this is
guaranteed to be of real benefit. In future, we can also use
PGO to determine if this is worthwhile by using the inner loop
trip count information.

When enabling this option for SPEC2017 on neoverse-v1 with the
flags "-Ofast -mcpu=native -flto" I see an overall geomean
improvement of ~0.5%:

SPEC2017 results (+ is an improvement, - is a regression):
520.omnetpp: +2%
525.x264: +2%
557.xz: +1.2%
...
GEOMEAN: +0.5%

I didn't investigate all the differences to see if they are
genuine or noise, but I know the x264 improvement is real because
it has some hot nested loops with low trip counts where I can
see this hoisting is beneficial.

Tests have been added here:

  Transforms/LoopVectorize/runtime-checks-hoist.ll

Differential Revision: https://reviews.llvm.org/D152366