[PATCH] D102748: [LoopUnroll] Don't unroll before vectorisation

Wed May 19 01:19:45 PDT 2021

SjoerdMeijer created this revision.
SjoerdMeijer added reviewers: efriedma, spatel, xbolva00, fhahn, dmgreen, reames, david-arm, RKSimon.
Herald added subscribers: hiraditya, kristof.beyls.
SjoerdMeijer requested review of this revision.
Herald added a project: LLVM.
Herald added a subscriber: llvm-commits.

The loop vectoriser is sandwiched between two loopunroll invocations in the optimisation pipeline and this removes the first one. The motivation is that (fully) unrolling loops early removes opportunities for the loop vectoriser. This is often the case for loops with constant loop bounds and relatively small iteration counts for which vectorisation is still very much profitable. After fully unrolling these sort of loops, the SLP vectoriser is not always able to compensate for this, or is not (yet) as effective as the loop vectoriser. Therefore, first performing loop vectorisation, unrolling, SLP vectorisation seems a better approach.

There are a quite a few of these cases in x264 in SPEC, like this one which GCC loop vectorises and we don't which is the reason why we are behind quite a lot:

  for( int i = 0; i < 16; i++ )
    if ((dct[i]) > 0 )
      dct[i] = (bias[i] + (dct[i])) * (mf[i]) >> 16;
    else
      dct[i] = - ((bias[i] - (dct[i])) * (mf[i]) >> 16);

But this is also a bit of an old problem, and at least the following PRs are related: PR47178, PR47726, PR47554, PR47436, PR31572, PR47553, PR47491.

Some first performance numbers with patch, where `+` is a performance improvement and `-` is a regression:

     AArch64 (neoverse-n1)
     500.perlbench_r +0.34%
     502.gcc_r i     -0.28%
     505.mcf_r       -0.60%
     520.omnetpp_r   +0.585
     523.xalancbmk_r +1.68%
     525.x264_r      +1.33%
     531.deepsjeng_r +0.29%
     541.leela_r     -0.54%
     557.xz_r         0.00%
  
  And:
  
     Thumb2 (Cortex-M55)
     CoreMark -0.21%
     EEMBC    +0.06%
     DSP      +0.02%

These numbers show and improvement where I would like to see it: x264. The uplift in xalancbmk is nice too, but I haven't analysed that one yet. The other numbers are show a little bit of up and down behaviour, but only very small, and overall cancelling out each other. I think these are really encouraging results, because it suggests we get the results where we want without any fallout. This picture was confirmed on a set of embedded benchmarks (where DSP is an Arm DSP library/benchmark).

I am not really a fan of the llvm test suite as a performance benchmark (noisy), but will get some numbers for that too. And while I do that, and fix up a few llvm regression test cases (the ones that check optimisation pipeline order), I already wanted to share this to get some opinions on this.


https://reviews.llvm.org/D102748

Files:
  llvm/lib/Passes/PassBuilder.cpp


Index: llvm/lib/Passes/PassBuilder.cpp
===================================================================

--- llvm/lib/Passes/PassBuilder.cpp
+++ llvm/lib/Passes/PassBuilder.cpp
@@ -773,17 +773,6 @@
   if (EnableLoopInterchange)
     LPM2.addPass(LoopInterchangePass());
 
-  // Do not enable unrolling in PreLinkThinLTO phase during sample PGO
-  // because it changes IR to makes profile annotation in back compile
-  // inaccurate. The normal unroller doesn't pay attention to forced full unroll
-  // attributes so we need to make sure and allow the full unroll pass to pay
-  // attention to it.
-  if (Phase != ThinOrFullLTOPhase::ThinLTOPreLink || !PGOOpt ||
-      PGOOpt->Action != PGOOptions::SampleUse)
-    LPM2.addPass(LoopFullUnrollPass(Level.getSpeedupLevel(),
-                                    /* OnlyWhenForced= */ !PTO.LoopUnrolling,
-                                    PTO.ForgetAllSCEVInLoopUnroll));
-
   for (auto &C : LoopOptimizerEndEPCallbacks)
     C(LPM2, Level);
 


-------------- next part --------------
A non-text attachment was scrubbed...
Name: D102748.346357.patch
Type: text/x-patch
Size: 998 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20210519/09a15e68/attachment.bin>