[PATCH][LoopVectorizer] Restrict the unroll factor of reductions in loops

Mon Aug 11 09:17:49 PDT 2014

Hi Arnold, Gerolf, Hal,

Good idea about a tuning option. Attached is a patch that implements that
tuning option. Is it OK to commit?

Gerolf, did you want me to add "|| isCyclone()" to
AArch64TargetTransformInfo?

Cheers,

James


On 11 August 2014 04:31, Gerolf Hoflehner <ghoflehner at apple.com> wrote:

> On Typhoon the gain for libquantum is almost 2%, and about one percent on
> hmmer. No regression on CINT2006. This is O3 LTO, ref input.
>
> -Gerolf
>
> On Aug 8, 2014, at 4:57 PM, Gerolf Hoflehner <ghoflehner at apple.com> wrote:
>
> > I second a tuning option at least in the short term. It is usually hard
> to get it right, though. So longer term this is a case for dynamic
> versioning that invokes different versions of the code at run-time
> depending on the trip count.
> >
> > -Gerolf
> >
> > On Aug 8, 2014, at 12:30 PM, Hal Finkel <hfinkel at anl.gov> wrote:
> >
> >> ----- Original Message -----
> >>> From: "James Molloy" <james.molloy at arm.com>
> >>> To: "Arnold Schwaighofer" <aschwaighofer at apple.com>
> >>> Cc: "llvm-commits" <llvm-commits at cs.uiuc.edu>
> >>> Sent: Friday, August 8, 2014 9:37:38 AM
> >>> Subject: [PATCH][LoopVectorizer] Restrict the unroll factor of
> reductions in        loops
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> Hi Arnold,
> >>>
> >>>
> >>>
> >>> Attached are two patches. The first ups the maximum unroll factor on
> >>> AArch64 from 2 to 4, for C-A57 only at the moment as that’s all I’ve
> >>> got data for. This gives us significant wins – ~14% on
> >>> 462.libquantum at least.
> >>>
> >>>
> >>>
> >>> However it also causes some regressions. The second patch makes the
> >>> loop vectorizer a bit more conservative with its unroll factor. The
> >>> problem is purely for reductions within loops. The regressions I’ve
> >>> seen are small (but runtime-known) trip count loops within a loop
> >>> nest. A loop unroll factor of 2 is fine, but above 2 the reduction
> >>> variable fixup logic after the loop increases the critical path
> >>> length and resource usage. For most loops this isn’t a problem, but
> >>> small loops in a larger loop nest will execute this fixup code many
> >>> times.
> >>
> >> Can you please add a flag for this? I anticipate needing to tune it.
> >>
> >> Also, it seems to me that this is exactly the kind of thing that would
> benefit from profiling information (so we can determine if the inner loop
> is likely to have a large trip count). Can the current infrastructure do
> this? Also, maybe in cases where the inner loop count is not a function of
> the outer loop, we might 'unswitch' it so that we get the unrolled inner
> loop only when actually profitable.
> >>
> >> Thanks again,
> >> Hal
> >>
> >>>
> >>>
> >>>
> >>> The heuristic is: if this is a (scalar) reduction, and the loop is
> >>> nested, clamp the UF to a maximum of 2. With 2, we still get wins
> >>> but we only add one fadd/fmul to the critical path.
> >>>
> >>>
> >>>
> >>> Please take a look.
> >>>
> >>>
> >>>
> >>> Cheers,
> >>>
> >>>
> >>>
> >>> James
> >>> _______________________________________________
> >>> llvm-commits mailing list
> >>> llvm-commits at cs.uiuc.edu
> >>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> >>>
> >>
> >> --
> >> Hal Finkel
> >> Assistant Computational Scientist
> >> Leadership Computing Facility
> >> Argonne National Laboratory
> >>
> >> _______________________________________________
> >> llvm-commits mailing list
> >> llvm-commits at cs.uiuc.edu
> >> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> >
>
>
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20140811/41d077b1/attachment.html>
-------------- next part --------------
Index: lib/Transforms/Vectorize/LoopVectorize.cpp
===================================================================

--- lib/Transforms/Vectorize/LoopVectorize.cpp	(revision 215117)
+++ lib/Transforms/Vectorize/LoopVectorize.cpp	(working copy)
@@ -204,6 +204,11 @@
     "enable-cond-stores-vec", cl::init(false), cl::Hidden,
     cl::desc("Enable if predication of stores during vectorization."));
 
+static cl::opt<unsigned> MaxNestedScalarReductionUF(
+    "max-nested-scalar-reduction-unroll", cl::init(2), cl::Hidden,
+    cl::desc("The maximum unroll factor to use when unrolling a scalar "
+             "reduction in a nested loop."));
+
 namespace {
 
 // Forward declarations.
@@ -5545,6 +5550,18 @@
     unsigned StoresUF = UF / (Legal->NumStores ? Legal->NumStores : 1);
     unsigned LoadsUF = UF /  (Legal->NumLoads ? Legal->NumLoads : 1);
 
+    // If we have a scalar reduction (vector reductions are already dealt with
+    // by this point), we can increase the critical path length if the loop
+    // we're unrolling is inside another loop. Limit, by default to 2, so the
+    // critical path only gets increased by one reduction operation.
+    if (Legal->getReductionVars()->size() &&
+        TheLoop->getLoopDepth() > 1) {
+      unsigned F = static_cast<unsigned>(MaxNestedScalarReductionUF);
+      SmallUF = std::min(SmallUF, F);
+      StoresUF = std::min(StoresUF, F);
+      LoadsUF = std::min(LoadsUF, F);
+    }
+
     if (EnableLoadStoreRuntimeUnroll && std::max(StoresUF, LoadsUF) > SmallUF) {
       DEBUG(dbgs() << "LV: Unrolling to saturate store or load ports.\n");
       return std::max(StoresUF, LoadsUF);
-------------- next part --------------
Index: lib/Target/AArch64/AArch64TargetTransformInfo.cpp
===================================================================
--- lib/Target/AArch64/AArch64TargetTransformInfo.cpp	(revision 215117)
+++ lib/Target/AArch64/AArch64TargetTransformInfo.cpp	(working copy)
@@ -104,7 +104,7 @@
     return 64;
   }
 
-  unsigned getMaximumUnrollFactor() const override { return 2; }
+  unsigned getMaximumUnrollFactor() const override;
 
   unsigned getCastInstrCost(unsigned Opcode, Type *Dst, Type *Src) const
       override;
@@ -513,3 +513,9 @@
   }
   return Cost;
 }
+
+unsigned AArch64TTI::getMaximumUnrollFactor() const {
+  if (ST->isCortexA57())
+    return 4;
+  return 2;
+}