[llvm-dev] [RFC] Polly Status and Integration

Wed Oct 11 00:07:11 PDT 2017

Dear LLVM community,

when working on some project larger projects, probably everybody has
come up with ideas on how they would make things different if they
started from scratch. For instance, in LLVM, modeling PHI nodes as in
Swift's intermediate representation might be such an idea, but
requires huge changes in the code base. As contributor to Polly, I am
no exception. I therefore would like to share my thoughts about what
the best possible integration of a polyhedral optimizer/Polly could
look like, to discuss which end result we might strive for. I am a
proponent of polyhedral compilation and would like to see it used on a
daily basis by just choosing an optimization level.

IMHO we should avoid multiple loop optimizations in LLVM. If each of
them does its own loop versioning, we will get multiple redundant
runtime checks and code duplication. This is already the case with
LoopVersioningLICM, LoopDistribution, LoopVectorizer and
LoopLoadElimination. I argue for a more integrated approach for
loop-based optimizations.

To clarify, what this is not:
- It has not been shown to work in practice.
- While I would like to contribute to implement such a loop optimizer,
this is not something I could do alone.
- This is no direct replacement for Polly or VPlan, rather a long-term
vision which also includes VPlan/Polly.

To have a comparison, I first try to sketch the approaches taken by
the LoopVectorizer, VPlan and Polly to then present the "Loop
Optimization Framework". First only the design and afterwards the
rationales for the design decisions.

To be able to show some examples, I'll refer to this snippet we intent
to optimize.

for (x = 0; x < sizes->nx; x += 1)
  for (y = 0; y < sizes->ny; y += 1)
     C[x][y] = 0;
for (x = 0; x < sizes->nx; x += 1) {
  tmp = A[x]; /* hoisted load by LICM or GVN */
  int ny = sizes->ny; /* dynamic load of size parameter */
  for (y = 0; y < sizes->ny; y += 1) {
    if (x != y) /* affine conditional */
      C[x][y] = tmp * B[y];
    if (C[x][y] > 0) /* data-dependent conditional */
      C[x][y] = sqrt(C[x][y]);
    if (verbose) /* non-performance-critical case */
      printf("Some output");
  }
}

LoopVectorizer (+VPlan)
-------------------------

This part I know the least, please correct me if I state something wrong.

Details on VPlan were taken from [1] and its talk, and the code
already committed to LLVM.

A) Preprocessing

LCSSA-form and loop-simplification.

B) Scope of Analysis

Innermost loop, single body BB (or if-convertible to single BB)

VPlan: Acyclic control flow with SESE (single entry/exit block or edge?).

Extensions for outer loop vectorization (cyclic control flow) to be
presented at the dev meeting: [2]

C) Control-Flow Modeling

Not needed, it's always just a single loop.

VPlan: Hierarchical-CFG of VPBasicBlocks and VPRegionBlocks.

D) Access Modeling (LoopAccessInfo/LoopAccessAnalysis/AccessAnalysis)

Use AliasSetTracker and look analyze each set independently.
No PHI nodes (apart from induction variable) allowed (there seems to
be some support for reductions).

E) Dependency Analysis (MemoryDepChecker)

Leaks only DependenceType and max vectorization factor to the outside.

F) Assumption Tracking (PredicatedScalarEvolution/LoopVectorizationRequirements)

Assumptions can be added to PredicatedScalarEvolution to make SCEV
modeling possible/simpler/exact.
Used for:
- Assume exact BackedgeTakenCount
- Assume no overflow

Aliasing tracked by AccessAnalysis

G) Transformation

One explicit transformation. Either vectorization is possible or don't
do anything at all.

VPlan: Modifying VPlans (widening, etc) and pick the best according to
the cost model.

H) Code Generation (InnerLoopVectorizer/LoopVersioning)

Create a new vectorized loop (without prologue to get aligned
accesses) and use the old loop as epilogue and if the assumptions are
not met.

VPlan: Generate instructions from the recipes in the VPlan.

I) Pass Manager Integration

Depends on analyses: LoopInfo and LoopAccessAnalysis

Otherwise LoopVectorize is self-contained and instantiates objects
itself (eg. LoopVersioning)

Polly Architecture
------------------

A) Preprocessing

There is no considerable preprocessing (anymore). The only
modification done is inserting an entry block containing only alloca's
where Polly's IR generator can add its own allocas to. The split-off
BB can then be analyzed as well.

B) Scope of Analysis (ScopDetection/ScopInfo)

Search for the largest compatible llvm::Regions, called SCoPs.
Compatibility requirements are (among others):
- Single Entry Single Exit: SESE
- Base pointers invariant in SCoP
- Affine loop-bounds
- Only multidimensional affine array accesses (by default)
- No function calls (except side-effect free functions)

Regions are expanded in order to include multiple top-level loops into
on region. In the example both outermost loops are included into the
same SCoP region.

LoadInst results that are used for loop bounds and array subscripts
are assumed to be invariant during the execution of the SCoP. This
allows to use `sizes->ny` as a loop bound, even though the LoadInst
could return a different value at every iteration of the outer loop.
(Invariant load hoisting)

Conditions that lead to function calls are assumed to be false. In the
example makes it possible to remove the call to printf by adding the
assumption `!verbose`.

C) Control-Flow Modeling (Schedule Tree)

A schedule tree is basically a tree of bands nodes (collection of
perfectly nested loops with a defined order) and sequence nodes (child
node are execute in sequence order). Affine conditionals are
implemented as filters over loop variable values they are executed on.

Example (simplified):
Root
 |
 +- Band(Loops: x, y)
 |  |
 |  +- C[x][y] = 0;
 |
 +- Band(Loops: x)
    |
    + tmp = A[x]; /* statement */
    |
    +- Band(Loops: y)
       |
       +- Filter(x + -y > 0 || x + - y < 0) /* affine conditional */
       |  |
       |  +- C[x][y] += tmp * B[y]; /* statement */
       |
       +- if(C[x][y] > 0) C[x][y] = sqrt(C[x][y]); /* data-dependent
conditional as macro-statement (llvm::Region) */

A statement execution for fixed x and y is called a 'statement instance'.

Schedule trees are part of the Integer Set Library (isl) [3], which is
also used to represent polyhedral in general.

D) Access Modeling (polly::MemoryAccess)

Every load/store has a base pointer and multi-dimensional indices.
Indices are derived from GetElementPtr and/or using ScalarEvolution
and then converted to piecewise-affine functions (isl_pw_aff).

Because in C, multi-dimensional accesses decay to flat subscripts
(C[x*sizes->nx + y]), delinearization tries to recover the
multidimensional subscripts. If this does not succeed, the option
-polly-allow-nonaffine interprets them as touching the entire array.
Assumptions are added to ensure that the accesses are within the
derived dimension sized.

llvm::Instructions are modeled as zero-dimensional array (or "scalar")
if there is a use outside of the statement instance they are defined
in.

E) Dependency Analysis (DependencyInfo)

When passed the accesses for each statement, isl computes the
dependencies (flow, anti and output) as a relational polyhedron. That
is, the exact dependencies between all statement instances.

Example:
{ [x,y-1] -> [x,y]   if x != y;
  [x,y-1] -> [x,y+1] if x == y }
describes the dependency of the statement with the affine conditional
to the next instance of itself.

F) Assumption Tracking (ScopInfo)

Remembers the assumptions mentioned before. There are four kinds of assumptions
- Valid value ranges, not requiring a runtime check (eg from __builtin_assume)
- Value ranges that must be checked at runtime (eg. the range of nx,ny
such that all array subscripts are within bounds)
- Value ranges that are forbidden and must be checked at runtime (eg
`verbose` must not be true)
- Non-aliasing assumptions, checked as runtime.

Some violated assumptions may lead to a bail-out, eg. if the pointer
of an assumed invariant load is written to.

G) Transformations (ForwardOpTree/DeLICM/Simplify/ScheduleOptimizer)

To improve schedulability, ForwardOpTree and DeLICM modify the
instruction lists of statements (not the IR itself). For instance, it
propagates the definition of tmp (a LoadInst) to the statements that
use it. In the example this makes the x and y loops perfectly nested
such that it can be tiled, and removes the false dependencies of tmp
to itself (a "scalar dependency").

isl then computes a new schedule tree out of the old schedule tree and
the dependence analysis.  By default, it uses a modified PLuTo [4]
algorithm. Its goals are:
- Separate independent statements into separate bands
- Reduce the distance between dependencies to optimize temporal locality
- Find parallel loops

On the transformed schedule tree, Polly walks the tree to find
additional optimization opportunities:
- Tile bands that are 'permutable' and have at least 2 loops
- Optionally vectorize the first innermost 'coincident' loop
- Detect matrix-multiplication access patterns and apply a special
optimization [5]

H) Code Generation (IslAstInfo/CodeGeneration/PPCGCodeGeneration)

Isl can transform a schedule tree representation into an AST
representation, with loops and if-conditions.

CodeGeneration versions the original SCoP region using the tracked
assumptions and emits the AST representation.  For a statement in the
AST, it copies the instructions from the instruction list. For
parallel loops, it can generate OpenMP. PPCGCodeGeneration generates
code for GPU accelerators.

I) Pass Manager Integration

ScopDetection is a function pass, all following passes are ScopPasses
(RegionPasses in disguise):
- ScopInfo (analysis)
- JScopImporter (SCoP transformation)
- Simplify (SCoP transformation)
- ForwardOpTree (SCoP transformation)
- DeLICM (SCoP transformation)
- DependenceInfo (SCoP analysis)
- ScheduleOptimizer (SCoP transformation)
- IslAstInfo (SCoP analysis)
- CodeGeneration (IR transformation)

Proposed Loop Optimization Framework
------------------------------------

Only the architecture is outlined here. The reasons for them can be
found in the "rationales" section below.

A) Preprocessing

Make a copy of the entire function to allow transformations only for
the sake of analysis. For instance (and not necessarily in the
preprocessing):
- Invariant load hoisting (Move the load of sizes->ny before the loop)
- Move operations closer to the instructions that use them (Move tmp =
A[x] to the instructions that use it)
- Split basic blocks to allow loop distribution

The function will be discarded when no loop transformation is applied.

This step can be made optional to be only active at higher
optimization settings.

B) Scope of Analysis

Always entire functions. Some loops/blocks might be marked as non-transformable.

C) Control-Flow Modeling

An AST-like tree based on the LoopInfo tree. Statements are pointers
to Basic Blocks in the original IR.  Any conditionals are modeled as
data-dependency.

Example:

+- Loop 0 <= x < sizes->nx
|  |
|  +- Loop 0 <= y < sizes->ny
|     |
|     +- C[x][y] = 0;             [condition: true]
|
+- Loop 0 <= x < sizes->nx
   |
   +- Loop 0 <= y < sizes->ny
      |
      +- C[x][y] += A[x] * B[y];  [condition: true]
      |
      +- C[x][y] = sqrt(C[x][y]); [condition: C[x][y] > 0]
      |
      +- printf("Some output");   [condition: verbose]

The tree is mutable by transformations, IR will be generated from it at the end.

Loops bounds and affine conditions can also optionally be expressed as isl_sets.

Statements have properties:
- Conditions under which they are executed
- Is it if-convertible?
- Can be executed even in the original IR it would not be executed?
- If executed at least once, executing multiple times has no effect?
- Hotness information from PGO
- #pragma annotations from clang
- Multi-thread effects (e.g. atomics/fences)
- ...

D) Access Modeling

Each load/store/... is associated with:
- The base pointer
- For each of the array's dimensions, the subscript
  - Either as SCEV
  - Or as affine function (isl_pw_aff)

Frontends might help recovering the multidimensional structure of an
array by, instead of linearizing array subscripts (eg. row-major),
using a intrinsic/builtin.

E) Dependency Analysis

A collection of dependencies between two statements, say S and T. S
must be executed before T.
- The type of dependency: Register, Control, Flow, Anti, Output
- Which instances depend on each other. This can be:
  - No instance information. All instances of S must be executed
before any instance of T.
  - A vector of dependence distances. Basically MemoryDepChecker for
each surrounding loop.
  - An polyhedral relational map (isl_map) describing which instances
of T depend on which instances of S.

The dependency type "register" means that T used an llvm::Value
defined by S. This can also be modeled using flow and
anti-dependencies but as shown before this can be destructive to
transformation opportunitied.  Transformations need to specially
consider register dependencies and avoid requiring expansion.

The dependency type "control" models order induced by control
conditions. Eg. the data-dependent conditional 'C[x][y] > 0' can only
be evaluated after 'C[x][y]' has been written.

F) Assumption Tracking

A public interface that loop analysis and transformations can use to
register their assumptions to checked at runtime. Assumptions are for
specific loops, so they only need to be checked if the loop is
actually transformed.

Assumptions can also be queried, eg. knowing that a variable always
has a specific value may allow some simplifications. This knowledge
could also be fed from __builtin_assume and profiling data.

G) Transformation

In the style of InstCombine, visit the tree and modify it (and only
the tree, not the IR).  Transformations can include:
- Add assumptions to make parts more transformable (e.g. assume
verbose==false to eliminate the call to printf)
- Combine single loops into bands
- Apply #pragmas on the loop/band it is associated with.
- Vectorize loops (VPlan)
- Tiling
- Unrolling
- loop-unswitching
- Localize working sets of inner tiles. This can be a generalization
of Polly's matrix-multiplication optimization.
- (Partial) unrolling of loops
- Specializing loops for specific inputs (loop-unswitching)
- Mixed-LP scheduling
- OpenMP-parallelize bands
- Extract accelerator kernels
- Detect loop idoms (memset, memcopy, zgemm, etc.) to call library
functions instead
- ...

Transformations are responsible to not violate dependencies.

The current Polly would become one of the transformations. It finds
the largest loop body where all size parameters are invariant and
translate the subtree to an isl schedule tree. Isl would transform it,
generate an isl_ast and Polly translates it to a new loop tree. Much
of the work done by Polly today would be provided by the framework.

Example transformation output:

+- memset(C, '\0', sizes->nx*sizes->ny*sizeof(C[0][0]));
|
+- Loop 0 <= x < sizes->nx [#pragma omp parallel for][condition: !verbose]
   [assumption: C[0..sizes->nx-1][0..sizes->ny-1], A[0..sizes->nx-1],
B[0..sizes->ny-1] not aliasing]
|  |
|  +- Loop 0 <= yv < sizes->ny step 4
|     |
|     +- Loop yv <= y < yv+4 [vectorize VF=4]
|     |  |
|     |  +- C[x][y] += A[x] * B[y];
|     |
|     +- Loop 4*(sizes->ny/4) <= y < sizes->ny
|     |  |
|     |  +- C[x][y] += A[x] * B[y];
|     |
|     +- Loop 0 <= y < sizes->ny
|        |
|        +- C[x][y] = sqrt(C[x][y]); [condition: C[x][y] > 0]
|
+- Loop 0 <= x < sizes->nx [condition: verbose]
   |
   +- Loop 0 <= y < sizes->ny
      |
      +- C[x][y] += A[x] * B[y];  [condition: true]
      |
      +- C[x][y] = sqrt(C[x][y]); [condition: C[x][y] > 0]
      |
      +- printf("Some output");   [condition: true]

H) Code Generation

Determine which part of the tree has been modified. Outermost parts
that have not been transformed do not need to be regenerated.  The
transformed part of the tree is generated into a separate function.
Runtime checks for the collected assumptions are generated, and
depending on its result, either the optimized function or the
previously outlined function is called.

The optimized function can be further optimized using LLVM's usual
passes and the inliner decide according to its heuristics whether it
is worth inlining into the caller.

I) Pass Manager Integration

A monolithic loop transformation pass which creates the loop tree and
provides dependency analysis and the assumption track. Then it invokes
the separate transformations like InstCombine. The final tree is
code-generated before control is passed back to (old or new) pass
manager.

Design Decision Rationales
--------------------------

A) Preprocessing

Polly's policy at the moment is not modify the IR until code
generations (to not interfere with the current pipeline). For
invariant load hoisting and avoiding scalar dependencies it only
virtually moves/copies instructions in its own data structures. This
adds complexity as these representations get out-of-sync. For
instance, DominatorTree does not reflect the new locations of
instructions, the CFG is modified by CodeGenerator for versioning, PHI
node incoming lists does not correspond to the predecessor blocks
anymore, instructions cannot be propagated cross loop iterations as we
don't know which iteration it was originally in, etc. If possible, I'd
like to avoid this complexity. It approaches being an alternative
representation for the current IR, so why not using the IR?
VPlan has its own IR-like representation consisting of ingredients and
recipes. For the vectorizer it might be justifiable because it does
not change control flow.

Some transformations cope well with scalar dependencies (eg.
vectorization, there's a copy of each scalar anyway and
thread-parallelization would have them privatized), but in the
polyhedral model, analysis takes place before transformation
decisions. However, today's polyhedral schedulers (That are based on
Feautrier's or PLuTo's) do not have special handling for scalar
dependencies.  I am positive that we can have polyhedral schedulers
that handle scalars without getting worse results (by avoiding putting
value definitions into different loops than its uses; ability to
execute a definition multiple times), but this is research were such a
framework could be very useful. Most of today's polyhedral compilers
used by researchers are source-to-source where the problem is much
less pronounced.

We might alternatively change the meaning of IR canonicalization in
LLVM. Pre-loop canonicalized would try to create self-contained blocks
where post-loop transformations do hoisting and sinking.

To illustrate the issue with scalar dependencies:

for (int i = 0; i < N; i+=1) {
  double s = ...;
  ... = s;
}

Since in Polly s is modeled as a zero-dimensional array, there is just
one memory location for it. Before processing the next iterations, the
current iteration must have completed before s is overwritten. If
virtual registers are not modeled having a memory location, a
polyhedral schedule might be included to do this (because, eg.
splitting those up makes one of the loops vectorizable):

double s[N];
for (int i = 0; i < N; i+=1)
  s[i] = ...;
for (int i = 0; i < N; i+=1)
  .... = s[i];

that is, requiring an unbounded amount of additional memory. That's
probably not something we want.

B) Scope of Analysis

Modeling entire functions has the advantage that analysis does not
have to choose a unit of transformation (Polly: SCoPs, LoopVectorizer:
a loop), but the transformations decide themselves what they are
modifying.

I'd also like to synchronize the unit of modeling. Polly's SCoPs is
based on SESE regions. llvm::Region and llvm::Loop have no hierarchy,
regions and loops may partially overlap. It is possible to represent
multiple exit loops in the polyhedral model.

C) Control-Flow Modeling

LoopInfo, Polly's schedule tree, VPlan's hierarchical CFG all use a
tree-like representation of control flow. I hope to find a common one.

Using execution conditions 'flattens' the loop body. Conditions can be
whether a previous statement has been executed or not (control
dependence) so that the original CFG can be reconstructed.

Affine conditions can be directly represented in the polyhedral model,
for others 'always executing' is an overapproximation and it is
checked at runtime whether it is really executed. This allows more
flexibility than Polly's SESE subregion macro-statement approach.
Whether a condition is affine or not also has no influence on other
transformations.

VPlan also uses SESE in its CFG representation, but I think it can
also make use of this model.

D) Access Modeling

There is a common interface for both representations. Modeling as
affine function is required for the polyhedral model and is more
powerful, but assumes arbitrary precision integers, so additional
assumptions might need to be added.

The SCEV representation can (when linear/affine) be converted to an
isl_pw_aff when needed, or the isl_pw_aff computed directly as by
Johannes' suggestion. However, I did not observe that SCEV was the
limiting part in Polly, and could be extended to support conditional
operators as well.

The builtin might not be useful for C arrays, where access is defined
as row-major and always linearized. But libraries could make use of
the builtin, for instance a Matrix class with overloaded operator:

double &operator()(int x, int y) {
#if __has_builtin(__builtin_subscript)
  return __builtin_subscript(data, x, y);
#else
  return data[x*n+y];
#endif
}

A special case are jagged arrays/pointers to pointers of arrays which
are very common in preexisting code bases. To ensure non-aliasing, we
would need a pairwise check for aliasing of every subarray, which has
polynomial complexity by itself. A builtin/intrinsic that guarantees
the non-aliasing of subarrays would allow us to transform these cases
as well. If there is a common interface for both cases,
transformations work transparently on either representation.

E) Dependency Analysis

Again there is a common interface for all three kinds of dependency
representations, eg. "does statement instance S(x,y) depend on
instance T(x)"?. Instances can, as memory subscripts, described as
SCEV or affine functions over surrounding loop counters.
Representations can also be converted to either type, possibly
overapproximating the dependencies.
The precision can be selected by compiler switches or optimization
level. The goal is to allow any loop transformation (eg. vectorizer)
can use any dependency precision.

For the polyhedral representation, we could introduce an intermediate
layer that allows to plug in other intLP solvers and return
approximations of the true result. Dependencies can be arbitrarily
overapproximated, which inhibits transformations, but stays correct.
Alternative implementations might just allow rectangles or octahedra
[6] instead of polyhedra, which do not have exponential worst-case
behaviours.

Polly at the moment does not know register or control dependencies. I
hope that adding such kind of dependency would allow developing
polyhedral scheduler that keep definition and uses of a virtual
register in the same loop.

F) Assumption Tracking

The purpose of a shared assumption tracker is to only have the need
for a single runtime condition check after all transformations have
been applied.

G) Transformation

Notice that polyhedral scheduling is just one of the possible
transformation passes. Even in Polly, some transformations (tiling,
vectorization, matrix-multiplication) are post-optimizations on the
schedule tree and can be implemented on other AST representations as
well. No knowledge of the polyhedral model is required to implement a
new transformation.

VPlan allows creating multiple plans at the same time and select the
best according a heuristic. However, some transformations might be
easier to implement by modifying the underlying IR in the cloned
function. Eg. removing conditionals that lead to exceptional cases
(the non-performance-relevant printf in the example). Otherwise we
could also select the most profitable transformed AST according to a
cost function.

H) Code Generation

Generating code it into a separate function has some advantages.
First, we control the environment. Input base pointers can be
annotated using the noalias qualifier without metadata.  Code
generation is easier because there are less special cases like
multiple exit edges, llvm.lifetime.begin/end intrinsics, etc.
Separate register pressure for the presumably hot code and the rest of
the function.

I) Pass Manager Integration

The reason to not use LLVM's pass manager for the individual
transformations is because the loop tree carries state that is not
represented in the LLVM IR. The pass manager would be allows to
invalidate the pass holding the tree and expect it to be recomputable
the same way when instantiated again. This is not the case if the tree
has been transformed already. Polly does it this way and needs all its
transformation passes to preserve all analyses to avoid making the
pass manager drop and recompute ScopInfo and ScopDetect.

[1] http://llvm.org/devmtg/2017-03/assets/slides/introducing_vplan_to_the_loop_vectorizer.pdf
[2] http://llvm.org/devmtg/2017-10/#talk17
[3] http://isl.gforge.inria.fr/
[4] http://pluto-compiler.sourceforge.net/
[5] http://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf
[6] https://tel.archives-ouvertes.fr/tel-00818764