[llvm-dev] Writing loop transformations on the right representation is more productive

Fri Jan 3 03:26:39 PST 2020

In the 2018 LLVM DevMtg [1], I presented some shortcomings of how LLVM
optimizes loops. In summary, the biggest issues are (a) the complexity
of writing a new loop optimization pass (including needing to deal
with a variety of low-level issues, a significant amount of required
boilerplate, the difficulty of analysis preservation, etc.), (b)
independent optimization heuristics and a fixed pass ordering and (c)
code explosion due to versioning. Currently, different people are
working on improving different loop optimization passes such as
LoopInterchange, LoopFuse and LoopVectorize. Even if LLVM had
'perfect' individual loop passes, they would still have the
aforementioned problems due to still being independent and the system
as a whole will be suboptimal.

Instead of each loop pass being a major component, the heavy lifting
could be done by a framework of which each transformation itself is a
small part. In this RFC, I would like to work towards a consensus on
how such a framework could look. I already outlined a possible
solution in the same presentation [1] and a publication [7], which is
partially based on a previous RFC [8]. All feedback is welcome,
including a simple '+1'.

The central idea is to use a modifiable loop tree -- similar to
LoopInfo -- as the primary representation. LLVM-IR is converted to a
loop tree, then optimized and finally LLVM-IR is generated again for
subtrees that are considered profitable. This is not a new concept, it
has already been used in compilers such as IBM XL Fortran (called ASTI
[4]) and Silicon Graphics/Open64 (called LNO [10]), and in research
such as the Value State Dependence Graph [3].

Other features include the following:

1. Candidate selection through cost functions
2. Cheap copies using Red/Green Trees
3. Application of transformations from high-level to low-level
4. Represents loops and predicates instead of CFGs
5. Data and control dependencies in one graph
6. Late fallback versioning at IR regeneration
7. Standardized API for analyses with multiple implementations
8. Abstract representation of statements
9. Expansion of use-def chains to arrays when spanning loops

To elaborate on each of these:

1. Candidate selection through cost function
--------------------------------------------
Instead of needing to know which transformation is profitable before
applying it, create a copy of the data structure, modify it, and
compare it to the original. Moreover, holding multiple, differently
optimized, copies allows evaluating each variant using a cost function
and select the 'best' when re-generating LLVM-IR again (or re-use the
original LLVM-IR).

Otherwise, each transformation will have to implement its own
profitability heuristic but cannot know what other passes will do. For
instance, the motivation for LLVM's LoopDistribution is to enable
vectorization of loops that can be separated from other loops.
However, it does not know whether LoopVectorize will actually
vectorize the loop, or whether two vectorizable loops are better
vectorized together.

Instantiating every possible sequence of transformations of course is
not feasible, so the search space needs to be pruned. This could be
made dependent on the optimization level.

2. Cheap copies using Red/Green Trees
-------------------------------------
A red/green trees [4] is a data structure for cheap copies of DAGs.
Basically, a green tree node references only the nodes children but
not the parent. A node from the red tree references its corresponding
green node and its parent (red) node. When modifying the represented
node through copy-on-write, this allows reusing all green nodes except
the ones from the modified node to the tree's root. The same node can
even be used multiple times in the same copy, which can be useful for
e.g. loop unrolling and code versioning. As a result, storing many
optimization variants of the same code only requires copies of the
changed parts.

By contrast, the red node is created on-demand. Traversing the entire
tree should rarely be necessary (instead, use summary information in
the green node) and when limited to subtrees that are considered
interesting for optimization.

3. Apply transformations from high-level to low-level
-----------------------------------------------------
Optimization should be applied from very specialized to very general
(potentially after some canonicalization). For instance, the first
step could be detecting common idioms such as gemm and replace them
with either a BLAS function call or apply well-studied optimizations
like BLIS to them. After such an idiom has been detected, no other
transformation should be applied to them.

Mid-level transformations may try to map entire loop nests to cache-
and compute hierarchies (SIMT threads, multiprocessors, offloading,
etc) by applying transformations such as tiling, loop interchange and
array packing.

Low-level transformations commonly look at only one loop to improve
processor pipelining and predication: Unrolling, software pipelining,
etc.

4. Represents loops and predicates instead of CFGs
-------------------------------------------------
Control flow graphs make loop optimizations difficult, but is also not
something a loop optimizer typically cares about. For instance, in the
example below, loop distribution would need to understand the
condition `c` and reconstruct the branch to fission

    for (int i = 0; i < n; i += 1)
      if (c) {
        foo(i);
        bar(i);
      }

into

    for (int i = 0; i < n; i += 1)
      if (c)
        foo(i);
    for (int i = 0; i < n; i += 1)
      if (c)
        bar(i);

In the loop tree, control flow is represented through predicates of
the statements.  In a first step, it is represented as

    for (int i = 0; i < n; i += 1) {
      bool __BB1_executed = 0;
      { if (c) foo(i); __BB1_executed = 1; }
      { if (__BB1_executed) bar(i); }
    }

The predicates (`c` and `__BB1_executed`) can be separated into a
"necessary condition" (`c` must be true before `foo` can be executed)
and a "sufficient condition" (if `c` is true, `foo` must be executed).
These can be different to allow speculative execution of the
statement. For instance, a necessary condition of `true` means that
the statement can always be executed speculatively.

The advantage is that control flow becomes a data-dependency problem.
It can trivially be converted back to its CFG representation because
`__BB1_executed` directly corresponds to the CFG edge to which it was
connected. Data flow optimizations such as value propagation can be
applied as well (here: `__BB1_executed == c`), such that we get

    for (int i = 0; i < n; i += 1) {
      { if (c) foo(i); }
      { if (c) bar(i); }
    }

Which can be distributed trivially (assuming foo() and bar() do not
cause dependencies). Whether the copy propagation is beneficial can be
decided by a heuristic (considering whether the original flow can
still be reconstructed) or just thrown away if the cost function does
not select a loop tree based on the copy propagation.

As a result of the predicated representation the loop tree has a
regular pattern: sequence -> loop -> sequence -> loop -> ... . A loop
here is abstract in that it is something that is executed multiple
times. This can be a sequential loop, but also a collection of SPMD
processes, threads, vector lanes, etc.

At IR regeneration, we have multiple choices of how to lower
conditions. If we lower to LLVM-IR CFG, it may lower to
 * Branch instructions, reusing branch conditions from previous statements
 * Select instructions

If lowering to SIMD instructions, we can generate
 * Branch instructions for uniform predicates
 * Execution masks for varying predicates
 * A mix of both, since predicates can be uniform relative to previous
taken conditions

If lowering to NVPTX-style SIMT
 * @p per-instruction execution masks
 * Divergent branches; Every statement can be made a re-convergence
point (by re-evaluating the condition), but will happen, at the
latest, at the end of the loop
 * Uniform predicates can be given the ".uni" suffix

When lowering to AMD-style SIMT
 * S_CBRANCH (explicitly managed execution mask)
 * S_CBRANCH_I/G_FORK (stack of execution masks)

5. Data and control dependencies in one graph
---------------------------------------------
Also as a result of the predicated from, during dependency analysis,
control dependencies and data dependencies have the same
representation, like in a program dependency graph. In LLVM-IR, passes
such as AggressiveDeadCodeElimination use DominanceFrontier for this
purpose. This would not be necessary.

When re-generating IR, control dependencies can either be lowered to
branching or as a boolean variables. For the purpose of loop
optimizations, the choice of representation does not (and should not)
make a difference.

6. Late fallback versioning at IR regeneration
------------------------------------------
When a transformation is applied, it can attach conditions (no
aliasing, no integer wrap, value restrictions, etc.) under which the
transformation is valid. During LLVM-IR generation, these conditions
are collected and emitted as run-time conditions. If the condition
fails, the original code is executed.

This avoids having every transformation emit its own fallback code.
The fallback should never execute, or only in cases that do not do
significant work (e.g. loop trip count is 0), such that optimizing the
fallback itself is not necessary. In other cases, the transformation
should implement proper loop versioning by adding the original
statement with a `!rtc` condition.

7. Standardized API for analyses with multiple implementations
--------------------------------------------------------------
Common analyses use an interface that can be fulfilled by different
implementations. Among these are:

 * Closed form expressions
   - None (expressions are always represented through dependencies
     between statements)
   - LLVM ScalarEvolution
   - PolyhedralValueAnalysis [6]

Closed form expressions can be used to remap values (typically array
subscripts, loop lower/upper bounds), simplify expression computations
(including predicates => is a predicate always true?), determine loop
trip counts, etc. Most importantly, it identifies a statement
execution (instance) relative to the loop counters.

 * Array accesses / access range / provenance
   - LLVM AliasAnalysis
   - One-dimensional (always linearized)
   - Multi-dimensional, delinearized if necessary
   - From metadata (that a front-end emits for arrays of pointers to
     arrays of data)

Array access analysis determines whether two accesses at a specific
index can alias each other, access ranges and connectivity.

 * Dependencies
   - none (cannot reorder statements)
   - flat (depends on last execution of another statement)
   - LLVM DependendencyAnalysis
   - Polyhedral analysis

There is intentionally no "does X depend on Y?" API as this would make
any analysis based on it quadratic in runtime. Instead, it has to list
all statements another statement depends on such that a graph can be
built and e.g. sorted topologically.

Dependencies include: use-def chains via registers, memory
flow/anti/output, general side-effects and control dependencies.

 * Control flow
   - Extracted from original CFG
   - Evaluate statement conditions

Control flow analysis answers questions about (post-)dominance and
mutual exclusivity within the acyclic sequence/body of a loop.

 * Cost heuristic
   - TargetTransformationInfo

Estimate the execution time of a (sub-)tree. Average trip counts from
profiling [11] could be useful.

8. Abstract representation of statements
----------------------------------------
There is no direct relationship between LLVM instructions and loop DAG
statements. If possible, loop tree optimizations should only use the
properties of the statement node, not its contents. Typically, a
statement consists of multiple instructions that do not make sense to
split. For instance, assuming that %add is not used a second time, in
the example below

    %add = add i64 %arg, 2
    %mul = shl i64 %add, 1

the two instructions should always be computed together in the same
loop iteration. Combining instructions to statements reduces the
computational complexity of loop optimizations and might be controlled
by the optimization level (at lower optimization level more
instructions would be combined). A loop node is itself a statement in
it's parent loop node's acyclic body.

Moreover, there is no reason to use only LLVM-IR to carry the
semantics. We could also use e.g. Machine-IR or MLIR operations.

9. Expansion of use-def chains to arrays when spanning loops
------------------------------------------------------------
Consider we would like to fission the following loop:

    for (int i = 0; i < n; i += 1) {
      { if(true) double a = f(i); }
      { if(true) g(a); }
    }

The problem is that the split cuts the use-def dependency over `a`.
However, the fission can still be done and results in:

    double A[n];
    for (int i = 0; i < n; i += 1)
      { if(true) A[i] = f(i); }
    for (int i = 0; i < n; i += 1) {
      { if(true) g(A[i]); }

The transforming pass has to consider this during its profitability
model. The big advantage is that in terms of correctness, use-def
chains do not manifest false dependencies.

Possible Transformations
------------------------
In principle, any transformation could be applied on this
representation, but instruction-centric transformations such as
InstCombine are less suitable. Obviously, loop-centric transformations
are a better fit. Ideas for loop transformations include:

 * Unroll(-and-jam)
 * Peeling
 * Iteration range split, concatenation
 * Interchange
 * Loop fusion and distribution(/fission)
 * Tiling/stripmining/stripemining/blocking
 * Code versioning (including loop unswitching)
 * Maybe vectorization
 * Accelerator offloading
 * Parallelization
 * Space-filling curve iteration [12]
 * Array packing (e.g. for scratch pads), transposition
 * Statement reordering
 * SIMT-ization
 * Detect and optimize reductions
 * Loop skewing/wavefronting
 * Loop reversal
 * Solver-driven (polyhedral) optimization
 * Loop idiom recognition

Optimization Pipeline
---------------------
The optimization pipeline for the hierarchy DAG is straightforward:

    void optimizeLoopNest() {
      RedOriginal = generateLoopHierarchy();
      List = createCandidates(RedOriginal);
      sort(List, [](Cand1,Cand2) { return estimateExecutionTime(Cand2)
                                                   -
estimateExecutionTime(Cand1); } );
      if (List[0] != RedOriginal)
        emit(List[0]);
    }

Subtree exploration is done recursively, optimizing inner nodes which
can be (re-)used when optimizing outer nodes.

The current node to optimize is compared a selection of known
heuristics and its transformation is added to the list of candidates.
That is, an optimization routine resemble the following, with more
specific heuristics checked before more general ones.

void exploreOptimizations(Root, Worklist) {
   Matches = 0;

   // High level: user annotations
   if (Root.hasHint()) {
     // Hint is mandatory: do not explore any other possibility
     Worklist.replace(Root, applyHint(Root));
     return;
   }

    // High level: Idiom recognition
    if (isBLAS(Root)) {
      Optimized = LinkAgainstBLASLibary ? convertToBlasLibCall(Root) :

applyBLIS(Root):
      Worklist.replace(Root, Optimized);
      return;
    }

    // Mid level: recognize stencils
    if (isStencil(Root))
      ...

    // Mid level: Map to compute hierarchy (here: simd -> SMT -> core
-> socket for CPUs)
    if (isParallel(Root) fitsWorkingSet(Root.getBody(), LLCSize) && &&
      isParallel(Root.getPerfectlyNestedLoop(1)) &&
       fitsWorkingSet(Root.getPerfectlyNestedLoop(2).getBody(), L2Size) &&
      isParallel(Root.getPerfectlyNestedLoop(2)) &&
       fitsWorkingSet(Root.getPerfectlyNestedLoop(3).getBody(), L1Size) &&
      isVectorizable(Root.getPerfectlyNestedLoop(3))) {
        Optimized = applyVectorization(Root.getPerfectlyNestedLoop(3));
        Optimized = applyParallel(Optimized.getParent(), /*places=*/SMT);
        Optimized = applyParallel(Optimized.getParent(), /*places=*/Cores);
        Optimized = applyParallel(Optimized.getParent(), /*places=*/Sockets);

        // Still explore CPU pipeline optimizations of innermost
        Worklist.insert(Root,Optimized);
        Matches += 1;
    }

    if (Matches >= Threshold)
      return;

    // Low level: optimize working set for L1 cache size using tiling
    Band = maximumTilableNest(Root);
    WorkingSetSizePerIteration = estimatWorkingSet(Band.back().getBody());
    TileSize = floor(nthroot(L1CacheSize / WorkingSetSizePerIteration,
Band.size()));
    if (TileSize > 1) {
       Worklist.insert(applyTiling(Root, TilesSize.splat(Band.size()));
       Matches += 1;
    }

    if (Matches >= Threshold)
      return;

    // Low level: vectorization for each SIMD level (potentially
    // carried-out by VPlan)
    foreach (Width : promisingSimdWidths(Root)) {
      Vectorized = applyVectorization(Root, Width);
      if (Vectorized.isSatisfiable()) {
        Worklist.insert(Vectorized);
        Matches += 1;
      }
    }

    if (Matches >= Threshold)
      return;

    // Low level: rely on heuristic to find best unrolling factor
    if (Factor = unrollFactorHeuristic(Root))  {
      Worklist.insert(applyUnroll(Root, Factor).getRoot());
      Matches += 1;
    }

    if (Matches >= Threshold)
      return;

    ...
}

The "levels" encode precedence optimizations. Depending on the
optimization level, we might only apply the most promising
optimization.

Instead of this brute force implementation, for predefined
optimization levels, the search space exploration should be guided:
Encode the order of the most useful transformations in the search
space exploration. For instance, find all idioms first, then apply
memory hierarchy optimizations, then compute hierarchy, then CPU
pipeline optimizations. Since this is very transformation specific, I
do not expect to write a 'pass manager' where transformations can
register itself to (except for exhaustive searches when explicitly
enabled). Instead, I'd write this in code that classifies loop nests
(e.g. compute bound, bandwidth bound, known idioms, tensor
contractions, stencils, etc.) and choses appropriate optimization
paths. We can, of course, make this more generic over time.

Writing an optimization pass
-----------------------------

Optimizations themselves are written in the following steps:
1. Check preconditions
2. Quick profitability analysis to skip obviously bad transformations
3. Create new green nodes to represent the changed loop nest; reuse
unchanged green nodes as much as possible
4. Create a red node to insert into the parent
5. Preserve analyses (most analyses should be able to use analysis
stored in reused green nodes and/or lazyly reevaluate new nodes)
6. Check whether the transformation was valid

Especially the possibility to analyze the correctness after the
transformation was applied could significantly reduce the amount of
code needed to verify the correctness.

A transformation like loop fusion could look like:

  void applyFuse(Red1, Red2) {
   // Assume Red1 and Red2 are consecutive sibling loops

  // Bail out on untypical things (atomic accesses, exceptions,
convergent instructions, ...)
  if (!Red1->isSimple() || !Red2->isSimple())
    return;

  // Consistency checks (can be less less strict by moving conditions
into the fused body).
  if (Red1->getIterationCount() != Red2->getIterationCount())
    return;
  if (Red1->getPredicate() != Red2->getPredicate())
    return;

  // Create new loop hierarchy with fused loops.
  GreenFused = new GreenLoop({ Red1->getGreen(), Red2->getGreen() },
                                                     Red1->getIterationCount());
  RedFused = new RedNode(GreenFused, Red1->getParent());
  NewRoot = RedFused->getRoot();

  // Preserve analysis.
  // In this case not really necessary since the green nodes of the
  // loop bodies are reused.
  DependenceAnalysis->transferAnalysis(Red1->getBody(),
                                                   RedFused->getBody()[0],
ClosedExprAnalysis->makeIdentityMapping(Red1->getIterator(),
                                                  RedFused->getIterator()));
  DependenceAnalysis->transferAnalysis(Red2->getBody(),
                                                  RedFused->getBody()[1],
   ClosedExprAnalysis->makeIdentityMapping(Red2->getIterator(),
                                                  RedFused->getIterator()));

  ValidityCondition = DependenceAnalysis->isValid(NewRoot);
  if (ValidityCondition.isSatisfiable()) {
    RedFused.addAssumption(ValidityCondition);
    Candidates.push_back(NewRoot);
  }
}

Questions & Answers
-------------------

Q: Is the loop hierarchy a tree or DAG?
A: Both. A green node can be reused multiple times in the same
structure (e.g. due to unrolling, versioning, peelings, etc), but a
red node is specific to the position in its parent; there can be
multiple red nodes for the same green node.

Q: No SSA?
A: Since the location of a definition depends on the parent in a green
DAG, possibly from different loop hierarchies after transformations,
it is not possible to directly reference the definition with its use.
However, it is possible in the red tree, but runs against the goal of
avoiding heavy red nodes. Instead, to find the previous definition,
one has to traverse the green no up to find at which sibling level the
value is defined. However, this should not be a common query; loop
optimizations should use dependency information instead.

Q: Loops with multiple exits?
A: If every exiting edge a unique number to its __???_executed
variable, it can be handled like a switch.

Q: Unstructured control flow?
A: Unstructured control flow only introduces disjunctions to statement
predicated. That is, the predicate

    a or b

corresponds to block BB2 the CFG

    BB0:
      br i1 %a, label %BB2, label %BB2

    BB1:
      br i1 %b, label %BB2, label %BB3

    BB2:

In structured control flow, BB1 and/or BB2 would be post-dominated by
BB2 and the disjunction disappear.

Q: Irreducible loops?
A: Can be folded into the first iteration of the loop by including a
condition for statements not executed during the first iteration.

Q: The unreachable terminator?
A: If due to a non-returning function call, this introduces a loop
exit (which will also introduce control a dependency to everything
that follows) to the surrounding loops. If due to an assumption, we
may add the branching condition to the sufficient condition, but not
to the necessary condition.

Q: Infinite loops?
A: The closed-form expression identifying a statement instances and
loop trip count have no upper bound. Either use arbitrary range
arithmetic, use relative dependence vectors and/or fall back on
pessimistic assumptions. In addition, like with the unreachable
terminator, this may be handled as another loop exit.

Q: IndirectBr?
Since the list of possible targets is part of the instructions, can be
handled like a switch over the target address

Q: Loop guards?
A: Loop guards manifests as the execution condition of the loop
statement's predicate. A loop whose predicate evaluates to false can
be considered a loop with 0 iterations. The IR generator can also emit
the predicate as a loop guard. It might be useful to ensure that if
the predicate evaluates to true, the loop has at least one iteration
and the loop-condition is not evaluated for the first iteration (like
LoopRotation).

Q: Exceptions?
A: Could also be handled with control dependencies, but I would
instead mark the surrounding loops with "contains unpredictable
control flow" to be ignored by most transformations. Similar to
`!isSimple()` of LoadInst and StoreInst.

Q: Relation to the new/legacy pass manager?
A: LLVM's pass managers are unfortunately not designed to apply to
subtrees nor persistent data structures besides the LLVM-IR itself.
Instead, the loop tree optimizer would be its own monolithic pass on
the pass manager level (like MachinePassManager and VPlan). My idea is
to add it somewhere before LoopVectorize, but after the inliner,
potentially replace most other loop transformations. There should be
just one instance of the pass in the pipeline since creating an
alternative representation of LLVM-IR is not that cheap. It does not
need IR normalization (LoopSimplify, LICM, LoopRotate, LCSSA); these
can be performed on the loop tree and emitted into the LLVM-IR if a
non-modified tree is chosen. Possibly some of these passes can be
removed.
In order to preserve user-directed transformations (such as "#pragma
clang loop distribute"), it may also be interesting to put it earlier
into the pipeline to avoid that other passes such as LoopFullUnroll
change the loop before the transformation can be applied.
Alternatively, such passes can be taught the check the presence of a
user-directed transformation before applying its transformation.

Q: Relation to LoopVectorize/VPlan?
A: VPlan has similar design goals [9] but is intended for
vectorization only. It also has a hierarchical data structure, but
uses a CFG for acyclic control flow. It is also intended to have
multiple variants of the same source IR from which the best is
selected via a cost function. However, it lacks cheap copies. Instead
of instructions, it uses recipes/"meta-instructions" that handle what
happens to instructions after vectorization, e.g. do that operation on
each vector lane ("WIDEN"). VPlan is more oriented towards modifying
instructions instead of statements as collection of instructions. With
the loop hierarchy, I do not plan to redo the work of VPlan, but also
do not see a reason why it should not also be able to do
vectorization.

Q: Relation to Polly?
A: Polly uses ISL's schedule tree as the representation of the loop
hierarchy. Its creation may fail of something not representable in a
schedule tree is found. Polly tries to find the maximum representable
region, but once it is chosen, it cannot be changed (e.g. because
optimizing it takes too long). The schedule tree is stored internally
by ISL as a kind of green tree, but is walked over by an iterator that
remembers the path to the root. While this could be used to select a
best schedule tree using a cost function, Polly relies mostly on ISL's
reschedule algorithm to find a profitable transformation.

Q: Relation to MLIR?
A: MLIR is more similar to LLVM-IR than a loop hierarchy. For
instance, it also does not feature cheap copies. An advantage is that
loops and multi-dimensional arrays can be represented in the language
without the need of being rediscovered, but have to be inserted by a
front-end. That is, if Clang was generating MLIR, loops and arrays
still have to be rediscovered. However, a loop hierarchy optimizer
could be applied to MLIR just as well as to LLVM-IR.

References
----------
[1] http://llvm.org/devmtg/2018-10/talk-abstracts.html#talk11
[2] https://ieeexplore.ieee.org/abstract/document/5389392/
[3] http://sro.sussex.ac.uk/id/eprint/7576/1/Stanier,_James.pdf
[4] https://blogs.msdn.microsoft.com/ericlippert/2012/06/08/persistence-facades-and-roslyns-red-green-trees/
[5] https://github.com/Meinersbur/llvm-project/tree/LOF/llvm/lib/LOF
[6] https://llvm.org/devmtg/2017-10/#src2
[7] https://arxiv.org/abs/1811.00632
[8] https://lists.llvm.org/pipermail/llvm-dev/2017-October/118125.html
[9] https://llvm.org/docs/Proposals/VectorizationPlan.html
[10] https://github.com/open64-compiler/open64/tree/master/osprey/be/lno
[11] https://reviews.llvm.org/D70688
[12] https://arxiv.org/abs/1801.07399