<div dir="ltr"><div>Hi Michael,</div><div><br></div><div>Although the approach to use a higher order in-memory abstraction like the loop tree will make it easier than what you have today, if you used MLIR for this representation, you already get a round trippable textual format that is *very close* to your form. The affine.for/if, std.for/if in MLIR are nearly isomorphic to the tree representation you want, and as such, this drastically reduces the delta between the in-memory data structures your passes want to operate on and what you see when you print the IR. Normally, there'd be resistance to building a textual form / introducing a first class concept in the IR for what you are proposing, but since this was already done for MLIR, it looks like it would be a big win from a compiler developers' productivity standpoint if you just used MLIR for this loop tree representation. With regard to the FAQ, I can't tell whether you meant something else or missed the representation used in MLIR for the affine dialect or in general for "ops with regions".</div><div><br></div><div>> Q: Relation to MLIR?<br>> A: MLIR is more similar to LLVM-IR than a loop hierarchy. For<br></div><div><br></div><div>It's completely the opposite unless you are looking only at MLIR's std dialect! The affine dialect as well as the std.for/if (currently misnamed as loop.for/loop.if) are actually a loop tree. The affine ops are just an affine loop AST isomorphic to the materialization of polyhedral domains/schedules via code generation. Every IST AST or the output of polyhedral code generation can be represented in the affine dialect and vice versa. MLIR's loop/if ops are a hierarchy rather than a flat form / list of blocks CFG. </div><div><br></div><div>> still have to be rediscovered. However, a loop hierarchy optimizer<br>> could be applied to MLIR just as well as to LLVM-IR.<br></div><div><br></div><div>This is true, but it's easier to apply it to MLIR because the actual IR is close by miles to the in-memory thing your loop hierarchy optimizer would be using. For eg., here's the input IR and the output IR of a simple outer loop vectorization performed in MLIR:</div><div><a href="https://github.com/bondhugula/llvm-project/blob/hop/mlir/test/Transforms/vectorize.mlir#L23">https://github.com/bondhugula/llvm-project/blob/hop/mlir/test/Transforms/vectorize.mlir#L23</a><br></div><div><br></div><div>~ Uday</div><div><br></div><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, 3 Jan 2020 at 16:57, Michael Kruse via llvm-dev <<a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">In the 2018 LLVM DevMtg [1], I presented some shortcomings of how LLVM<br>

optimizes loops. In summary, the biggest issues are (a) the complexity<br>

of writing a new loop optimization pass (including needing to deal<br>

with a variety of low-level issues, a significant amount of required<br>

boilerplate, the difficulty of analysis preservation, etc.), (b)<br>

independent optimization heuristics and a fixed pass ordering and (c)<br>

code explosion due to versioning. Currently, different people are<br>

working on improving different loop optimization passes such as<br>

LoopInterchange, LoopFuse and LoopVectorize. Even if LLVM had<br>

'perfect' individual loop passes, they would still have the<br>

aforementioned problems due to still being independent and the system<br>

as a whole will be suboptimal.<br>

<br>

Instead of each loop pass being a major component, the heavy lifting<br>

could be done by a framework of which each transformation itself is a<br>

small part. In this RFC, I would like to work towards a consensus on<br>

how such a framework could look. I already outlined a possible<br>

solution in the same presentation [1] and a publication [7], which is<br>

partially based on a previous RFC [8]. All feedback is welcome,<br>

including a simple '+1'.<br>

<br>

The central idea is to use a modifiable loop tree -- similar to<br>

LoopInfo -- as the primary representation. LLVM-IR is converted to a<br>

loop tree, then optimized and finally LLVM-IR is generated again for<br>

subtrees that are considered profitable. This is not a new concept, it<br>

has already been used in compilers such as IBM XL Fortran (called ASTI<br>

[4]) and Silicon Graphics/Open64 (called LNO [10]), and in research<br>

such as the Value State Dependence Graph [3].<br>

<br>

Other features include the following:<br>

<br>

1. Candidate selection through cost functions<br>

2. Cheap copies using Red/Green Trees<br>

3. Application of transformations from high-level to low-level<br>

4. Represents loops and predicates instead of CFGs<br>

5. Data and control dependencies in one graph<br>

6. Late fallback versioning at IR regeneration<br>

7. Standardized API for analyses with multiple implementations<br>

8. Abstract representation of statements<br>

9. Expansion of use-def chains to arrays when spanning loops<br>

<br>

To elaborate on each of these:<br>

<br>

1. Candidate selection through cost function<br>

--------------------------------------------<br>

Instead of needing to know which transformation is profitable before<br>

applying it, create a copy of the data structure, modify it, and<br>

compare it to the original. Moreover, holding multiple, differently<br>

optimized, copies allows evaluating each variant using a cost function<br>

and select the 'best' when re-generating LLVM-IR again (or re-use the<br>

original LLVM-IR).<br>

<br>

Otherwise, each transformation will have to implement its own<br>

profitability heuristic but cannot know what other passes will do. For<br>

instance, the motivation for LLVM's LoopDistribution is to enable<br>

vectorization of loops that can be separated from other loops.<br>

However, it does not know whether LoopVectorize will actually<br>

vectorize the loop, or whether two vectorizable loops are better<br>

vectorized together.<br>

<br>

Instantiating every possible sequence of transformations of course is<br>

not feasible, so the search space needs to be pruned. This could be<br>

made dependent on the optimization level.<br>

<br>

<br>

2. Cheap copies using Red/Green Trees<br>

-------------------------------------<br>

A red/green trees [4] is a data structure for cheap copies of DAGs.<br>

Basically, a green tree node references only the nodes children but<br>

not the parent. A node from the red tree references its corresponding<br>

green node and its parent (red) node. When modifying the represented<br>

node through copy-on-write, this allows reusing all green nodes except<br>

the ones from the modified node to the tree's root. The same node can<br>

even be used multiple times in the same copy, which can be useful for<br>

e.g. loop unrolling and code versioning. As a result, storing many<br>

optimization variants of the same code only requires copies of the<br>

changed parts.<br>

<br>

By contrast, the red node is created on-demand. Traversing the entire<br>

tree should rarely be necessary (instead, use summary information in<br>

the green node) and when limited to subtrees that are considered<br>

interesting for optimization.<br>

<br>

<br>

3. Apply transformations from high-level to low-level<br>

-----------------------------------------------------<br>

Optimization should be applied from very specialized to very general<br>

(potentially after some canonicalization). For instance, the first<br>

step could be detecting common idioms such as gemm and replace them<br>

with either a BLAS function call or apply well-studied optimizations<br>

like BLIS to them. After such an idiom has been detected, no other<br>

transformation should be applied to them.<br>

<br>

Mid-level transformations may try to map entire loop nests to cache-<br>

and compute hierarchies (SIMT threads, multiprocessors, offloading,<br>

etc) by applying transformations such as tiling, loop interchange and<br>

array packing.<br>

<br>

Low-level transformations commonly look at only one loop to improve<br>

processor pipelining and predication: Unrolling, software pipelining,<br>

etc.<br>

<br>

<br>

4. Represents loops and predicates instead of CFGs<br>

-------------------------------------------------<br>

Control flow graphs make loop optimizations difficult, but is also not<br>

something a loop optimizer typically cares about. For instance, in the<br>

example below, loop distribution would need to understand the<br>

condition `c` and reconstruct the branch to fission<br>

<br>

    for (int i = 0; i < n; i += 1)<br>

      if (c) {<br>

        foo(i);<br>

        bar(i);<br>

      }<br>

<br>

into<br>

<br>

    for (int i = 0; i < n; i += 1)<br>

      if (c)<br>

        foo(i);<br>

    for (int i = 0; i < n; i += 1)<br>

      if (c)<br>

        bar(i);<br>

<br>

In the loop tree, control flow is represented through predicates of<br>

the statements.  In a first step, it is represented as<br>

<br>

    for (int i = 0; i < n; i += 1) {<br>

      bool __BB1_executed = 0;<br>

      { if (c) foo(i); __BB1_executed = 1; }<br>

      { if (__BB1_executed) bar(i); }<br>

    }<br>

<br>

The predicates (`c` and `__BB1_executed`) can be separated into a<br>

"necessary condition" (`c` must be true before `foo` can be executed)<br>

and a "sufficient condition" (if `c` is true, `foo` must be executed).<br>

These can be different to allow speculative execution of the<br>

statement. For instance, a necessary condition of `true` means that<br>

the statement can always be executed speculatively.<br>

<br>

The advantage is that control flow becomes a data-dependency problem.<br>

It can trivially be converted back to its CFG representation because<br>

`__BB1_executed` directly corresponds to the CFG edge to which it was<br>

connected. Data flow optimizations such as value propagation can be<br>

applied as well (here: `__BB1_executed == c`), such that we get<br>

<br>

    for (int i = 0; i < n; i += 1) {<br>

      { if (c) foo(i); }<br>

      { if (c) bar(i); }<br>

    }<br>

<br>

Which can be distributed trivially (assuming foo() and bar() do not<br>

cause dependencies). Whether the copy propagation is beneficial can be<br>

decided by a heuristic (considering whether the original flow can<br>

still be reconstructed) or just thrown away if the cost function does<br>

not select a loop tree based on the copy propagation.<br>

<br>

As a result of the predicated representation the loop tree has a<br>

regular pattern: sequence -> loop -> sequence -> loop -> ... . A loop<br>

here is abstract in that it is something that is executed multiple<br>

times. This can be a sequential loop, but also a collection of SPMD<br>

processes, threads, vector lanes, etc.<br>

<br>

At IR regeneration, we have multiple choices of how to lower<br>

conditions. If we lower to LLVM-IR CFG, it may lower to<br>

 * Branch instructions, reusing branch conditions from previous statements<br>

 * Select instructions<br>

<br>

If lowering to SIMD instructions, we can generate<br>

 * Branch instructions for uniform predicates<br>

 * Execution masks for varying predicates<br>

 * A mix of both, since predicates can be uniform relative to previous<br>

taken conditions<br>

<br>

If lowering to NVPTX-style SIMT<br>

 * @p per-instruction execution masks<br>

 * Divergent branches; Every statement can be made a re-convergence<br>

point (by re-evaluating the condition), but will happen, at the<br>

latest, at the end of the loop<br>

 * Uniform predicates can be given the ".uni" suffix<br>

<br>

When lowering to AMD-style SIMT<br>

 * S_CBRANCH (explicitly managed execution mask)<br>

 * S_CBRANCH_I/G_FORK (stack of execution masks)<br>

<br>

<br>

5. Data and control dependencies in one graph<br>

---------------------------------------------<br>

Also as a result of the predicated from, during dependency analysis,<br>

control dependencies and data dependencies have the same<br>

representation, like in a program dependency graph. In LLVM-IR, passes<br>

such as AggressiveDeadCodeElimination use DominanceFrontier for this<br>

purpose. This would not be necessary.<br>

<br>

When re-generating IR, control dependencies can either be lowered to<br>

branching or as a boolean variables. For the purpose of loop<br>

optimizations, the choice of representation does not (and should not)<br>

make a difference.<br>

<br>

<br>

6. Late fallback versioning at IR regeneration<br>

------------------------------------------<br>

When a transformation is applied, it can attach conditions (no<br>

aliasing, no integer wrap, value restrictions, etc.) under which the<br>

transformation is valid. During LLVM-IR generation, these conditions<br>

are collected and emitted as run-time conditions. If the condition<br>

fails, the original code is executed.<br>

<br>

This avoids having every transformation emit its own fallback code.<br>

The fallback should never execute, or only in cases that do not do<br>

significant work (e.g. loop trip count is 0), such that optimizing the<br>

fallback itself is not necessary. In other cases, the transformation<br>

should implement proper loop versioning by adding the original<br>

statement with a `!rtc` condition.<br>

<br>

<br>

7. Standardized API for analyses with multiple implementations<br>

--------------------------------------------------------------<br>

Common analyses use an interface that can be fulfilled by different<br>

implementations. Among these are:<br>

<br>

 * Closed form expressions<br>

   - None (expressions are always represented through dependencies<br>

     between statements)<br>

   - LLVM ScalarEvolution<br>

   - PolyhedralValueAnalysis [6]<br>

<br>

Closed form expressions can be used to remap values (typically array<br>

subscripts, loop lower/upper bounds), simplify expression computations<br>

(including predicates => is a predicate always true?), determine loop<br>

trip counts, etc. Most importantly, it identifies a statement<br>

execution (instance) relative to the loop counters.<br>

<br>

<br>

 * Array accesses / access range / provenance<br>

   - LLVM AliasAnalysis<br>

   - One-dimensional (always linearized)<br>

   - Multi-dimensional, delinearized if necessary<br>

   - From metadata (that a front-end emits for arrays of pointers to<br>

     arrays of data)<br>

<br>

Array access analysis determines whether two accesses at a specific<br>

index can alias each other, access ranges and connectivity.<br>

<br>

<br>

 * Dependencies<br>

   - none (cannot reorder statements)<br>

   - flat (depends on last execution of another statement)<br>

   - LLVM DependendencyAnalysis<br>

   - Polyhedral analysis<br>

<br>

There is intentionally no "does X depend on Y?" API as this would make<br>

any analysis based on it quadratic in runtime. Instead, it has to list<br>

all statements another statement depends on such that a graph can be<br>

built and e.g. sorted topologically.<br>

<br>

Dependencies include: use-def chains via registers, memory<br>

flow/anti/output, general side-effects and control dependencies.<br>

<br>

<br>

 * Control flow<br>

   - Extracted from original CFG<br>

   - Evaluate statement conditions<br>

<br>

Control flow analysis answers questions about (post-)dominance and<br>

mutual exclusivity within the acyclic sequence/body of a loop.<br>

<br>

<br>

 * Cost heuristic<br>

   - TargetTransformationInfo<br>

<br>

Estimate the execution time of a (sub-)tree. Average trip counts from<br>

profiling [11] could be useful.<br>

<br>

<br>

8. Abstract representation of statements<br>

----------------------------------------<br>

There is no direct relationship between LLVM instructions and loop DAG<br>

statements. If possible, loop tree optimizations should only use the<br>

properties of the statement node, not its contents. Typically, a<br>

statement consists of multiple instructions that do not make sense to<br>

split. For instance, assuming that %add is not used a second time, in<br>

the example below<br>

<br>

    %add = add i64 %arg, 2<br>

    %mul = shl i64 %add, 1<br>

<br>

the two instructions should always be computed together in the same<br>

loop iteration. Combining instructions to statements reduces the<br>

computational complexity of loop optimizations and might be controlled<br>

by the optimization level (at lower optimization level more<br>

instructions would be combined). A loop node is itself a statement in<br>

it's parent loop node's acyclic body.<br>

<br>

Moreover, there is no reason to use only LLVM-IR to carry the<br>

semantics. We could also use e.g. Machine-IR or MLIR operations.<br>

<br>

<br>

9. Expansion of use-def chains to arrays when spanning loops<br>

------------------------------------------------------------<br>

Consider we would like to fission the following loop:<br>

<br>

    for (int i = 0; i < n; i += 1) {<br>

      { if(true) double a = f(i); }<br>

      { if(true) g(a); }<br>

    }<br>

<br>

The problem is that the split cuts the use-def dependency over `a`.<br>

However, the fission can still be done and results in:<br>

<br>

    double A[n];<br>

    for (int i = 0; i < n; i += 1)<br>

      { if(true) A[i] = f(i); }<br>

    for (int i = 0; i < n; i += 1) {<br>

      { if(true) g(A[i]); }<br>

<br>

The transforming pass has to consider this during its profitability<br>

model. The big advantage is that in terms of correctness, use-def<br>

chains do not manifest false dependencies.<br>

<br>

<br>

Possible Transformations<br>

------------------------<br>

In principle, any transformation could be applied on this<br>

representation, but instruction-centric transformations such as<br>

InstCombine are less suitable. Obviously, loop-centric transformations<br>

are a better fit. Ideas for loop transformations include:<br>

<br>

 * Unroll(-and-jam)<br>

 * Peeling<br>

 * Iteration range split, concatenation<br>

 * Interchange<br>

 * Loop fusion and distribution(/fission)<br>

 * Tiling/stripmining/stripemining/blocking<br>

 * Code versioning (including loop unswitching)<br>

 * Maybe vectorization<br>

 * Accelerator offloading<br>

 * Parallelization<br>

 * Space-filling curve iteration [12]<br>

 * Array packing (e.g. for scratch pads), transposition<br>

 * Statement reordering<br>

 * SIMT-ization<br>

 * Detect and optimize reductions<br>

 * Loop skewing/wavefronting<br>

 * Loop reversal<br>

 * Solver-driven (polyhedral) optimization<br>

 * Loop idiom recognition<br>

<br>

<br>

Optimization Pipeline<br>

---------------------<br>

The optimization pipeline for the hierarchy DAG is straightforward:<br>

<br>

    void optimizeLoopNest() {<br>

      RedOriginal = generateLoopHierarchy();<br>

      List = createCandidates(RedOriginal);<br>

      sort(List, [](Cand1,Cand2) { return estimateExecutionTime(Cand2)<br>

                                                   -<br>

estimateExecutionTime(Cand1); } );<br>

      if (List[0] != RedOriginal)<br>

        emit(List[0]);<br>

    }<br>

<br>

Subtree exploration is done recursively, optimizing inner nodes which<br>

can be (re-)used when optimizing outer nodes.<br>

<br>

The current node to optimize is compared a selection of known<br>

heuristics and its transformation is added to the list of candidates.<br>

That is, an optimization routine resemble the following, with more<br>

specific heuristics checked before more general ones.<br>

<br>

void exploreOptimizations(Root, Worklist) {<br>

   Matches = 0;<br>

<br>

   // High level: user annotations<br>

   if (Root.hasHint()) {<br>

     // Hint is mandatory: do not explore any other possibility<br>

     Worklist.replace(Root, applyHint(Root));<br>

     return;<br>

   }<br>

<br>

    // High level: Idiom recognition<br>

    if (isBLAS(Root)) {<br>

      Optimized = LinkAgainstBLASLibary ? convertToBlasLibCall(Root) :<br>

<br>

applyBLIS(Root):<br>

      Worklist.replace(Root, Optimized);<br>

      return;<br>

    }<br>

<br>

    // Mid level: recognize stencils<br>

    if (isStencil(Root))<br>

      ...<br>

<br>

    // Mid level: Map to compute hierarchy (here: simd -> SMT -> core<br>

-> socket for CPUs)<br>

    if (isParallel(Root) fitsWorkingSet(Root.getBody(), LLCSize) && &&<br>

      isParallel(Root.getPerfectlyNestedLoop(1)) &&<br>

       fitsWorkingSet(Root.getPerfectlyNestedLoop(2).getBody(), L2Size) &&<br>

      isParallel(Root.getPerfectlyNestedLoop(2)) &&<br>

       fitsWorkingSet(Root.getPerfectlyNestedLoop(3).getBody(), L1Size) &&<br>

      isVectorizable(Root.getPerfectlyNestedLoop(3))) {<br>

        Optimized = applyVectorization(Root.getPerfectlyNestedLoop(3));<br>

        Optimized = applyParallel(Optimized.getParent(), /*places=*/SMT);<br>

        Optimized = applyParallel(Optimized.getParent(), /*places=*/Cores);<br>

        Optimized = applyParallel(Optimized.getParent(), /*places=*/Sockets);<br>

<br>

        // Still explore CPU pipeline optimizations of innermost<br>

        Worklist.insert(Root,Optimized);<br>

        Matches += 1;<br>

    }<br>

<br>

    if (Matches >= Threshold)<br>

      return;<br>

<br>

    // Low level: optimize working set for L1 cache size using tiling<br>

    Band = maximumTilableNest(Root);<br>

    WorkingSetSizePerIteration = estimatWorkingSet(Band.back().getBody());<br>

    TileSize = floor(nthroot(L1CacheSize / WorkingSetSizePerIteration,<br>

Band.size()));<br>

    if (TileSize > 1) {<br>

       Worklist.insert(applyTiling(Root, TilesSize.splat(Band.size()));<br>

       Matches += 1;<br>

    }<br>

<br>

    if (Matches >= Threshold)<br>

      return;<br>

<br>

    // Low level: vectorization for each SIMD level (potentially<br>

    // carried-out by VPlan)<br>

    foreach (Width : promisingSimdWidths(Root)) {<br>

      Vectorized = applyVectorization(Root, Width);<br>

      if (Vectorized.isSatisfiable()) {<br>

        Worklist.insert(Vectorized);<br>

        Matches += 1;<br>

      }<br>

    }<br>

<br>

    if (Matches >= Threshold)<br>

      return;<br>

<br>

    // Low level: rely on heuristic to find best unrolling factor<br>

    if (Factor = unrollFactorHeuristic(Root))  {<br>

      Worklist.insert(applyUnroll(Root, Factor).getRoot());<br>

      Matches += 1;<br>

    }<br>

<br>

    if (Matches >= Threshold)<br>

      return;<br>

<br>

    ...<br>

}<br>

<br>

<br>

The "levels" encode precedence optimizations. Depending on the<br>

optimization level, we might only apply the most promising<br>

optimization.<br>

<br>

Instead of this brute force implementation, for predefined<br>

optimization levels, the search space exploration should be guided:<br>

Encode the order of the most useful transformations in the search<br>

space exploration. For instance, find all idioms first, then apply<br>

memory hierarchy optimizations, then compute hierarchy, then CPU<br>

pipeline optimizations. Since this is very transformation specific, I<br>

do not expect to write a 'pass manager' where transformations can<br>

register itself to (except for exhaustive searches when explicitly<br>

enabled). Instead, I'd write this in code that classifies loop nests<br>

(e.g. compute bound, bandwidth bound, known idioms, tensor<br>

contractions, stencils, etc.) and choses appropriate optimization<br>

paths. We can, of course, make this more generic over time.<br>

<br>

<br>

<br>

Writing an optimization pass<br>

-----------------------------<br>

<br>

Optimizations themselves are written in the following steps:<br>

1. Check preconditions<br>

2. Quick profitability analysis to skip obviously bad transformations<br>

3. Create new green nodes to represent the changed loop nest; reuse<br>

unchanged green nodes as much as possible<br>

4. Create a red node to insert into the parent<br>

5. Preserve analyses (most analyses should be able to use analysis<br>

stored in reused green nodes and/or lazyly reevaluate new nodes)<br>

6. Check whether the transformation was valid<br>

<br>

Especially the possibility to analyze the correctness after the<br>

transformation was applied could significantly reduce the amount of<br>

code needed to verify the correctness.<br>

<br>

A transformation like loop fusion could look like:<br>

<br>

  void applyFuse(Red1, Red2) {<br>

   // Assume Red1 and Red2 are consecutive sibling loops<br>

<br>

  // Bail out on untypical things (atomic accesses, exceptions,<br>

convergent instructions, ...)<br>

  if (!Red1->isSimple() || !Red2->isSimple())<br>

    return;<br>

<br>

  // Consistency checks (can be less less strict by moving conditions<br>

into the fused body).<br>

  if (Red1->getIterationCount() != Red2->getIterationCount())<br>

    return;<br>

  if (Red1->getPredicate() != Red2->getPredicate())<br>

    return;<br>

<br>

  // Create new loop hierarchy with fused loops.<br>

  GreenFused = new GreenLoop({ Red1->getGreen(), Red2->getGreen() },<br>

                                                     Red1->getIterationCount());<br>

  RedFused = new RedNode(GreenFused, Red1->getParent());<br>

  NewRoot = RedFused->getRoot();<br>

<br>

  // Preserve analysis.<br>

  // In this case not really necessary since the green nodes of the<br>

  // loop bodies are reused.<br>

  DependenceAnalysis->transferAnalysis(Red1->getBody(),<br>

                                                   RedFused->getBody()[0],<br>

ClosedExprAnalysis->makeIdentityMapping(Red1->getIterator(),<br>

                                                  RedFused->getIterator()));<br>

  DependenceAnalysis->transferAnalysis(Red2->getBody(),<br>

                                                  RedFused->getBody()[1],<br>

   ClosedExprAnalysis->makeIdentityMapping(Red2->getIterator(),<br>

                                                  RedFused->getIterator()));<br>

<br>

  ValidityCondition = DependenceAnalysis->isValid(NewRoot);<br>

  if (ValidityCondition.isSatisfiable()) {<br>

    RedFused.addAssumption(ValidityCondition);<br>

    Candidates.push_back(NewRoot);<br>

  }<br>

}<br>

<br>

<br>

<br>

Questions & Answers<br>

-------------------<br>

<br>

Q: Is the loop hierarchy a tree or DAG?<br>

A: Both. A green node can be reused multiple times in the same<br>

structure (e.g. due to unrolling, versioning, peelings, etc), but a<br>

red node is specific to the position in its parent; there can be<br>

multiple red nodes for the same green node.<br>

<br>

<br>

Q: No SSA?<br>

A: Since the location of a definition depends on the parent in a green<br>

DAG, possibly from different loop hierarchies after transformations,<br>

it is not possible to directly reference the definition with its use.<br>

However, it is possible in the red tree, but runs against the goal of<br>

avoiding heavy red nodes. Instead, to find the previous definition,<br>

one has to traverse the green no up to find at which sibling level the<br>

value is defined. However, this should not be a common query; loop<br>

optimizations should use dependency information instead.<br>

<br>

<br>

Q: Loops with multiple exits?<br>

A: If every exiting edge a unique number to its __???_executed<br>

variable, it can be handled like a switch.<br>

<br>

<br>

Q: Unstructured control flow?<br>

A: Unstructured control flow only introduces disjunctions to statement<br>

predicated. That is, the predicate<br>

<br>

    a or b<br>

<br>

corresponds to block BB2 the CFG<br>

<br>

    BB0:<br>

      br i1 %a, label %BB2, label %BB2<br>

<br>

    BB1:<br>

      br i1 %b, label %BB2, label %BB3<br>

<br>

    BB2:<br>

<br>

In structured control flow, BB1 and/or BB2 would be post-dominated by<br>

BB2 and the disjunction disappear.<br>

<br>

<br>

Q: Irreducible loops?<br>

A: Can be folded into the first iteration of the loop by including a<br>

condition for statements not executed during the first iteration.<br>

<br>

<br>

Q: The unreachable terminator?<br>

A: If due to a non-returning function call, this introduces a loop<br>

exit (which will also introduce control a dependency to everything<br>

that follows) to the surrounding loops. If due to an assumption, we<br>

may add the branching condition to the sufficient condition, but not<br>

to the necessary condition.<br>

<br>

<br>

Q: Infinite loops?<br>

A: The closed-form expression identifying a statement instances and<br>

loop trip count have no upper bound. Either use arbitrary range<br>

arithmetic, use relative dependence vectors and/or fall back on<br>

pessimistic assumptions. In addition, like with the unreachable<br>

terminator, this may be handled as another loop exit.<br>

<br>

<br>

Q: IndirectBr?<br>

Since the list of possible targets is part of the instructions, can be<br>

handled like a switch over the target address<br>

<br>

<br>

Q: Loop guards?<br>

A: Loop guards manifests as the execution condition of the loop<br>

statement's predicate. A loop whose predicate evaluates to false can<br>

be considered a loop with 0 iterations. The IR generator can also emit<br>

the predicate as a loop guard. It might be useful to ensure that if<br>

the predicate evaluates to true, the loop has at least one iteration<br>

and the loop-condition is not evaluated for the first iteration (like<br>

LoopRotation).<br>

<br>

<br>

Q: Exceptions?<br>

A: Could also be handled with control dependencies, but I would<br>

instead mark the surrounding loops with "contains unpredictable<br>

control flow" to be ignored by most transformations. Similar to<br>

`!isSimple()` of LoadInst and StoreInst.<br>

<br>

<br>

Q: Relation to the new/legacy pass manager?<br>

A: LLVM's pass managers are unfortunately not designed to apply to<br>

subtrees nor persistent data structures besides the LLVM-IR itself.<br>

Instead, the loop tree optimizer would be its own monolithic pass on<br>

the pass manager level (like MachinePassManager and VPlan). My idea is<br>

to add it somewhere before LoopVectorize, but after the inliner,<br>

potentially replace most other loop transformations. There should be<br>

just one instance of the pass in the pipeline since creating an<br>

alternative representation of LLVM-IR is not that cheap. It does not<br>

need IR normalization (LoopSimplify, LICM, LoopRotate, LCSSA); these<br>

can be performed on the loop tree and emitted into the LLVM-IR if a<br>

non-modified tree is chosen. Possibly some of these passes can be<br>

removed.<br>

In order to preserve user-directed transformations (such as "#pragma<br>

clang loop distribute"), it may also be interesting to put it earlier<br>

into the pipeline to avoid that other passes such as LoopFullUnroll<br>

change the loop before the transformation can be applied.<br>

Alternatively, such passes can be taught the check the presence of a<br>

user-directed transformation before applying its transformation.<br>

<br>

<br>

Q: Relation to LoopVectorize/VPlan?<br>

A: VPlan has similar design goals [9] but is intended for<br>

vectorization only. It also has a hierarchical data structure, but<br>

uses a CFG for acyclic control flow. It is also intended to have<br>

multiple variants of the same source IR from which the best is<br>

selected via a cost function. However, it lacks cheap copies. Instead<br>

of instructions, it uses recipes/"meta-instructions" that handle what<br>

happens to instructions after vectorization, e.g. do that operation on<br>

each vector lane ("WIDEN"). VPlan is more oriented towards modifying<br>

instructions instead of statements as collection of instructions. With<br>

the loop hierarchy, I do not plan to redo the work of VPlan, but also<br>

do not see a reason why it should not also be able to do<br>

vectorization.<br>

<br>

<br>

Q: Relation to Polly?<br>

A: Polly uses ISL's schedule tree as the representation of the loop<br>

hierarchy. Its creation may fail of something not representable in a<br>

schedule tree is found. Polly tries to find the maximum representable<br>

region, but once it is chosen, it cannot be changed (e.g. because<br>

optimizing it takes too long). The schedule tree is stored internally<br>

by ISL as a kind of green tree, but is walked over by an iterator that<br>

remembers the path to the root. While this could be used to select a<br>

best schedule tree using a cost function, Polly relies mostly on ISL's<br>

reschedule algorithm to find a profitable transformation.<br>

<br>

<br>

Q: Relation to MLIR?<br>

A: MLIR is more similar to LLVM-IR than a loop hierarchy. For<br>

instance, it also does not feature cheap copies. An advantage is that<br>

loops and multi-dimensional arrays can be represented in the language<br>

without the need of being rediscovered, but have to be inserted by a<br>

front-end. That is, if Clang was generating MLIR, loops and arrays<br>

still have to be rediscovered. However, a loop hierarchy optimizer<br>

could be applied to MLIR just as well as to LLVM-IR.<br>

<br>

<br>

References<br>

----------<br>

[1] <a href="http://llvm.org/devmtg/2018-10/talk-abstracts.html#talk11" rel="noreferrer" target="_blank">http://llvm.org/devmtg/2018-10/talk-abstracts.html#talk11</a><br>

[2] <a href="https://ieeexplore.ieee.org/abstract/document/5389392/" rel="noreferrer" target="_blank">https://ieeexplore.ieee.org/abstract/document/5389392/</a><br>

[3] <a href="http://sro.sussex.ac.uk/id/eprint/7576/1/Stanier,_James.pdf" rel="noreferrer" target="_blank">http://sro.sussex.ac.uk/id/eprint/7576/1/Stanier,_James.pdf</a><br>

[4] <a href="https://blogs.msdn.microsoft.com/ericlippert/2012/06/08/persistence-facades-and-roslyns-red-green-trees/" rel="noreferrer" target="_blank">https://blogs.msdn.microsoft.com/ericlippert/2012/06/08/persistence-facades-and-roslyns-red-green-trees/</a><br>

[5] <a href="https://github.com/Meinersbur/llvm-project/tree/LOF/llvm/lib/LOF" rel="noreferrer" target="_blank">https://github.com/Meinersbur/llvm-project/tree/LOF/llvm/lib/LOF</a><br>

[6] <a href="https://llvm.org/devmtg/2017-10/#src2" rel="noreferrer" target="_blank">https://llvm.org/devmtg/2017-10/#src2</a><br>

[7] <a href="https://arxiv.org/abs/1811.00632" rel="noreferrer" target="_blank">https://arxiv.org/abs/1811.00632</a><br>

[8] <a href="https://lists.llvm.org/pipermail/llvm-dev/2017-October/118125.html" rel="noreferrer" target="_blank">https://lists.llvm.org/pipermail/llvm-dev/2017-October/118125.html</a><br>

[9] <a href="https://llvm.org/docs/Proposals/VectorizationPlan.html" rel="noreferrer" target="_blank">https://llvm.org/docs/Proposals/VectorizationPlan.html</a><br>

[10] <a href="https://github.com/open64-compiler/open64/tree/master/osprey/be/lno" rel="noreferrer" target="_blank">https://github.com/open64-compiler/open64/tree/master/osprey/be/lno</a><br>

[11] <a href="https://reviews.llvm.org/D70688" rel="noreferrer" target="_blank">https://reviews.llvm.org/D70688</a><br>

[12] <a href="https://arxiv.org/abs/1801.07399" rel="noreferrer" target="_blank">https://arxiv.org/abs/1801.07399</a><br>

_______________________________________________<br>

LLVM Developers mailing list<br>

<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a><br>

<a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" rel="noreferrer" target="_blank">https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><br>

</blockquote></div><br clear="all"><div><br></div>-- <br><div dir="ltr" class="gmail_signature"><div dir="ltr">Founder and Director, PolyMage Labs</div></div></div>