[llvm-dev] RFC: Writing loop transformations on the right representation is more productive
Michael Kruse via llvm-dev
llvm-dev at lists.llvm.org
Fri Jan 3 07:57:54 PST 2020
I seem to have dropped the "RFC" in the title accidentally. That is,
this is meant to be a Request For Comments.
Michael
Am Fr., 3. Jan. 2020 um 12:26 Uhr schrieb Michael Kruse <llvmdev at meinersbur.de>:
>
> In the 2018 LLVM DevMtg [1], I presented some shortcomings of how LLVM
> optimizes loops. In summary, the biggest issues are (a) the complexity
> of writing a new loop optimization pass (including needing to deal
> with a variety of low-level issues, a significant amount of required
> boilerplate, the difficulty of analysis preservation, etc.), (b)
> independent optimization heuristics and a fixed pass ordering and (c)
> code explosion due to versioning. Currently, different people are
> working on improving different loop optimization passes such as
> LoopInterchange, LoopFuse and LoopVectorize. Even if LLVM had
> 'perfect' individual loop passes, they would still have the
> aforementioned problems due to still being independent and the system
> as a whole will be suboptimal.
>
> Instead of each loop pass being a major component, the heavy lifting
> could be done by a framework of which each transformation itself is a
> small part. In this RFC, I would like to work towards a consensus on
> how such a framework could look. I already outlined a possible
> solution in the same presentation [1] and a publication [7], which is
> partially based on a previous RFC [8]. All feedback is welcome,
> including a simple '+1'.
>
> The central idea is to use a modifiable loop tree -- similar to
> LoopInfo -- as the primary representation. LLVM-IR is converted to a
> loop tree, then optimized and finally LLVM-IR is generated again for
> subtrees that are considered profitable. This is not a new concept, it
> has already been used in compilers such as IBM XL Fortran (called ASTI
> [4]) and Silicon Graphics/Open64 (called LNO [10]), and in research
> such as the Value State Dependence Graph [3].
>
> Other features include the following:
>
> 1. Candidate selection through cost functions
> 2. Cheap copies using Red/Green Trees
> 3. Application of transformations from high-level to low-level
> 4. Represents loops and predicates instead of CFGs
> 5. Data and control dependencies in one graph
> 6. Late fallback versioning at IR regeneration
> 7. Standardized API for analyses with multiple implementations
> 8. Abstract representation of statements
> 9. Expansion of use-def chains to arrays when spanning loops
>
> To elaborate on each of these:
>
> 1. Candidate selection through cost function
> --------------------------------------------
> Instead of needing to know which transformation is profitable before
> applying it, create a copy of the data structure, modify it, and
> compare it to the original. Moreover, holding multiple, differently
> optimized, copies allows evaluating each variant using a cost function
> and select the 'best' when re-generating LLVM-IR again (or re-use the
> original LLVM-IR).
>
> Otherwise, each transformation will have to implement its own
> profitability heuristic but cannot know what other passes will do. For
> instance, the motivation for LLVM's LoopDistribution is to enable
> vectorization of loops that can be separated from other loops.
> However, it does not know whether LoopVectorize will actually
> vectorize the loop, or whether two vectorizable loops are better
> vectorized together.
>
> Instantiating every possible sequence of transformations of course is
> not feasible, so the search space needs to be pruned. This could be
> made dependent on the optimization level.
>
>
> 2. Cheap copies using Red/Green Trees
> -------------------------------------
> A red/green trees [4] is a data structure for cheap copies of DAGs.
> Basically, a green tree node references only the nodes children but
> not the parent. A node from the red tree references its corresponding
> green node and its parent (red) node. When modifying the represented
> node through copy-on-write, this allows reusing all green nodes except
> the ones from the modified node to the tree's root. The same node can
> even be used multiple times in the same copy, which can be useful for
> e.g. loop unrolling and code versioning. As a result, storing many
> optimization variants of the same code only requires copies of the
> changed parts.
>
> By contrast, the red node is created on-demand. Traversing the entire
> tree should rarely be necessary (instead, use summary information in
> the green node) and when limited to subtrees that are considered
> interesting for optimization.
>
>
> 3. Apply transformations from high-level to low-level
> -----------------------------------------------------
> Optimization should be applied from very specialized to very general
> (potentially after some canonicalization). For instance, the first
> step could be detecting common idioms such as gemm and replace them
> with either a BLAS function call or apply well-studied optimizations
> like BLIS to them. After such an idiom has been detected, no other
> transformation should be applied to them.
>
> Mid-level transformations may try to map entire loop nests to cache-
> and compute hierarchies (SIMT threads, multiprocessors, offloading,
> etc) by applying transformations such as tiling, loop interchange and
> array packing.
>
> Low-level transformations commonly look at only one loop to improve
> processor pipelining and predication: Unrolling, software pipelining,
> etc.
>
>
> 4. Represents loops and predicates instead of CFGs
> -------------------------------------------------
> Control flow graphs make loop optimizations difficult, but is also not
> something a loop optimizer typically cares about. For instance, in the
> example below, loop distribution would need to understand the
> condition `c` and reconstruct the branch to fission
>
> for (int i = 0; i < n; i += 1)
> if (c) {
> foo(i);
> bar(i);
> }
>
> into
>
> for (int i = 0; i < n; i += 1)
> if (c)
> foo(i);
> for (int i = 0; i < n; i += 1)
> if (c)
> bar(i);
>
> In the loop tree, control flow is represented through predicates of
> the statements. In a first step, it is represented as
>
> for (int i = 0; i < n; i += 1) {
> bool __BB1_executed = 0;
> { if (c) foo(i); __BB1_executed = 1; }
> { if (__BB1_executed) bar(i); }
> }
>
> The predicates (`c` and `__BB1_executed`) can be separated into a
> "necessary condition" (`c` must be true before `foo` can be executed)
> and a "sufficient condition" (if `c` is true, `foo` must be executed).
> These can be different to allow speculative execution of the
> statement. For instance, a necessary condition of `true` means that
> the statement can always be executed speculatively.
>
> The advantage is that control flow becomes a data-dependency problem.
> It can trivially be converted back to its CFG representation because
> `__BB1_executed` directly corresponds to the CFG edge to which it was
> connected. Data flow optimizations such as value propagation can be
> applied as well (here: `__BB1_executed == c`), such that we get
>
> for (int i = 0; i < n; i += 1) {
> { if (c) foo(i); }
> { if (c) bar(i); }
> }
>
> Which can be distributed trivially (assuming foo() and bar() do not
> cause dependencies). Whether the copy propagation is beneficial can be
> decided by a heuristic (considering whether the original flow can
> still be reconstructed) or just thrown away if the cost function does
> not select a loop tree based on the copy propagation.
>
> As a result of the predicated representation the loop tree has a
> regular pattern: sequence -> loop -> sequence -> loop -> ... . A loop
> here is abstract in that it is something that is executed multiple
> times. This can be a sequential loop, but also a collection of SPMD
> processes, threads, vector lanes, etc.
>
> At IR regeneration, we have multiple choices of how to lower
> conditions. If we lower to LLVM-IR CFG, it may lower to
> * Branch instructions, reusing branch conditions from previous statements
> * Select instructions
>
> If lowering to SIMD instructions, we can generate
> * Branch instructions for uniform predicates
> * Execution masks for varying predicates
> * A mix of both, since predicates can be uniform relative to previous
> taken conditions
>
> If lowering to NVPTX-style SIMT
> * @p per-instruction execution masks
> * Divergent branches; Every statement can be made a re-convergence
> point (by re-evaluating the condition), but will happen, at the
> latest, at the end of the loop
> * Uniform predicates can be given the ".uni" suffix
>
> When lowering to AMD-style SIMT
> * S_CBRANCH (explicitly managed execution mask)
> * S_CBRANCH_I/G_FORK (stack of execution masks)
>
>
> 5. Data and control dependencies in one graph
> ---------------------------------------------
> Also as a result of the predicated from, during dependency analysis,
> control dependencies and data dependencies have the same
> representation, like in a program dependency graph. In LLVM-IR, passes
> such as AggressiveDeadCodeElimination use DominanceFrontier for this
> purpose. This would not be necessary.
>
> When re-generating IR, control dependencies can either be lowered to
> branching or as a boolean variables. For the purpose of loop
> optimizations, the choice of representation does not (and should not)
> make a difference.
>
>
> 6. Late fallback versioning at IR regeneration
> ------------------------------------------
> When a transformation is applied, it can attach conditions (no
> aliasing, no integer wrap, value restrictions, etc.) under which the
> transformation is valid. During LLVM-IR generation, these conditions
> are collected and emitted as run-time conditions. If the condition
> fails, the original code is executed.
>
> This avoids having every transformation emit its own fallback code.
> The fallback should never execute, or only in cases that do not do
> significant work (e.g. loop trip count is 0), such that optimizing the
> fallback itself is not necessary. In other cases, the transformation
> should implement proper loop versioning by adding the original
> statement with a `!rtc` condition.
>
>
> 7. Standardized API for analyses with multiple implementations
> --------------------------------------------------------------
> Common analyses use an interface that can be fulfilled by different
> implementations. Among these are:
>
> * Closed form expressions
> - None (expressions are always represented through dependencies
> between statements)
> - LLVM ScalarEvolution
> - PolyhedralValueAnalysis [6]
>
> Closed form expressions can be used to remap values (typically array
> subscripts, loop lower/upper bounds), simplify expression computations
> (including predicates => is a predicate always true?), determine loop
> trip counts, etc. Most importantly, it identifies a statement
> execution (instance) relative to the loop counters.
>
>
> * Array accesses / access range / provenance
> - LLVM AliasAnalysis
> - One-dimensional (always linearized)
> - Multi-dimensional, delinearized if necessary
> - From metadata (that a front-end emits for arrays of pointers to
> arrays of data)
>
> Array access analysis determines whether two accesses at a specific
> index can alias each other, access ranges and connectivity.
>
>
> * Dependencies
> - none (cannot reorder statements)
> - flat (depends on last execution of another statement)
> - LLVM DependendencyAnalysis
> - Polyhedral analysis
>
> There is intentionally no "does X depend on Y?" API as this would make
> any analysis based on it quadratic in runtime. Instead, it has to list
> all statements another statement depends on such that a graph can be
> built and e.g. sorted topologically.
>
> Dependencies include: use-def chains via registers, memory
> flow/anti/output, general side-effects and control dependencies.
>
>
> * Control flow
> - Extracted from original CFG
> - Evaluate statement conditions
>
> Control flow analysis answers questions about (post-)dominance and
> mutual exclusivity within the acyclic sequence/body of a loop.
>
>
> * Cost heuristic
> - TargetTransformationInfo
>
> Estimate the execution time of a (sub-)tree. Average trip counts from
> profiling [11] could be useful.
>
>
> 8. Abstract representation of statements
> ----------------------------------------
> There is no direct relationship between LLVM instructions and loop DAG
> statements. If possible, loop tree optimizations should only use the
> properties of the statement node, not its contents. Typically, a
> statement consists of multiple instructions that do not make sense to
> split. For instance, assuming that %add is not used a second time, in
> the example below
>
> %add = add i64 %arg, 2
> %mul = shl i64 %add, 1
>
> the two instructions should always be computed together in the same
> loop iteration. Combining instructions to statements reduces the
> computational complexity of loop optimizations and might be controlled
> by the optimization level (at lower optimization level more
> instructions would be combined). A loop node is itself a statement in
> it's parent loop node's acyclic body.
>
> Moreover, there is no reason to use only LLVM-IR to carry the
> semantics. We could also use e.g. Machine-IR or MLIR operations.
>
>
> 9. Expansion of use-def chains to arrays when spanning loops
> ------------------------------------------------------------
> Consider we would like to fission the following loop:
>
> for (int i = 0; i < n; i += 1) {
> { if(true) double a = f(i); }
> { if(true) g(a); }
> }
>
> The problem is that the split cuts the use-def dependency over `a`.
> However, the fission can still be done and results in:
>
> double A[n];
> for (int i = 0; i < n; i += 1)
> { if(true) A[i] = f(i); }
> for (int i = 0; i < n; i += 1) {
> { if(true) g(A[i]); }
>
> The transforming pass has to consider this during its profitability
> model. The big advantage is that in terms of correctness, use-def
> chains do not manifest false dependencies.
>
>
> Possible Transformations
> ------------------------
> In principle, any transformation could be applied on this
> representation, but instruction-centric transformations such as
> InstCombine are less suitable. Obviously, loop-centric transformations
> are a better fit. Ideas for loop transformations include:
>
> * Unroll(-and-jam)
> * Peeling
> * Iteration range split, concatenation
> * Interchange
> * Loop fusion and distribution(/fission)
> * Tiling/stripmining/stripemining/blocking
> * Code versioning (including loop unswitching)
> * Maybe vectorization
> * Accelerator offloading
> * Parallelization
> * Space-filling curve iteration [12]
> * Array packing (e.g. for scratch pads), transposition
> * Statement reordering
> * SIMT-ization
> * Detect and optimize reductions
> * Loop skewing/wavefronting
> * Loop reversal
> * Solver-driven (polyhedral) optimization
> * Loop idiom recognition
>
>
> Optimization Pipeline
> ---------------------
> The optimization pipeline for the hierarchy DAG is straightforward:
>
> void optimizeLoopNest() {
> RedOriginal = generateLoopHierarchy();
> List = createCandidates(RedOriginal);
> sort(List, [](Cand1,Cand2) { return estimateExecutionTime(Cand2)
> -
> estimateExecutionTime(Cand1); } );
> if (List[0] != RedOriginal)
> emit(List[0]);
> }
>
> Subtree exploration is done recursively, optimizing inner nodes which
> can be (re-)used when optimizing outer nodes.
>
> The current node to optimize is compared a selection of known
> heuristics and its transformation is added to the list of candidates.
> That is, an optimization routine resemble the following, with more
> specific heuristics checked before more general ones.
>
> void exploreOptimizations(Root, Worklist) {
> Matches = 0;
>
> // High level: user annotations
> if (Root.hasHint()) {
> // Hint is mandatory: do not explore any other possibility
> Worklist.replace(Root, applyHint(Root));
> return;
> }
>
> // High level: Idiom recognition
> if (isBLAS(Root)) {
> Optimized = LinkAgainstBLASLibary ? convertToBlasLibCall(Root) :
>
> applyBLIS(Root):
> Worklist.replace(Root, Optimized);
> return;
> }
>
> // Mid level: recognize stencils
> if (isStencil(Root))
> ...
>
> // Mid level: Map to compute hierarchy (here: simd -> SMT -> core
> -> socket for CPUs)
> if (isParallel(Root) fitsWorkingSet(Root.getBody(), LLCSize) && &&
> isParallel(Root.getPerfectlyNestedLoop(1)) &&
> fitsWorkingSet(Root.getPerfectlyNestedLoop(2).getBody(), L2Size) &&
> isParallel(Root.getPerfectlyNestedLoop(2)) &&
> fitsWorkingSet(Root.getPerfectlyNestedLoop(3).getBody(), L1Size) &&
> isVectorizable(Root.getPerfectlyNestedLoop(3))) {
> Optimized = applyVectorization(Root.getPerfectlyNestedLoop(3));
> Optimized = applyParallel(Optimized.getParent(), /*places=*/SMT);
> Optimized = applyParallel(Optimized.getParent(), /*places=*/Cores);
> Optimized = applyParallel(Optimized.getParent(), /*places=*/Sockets);
>
> // Still explore CPU pipeline optimizations of innermost
> Worklist.insert(Root,Optimized);
> Matches += 1;
> }
>
> if (Matches >= Threshold)
> return;
>
> // Low level: optimize working set for L1 cache size using tiling
> Band = maximumTilableNest(Root);
> WorkingSetSizePerIteration = estimatWorkingSet(Band.back().getBody());
> TileSize = floor(nthroot(L1CacheSize / WorkingSetSizePerIteration,
> Band.size()));
> if (TileSize > 1) {
> Worklist.insert(applyTiling(Root, TilesSize.splat(Band.size()));
> Matches += 1;
> }
>
> if (Matches >= Threshold)
> return;
>
> // Low level: vectorization for each SIMD level (potentially
> // carried-out by VPlan)
> foreach (Width : promisingSimdWidths(Root)) {
> Vectorized = applyVectorization(Root, Width);
> if (Vectorized.isSatisfiable()) {
> Worklist.insert(Vectorized);
> Matches += 1;
> }
> }
>
> if (Matches >= Threshold)
> return;
>
> // Low level: rely on heuristic to find best unrolling factor
> if (Factor = unrollFactorHeuristic(Root)) {
> Worklist.insert(applyUnroll(Root, Factor).getRoot());
> Matches += 1;
> }
>
> if (Matches >= Threshold)
> return;
>
> ...
> }
>
>
> The "levels" encode precedence optimizations. Depending on the
> optimization level, we might only apply the most promising
> optimization.
>
> Instead of this brute force implementation, for predefined
> optimization levels, the search space exploration should be guided:
> Encode the order of the most useful transformations in the search
> space exploration. For instance, find all idioms first, then apply
> memory hierarchy optimizations, then compute hierarchy, then CPU
> pipeline optimizations. Since this is very transformation specific, I
> do not expect to write a 'pass manager' where transformations can
> register itself to (except for exhaustive searches when explicitly
> enabled). Instead, I'd write this in code that classifies loop nests
> (e.g. compute bound, bandwidth bound, known idioms, tensor
> contractions, stencils, etc.) and choses appropriate optimization
> paths. We can, of course, make this more generic over time.
>
>
>
> Writing an optimization pass
> -----------------------------
>
> Optimizations themselves are written in the following steps:
> 1. Check preconditions
> 2. Quick profitability analysis to skip obviously bad transformations
> 3. Create new green nodes to represent the changed loop nest; reuse
> unchanged green nodes as much as possible
> 4. Create a red node to insert into the parent
> 5. Preserve analyses (most analyses should be able to use analysis
> stored in reused green nodes and/or lazyly reevaluate new nodes)
> 6. Check whether the transformation was valid
>
> Especially the possibility to analyze the correctness after the
> transformation was applied could significantly reduce the amount of
> code needed to verify the correctness.
>
> A transformation like loop fusion could look like:
>
> void applyFuse(Red1, Red2) {
> // Assume Red1 and Red2 are consecutive sibling loops
>
> // Bail out on untypical things (atomic accesses, exceptions,
> convergent instructions, ...)
> if (!Red1->isSimple() || !Red2->isSimple())
> return;
>
> // Consistency checks (can be less less strict by moving conditions
> into the fused body).
> if (Red1->getIterationCount() != Red2->getIterationCount())
> return;
> if (Red1->getPredicate() != Red2->getPredicate())
> return;
>
> // Create new loop hierarchy with fused loops.
> GreenFused = new GreenLoop({ Red1->getGreen(), Red2->getGreen() },
> Red1->getIterationCount());
> RedFused = new RedNode(GreenFused, Red1->getParent());
> NewRoot = RedFused->getRoot();
>
> // Preserve analysis.
> // In this case not really necessary since the green nodes of the
> // loop bodies are reused.
> DependenceAnalysis->transferAnalysis(Red1->getBody(),
> RedFused->getBody()[0],
> ClosedExprAnalysis->makeIdentityMapping(Red1->getIterator(),
> RedFused->getIterator()));
> DependenceAnalysis->transferAnalysis(Red2->getBody(),
> RedFused->getBody()[1],
> ClosedExprAnalysis->makeIdentityMapping(Red2->getIterator(),
> RedFused->getIterator()));
>
> ValidityCondition = DependenceAnalysis->isValid(NewRoot);
> if (ValidityCondition.isSatisfiable()) {
> RedFused.addAssumption(ValidityCondition);
> Candidates.push_back(NewRoot);
> }
> }
>
>
>
> Questions & Answers
> -------------------
>
> Q: Is the loop hierarchy a tree or DAG?
> A: Both. A green node can be reused multiple times in the same
> structure (e.g. due to unrolling, versioning, peelings, etc), but a
> red node is specific to the position in its parent; there can be
> multiple red nodes for the same green node.
>
>
> Q: No SSA?
> A: Since the location of a definition depends on the parent in a green
> DAG, possibly from different loop hierarchies after transformations,
> it is not possible to directly reference the definition with its use.
> However, it is possible in the red tree, but runs against the goal of
> avoiding heavy red nodes. Instead, to find the previous definition,
> one has to traverse the green no up to find at which sibling level the
> value is defined. However, this should not be a common query; loop
> optimizations should use dependency information instead.
>
>
> Q: Loops with multiple exits?
> A: If every exiting edge a unique number to its __???_executed
> variable, it can be handled like a switch.
>
>
> Q: Unstructured control flow?
> A: Unstructured control flow only introduces disjunctions to statement
> predicated. That is, the predicate
>
> a or b
>
> corresponds to block BB2 the CFG
>
> BB0:
> br i1 %a, label %BB2, label %BB2
>
> BB1:
> br i1 %b, label %BB2, label %BB3
>
> BB2:
>
> In structured control flow, BB1 and/or BB2 would be post-dominated by
> BB2 and the disjunction disappear.
>
>
> Q: Irreducible loops?
> A: Can be folded into the first iteration of the loop by including a
> condition for statements not executed during the first iteration.
>
>
> Q: The unreachable terminator?
> A: If due to a non-returning function call, this introduces a loop
> exit (which will also introduce control a dependency to everything
> that follows) to the surrounding loops. If due to an assumption, we
> may add the branching condition to the sufficient condition, but not
> to the necessary condition.
>
>
> Q: Infinite loops?
> A: The closed-form expression identifying a statement instances and
> loop trip count have no upper bound. Either use arbitrary range
> arithmetic, use relative dependence vectors and/or fall back on
> pessimistic assumptions. In addition, like with the unreachable
> terminator, this may be handled as another loop exit.
>
>
> Q: IndirectBr?
> Since the list of possible targets is part of the instructions, can be
> handled like a switch over the target address
>
>
> Q: Loop guards?
> A: Loop guards manifests as the execution condition of the loop
> statement's predicate. A loop whose predicate evaluates to false can
> be considered a loop with 0 iterations. The IR generator can also emit
> the predicate as a loop guard. It might be useful to ensure that if
> the predicate evaluates to true, the loop has at least one iteration
> and the loop-condition is not evaluated for the first iteration (like
> LoopRotation).
>
>
> Q: Exceptions?
> A: Could also be handled with control dependencies, but I would
> instead mark the surrounding loops with "contains unpredictable
> control flow" to be ignored by most transformations. Similar to
> `!isSimple()` of LoadInst and StoreInst.
>
>
> Q: Relation to the new/legacy pass manager?
> A: LLVM's pass managers are unfortunately not designed to apply to
> subtrees nor persistent data structures besides the LLVM-IR itself.
> Instead, the loop tree optimizer would be its own monolithic pass on
> the pass manager level (like MachinePassManager and VPlan). My idea is
> to add it somewhere before LoopVectorize, but after the inliner,
> potentially replace most other loop transformations. There should be
> just one instance of the pass in the pipeline since creating an
> alternative representation of LLVM-IR is not that cheap. It does not
> need IR normalization (LoopSimplify, LICM, LoopRotate, LCSSA); these
> can be performed on the loop tree and emitted into the LLVM-IR if a
> non-modified tree is chosen. Possibly some of these passes can be
> removed.
> In order to preserve user-directed transformations (such as "#pragma
> clang loop distribute"), it may also be interesting to put it earlier
> into the pipeline to avoid that other passes such as LoopFullUnroll
> change the loop before the transformation can be applied.
> Alternatively, such passes can be taught the check the presence of a
> user-directed transformation before applying its transformation.
>
>
> Q: Relation to LoopVectorize/VPlan?
> A: VPlan has similar design goals [9] but is intended for
> vectorization only. It also has a hierarchical data structure, but
> uses a CFG for acyclic control flow. It is also intended to have
> multiple variants of the same source IR from which the best is
> selected via a cost function. However, it lacks cheap copies. Instead
> of instructions, it uses recipes/"meta-instructions" that handle what
> happens to instructions after vectorization, e.g. do that operation on
> each vector lane ("WIDEN"). VPlan is more oriented towards modifying
> instructions instead of statements as collection of instructions. With
> the loop hierarchy, I do not plan to redo the work of VPlan, but also
> do not see a reason why it should not also be able to do
> vectorization.
>
>
> Q: Relation to Polly?
> A: Polly uses ISL's schedule tree as the representation of the loop
> hierarchy. Its creation may fail of something not representable in a
> schedule tree is found. Polly tries to find the maximum representable
> region, but once it is chosen, it cannot be changed (e.g. because
> optimizing it takes too long). The schedule tree is stored internally
> by ISL as a kind of green tree, but is walked over by an iterator that
> remembers the path to the root. While this could be used to select a
> best schedule tree using a cost function, Polly relies mostly on ISL's
> reschedule algorithm to find a profitable transformation.
>
>
> Q: Relation to MLIR?
> A: MLIR is more similar to LLVM-IR than a loop hierarchy. For
> instance, it also does not feature cheap copies. An advantage is that
> loops and multi-dimensional arrays can be represented in the language
> without the need of being rediscovered, but have to be inserted by a
> front-end. That is, if Clang was generating MLIR, loops and arrays
> still have to be rediscovered. However, a loop hierarchy optimizer
> could be applied to MLIR just as well as to LLVM-IR.
>
>
> References
> ----------
> [1] http://llvm.org/devmtg/2018-10/talk-abstracts.html#talk11
> [2] https://ieeexplore.ieee.org/abstract/document/5389392/
> [3] http://sro.sussex.ac.uk/id/eprint/7576/1/Stanier,_James.pdf
> [4] https://blogs.msdn.microsoft.com/ericlippert/2012/06/08/persistence-facades-and-roslyns-red-green-trees/
> [5] https://github.com/Meinersbur/llvm-project/tree/LOF/llvm/lib/LOF
> [6] https://llvm.org/devmtg/2017-10/#src2
> [7] https://arxiv.org/abs/1811.00632
> [8] https://lists.llvm.org/pipermail/llvm-dev/2017-October/118125.html
> [9] https://llvm.org/docs/Proposals/VectorizationPlan.html
> [10] https://github.com/open64-compiler/open64/tree/master/osprey/be/lno
> [11] https://reviews.llvm.org/D70688
> [12] https://arxiv.org/abs/1801.07399
More information about the llvm-dev
mailing list