<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">Am Mi., 15. Jan. 2020 um 20:27 Uhr schrieb Chris Lattner <<a href="mailto:clattner@nondot.org">clattner@nondot.org</a>>:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div style="overflow-wrap: break-word;"><div><blockquote type="cite"><div dir="auto"><div dir="ltr"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><div><div>One you achieve consensus on data structure, there is the question of what IR to use within it.  I would recommend starting with some combination of “existing LLVM IR operations + high level control flow representation”, e.g. parallel and affine loops.  The key here is that you need to always be able to lower in a simple and predictable way to LLVM IR (this is one of the things that classic polyhedral systems did sub optimally, making it difficult to reason about the cost model of various transformations), and this is a natural incremental starting point anyway.  Over time, more high level concepts can be gradually introduced.  FYI, MLIR already has a reasonable <a href="https://mlir.llvm.org/docs/Dialects/LLVM/" rel="noreferrer" target="_blank">LLVM dialect</a> and can generate LLVM IR from it, so we’d just need an “LLVM IR -> MLIR LLVM dialect” conversion, which should be straightforward to build.</div></div></div></blockquote><div><br></div><div>Adding a LLVM-IR -> MLIR -> LLVM-IR round-trip would at the beginning just introduce compile-time overhead and what Renato described as brittleness. I fear this hurts adaption.</div></div></div></div></blockquote><div><br></div><div>Isn’t this true of *any* higher level IR?  Unless I’m missing something big, this seems inherent to your proposal.</div></div></div></blockquote><div><br></div><div>No. A loop hierarchy may be created on-demand and can be skipped if, e.g., the function does not contain a loop. For IRs that are translation-unit based, the entire module will have to do a round-trip whether changed or not. To improve the situation, one could e.g. add a "has been changed" flag to each function. But it has to be added somewhere into the MLIR data structure and kept up-to-date on modifications. In a loop-hierarchical structure only the node(s) that has been changed needs to be lowered (e.g. an innermost loop) and versioned with the original IR depending on taken assumptions.</div><div> </div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div style="overflow-wrap: break-word;"><div><blockquote type="cite"><div dir="auto"><div dir="ltr"><div class="gmail_quote"><div>This is definitely subjective question. I think that MLIR is closer to LLVM-IR for how it is processed. Both have a sequence of passes running over a single source of truth. Both allow walking the entire structure from every instruction/operation/block. Analyses are on function or module level. Both have CFGs (I think for a certain kind of transformations it is an advantage that control flow is handled implicitly).</div></div></div></div></blockquote><div><br></div><div>Right, but a frequent way that MLIR is used is without its CFG: most machine learning kernels use nests of loops and ifs, not CFGs.  CFGs are exposed when those are lowered out.  See some simple examples like:</div><div><a href="https://github.com/llvm/llvm-project/blob/master/mlir/test/Transforms/affine-data-copy.mlir" target="_blank">https://github.com/llvm/llvm-project/blob/master/mlir/test/Transforms/affine-data-copy.mlir</a></div><div><br></div></div></div></blockquote><div><br></div><div>I agree that a loop nest can be represented in MLIR. What is missing IMHO is being able to have multiple versions of the same code. For instance, raising emitted C++ to such representation to make it more optimizable may only be possible under preconditions and by itself making the code slower. If the raised representation cannot be optimized, we will want to use the original one.</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div style="overflow-wrap: break-word;"><div><br><blockquote type="cite"><div dir="auto"><div dir="ltr"><div class="gmail_quote"><div>The possibility to make local changes speculatively without copying the entire data structure. IMHO this is a central idea that allows applying a transformations speculatively to pass it to a legality check and cost heuristic without committing to apply it. As a consequence, passes do not need to implement to implement these in a transformation-specific manner, drastically reducing the burden of implementation.</div><div><br></div><div>For instance, more loop transformations are feasible if instructions are moved into the innermost loops. With speculative transformations, we can canonicalize the representation to sink computations into loops -- the opposite of what LICM does -- and then see whether a transformation can applied. If not, the speculative representation is discarded without having an effect on the original representation (and not needing to hoist those computations again).</div><div><br></div><div>Because the MLIR classes have many references to related objects (such as pointer to parents and use-def chains), I don't think it is feasible to implement on top of MLIR. </div></div></div></div></blockquote><div><br></div><div>Ah yes, I see what you mean.  One way to do that is to represent multiple options as an op with region for each option.  This means you only fork the part of the IR that you’re producing variants of.  I think this is the red/green tree technique you mentioned, but I’m not sure.</div><br></div></div></blockquote><div><br></div><div>The red-green tree technique even allows re-inserting entire unchanged subtrees (e.g. loop bodies after an interchange). If op takes multiple regions, each region still must be deep copies.</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div style="overflow-wrap: break-word;"><div><br><blockquote type="cite"><div dir="auto"><div dir="ltr"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><div><blockquote type="cite"><div><div>An advantage is that<br>loops and multi-dimensional arrays can be represented in the language<br>without the need of being rediscovered, but have to be inserted by a<br>front-end. </div></div></blockquote><div><br></div><div>This is correct, but I don’t see how this helps if your focus is raising code that has already been lowered to LLVM IR, e.g. by Clang or some other frontend that generates LLVM IR today.</div></div></div></blockquote><div><br></div><div>Indeed, I would hope that LLVM-IR can preserve multi-dimensional array accesses in some fashion as well (<a href="https://lists.llvm.org/pipermail/llvm-dev/2019-July/134063.html" rel="noreferrer" target="_blank">https://lists.llvm.org/pipermail/llvm-dev/2019-July/134063.html</a>). However, currently MLIR has the advantage of being able represent it.</div></div></div></div></blockquote><div><br></div><div>I don’t think LLVM IR will ever get there without a massive design change.  It is possible that it will support static shaped accesses in limited ways though.</div><br></div></div></blockquote><div><br></div><div>Static sized rectangular multi-dimensional arrays are already possible using a standard GetElementPtr and its inrange qualifier. For dynamic sized multi-dimensional sized arrays what is needed is to convey the dimensions of the array in form of an llvm::Value. In the RFC we discussed an intrinsic and operand bundles, neither looks like massive design changes to me.</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div style="overflow-wrap: break-word;"><div><blockquote type="cite"><div dir="auto"><div dir="ltr"><div class="gmail_quote"><div>On the other side, natural loop detection on CFGs is quite mature (with a remaining issue of irreducible loops that might appear, but can also be eliminated again). As a plus, optimization does depend less on how the source code is written.<br></div></div></div></div></blockquote><br></div><div>Yep totally.  The question is whether you lose semantic information from lowering to a CFG and reconstructing back up.  This can affect you when you have higher level language semantics (e.g. Fortran parallel loops, openmp or other concurrency constructs etc).  This is where MLIR excels of course.</div><div><br></div></div></blockquote><div><br></div><div>Indeed it is easier to not lower these constructs, but not impossible (as shown in <a href="https://reviews.llvm.org/D69930">https://reviews.llvm.org/D69930</a>). I think the relevant difference is that these constructs come with additional guarantees (e.g. Single-Entry-Single-Exit regions) and optimization hurdles (e.g. thread synchronization; where programmers do not expect the compiler to do a lot of things) compared to C++ loop constructs.</div><div><br></div><div><br></div><div>Michael</div><div><br></div></div></div>