[Mlir-commits] [mlir] 975284a - [mlir][bufferization] Update public MLIR documentation

Matthias Springer llvmlistbot at llvm.org
Mon Mar 14 06:14:00 PDT 2022

Author: Matthias Springer
Date: 2022-03-14T22:13:41+09:00
New Revision: 975284ab4b18232fc10dccd14527981ebcb8663e

URL: https://github.com/llvm/llvm-project/commit/975284ab4b18232fc10dccd14527981ebcb8663e
DIFF: https://github.com/llvm/llvm-project/commit/975284ab4b18232fc10dccd14527981ebcb8663e.diff

LOG: [mlir][bufferization] Update public MLIR documentation

Differential Revision: https://reviews.llvm.org/D121071




diff  --git a/mlir/docs/Bufferization.md b/mlir/docs/Bufferization.md
index dbfc7c52c12e9..fef8a968c5787 100644
--- a/mlir/docs/Bufferization.md
+++ b/mlir/docs/Bufferization.md
@@ -4,16 +4,405 @@
 ## Overview
-Bufferization in MLIR is the process of converting the `tensor` type to the
-`memref` type. MLIR provides a composable system that allows dialects to
-systematically bufferize a program. This system is a simple application of
-MLIR's [dialect conversion](DialectConversion.md) infrastructure. The bulk of
-the code related to bufferization is a set of ordinary `ConversionPattern`'s
-that dialect authors write for converting ops that operate on `tensor`'s to ops
-that operate on `memref`'s. A set of conventions and best practices are followed
-that allow these patterns to be run across multiple independent passes (rather
-than requiring a single huge atomic conversion pass), which makes the
-compilation pipelines scalable, robust, and easy to debug.
+Bufferization in MLIR is the process of converting ops with `tensor` semantics
+to ops with `memref` semantics. MLIR provides an infrastructure that bufferizes
+an entire program in a single pass (*One-Shot Bufferize*). This infrastructure
+bufferizes all ops that implement the
+can be bufferized.
+MLIR has an older bufferization infrastructure built around
+[dialect conversion](DialectConversion.md). Most dialect conversion
+bufferization patterns have been migrated to One-Shot Bufferize, but some
+functionality such as function boundary bufferization still depends on dialect
+conversion and its type converter. New projects should use One-Shot Bufferize,
+as the dialect conversion-based bufferization will eventually be deprecated.
+Moreover, One-Shot Bufferize results in better bufferization with fewer memory
+allocations and buffer copies. This documentation is mostly about One-Shot
+Bufferize, but also describes how to gradually migrate a project from dialect
+conversion-based bufferization to One-Shot Bufferize.
+## What is One-Shot Bufferize?
+One-Shot Bufferize is a new tensor bufferization pass designed for IR in
+[destination-passing style](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/11/dps-fhpc17.pdf),
+and with aggressive in-place bufferization.
+One-Shot Bufferize is:
+* **Monolithic**: A single MLIR pass does the entire
+work, whereas the previous bufferization in MLIR was split across multiple
+passes residing in 
diff erent dialects. In One-Shot Bufferize,
+`BufferizableOpInterface` implementations are spread across 
diff erent dialects.
+* A **whole-function at a time analysis**. In-place bufferization decisions are
+made by analyzing SSA use-def chains on tensors. Op interface implementations
+not only provide the rewrite logic from tensor ops to memref ops, but also
+helper methods for One-Shot Bufferize's analysis to query information about an
+op's bufferization/memory semantics.
+* **Extensible** via an op interface: All
+ops that implement `BufferizableOpInterface` can be bufferized.
+* **2-Pass**:
+Bufferization is internally broken down into 2 steps: First, analyze the entire
+IR and make bufferization decisions. Then, bufferize (rewrite) the IR. The
+analysis has access to exact SSA use-def information. It incrementally builds
+alias and equivalence sets and does not rely on a posteriori-alias analysis from
+preallocated memory.
+* **Greedy**: Operations are analyzed one-by-one and it is
+decided on the spot whether a tensor OpOperand must be copied or not. Heuristics
+determine the order of analysis.
+* **Modular**: The current One-Shot Analysis
+can be replaced with a 
diff erent analysis. The result of the analysis are
+queried by the bufferization via `BufferizationState`, in particular
+`BufferizationState::isInPlace`. Any derived class of `BufferizationState` that
+implements a small number virtual functions can serve as a custom analysis. It
+is even possible to run One-Shot Bufferize without any analysis
+(`AlwaysCopyBufferizationState`), in which case One-Shot Bufferize behaves
+exactly like the old dialect conversion-based bufferization (i.e., copy every
+buffer before writing to it).
+To reduce complexity, One-Shot Bufferize should be
+[run after other transformations](https://llvm.discourse.group/t/rfc-linalg-on-tensors-update-and-comprehensive-bufferization-rfc/3373),
+typically as one of the last steps right before lowering memref ops. Many
+transformations are easier in tensor land; e.g., tile/fuse/… on tensors first,
+then bufferize the remaining IR.
+From an architecture perspective, One-Shot Bufferize consists of
+(and its implementations) and an
+of tensor SSA values that decides if a buffer can be used directly or must be
+copied. The [bufferize] method of the op interface inspects analysis results and
+rewrites tensor ops into memref ops.
+## Goals of Bufferization
+The high-level goal of every bufferization technique is to: 1. Use as little
+memory as possible. 2. Copy as little memory as possible.
+This implies reusing already allocated buffers when possible, turning
+bufferization into an algorithmically complex problem with similarities to
+register allocation.
+Depending on the concrete use case, there may be additional bufferization
+requirements. If the contents of a buffer are expensive to compute, there could
+be a tradeoff between *recomputation* and *compute once and copy*. On the
+contrary, it may not even be possible to allocate new buffers at runtime on some
+## Destination-Passing Style
+Bufferization is an algorithmically complex problem. Given an op with a tensor
+result, bufferization has to choose a memref buffer in which the result can be
+stored. It is always safe to allocate a brand new buffer, but such a
+bufferization strategy would be unacceptable for high-performance codegen. When
+choosing an already existing buffer, we must be careful not to accidentally
+overwrite data that is still needed later in the program.
+To simplify this problem, One-Shot Bufferize was designed for ops that are in
+*destination-passing style*. For every tensor result, such ops have a tensor
+operand, who's buffer could be for storing the result of the op in the absence
+of other conflicts. We call such tensor operands the *destination*.
+As an example, consider the following op: `%0 = tensor.insert %cst into
+%t[%idx] : tensor<?xf32>`
+`%t` is the destination in this example. When choosing a buffer for the result
+`%0`, One-Shot Bufferize considers only two options:
+1.  buffer(`%0`) = buffer(`%t`).
+2.  buffer(`%0`) is a newly allocated buffer.
+There may be other buffers in the same function that could potentially be used
+for buffer(`%0`), but those are not considered by One-Shot Bufferize to keep the
+bufferization simple. One-Shot Bufferize could be extended to consider such
+buffers in the future to achieve a better quality of bufferization.
+Tensor ops that are not in destination-passing style always bufferize to a
+memory allocation. E.g.:
+%0 = tensor.generate %sz {
+^bb0(%i : index):
+  %cst = arith.constant 0.0 : f32
+  tensor.yield %cst : f32
+} : tensor<?xf32>
+The result of `tensor.generate` does not have a "destination", so bufferization
+allocates a new buffer. This could be avoided by choosing an op such as
+`linalg.generic`, which can express the same computation with a destination
+("out") tensor:
+#map = affine_map<(i) -> (i)>
+%0 = linalg.generic {indexing_maps = [#map], iterator_types = ["parallel"]}
+                    outs(%t : tensor<?xf32>) {
+  ^bb0(%arg0 : f32):
+    %cst = arith.constant 0.0 : f32
+    linalg.yield %cst : f32
+} -> tensor<?xf32>
+At first glance, the above `linalg.generic` op may not seem very useful because
+the output tensor `%t` is entirely overwritten. Why pass the tensor `%t` as an
+operand in the first place? As an example, this can be useful for overwriting a
+slice of a tensor:
+%t = tensor.extract_slice %s [%idx] [%sz] [1] : tensor<?xf32> to tensor<?xf32>
+%0 = linalg.generic ... outs(%t) { ... } -> tensor<?xf32>
+%1 = tensor.insert_slice %0 into %s [%idx] [%sz] [1]
+    : tensor<?xf32> into tensor<?xf32>
+The above example bufferizes to a `memref.subview`, followed by a
+"`linalg.generic` on memrefs" that overwrites the memory of the subview. The
+`tensor.insert_slice` bufferizes to a no-op (in the absence of RaW conflicts
+such as a subsequent read of `%s`).
+RaW conflicts are detected with an analysis of SSA use-def chains (details
+later). One-Shot Bufferize works best if there is a single SSA use-def chain,
+where the result of a tensor op is the "destination" operand of the next tensor
+ops, e.g.:
+%0 = "my_dialect.some_op"(%t) : (tensor<?xf32>) -> (tensor<?xf32>)
+%1 = "my_dialect.another_op"(%0) : (tensor<?xf32>) -> (tensor<?xf32>)
+%2 = "my_dialect.yet_another_op"(%1) : (tensor<?xf32>) -> (tensor<?xf32>)
+Buffer copies are likely inserted if the SSA use-def chain splits at some point,
+%0 = "my_dialect.some_op"(%t) : (tensor<?xf32>) -> (tensor<?xf32>)
+%1 = "my_dialect.another_op"(%0) : (tensor<?xf32>) -> (tensor<?xf32>)
+%2 = "my_dialect.yet_another_op"(%0) : (tensor<?xf32>) -> (tensor<?xf32>)
+One-Shot Bufferize has debug flags (`test-analysis-only print-conflicts`) that
+print the results of the analysis and explain to the user why buffer copies were
+## Using One-Shot Bufferize
+MLIR provides a pass
+that performs an analysis and bufferizes all ops with tensor semantics that
+implement `BufferizableOpInterface`. For modularity reasons, these op interface
+implementations are typically external models that live in a dialect's
+"Transforms" build unit. (External models are a mechanism for implementing an op
+interface in a 
diff erent build unit.) It is the user's responsibility to ensure
+that all needed external models are registered before running One-Shot
+By default, One-Shot Bufferize fails when it encounters an op with tensor
+semantics (i.e., tensor result or tensor operand) that is not bufferizable
+(i.e., does not implement `BufferizableOpInterface`). This can be avoided with
+`allow-unknown-ops`. In that case, One-Shot Bufferize inserts
+`to_memref`/`to_tensor` ops around the bufferization boundary. These ops are
+named versions of `unrealized_conversion_cast`. Note that One-Shot Bufferize's
+analysis can currently not analyze these ops, so input IR with such ops may fail
+bufferization. Therefore, running One-Shot Bufferize multiple times in a
+sequence is also not supported at the moment.
+One-Shot Bufferize can be configured to bufferize only ops from a set of
+dialects with `dialect-filter`. This can be useful for gradually migrating from
+dialect conversion-based bufferization to One-Shot Bufferize. One-Shot Bufferize
+must run first in such a case, because dialect conversion-based bufferization
+generates `to_tensor`/`to_memref` ops which One-Shot Bufferize cannot analyze.
+One-Shot Bufferize can also be called programmatically with
+skips the analysis and inserts a copy on every buffer write, just like the
+dialect conversion-based bufferization.
+## Buffer Deallocation
+One-Shot Bufferize deallocates all buffers that it allocates. This is in
+contrast to the dialect conversion-based bufferization that delegates this job
+to the
+pass. One-Shot Bufferize cannot handle IR where a newly allocated buffer is
+returned from a block. Such IR will fail bufferization.
+A new buffer allocation is returned from a block when the result of an op that
+is not in destination-passing style is returned. E.g.:
+%0 = scf.if %c -> (tensor<?xf32>) {
+  %1 = tensor.generate ... -> tensor<?xf32>
+  scf.yield %1 : tensor<?xf32>
+} else {
+  scf.yield %another_tensor : tensor<?xf32>
+The `scf.yield` in the "else" branch is OK, but the `scf.yield` in the "then"
+branch will be rejected.
+Another case in which a buffer allocation may be returned is when a buffer copy
+must be inserted due to a RaW conflict. E.g.:
+%0 = scf.if %c -> (tensor<?xf32>) {
+  %1 = tensor.insert %cst into %another_tensor[%idx] : tensor<?xf32>
+  "my_dialect.reading_tensor_op"(%another_tensor) : (tensor<?xf32>) -> ()
+  ...
+  scf.yield %1 : tensor<?xf32>
+} else {
+  scf.yield %yet_another_tensor : tensor<?xf32>
+In the above example, a buffer copy of buffer(`%another_tensor`) (with `%cst`
+inserted) is yielded from the "then" branch.
+In both examples, a buffer is allocated inside of a block and then yielded from
+the block. This is not supported in One-Shot Bufferize. Alternatively, One-Shot
+Bufferize can be configured to leak all memory and not generate any buffer
+deallocations with `create-deallocs=0 allowReturnMemref`. The buffers can then
+be deallocated by running `-buffer-deallocation` after One-Shot Bufferize.
+## Memory Layouts
+One-Shot Bufferize bufferizes ops from top to bottom. This works well when all
+ops are bufferizable. However, when encountering a non-bufferizable tensor with
+`allow-unknown-ops`, One-Shot Bufferize must insert `to_memref` ops at the
+bufferization boundary and decide on a memref type. By default, One-Shot
+Bufferize choose the most dynamic memref type wrt. layout maps. E.g.:
+%0 = "my_dialect.unbufferizable_op(%t) : (tensor<?x?xf32>) -> (tensor<?x?xf32>)
+%1 = tensor.extract %0[%idx1, %idx2] : tensor<?xf32>
+When bufferizing the above IR, One-Shot Bufferize inserts a `to_memref` ops with
+dynamic offset and strides:
+#map = affine_map<(d0, d1)[s0, s1, s2] -> (d0 * s1 + s0 + d1 * s2)>
+%0 = "my_dialect.unbufferizable_op(%t) : (tensor<?x?xf32>) -> (tensor<?x?xf32>)
+%0_m = bufferization.to_memref %0 : memref<?x?xf32, #map>
+%1 = memref.load %0_m[%idx1, %idx2] : memref<?x?xf32, #map>
+All users of `%0` have fully dynamic layout maps. This ensures that the
+bufferized IR composes well with future bufferizations of `unbufferizable_op`
+(maybe bufferized by another pass), regardless of the exact memref type of the
+future bufferization. If the op turns out to be bufferized to an op with a
+simpler memref type (e.g., identity layout map), we expect that canonicalization
+patterns would clean up unnecessarily dynamic layout maps. (Some of these
+canonicalization patterns may not be implemented yet.)
+Note that One-Shot Bufferize always generates the most specific memref type when
+the entire IR is bufferizable. In that case, we do not have to rely on
+canonicalization patterns to clean up the bufferized IR.
+One-Shot Bufferize can be configured to always generate memref types with
+identity layout when the exact target memref type is not known via
+`fully-dynamic-layout-maps=0`. This can be useful for legacy code that cannot
+handle memref types with layout maps. Note that this leads to additional buffer
+copies when folding a `to_tensor`/`to_memref` pair with memref types that are
+not cast-compatible.
+## Extending One-Shot Bufferize
+Custom ops can be bufferized if they implement `BufferizableOpInterface`. Users
+must at least implement the following interface methods.
+*   `bufferizesToMemoryRead`: Return `true` if the buffer of the given tensor
+    OpOperand is read.
+*   `bufferizesToMemoryWrite`: Return `true` if the buffer of the given tensor
+    OpOperand is written (if bufferizing in-place).
+*   `getAliasingOpResult`: Return the OpResults that may share the same buffer
+    as the given OpOperand. This interface method describes to
+    OpOperand-to-OpResult mapping wrt. destination-passing style.
+*   `bufferRelation`: Return `BufferRelation::Equivalent` if the given OpResult
+    is the exact same memref as the aliasing OpOperand after bufferization (in
+    case of in-place bufferization). Otherwise, (e.g., they overlap but are not
+    necessarily the exact same memrefs), `BufferRelation::None` should be
+    returned. Additional buffer relations will be added in the future, but
+    `BufferRelation::None` is always safe.
+*   `bufferize`: Rewrite the op with the given rewriter. Ops should be replaced
+    with `bufferization::replaceOpWithBufferizedValues`.
+To get a better intuition of the interface methods, we invite users to take a
+look at existing implementations in MLIR, e.g., the implementation of
+`tensor.insert` or `tensor.extract`.
+## Debugging Buffer Copies
+To get a better understanding of why One-Shot Bufferize introduced a buffer
+copy, users can run the pass with `test-analysis-only print-conflicts`. Every
+tensor op is then annotated with an attribute that has a boolean value for each
+tensor OpOperand. `true` means that the OpOperand bufferizes in-place. `false`
+means that the OpOperand bufferizes out-of-place and a buffer copy will be
+There are two reasons why a buffer copy may be inserted.
+1.  Due to a RaW conflict, it is not safe to bufferize in-place. I.e., the
+    overwritten data is still needed.
+2.  The buffer is not writable. E.g., `memref.global` buffers that are the
+    result of `arith.constant` ops are never modified.
+In the first case, `print-conflicts` illustrates the conflict in the form of a
+("read", "conflicting write", "last write") tuple.
+## Understanding the SSA Use-Def Chain Analysis
+To get a better understanding of the SSA Use-Def Chain Analysis and the RaW
+conflict detection algorithm, we invite interested users to read the
+[design document](https://discourse.llvm.org/uploads/short-url/5kckJ3DftYwQokG252teFgw3sYa.pdf)
+and watch the corresponding [ODM talk](https://youtu.be/TXEo59CYS9A)
+can be used to bufferize a program in a single pass, as long as each op
+## Migrating from Dialect Conversion-based Bufferization
+Both dialect conversion-based bufferization and One-Shot Bufferize generate
+`to_tensor`/`to_memref` ops at the bufferization boundary (when run with
+`allow-unknown-ops`). They can be combined and run in sequence. However,
+One-Shot Bufferize must run first because it cannot analyze those boundary ops.
+To update existing code step-by-step, it may be useful to specify a dialect
+filter for One-Shot Bufferize, so that dialects can be switched over one-by-one.
+## Bufferization Function Graphs
+One-Shot Bufferize does currently not support function graph bufferization.
+I.e., `CallOp`, `ReturnOp` and function bbArgs are not bufferizable. Users can
+run the existing `--func-bufferize` bufferization pass after One-Shot Bufferize.
+Alternatively, users can try
+which is an extension of One-Shot Bufferize. This bufferization is still under
+development and does not support arbitrary IR. In essence, returning a tensor
+from a function is not supported, unless it is equivalent to a function bbArg.
+In that case, the corresponding return value can simply be dropped during
+## Dialect Conversion-based Bufferization
+Disclaimer: Most dialect conversion-based bufferization has been migrated to
+One-Shot Bufferize. New users should use One-Shot Bufferize (with or without
+analysis). The following documentation is only for existing users of dialect
+conversion-based bufferization.
+This system is a simple application of MLIR's dialect conversion infrastructure.
+The bulk of the code related to bufferization is a set of ordinary
+`ConversionPattern`'s that dialect authors write for converting ops that operate
+on `tensor`'s to ops that operate on `memref`'s. A set of conventions and best
+practices are followed that allow these patterns to be run across multiple
+independent passes (rather than requiring a single huge atomic conversion pass),
+which makes the compilation pipelines scalable, robust, and easy to debug.
 This document is targeted at people looking to utilize MLIR's bufferization
 functionality, along with people who want to extend it to cover their own ops.
@@ -27,7 +416,7 @@ That talk gives a high-level overview of the bufferization infrastructure and
 important conceptual details related to using the MLIR dialect conversion
-## Bufferization's place in a compilation pipeline
+### Bufferization's place in a compilation pipeline
 Bufferization itself does not free any of the buffers that have been allocated,
 nor does it do anything particularly intelligent with the placement of buffers
@@ -45,7 +434,7 @@ After buffer deallocation has been completed, the program will be quite
diff icult to transform due to the presence of the deallocation ops. Thus, other
 optimizations such as linalg fusion on memrefs should be done before that stage.
-## General structure of the bufferization process
+### General structure of the bufferization process
 Bufferization consists of running multiple *partial* bufferization passes,
 followed by one *finalizing* bufferization pass.
@@ -114,7 +503,7 @@ bufferization pass. This gives excellent diagnostics when something goes wrong
 with the bufferization process, such as due to an op that wasn't handled by any
-## How to write a partial bufferization pass
+### How to write a partial bufferization pass
 The contract of a partial bufferization pass is that a subset of ops (or kinds
 of ops, customizable by a ConversionTarget) get bufferized.
@@ -127,7 +516,9 @@ To describe how to write such a pass, we will walk through an example, the
 `tensor-bufferize` pass
-that bufferizes the `tensor` dialect.
+that bufferizes the `tensor` dialect. Note that these passes have been replaced
+with a `BufferizableOpInterface`-based implementation in the meantime, so we
+have to take a looker at an older version of the code.
 The bulk of the code in the pass will be a set of conversion patterns, with a
 simple example being
@@ -198,17 +589,6 @@ which helps with this in general.
 ### Other partial bufferization examples
--   `linalg-bufferize`
-    ([code](https://github.com/llvm/llvm-project/blob/bc8acf2ce8ad6e8c9b1d97b2e02d3f4ad26e1d9d/mlir/lib/Dialect/Linalg/Transforms/Bufferize.cpp#L1),
-    [test](https://github.com/llvm/llvm-project/blob/bc8acf2ce8ad6e8c9b1d97b2e02d3f4ad26e1d9d/mlir/test/Dialect/Linalg/bufferize.mlir#L1))
-    -   Bufferizes the `linalg` dialect.
-    -   This is an example of how to simultaneously bufferize all the ops that
-        satisfy a certain OpInterface with a single pattern. Specifically,
-        `BufferizeAnyLinalgOp`
-        ([code](https://github.com/llvm/llvm-project/blob/daaaed6bb89044ac58a23f1bb1ccdd12342a5a58/mlir/lib/Dialect/Linalg/Transforms/Bufferize.cpp#L170))
-        bufferizes any ops that implements the `LinalgOp` interface.
 -   `scf-bufferize`
@@ -233,17 +613,7 @@ which helps with this in general.
     -   This is an example of a pass that is not split along dialect
--   `arith-bufferize`
-    ([code](https://github.com/llvm/llvm-project/blob/446425f89871aa7849c5615e6b695ebd10c9b34a/mlir/lib/Dialect/Arithmetic/Transforms/Bufferize.cpp),
-    [test](https://github.com/llvm/llvm-project/blob/d1aed486efc6d35a81ca4acbabb4203c4b91cda9/mlir/test/Dialect/Arithmetic/bufferize.mlir))
-    -   Bufferizes only `arith` ops of `tensor` type.
-    -   This is an example of setting up the legality so that only a subset of
-        `arith.constant` ops get bufferized.
-    -   This is an example of a pass that is not split along dialect
-        subdivisions.
-## How to write a finalizing bufferization pass
+### How to write a finalizing bufferization pass
 The contract of a finalizing bufferization pass is that all tensors are gone
 from the program.
@@ -272,8 +642,11 @@ new code. A helper, `populateEliminateBufferizeMaterializationsPatterns`
 is available for such passes to provide patterns that eliminate
 `bufferization.to_tensor` and `bufferization.to_memref`.
-## Changes since [the talk](#the-talk)
+### Changes since [the talk](#the-talk)
 -   `func-bufferize` was changed to be a partial conversion pass, and there is a
     new `finalizing-bufferize` which serves as a general finalizing
     bufferization pass.
+-   Most partial bufferization passes have been reimplemented in terms of
+    `BufferizableOpInterface`. New users should use One-Shot Bufferize instead
+    of dialect conversion-based bufferization.


More information about the Mlir-commits mailing list