[Mlir-commits] [mlir] 1842fd5 - [mlir] Fix multiple titles
Jacques Pienaar
llvmlistbot at llvm.org
Mon Feb 17 13:55:56 PST 2020
Author: Jacques Pienaar
Date: 2020-02-17T13:55:46-08:00
New Revision: 1842fd50d2ff66209cee11268ddf4bee28c482f7
URL: https://github.com/llvm/llvm-project/commit/1842fd50d2ff66209cee11268ddf4bee28c482f7
DIFF: https://github.com/llvm/llvm-project/commit/1842fd50d2ff66209cee11268ddf4bee28c482f7.diff
LOG: [mlir] Fix multiple titles
We have one title in every doc which corresponds to `#`, in the some
there are multiple and it is expected to be h1 headers (visual elements
rather than organizational). Indent every nesting by one in all of the
docs with multiple titles.
Also fixing trailing whitespace.
Added:
Modified:
mlir/docs/Dialects/Linalg.md
mlir/docs/Dialects/Vector.md
mlir/docs/RationaleLinalgDialect.md
mlir/docs/Tutorials/Toy/Ch-3.md
mlir/docs/Tutorials/Toy/Ch-6.md
Removed:
################################################################################
diff --git a/mlir/docs/Dialects/Linalg.md b/mlir/docs/Dialects/Linalg.md
index ddc0cd7f2dbd..6255f23ae4b8 100644
--- a/mlir/docs/Dialects/Linalg.md
+++ b/mlir/docs/Dialects/Linalg.md
@@ -2,18 +2,18 @@
[TOC]
-# Rationale
+## Rationale
<img width="90" align="left" alt="MLIR Codegen Flow" src="https://user-images.githubusercontent.com/10148468/73613629-c5586580-45c5-11ea-94b7-074aeea94c7b.png">
Linalg is designed to solve the High-level Hierarchical Optimization
(HHO box) in MLIR and to interoperate nicely within a
-*Mixture Of Expert Compilers* environment (i.e. the *CGSel* box).
+*Mixture Of Expert Compilers* environment (i.e. the *CGSel* box).
The [Rationale Document](https://mlir.llvm.org/docs/RationaleLinalgDialect)
goes into significantly more design and architectural decision details.
-# Set of Key Transformations<a name="key_transformations"></a>
+## Set of Key Transformations<a name="key_transformations"></a>
The following key transformations have been central to driving the design of
Linalg. They are all implemented in terms of the properties of the
@@ -33,7 +33,7 @@ performed on the Linalg IR and that have influenced its design:
1. Lower to Library Calls or Special Instructions, Intrinsics or ISA.
1. Partially Lower to Iterations Over a Finer-Grained Linalg Op.
-# High-Level Description of Linalg Ops<a name="linalg_ops"></a>
+## High-Level Description of Linalg Ops<a name="linalg_ops"></a>
Linalg takes at least some inspiration from all previously [listed prior
art](#prior_art). The design enables the definition of ***CustomOps*** with
generic properties that enable [key transformations](#key_transformations),
@@ -42,7 +42,7 @@ library calls and intrinsics.
These ops can have ***either tensor or buffer operands***.
-## Payload-Carrying Ops<a name="payload_ops"></a>
+### Payload-Carrying Ops<a name="payload_ops"></a>
Linalg defines two payload carrying operations that implement the [structured ops](
https://docs.google.com/presentation/d/1P-j1GrH6Q5gLBjao0afQ-GfvcAeF-QU4GXXeSy0eJ9I/edit#slide=id.p
) abstraction on tensors and buffers. This is architected as two generic operations
@@ -52,7 +52,7 @@ The properties of these generic ops are the result of applying the
guiding principles described in the [Rationale Document](https://mlir.llvm.org/docs/RationaleLinalgDialect).
They are listed next, with a brief example and discussion for each.
-### Property 1: Input and Output Operands Define The Iteration Space<a name="prop1"></a>
+#### Property 1: Input and Output Operands Define The Iteration Space<a name="prop1"></a>
A `linalg.generic` op fully *derives* the specification of its iteration space
from its operands.
The property enforces that a localized IR element (the op) *has* all the information
@@ -63,7 +63,7 @@ to [URUK](http://icps.u-strasbg.fr/~bastoul/research/papers/GVBCPST06-IJPP.pdf).
Consider the following, partially specified, `linalg.generic` example:
```
#attrs = {args_in: 1, args_out: 1}
-func @example(%A: memref<?xf32, layout1>,
+func @example(%A: memref<?xf32, layout1>,
%B: memref<?xvector<4xf32, layout2>>) {
linalg.generic #attrs (%2, %3): memref<?xf32, layout1>,
memref<?xvector<4xf32, layout2>>
@@ -74,7 +74,7 @@ func @example(%A: memref<?xf32, layout1>,
The property "*Input and Output Operands Define The Iteration Space*" is
materialized by a lowering into a form that will resemble:
```
-func @example(%A: memref<?xf32, layout1>,
+func @example(%A: memref<?xf32, layout1>,
%B: memref<?xvector<4xf32, layout2>>) {
%M = "dim" %A, 0: index
%N = "dim" %B, 0: index
@@ -119,18 +119,18 @@ and [position-dependent
arrays](https://www.lift-project.org/publications/2019/pizzuti19positiondependentarrays.pdf),
as well as [TACO](http://tensor-compiler.org/), has shown.
-### Property 2: Reversible Mappings Between Control and Data Structures<a name="prop2"></a>
+#### Property 2: Reversible Mappings Between Control and Data Structures<a name="prop2"></a>
A `linalg.generic` *defines* the mapping between the iteration space (i.e. the
-loops) and the data.
+loops) and the data.
Consider the following, partially specified, `linalg.generic` example:
```
-#indexing_maps = {
- (i, j) -> (j, i),
- (i, j) -> (j)
+#indexing_maps = {
+ (i, j) -> (j, i),
+ (i, j) -> (j)
}
#attrs = {args_in: 1, args_out: 1, indexings: indexing_maps}
-func @example(%A: memref<?xf32, layout1>,
+func @example(%A: memref<?xf32, layout1>,
%B: memref<?xvector<4xf32, layout2>>) {
linalg.generic #attrs (%A, %B): memref<?xf32, layout1>,
memref<?xvector<4xf32, layout2>>
@@ -142,13 +142,13 @@ The property "*Reversible Mappings Between Control and Data Structures*" is
materialized by a lowering into a form that will resemble:
```
#attrs = {args_in: 1, args_out: 1, indexings: indexing_maps}
-func @example(%A: memref<?xf32, layout1>,
+func @example(%A: memref<?xf32, layout1>,
%B: memref<?xvector<4xf32, layout2>>) {
// loop bounds determined from data sizes by “inverting the map”
%J = "dim" %2, 0: index
%I = "dim" %2, 1: index
%J2 = "dim" %3, 0: index
- // iteration space is consistent with data + mapping inference
+ // iteration space is consistent with data + mapping inference
%eq = "eq" %J, %J2: i1
"assert" %eq: (i1) -> ()
for %i = 0 to %I { // loop order is fully defined by indexing maps
@@ -170,7 +170,7 @@ write?
- Given a subset of data read or written, what subset of the iteration space
is responsible for this read or write?
-Answering these `2` questions is one of the main analyses that Linalg uses to
+Answering these `2` questions is one of the main analyses that Linalg uses to
implement transformations such as tiling, tiled producer-consumer fusion, and
promotion to temporary buffers in fast memory.
@@ -179,7 +179,7 @@ This is a pragmatic short-term solution, but in the longer term note that
this property could be even evaluated dynamically, similarly to
inspector-executor algorithms.
-### Property 3: The Type Of Iterators is Defined Explicitly<a name="prop3"></a>
+#### Property 3: The Type Of Iterators is Defined Explicitly<a name="prop3"></a>
A `linalg.generic` op fully *declares* the type of its iterators. This
information is used in transformations.
@@ -192,21 +192,21 @@ preserved***.
This can be better captured directly at the loop level thanks to specific
iterator types, among which:
-*parallel*, *reduction*, *partition*, *permutable/monotonic*, *sequential*,
+*parallel*, *reduction*, *partition*, *permutable/monotonic*, *sequential*,
*dependence distance*, ...
These types are traditionally the result of complex dependence analyses and
have been referred to as "*bands*" in the polyhedral community (e.g. *parallel
bands*, *permutable bands*, etc, in
[ISL](https://en.wikipedia.org/wiki/Integer_set_library) schedule tree
-parlance).
+parlance).
Specifying the information declaratively in a `linalg.generic` allows
conveying properties that may be hard (or even impossible) to derive from
lower-level information. These properties can be brought all the way to the
moment when they are useful for transformations, used and then discarded.
-Additionally, these properties may also be viewed as a contract that the
+Additionally, these properties may also be viewed as a contract that the
frontend/user guarantees and that the compiler may take advantage of. The
common example is the use of data-dependent reduction semantics for
specifying histogram computations. If the frontend has additional knowledge
@@ -216,8 +216,8 @@ parallel semantics and use the special atomic in the computation region.
At this time, Linalg only has an explicit use for *parallel* and *reduction*
loops but previous experience shows that the abstraction generalizes.
-### Property 4: The Compute Payload is Specified With a Region<a name="prop4"></a>
-A `linalg.generic` op has a compute payload that is fully generic thanks to
+#### Property 4: The Compute Payload is Specified With a Region<a name="prop4"></a>
+A `linalg.generic` op has a compute payload that is fully generic thanks to
the use of
[Regions](https://github.com/llvm/llvm-project/blob/58265ad42a90ae8905be6a447cb42e53529a54a0/mlir/docs/LangRef.md#regions).
@@ -230,16 +230,16 @@ At this time there are no additional restrictions to the region
semantics. This is meant to allow the exploration of various design tradeoffs
at the intersection of regions and iterator types.
In particular, the frontend is responsible for the semantics of iterator types
-to correspond to the operations inside the region: the region can capture
+to correspond to the operations inside the region: the region can capture
buffers arbitrarily and write into them. If this conflicts with some parallel
iterator requirement, this is undefined behavior.
Concretely, consider the following, partially specified, `linalg.generic`
example:
```
-#indexing_maps = {
- (i, j) -> (i, j),
- (i, j) -> (i, j)
+#indexing_maps = {
+ (i, j) -> (i, j),
+ (i, j) -> (i, j)
}
#attrs = {args_in: 1, args_out: 1, indexings: #indexing_maps}
func @example(%A: memref<?x?xf32>, %B: memref<?x?xf32>, %C: memref<?x?xf32>) {
@@ -276,24 +276,24 @@ proposal](https://llvm.discourse.group/t/introduce-std-inlined-call-op-proposal/
We expect to be able to reuse the common lower-level infrastructure provided
it evolves to support both region arguments and captures.
-### Property 5: May Map To an External Library Call<a name="prop5"></a>
+#### Property 5: May Map To an External Library Call<a name="prop5"></a>
A `linalg.generic` op may map to an external library call by specifying a
-`SymbolAttr`. At this level of abstraction, the important glue is the ability
+`SymbolAttr`. At this level of abstraction, the important glue is the ability
to perform transformations that preserve the structure necessary to ***call
the external library after
diff erent transformations have been applied***.
This involves considerations related to preservation of op semantics
and integration at the ABI level. Regardless of whether one wants to use
-external library calls or a custom ISA, the problem for codegen is similar:
+external library calls or a custom ISA, the problem for codegen is similar:
preservation of a fixed granularity.
Consider the following, partially specified, `linalg.generic`
example:
```
#fun_attr = "pointwise_add"
-#indexing_maps = {
- (i, j) -> (i, j),
- (i, j) -> (i, j)
+#indexing_maps = {
+ (i, j) -> (i, j),
+ (i, j) -> (i, j)
}
#attrs = {args_in: 1, args_out: 1, indexings: #indexing_maps, fun: #fun_attr}
func @example(%A: memref<?x?xf32>, %B: memref<?x?xf32>, %C: memref<?x?xf32>) {
@@ -313,7 +313,7 @@ materialized by a lowering into a form that will resemble:
func @pointwise_add_sxsxf32_sxsxf32(memref<?x?xf32>, memref<?x?xf32>, memref<?x?xf32>) -> ()
func @example(%A: memref<?x?xf32>, %B: memref<?x?xf32>, %C: memref<?x?xf32>) {
- call @pointwise_add_sxsxf32_sxsxf32 (%A, %B, %C):
+ call @pointwise_add_sxsxf32_sxsxf32 (%A, %B, %C):
(memref<?x?xf32>, memref<?x?xf32>, memref<?x?xf32>) -> ()
return
}
@@ -321,20 +321,20 @@ func @example(%A: memref<?x?xf32>, %B: memref<?x?xf32>, %C: memref<?x?xf32>) {
Which, after lowering to LLVM resembles:
```
-func @pointwise_add_sxsxf32_sxsxf32(!llvm<"{ float*, i64, [2 x i64], [3 x i64] }*">,
- !llvm<"{ float*, i64, [2 x i64], [3 x i64] }*">,
+func @pointwise_add_sxsxf32_sxsxf32(!llvm<"{ float*, i64, [2 x i64], [3 x i64] }*">,
+ !llvm<"{ float*, i64, [2 x i64], [3 x i64] }*">,
!llvm<"{ float*, i64, [2 x i64], [3 x i64] }*">) -> ()
-func @example(%A: !llvm<"{ float*, i64, [2 x i64], [3 x i64] }*">,
- %B: !llvm<"{ float*, i64, [2 x i64], [3 x i64] }*">,
+func @example(%A: !llvm<"{ float*, i64, [2 x i64], [3 x i64] }*">,
+ %B: !llvm<"{ float*, i64, [2 x i64], [3 x i64] }*">,
%C: !llvm<"{ float*, i64, [2 x i64], [3 x i64] }*">) {
- llvm.call @pointwise_add_sxsxf32_sxsxf32 (%A, %B, %C):
+ llvm.call @pointwise_add_sxsxf32_sxsxf32 (%A, %B, %C):
(!llvm<"{ float*, i64, [2 x i64], [3 x i64] }*">...) -> ()
return
}
```
-#### Convention For External Library Interoperability
+##### Convention For External Library Interoperability
The `linalg` dialect adopts a convention that is similar to `BLAS` when
offloading operations to fast library implementations: pass a non-owning
pointer to input and output data with additional metadata. This convention
@@ -349,7 +349,7 @@ There is an [ongoing
discussion](https://llvm.discourse.group/t/lowering-optional-attributes-in-linalg-structuredops-to-standard-dialect/333/3)
on the topic of extending interoperability in the presence of key attributes.
-### Property 6: Perfectly Nested Writes To The Whole Output Operands<a name="prop6"></a>
+#### Property 6: Perfectly Nested Writes To The Whole Output Operands<a name="prop6"></a>
Perfectly nested loops form a particularly important class of structure that
enables key loop transformations such as tiling and mapping to library calls.
Unfortunately, this type of structure is easily broken by transformations such
@@ -363,12 +363,12 @@ entire memory region. This is a structural constraint across regions and
loops that has proven to be key in simplifying transformations.
One particular point to mention is that converting imperfectly nested code
-into perfectly nested code can often be done with enough loop distribution
+into perfectly nested code can often be done with enough loop distribution
and embedding of conditionals down to the innermost loop level.
Previous experience with Tensor Comprehensions gave us the intuition that
forcing innermost control-flow nesting is a lot like writing data-parallel
-code with arrays of boolean values and predication.
+code with arrays of boolean values and predication.
This type of trick has also been used before in polyhedral compilers to
convert non-affine control into affine compute dependencies.
@@ -376,7 +376,7 @@ While it may be possible to automate such rewrites from generic IR,
`linalg.generic` just forces the semantics for now.
The key implication is that this conversion to deep predication needs to be
-undone once we are done with Linalg transformations.
+undone once we are done with Linalg transformations.
After iterators and induction variables are materialized (i.e. after lowering
out of `linalg.generic` occurred), the overall performance will be greatly
influenced by the quality of canonicalizations, foldings and *Loop Independent
@@ -384,10 +384,10 @@ Code Motion* (LICM).
In the grander scheme, the reliance on late LICM was deemed a necessary risk.
-### Putting it Together<a name="summary"></a>
+#### Putting it Together<a name="summary"></a>
As it stands, the six properties above define the semantics of a
`linalg.generic` op. It is an open question whether all of these semantics are
-strictly necessary in practice and whether some should or could be derived
+strictly necessary in practice and whether some should or could be derived
automatically while still maintaining the [core guiding
principles](#guiding_principles).
@@ -396,7 +396,7 @@ because of empirical evidence building and working on multiple high-level
compilers. As we lay those down and engage more with the community, we expect
multiple rounds of discussions and design changes to the original architecture.
-## Data Representation: Views<a name="views"></a>
+### Data Representation: Views<a name="views"></a>
The current implementation uses the [Strided MemRef (a.k.a View)](
https://groups.google.com/a/tensorflow.org/forum/#!topic/mlir/MaL8m2nXuio)
abstraction. The name *View* is used interchangeably in `linalg` to signify
@@ -408,7 +408,7 @@ experience from existing LIFT abstractions for
and [position-dependent
arrays](https://www.lift-project.org/publications/2019/pizzuti19positiondependentarrays.pdf).
-## Metadata Ops<a name="metadata_ops"></a>
+### Metadata Ops<a name="metadata_ops"></a>
A set of ops that manipulate metadata but do not move memory. These ops take
`view` operands + extra attributes and return new `view`s. The returned
`view`s generally alias the operand `view`. At the moment the existing ops
@@ -435,7 +435,7 @@ In a longer-term future, the abstractions from [Legion data-centric
programming model](https://legion.stanford.edu/overview/) seem generally
appealing.
-## Named Payload-Carrying Ops<a name="named_ops"></a>
+### Named Payload-Carrying Ops<a name="named_ops"></a>
Additionally, `linalg` provides a small subset of commonly named operations:
* `linalg.copy`,
@@ -446,12 +446,12 @@ Additionally, `linalg` provides a small subset of commonly named operations:
These named operations adhere to the `linalg.generic` op interface. Work is in
progress to define declarative mechanisms to automatically generate named ops
-from a description in terms of only the generic op interface.
+from a description in terms of only the generic op interface.
This is the main reason there are only a small number of ops today: we expect
them to be auto-generated from Tablegen soon.
-# Open Issues and Design Alternatives<a name="open_issues"></a>
+## Open Issues and Design Alternatives<a name="open_issues"></a>
Multiple open issues and design alternatives are in flight and it is time to
lay them out for the community to discuss and pick apart:
1. Should `linalg.generic` support nesting?
@@ -466,6 +466,6 @@ extended, if at all?
...
These key questions (and much more) should be really thought of in the general
-context of MLIR in which
diff erent levels of IR interoperate seamlessly. In
-practice, it is not necessary (or beneficial) to try and solve all problems in the
+context of MLIR in which
diff erent levels of IR interoperate seamlessly. In
+practice, it is not necessary (or beneficial) to try and solve all problems in the
same IR.
diff --git a/mlir/docs/Dialects/Vector.md b/mlir/docs/Dialects/Vector.md
index 0ce393a435d0..1c00269c615a 100644
--- a/mlir/docs/Dialects/Vector.md
+++ b/mlir/docs/Dialects/Vector.md
@@ -11,7 +11,7 @@ targets. These abstractions serve to separate concerns between operations on
new proposal but rather a textual documentation of existing MLIR components
along with a rationale.
-# Positioning in the Codegen Infrastructure
+## Positioning in the Codegen Infrastructure
The following diagram, recently presented with the [StructuredOps
abstractions](https://drive.google.com/corp/drive/u/0/folders/1sRAsgsd8Bvpm_IxREmZf2agsGU2KvrK-),
captures the current codegen paths implemented in MLIR in the various existing
@@ -34,7 +34,7 @@ that the higher level of `vector`-level primitives we build and target from
codegen (or some user/language level), the simpler our task will be, the more
complex patterns can be expressed and the better performance will be.
-# Components of a Generic Retargetable Vector-Level Dialect
+## Components of a Generic Retargetable Vector-Level Dialect
The existing MLIR `vector`-level dialects are related to the following
bottom-up abstractions:
@@ -65,9 +65,9 @@ lowering patterns that are specified manually for now.
manually at the moment and that should be automated, following the `LLVM ->
Hardware Vector` ops generation as closely as possible.
-# Short Description of the Existing Infrastructure
+## Short Description of the Existing Infrastructure
-## LLVM level
+### LLVM level
On CPU, the `n-D` `vector` type currently lowers to
`!llvm<array<vector>>`. More concretely, `vector<4x8x128xf32>` lowers to
`!llvm<[4 x [ 8 x [ 128 x float ]]]>`.
@@ -76,7 +76,7 @@ one uses `llvm.extractelement`, `llvm.insertelement` and
`llvm.shufflevector`. A [deeper dive section](#DeeperDive) discusses the
current lowering choices and tradeoffs.
-## Hardware Vector Ops
+### Hardware Vector Ops
Hardware Vector Ops are implemented as one dialect per target.
For internal hardware, we are auto-generating the specific HW dialects.
For `GPU`, the `NVVM` dialect adds operations such as `mma.sync`, `shfl` and
@@ -90,7 +90,7 @@ Specialized `CPU` dialects that would capture specific features not well
captured by LLVM peephole optimizations of on
diff erent types that core MLIR
supports (e.g. Scalable Vectors) are welcome future extensions.
-## Virtual Vector Ops
+### Virtual Vector Ops
Some existing Standard and VectorOps Dialect on `n-D` `vector` types comprise:
```
%2 = std.addf %0, %1 : vector<3x7x8xf32> // -> vector<3x7x8xf32>
@@ -101,14 +101,14 @@ Some existing Standard and VectorOps Dialect on `n-D` `vector` types comprise:
%1 = vector.extract %0[1, 5]: vector<3x7x8xf32> // -> vector<8xf32>
%2 = vector.outerproduct %0, %1: vector<4xf32>, vector<8xf32> // -> vector<4x8xf32>
%3 = vector.outerproduct %0, %1, %2: vector<4xf32>, vector<8xf32> // fma when adding %2
-%3 = vector.strided_slice %0 {offsets = [2, 2], sizes = [2, 2], strides = [1, 1]}:
+%3 = vector.strided_slice %0 {offsets = [2, 2], sizes = [2, 2], strides = [1, 1]}:
vector<4x8x16xf32> // Returns a slice of type vector<2x2x16xf32>
-%2 = vector.transfer_read %A[%0, %1]
+%2 = vector.transfer_read %A[%0, %1]
{permutation_map = (d0, d1) -> (d0)}: memref<7x?xf32>, vector<4xf32>
-vector.transfer_write %f1, %A[%i0, %i1, %i2, %i3]
- {permutation_map = (d0, d1, d2, d3) -> (d3, d1, d0)} :
+vector.transfer_write %f1, %A[%i0, %i1, %i2, %i3]
+ {permutation_map = (d0, d1, d2, d3) -> (d3, d1, d0)} :
vector<5x4x3xf32>, memref<?x?x?x?xf32>
```
@@ -120,38 +120,38 @@ building and populates the [Vector
doc](https://github.com/llvm/llvm-project/blob/master/mlir/docs/Dialects/Vector.md)). Recent
extensions are driven by concrete use cases of interest. A notable such use
case is the `vector.contract` op which applies principles of the StructuredOps
-abstraction to `vector` types.
+abstraction to `vector` types.
-## Virtual Vector Rewrite Patterns
+### Virtual Vector Rewrite Patterns
The following rewrite patterns exist at the `VV->VV` level:
1. The now retired `MaterializeVector` pass used to legalize ops on a
coarse-grained virtual `vector` to a finer-grained virtual `vector` by
unrolling. This has been rewritten as a retargetable unroll-and-jam pattern on
-`vector` ops and `vector` types.
+`vector` ops and `vector` types.
2. The lowering of `vector_transfer` ops legalizes `vector` load/store ops to
permuted loops over scalar load/stores. This should evolve to loops over
`vector` load/stores + `mask` operations as they become available `vector` ops
-at the `VV` level.
+at the `VV` level.
The general direction is to add more Virtual Vector level ops and implement
more useful `VV -> VV` rewrites as composable patterns that the PatternRewrite
-infrastructure can apply iteratively.
+infrastructure can apply iteratively.
-## Virtual Vector to Hardware Vector Lowering
+### Virtual Vector to Hardware Vector Lowering
For now, `VV -> HWV` are specified in C++ (see for instance the
[SplatOpLowering for n-D
vectors](https://github.com/tensorflow/mlir/commit/0a0c4867c6a6fcb0a2f17ef26a791c1d551fe33d)
or the [VectorOuterProductOp
-lowering](https://github.com/tensorflow/mlir/commit/957b1ca9680b4aacabb3a480fbc4ebd2506334b8)).
+lowering](https://github.com/tensorflow/mlir/commit/957b1ca9680b4aacabb3a480fbc4ebd2506334b8)).
Simple [conversion
tests](https://github.com/llvm/llvm-project/blob/master/mlir/test/Conversion/VectorToLLVM/vector-to-llvm.mlir)
-are available for the `LLVM` target starting from the Virtual Vector Level.
+are available for the `LLVM` target starting from the Virtual Vector Level.
-# Rationale
-## Hardware as `vector` Machines of Minimum Granularity
+## Rationale
+### Hardware as `vector` Machines of Minimum Granularity
Higher-dimensional `vector`s are ubiquitous in modern HPC hardware. One way to
think about Generic Retargetable `vector`-Level Dialect is that it operates on
@@ -163,25 +163,25 @@ Some notable `vector` sizes of interest include:
1. CPU: `vector<HW_vector_size * k>`, `vector<core_count * k’ x
HW_vector_size * k>` and `vector<socket_count x core_count * k’ x
-HW_vector_size * k>`
+HW_vector_size * k>`
2. GPU: `vector<warp_size * k>`, `vector<warp_size * k x float4>` and
-`vector<warp_size * k x 4 x 4 x 4>` for tensor_core sizes,
-3. Other accelerators: n-D `vector` as first-class citizens in the HW.
+`vector<warp_size * k x 4 x 4 x 4>` for tensor_core sizes,
+3. Other accelerators: n-D `vector` as first-class citizens in the HW.
Depending on the target, ops on sizes that are not multiples of the HW
`vector` size may either produce slow code (e.g. by going through `LLVM`
legalization) or may not legalize at all (e.g. some unsupported accelerator X
-combination of ops and types).
+combination of ops and types).
-## Transformations Problems Avoided
+### Transformations Problems Avoided
A `vector<16x32x64xf32>` virtual `vector` is a coarse-grained type that can be
“unrolled” to HW-specific sizes. The multi-dimensional unrolling factors are
carried in the IR by the `vector` type. After unrolling, traditional
-instruction-level scheduling can be run.
+instruction-level scheduling can be run.
The following key transformations (along with the supporting analyses and
structural constraints) are completely avoided by operating on a ``vector``
-`ssa-value` abstraction:
+`ssa-value` abstraction:
1. Loop unroll and unroll-and-jam.
2. Loop and load-store restructuring for register reuse.
@@ -191,9 +191,9 @@ structural constraints) are completely avoided by operating on a ``vector``
Note that “unrolling” in the context of `vector`s corresponds to partial loop
unroll-and-jam and not full unrolling. As a consequence this is expected to
compose with SW pipelining where applicable and does not result in ICache blow
-up.
+up.
-## The Big Out-Of-Scope Piece: Automatic Vectorization
+### The Big Out-Of-Scope Piece: Automatic Vectorization
One important piece not discussed here is automatic vectorization
(automatically raising from scalar to n-D `vector` ops and types). The TL;DR
is that when the first "super-vectorization" prototype was implemented, MLIR
@@ -226,7 +226,7 @@ with expressing `vector`s in the IR directly and simple
pattern-rewrites. [EDSC](https://github.com/llvm/llvm-project/blob/master/mlir/docs/EDSC.md)s
provide a simple way of driving such a notional language directly in C++.
-# Bikeshed Naming Discussion
+## Bikeshed Naming Discussion
There are arguments against naming an n-D level of abstraction `vector`
because most people associate it with 1-D `vector`s. On the other hand,
`vector`s are first-class n-D values in MLIR.
@@ -236,7 +236,7 @@ hardware.
For now, we generally use the `n-D` `vector` name and are open to better
suggestions.
-# DeeperDive
+## DeeperDive
This section describes the tradeoffs involved in lowering the MLIR n-D vector
type and operations on it to LLVM-IR. Putting aside the [LLVM
@@ -257,7 +257,7 @@ MLIR operations are prefixed by the `vector.` dialect prefix
(e.g. `vector.insertelement`). Such ops operate exclusively on MLIR `n-D`
`vector` types.
-## Alternatives For Lowering an n-D Vector Type to LLVM
+### Alternatives For Lowering an n-D Vector Type to LLVM
Consider a vector of rank n with static sizes `{s_0, ... s_{n-1}}` (i.e. an
MLIR `vector<s_0x...s_{n-1}xf32>`). Lowering such an `n-D` MLIR vector type to
an LLVM descriptor can be done by either:
@@ -274,7 +274,7 @@ discuss. It is important to note that “a mix of both” immediately reduces to
vector<4x8x16x32xf32> to vector<4x4096xf32>` operation, that flattens the most
"k" minor dimensions.
-## Constraints Inherited from LLVM (see LangRef)
+### Constraints Inherited from LLVM (see LangRef)
The first constraint was already mentioned: LLVM only supports `1-D` `vector`
types natively.
Additional constraints are related to the
diff erence in LLVM between vector
@@ -295,7 +295,7 @@ types.
The next sentence illustrates a recurrent tradeoff, also found in MLIR,
between “value types” (subject to SSA use-def chains) and “memory types”
-(subject to aliasing and side-effects):
+(subject to aliasing and side-effects):
```
“Structures in memory are accessed using ‘load’ and ‘store’ by getting a
pointer to a field with the llvm.getelementptr instruction. Structures in
@@ -306,62 +306,62 @@ instructions.”
When transposing this to MLIR, `llvm.getelementptr` works on pointers to `n-D`
vectors in memory. For `n-D`, vectors values that live in registers we can use
`vector.extract` and `vector.insert` which do not accept dynamic indices. Note
-that this is consistent with hardware considerations as discussed below.
+that this is consistent with hardware considerations as discussed below.
An alternative is to use an LLVM `1-D` `vector` type for which one can use
`llvm.extractelement`, `llvm.insertelement` and `llvm.shufflevector`. These
operations accept dynamic indices. The implication is that one has to use a
-flattened lowering of an MLIR n-D vector to an LLVM 1-D vector.
+flattened lowering of an MLIR n-D vector to an LLVM 1-D vector.
There are multiple tradeoffs involved that mix implications on the programming
model, execution on actual HW and what is visible or hidden from codegen. They
-are discussed in the following sections.
+are discussed in the following sections.
-## Nested Aggregate
+### Nested Aggregate
Pros:
-1. Natural encoding n-D vector -> (n-1)-D aggregate over 1-D vector.
-2. No need for linearization / delinearization logic inserted everywhere.
-3. `llvm.insertvalue`, `llvm.extractvalue` of `(n-k)-D` aggregate is natural.
+1. Natural encoding n-D vector -> (n-1)-D aggregate over 1-D vector.
+2. No need for linearization / delinearization logic inserted everywhere.
+3. `llvm.insertvalue`, `llvm.extractvalue` of `(n-k)-D` aggregate is natural.
4. `llvm.insertelement`, `llvm.extractelement`, `llvm.shufflevector` over
-`1-D` vector type is natural.
+`1-D` vector type is natural.
Cons:
1. `llvm.insertvalue` / `llvm.extractvalue` does not accept dynamic indices
-but only static ones.
+but only static ones.
2. Dynamic indexing on the non-most-minor dimension requires roundtrips to
-memory.
+memory.
3. Special intrinsics and native instructions in LLVM operate on `1-D`
vectors. This is not expected to be a practical limitation thanks to a
`vector.cast %0: vector<4x8x16x32xf32> to vector<4x4096xf32>` operation, that
flattens the most minor dimensions (see the bigger picture in implications on
-codegen).
+codegen).
-## Flattened 1-D Vector Type
+### Flattened 1-D Vector Type
Pros:
1. `insertelement` / `extractelement` / `shufflevector` with dynamic indexing
-is possible over the whole lowered `n-D` vector type.
+is possible over the whole lowered `n-D` vector type.
2. Supports special intrinsics and native operations.
Cons:
1. Requires linearization/delinearization logic everywhere, translations are
-complex.
+complex.
2. Hides away the real HW structure behind dynamic indexing: at the end of the
day, HW vector sizes are generally fixed and multiple vectors will be needed
-to hold a vector that is larger than the HW.
+to hold a vector that is larger than the HW.
3. Unlikely peephole optimizations will result in good code: arbitrary dynamic
accesses, especially at HW vector boundaries unlikely to result in regular
-patterns.
+patterns.
-## Discussion
-### HW Vectors and Implications on the SW and the Programming Model
+### Discussion
+#### HW Vectors and Implications on the SW and the Programming Model
As of today, the LLVM model only support `1-D` vector types. This is
unsurprising because historically, the vast majority of HW only supports `1-D`
vector registers. We note that multiple HW vendors are in the process of
-evolving to higher-dimensional physical vectors.
+evolving to higher-dimensional physical vectors.
In the following discussion, let's assume the HW vector size is `1-D and the
SW vector size is `n-D`, with `n >= 1`. The same discussion would apply with
@@ -370,7 +370,7 @@ register file. The number of such vectors is fixed.
Depending on the rank and sizes of the SW vector abstraction and the HW vector
sizes and number of registers, an `n-D` SW vector type may be materialized by
a mix of multiple `1-D` HW vector registers + memory locations at a given
-point in time.
+point in time.
The implication of the physical HW constraints on the programming model are
that one cannot index dynamically across hardware registers: a register file
@@ -381,28 +381,28 @@ programmers: when declaring a `private float a[4]`; and subsequently indexing
with a *dynamic* value results in so-called **local memory** usage
(i.e. roundtripping to memory).
-### Implication on codegen
+#### Implication on codegen
MLIR `n-D` vector types are currently represented as `(n-1)-D` arrays of `1-D`
-vectors when lowered to LLVM.
+vectors when lowered to LLVM.
This introduces the consequences on static vs dynamic indexing discussed
previously: `extractelement`, `insertelement` and `shufflevector` on `n-D`
vectors in MLIR only support static indices. Dynamic indices are only
supported on the most minor `1-D` vector but not the outer `(n-1)-D`.
-For other cases, explicit load / stores are required.
+For other cases, explicit load / stores are required.
The implications on codegen are as follows:
1. Loops around `vector` values are indirect addressing of vector values, they
-must operate on explicit load / store operations over `n-D` vector types.
+must operate on explicit load / store operations over `n-D` vector types.
2. Once an `n-D` `vector` type is loaded into an SSA value (that may or may
not live in `n` registers, with or without spilling, when eventually lowered),
it may be unrolled to smaller `k-D` `vector` types and operations that
correspond to the HW. This level of MLIR codegen is related to register
-allocation and spilling that occur much later in the LLVM pipeline.
+allocation and spilling that occur much later in the LLVM pipeline.
3. HW may support >1-D vectors with intrinsics for indirect addressing within
these vectors. These can be targeted thanks to explicit `vector_cast`
operations from MLIR `k-D` vector types and operations to LLVM `1-D` vectors +
-intrinsics.
+intrinsics.
Alternatively, we argue that directly lowering to a linearized abstraction
hides away the codegen complexities related to memory accesses by giving a
@@ -410,7 +410,7 @@ false impression of magical dynamic indexing across registers. Instead we
prefer to make those very explicit in MLIR and allow codegen to explore
tradeoffs.
Different HW will require
diff erent tradeoffs in the sizes involved in steps
-1., 2. and 3.
+1., 2. and 3.
Decisions made at the MLIR level will have implications at a much later stage
in LLVM (after register allocation). We do not envision to expose concerns
@@ -421,7 +421,7 @@ MLIR level will be able to target. Such costs at the MLIR level will be
abstract and used for ranking, not for accurate performance modeling. In the
future such costs will be learned.
-### Implication on Lowering to Accelerators
+#### Implication on Lowering to Accelerators
To target accelerators that support higher dimensional vectors natively, we
can start from either `1-D` or `n-D` vectors in MLIR and use `vector.cast` to
flatten the most minor dimensions to `1-D` `vector<Kxf32>` where `K` is an
@@ -442,17 +442,17 @@ and intra-vector shuffling that may not be worthwhile or even feasible,
i.e. infinite cost.
However `vector.cast %0: vector<K1x...xKnxf32> to vector<Kxf32>` when `K =
-K1 * … * Kn` should be close to a noop.
+K1 * … * Kn` should be close to a noop.
As we start building accelerator-specific abstractions, we hope to achieve
retargetable codegen: the same infra is used for CPU, GPU and accelerators
-with extra MLIR patterns and costs.
+with extra MLIR patterns and costs.
-### Implication on calling external functions that operate on vectors
+#### Implication on calling external functions that operate on vectors
It is possible (likely) that we additionally need to linearize when calling an
-external function.
+external function.
-## Relationship to LLVM matrix type proposal.
+### Relationship to LLVM matrix type proposal.
The LLVM matrix proposal was formulated 1 year ago but seemed to be somewhat
stalled until recently. In its current form, it is limited to 2-D matrix types
and operations are implemented with LLVM intrinsics.
@@ -467,7 +467,7 @@ document could become the unifying abstraction that people should target for
>1-D vectors and the LLVM matrix proposal can be viewed as a subset of this
work.
-## Conclusion
+### Conclusion
The flattened 1-D vector design in the LLVM matrix proposal is good in a
HW-specific world with special intrinsics. This is a good abstraction for
register allocation, Instruction-Level-Parallelism and
@@ -481,7 +481,7 @@ This makes “nested aggregate type of 1-D vector” an appealing abstraction fo
lowering from MLIR because:
1. it does not hide complexity related to the buffer vs value semantics and
-the memory subsystem and
+the memory subsystem and
2. it does not rely on LLVM to magically make all the things work from a too
low-level abstraction.
@@ -489,5 +489,5 @@ The use of special intrinsics in a `1-D` LLVM world is still available thanks
to an explicit `vector.cast` op.
-## Operations
+### Operations
diff --git a/mlir/docs/RationaleLinalgDialect.md b/mlir/docs/RationaleLinalgDialect.md
index 39a675d9be5f..2c07502a9fb3 100644
--- a/mlir/docs/RationaleLinalgDialect.md
+++ b/mlir/docs/RationaleLinalgDialect.md
@@ -2,33 +2,32 @@
[TOC]
-# Introduction<a name="introduction"></a>
+## Introduction<a name="introduction"></a>
-## Positioning
+### Positioning
<img width="180" align="left" alt="MLIR Codegen Flow" src="https://user-images.githubusercontent.com/10148468/73613629-c5586580-45c5-11ea-94b7-074aeea94c7b.png">
-This document describes the key design principles
+This document describes the key design principles
that led to the existing implementation of Linalg and aims at exposing
the tradeoffs involved when building higher-level Intermediate
Representations (IR) and Dialects to facilitate code
generation. Consider the simplified schema describing codegen in MLIR.
Linalg is designed to solve the High-level Hierarchical Optimization
(HHO box) and to interoperate nicely within a
-*Mixture Of Expert Compilers* environment (i.e. the *CGSel* box).
+*Mixture Of Expert Compilers* environment (i.e. the *CGSel* box).
This work is inspired by a wealth of [prior art](#prior_art) in
the field, from which it seeks to learn key lessons. This documentation
and introspection effort also comes in the context of the proposal for a
working group for discussing the [Development of high-level Tensor Compute
Primitives dialect(s) and
-transformations](https://llvm.discourse.group/t/development-of-high-level-tensor-compute-primitives-dialect-s-and-transformations/388/3).
+transformations](https://llvm.discourse.group/t/development-of-high-level-tensor-compute-primitives-dialect-s-and-transformations/388/3).
We hope that the lessons from prior art, the design principles outlined in
-this doc and the architecture of Linalg can help inform the community on a
+this doc and the architecture of Linalg can help inform the community on a
path to defining these High-Level Tensor Compute Primitives.
+### Inception
-## Inception
-
Linalg started as a pragmatic dialect to bootstrap code generation in MLIR, by
*defining away* complex code generation problems like precise dependence
analysis or polyhedral code generation and by introducing the ability to call
@@ -41,31 +40,31 @@ so as not to miss out on simple performance benefits. For example, if
one's favorite HPC library or ISA has a `matmul` primitive running at 95% of
the achievable peak performance, for operands stored in some memory, one should
be able to **use the primitive** when possible *and* generate code otherwise.
-
+
However, as the design of Linalg co-evolved with the design of MLIR, it became
apparent that it could extend to larger application domains than just machine
learning on dense tensors.
-
+
The design and evolution of Linalg follows a *codegen-friendly* approach where
the IR and the transformations evolve hand-in-hand.
The key idea is that op semantics *declare* and transport information that is
-traditionally obtained by compiler analyses.
+traditionally obtained by compiler analyses.
This information captures the legality and applicability of transformations and
is **not lost by lowering prematurely to loop or CFG form**. The key
transformations are designed so as to **preserve this information** as long as
necessary. For example, `linalg.matmul` remains `linalg.matmul` after tiling
and fusion.
-
+
Furthermore, Linalg decouples transformation validity from profitability
considerations and voluntarily leaves the latter aside in the first iteration
(see the [suitability for search](#suitability_for_search) guiding principle).
-
+
The first incarnation of these ideas was presented as an example at the
EuroLLVM 2019 developer's meeting as part of the
[Linalg section](https://llvm.org/devmtg/2019-04/slides/Tutorial-AminiVasilacheZinenko-MLIR.pdf)
of the first [MLIR Tutorial](https://www.youtube.com/watch?v=cyICUIZ56wQ).
-
-## Evolution
+
+### Evolution
Since the initial implementation, the design has evolved with, and partially
driven the evolution of the core MLIR infrastructure to use
[Regions](https://mlir.llvm.org/docs/LangRef/#regions),
@@ -82,14 +81,14 @@ which define structured operations on vectors, following the same rationale and
design principles as Linalg. (Vector dialect includes the higher-level
operations on multi-dimensional vectors and abstracts away the lowering to
single-dimensional vectors).
-
+
The Linalg dialect itself grew beyond linear algebra-like operations to become
more expressive, in particular by providing an abstraction of a loop nest
supporting parallelism, reductions and sliding windows around arbitrary MLIR
[regions](https://mlir.llvm.org/docs/LangRef/#regions). It also has the
potential of growing beyond *dense* linear-algebra to support richer data
types, such as sparse and ragged tensors and buffers.
-
+
Linalg design remains open to evolution and cross-pollination with other
dialects and approaches. It has been successfully used as the staging ground
for code generation-related abstractions, spinning off the generalization of
@@ -102,22 +101,22 @@ unsurprising ABI conventions;
to the *structured control flow* dialect (named `LoopOps`).
More components can be extracted, redesigned and generalized when new uses or
requirements arise.
-
+
Several [design questions](#open_issues) remain open in Linalg, which does not
claim to be a general solution to all compilation problems.
It does aim at driving thinking and implementations of domain-specific
abstractions where programmer's intent can be captured at a very high level,
directly in the IR.
-
+
Given the evolution of the scope, it becomes apparent that a better name than
"Linalg" could remove some of the confusions related to the dialect (and the
underlying approach), its goals and limitations.
-# Prior Art<a name=""></a>
+## Prior Art<a name=""></a>
Linalg draws inspiration from decades of prior art to design a modern a
pragmatic solution. The following non-exhaustive list refers to some of the
projects that influenced Linalg design:
-
+
- [ONNX](https://onnx.ai/),
- [LIFT](https://www.lift-project.org/),
- [XLA](https://www.tensorflow.org/xla/architecture),
@@ -132,17 +131,17 @@ projects that influenced Linalg design:
[Optimizing Compilers for Modern Architectures](
https://www.elsevier.com/books/optimizing-compilers-for-modern-architectures/allen/978-0-08-051324-9))
- Traditional compiler CFGs with SSA forms.
-
+
Additionally, experience with the following tools proved very valuable when
thinking holistically about how all these components interplay all the way
up to the user and down to the hardware:
-
+
- the [Torch](http://torch.ch/) machine-learning framework,
- the LLVM compiler, specifically in JIT mode,
- high-performance libraries (MKL, CUBLAS, FBFFT)
- the [PeachPy](https://www.cs.utexas.edu/users/flame/BLISRetreat/BLISRetreatTalks/PeachPy.pdf) assembler
- current and potentially upcoming hardware ISAs.
-
+
The novelty of MLIR's code base and its unprecedented support for defining and
mixing abstractions, enabling one to reflect on and integrate the key elements
of the prior art success as well as avoid the common pitfalls in the area of
@@ -150,28 +149,28 @@ code generation. Thus, instead of diverging into a discussion about the
implications of adopting any of the existing solutions, Linalg had the
possibility to build on all of them and learn from their experience while
leveraging the benefit of hindsight.
-
+
The following reflections on prior art have influenced the design of Linalg.
The discussion is by no means exhaustive but should capture the key motivations
behind Linalg.
-
-## Lessons from ONNX<a name="lessonsonnx"></a>
+
+### Lessons from ONNX<a name="lessonsonnx"></a>
ONNX is a specification of operations that appear in Machine Learning
workloads. As such, it is predominantly driven by the expressiveness requirements
of ML, and less by the considerations of IR design for HPC code generation.
-
+
Similarly to ONNX, Linalg defines *"semantically charged" named ops*.
But it also considers *transformations on these ops* as a key component and
defines the IR to support the transformations, preferring transformations over
expressiveness if necessary.
-
+
Linalg hopes to additionally address the following:
- facilitate frontend-compiler co-design by taking into account compiler
transformations and lowerings in op definition;
- minimize the set of available ops by making them non-overlapping with each
other, thus simplifying the intermediate representation.
-
-## Lessons from LIFT<a name="lessonslift"></a>
+
+### Lessons from LIFT<a name="lessonslift"></a>
[LIFT](https://www.lift-project.org/) is a system to write computational
kernels based on functional abstractions. Transformations are
represented by additional nodes in the IR, whose semantics are at the
@@ -179,11 +178,11 @@ level of the algorithm (e.g. `partialReduce`).
LIFT applies and composes transformations by using [local rewrite
rules](https://www.lift-project.org/presentations/2015/ICFP-2015.pdf) that
embed these additional nodes directly in the functional abstraction.
-
+
Similarly to LIFT, Linalg uses local rewrite rules implemented with the MLIR
[Declarative Rewrite Rules](https://mlir.llvm.org/docs/DeclarativeRewrites/)
mechanisms.
-
+
Linalg builds on, and helps separate concerns in the LIFT approach as follows:
- transformations are either separated from the representation or expressed as
composable attributes that are independent of the actual computation,
@@ -191,7 +190,7 @@ Linalg builds on, and helps separate concerns in the LIFT approach as follows:
- abstractions are split into smaller components (e.g., control flow and data
structure abstractions) potentially reusable across
diff erent dialects in the
MLIR's open ecosystem.
-
+
LIFT is expected to further influence the design of Linalg as it evolve. In
particular, extending the data structure abstractions to support non-dense
tensors can use the experience of LIFT abstractions for
@@ -199,10 +198,10 @@ tensors can use the experience of LIFT abstractions for
and [position-dependent
arrays](https://www.lift-project.org/publications/2019/pizzuti19positiondependentarrays.pdf).
-## Lessons from XLA<a name="lessonsxla"></a>
+### Lessons from XLA<a name="lessonsxla"></a>
[XLA](https://www.tensorflow.org/xla/architecture) is one of the first
post-Theano ML compilers that was introduced as a pragmatic compilation
-solution for TensorFlow. It shines on Google's xPU
+solution for TensorFlow. It shines on Google's xPU
hardware and is an important piece of the puzzle. It is particularly good at
(1) transforming code back and forth between the scalar and the vector
worlds, (2) passing function boundaries for handling both host and device
@@ -231,7 +230,7 @@ Since the transformations themselves are not simple local rewrite patterns
independently. This monolithic design makes the system not portable: xPU passes
and GPU passes do not share much code.
-## Lessons from Halide and TVM<a name="lessonshalide"></a>
+### Lessons from Halide and TVM<a name="lessonshalide"></a>
[Halide](https://halide-lang.org/) is a DSL embedded in C++ that provides a
way of metaprogramming the HalideIR and applying transformations declaratively
to let the expert user transform and optimize the program in tailored ways.
@@ -253,42 +252,42 @@ canonicalization rules that are also very prevalent in MLIR.
Linalg hopes to additionally address the following:
- Halide scheduling is powerful and explores a large swath of possible
-transformations. But it's still too hard for newcomers to use or extend. The
+transformations. But it's still too hard for newcomers to use or extend. The
level of performance you get from Halide is very
diff erent depending on
whether one is a seasoned veteran or a newcomer. This is especially true as
the number of transformations grow.
-- Halide raises rather than lowers in two ways, going counter-current to the
+- Halide raises rather than lowers in two ways, going counter-current to the
design goals we set for high-level codegen abstractions in in MLIR. First,
-canonical Halide front-end code uses explicit indexing and math on scalar
+canonical Halide front-end code uses explicit indexing and math on scalar
values, so to target BLAS/DNN libraries one needs to add pattern matching
-which is similarly brittle as in the affine case. While Halide's performance
-is on par with the libraries on programmable targets (CPU/GPU), that
+which is similarly brittle as in the affine case. While Halide's performance
+is on par with the libraries on programmable targets (CPU/GPU), that
approach doesn't work on mobile accelerators or on xPUs, where the framework
-ingests whole-tensor operations.
-Second, reductions and scans are expressed using serial iteration, again
-requiring pattern matching before they can be transformed (e.g. to do a
-reduction using atomics, or hierarchically). The lesson to draw is that we
+ingests whole-tensor operations.
+Second, reductions and scans are expressed using serial iteration, again
+requiring pattern matching before they can be transformed (e.g. to do a
+reduction using atomics, or hierarchically). The lesson to draw is that we
should start with higher-level primitives than Halide.
-## Lessons from Tensor Comprehensions<a name="lessonstc"></a>
+### Lessons from Tensor Comprehensions<a name="lessonstc"></a>
[Tensor Comprehensions](https://arxiv.org/abs/1802.04730) is a
high-level language to express tensor computations with a syntax
generalizing the Einstein notation, coupled to an end-to-end
compilation flow capable of lowering to efficient GPU code. It was
-integrated with 2 ML frameworks: Caffe2 and PyTorch.
+integrated with 2 ML frameworks: Caffe2 and PyTorch.
<img width="600" alt="MLIR Codegen Flow"
src="https://user-images.githubusercontent.com/10148468/73613272-df904480-45c1-11ea-88f9-214dee7464cf.png">
The compilation flow combines [Halide](#lessonshalide) and a Polyhedral Compiler
derived from [ISL](https://en.wikipedia.org/wiki/Integer_set_library)
-and uses both HalideIR and the ISL *schedule-tree* IR.
+and uses both HalideIR and the ISL *schedule-tree* IR.
The compiler provides a collection of polyhedral compilation
algorithms to perform fusion and favor multi-level parallelism and
promotion to deeper levels of the memory hierarchy.
Tensor Comprehensions showed that, fixing a few predefined strategies
with parametric transformations and tuning knobs, can already provide
-great results. In that previous work, simple
+great results. In that previous work, simple
genetic search combined with an autotining framework was sufficient
to find good implementations in the ***non-compute bound regime***.
This requires code versions obtainable by the
@@ -320,7 +319,7 @@ are Stuck in a Rut](https://dl.acm.org/doi/10.1145/3317550.3321441).
Many of those issues are naturally addressed by implementing these ideas
in the MLIR infrastructure.
-## Lessons from Polyhedral compilers<a name="lessonspolyhedral"></a>
+### Lessons from Polyhedral compilers<a name="lessonspolyhedral"></a>
The polyhedral model has been on the cutting edge of loop-level optimization for
decades, with several incarnations in production compilers such as
[GRAPHITE](https://gcc.gnu.org/wiki/Graphite) for GCC and
@@ -333,7 +332,7 @@ analysis of the role of polyhedral transformations is provided in the
[simplified polyhedral
form](https://mlir.llvm.org/docs/RationaleSimplifiedPolyhedralForm/) document
dating back to the inception of MLIR.
-
+
In particular, polyhedral abstractions have proved challenging to integrate with
a more conventional compiler due to the following.
- The transformed code (or IR) quickly gets complex and thus hard to analyze and
@@ -345,7 +344,7 @@ a more conventional compiler due to the following.
- Expressiveness limitations, although addressed in the scientific literature
through, e.g., summary functions, often remain present in actual
implementations.
-
+
The Affine dialect in MLIR was specifically designed to address the integration
problems mention above. In particular, it maintains the IR in the same form
(loops with additional constraints on how the bounds are expressed) throughout
@@ -353,13 +352,13 @@ the transformation, decreasing the need for one-shot conversion between
drastically
diff erent representations. It also embeds the polyhedral
representation into the SSA form by using MLIR regions and thus allows one to
combine polyhedral and SSA-based transformations.
-
-## Lessons from the Affine dialect<a name="lessonsaffine"></a>
+
+### Lessons from the Affine dialect<a name="lessonsaffine"></a>
The Affine dialect in MLIR brings the polyhedral abstraction closer to the
conventional SSA representation. It addresses several long-standing integration
challenges as described above and is likely to be more suitable when compiling
from a C language-level abstraction.
-
+
MLIR makes it possible to start from a higher-level abstraction than C, for
example in machine learning workloads. In such cases, it may be possible to
avoid complex analyses (data-flow analysis across loop iterations is
@@ -372,7 +371,7 @@ the loop nest computing a matrix multiplication, no need to additionally rely on
affine dependence analysis to check this). This information is not readily
available in the Affine dialect, and can only be derived using potentially
expensive pattern-matching algorithms.
-
+
Informed by the practical experience in polyhedral compilation and with the
Affine dialects in particular, Linalg takes the following decisions.
- **Discourage loop skewing**: the loop skewing transformation, that is
@@ -403,7 +402,7 @@ Affine dialects in particular, Linalg takes the following decisions.
result of tiling a matrix multiplication is loops around a smaller matrix
multiplication. Even with pattern-matching on top of the Affine dialect, this
would have required another step of pattern-matching after the transformation.
-
+
Given these choices, Linalg intends to be a better fit for **high-level
compilation** were significantly more information is readily available in the
input representation and should be leveraged before lowering to other
@@ -412,9 +411,9 @@ and is used as a lowering target for Linalg, enabling further transformations
and combination of semantically-loaded and lower-level inputs. As such, Linalg
is intended to complement Affine rather than replace it.
-# Core Guiding Principles<a name="guiding_principles"></a>
+## Core Guiding Principles<a name="guiding_principles"></a>
-## Transformations and Simplicity First<a name="transformations_first"></a>
+### Transformations and Simplicity First<a name="transformations_first"></a>
The purpose of the Linalg IR and its operations is primarily to:
- develop a set of key transformations, and
- make them correct by construction by carefully curating the set of
@@ -447,7 +446,7 @@ abstractions.
This is not merely a reimplementation of idea X in system Y: simplicity
**must be the outcome** of this introspection effort.
-## Preservation of Information<a name="information_preservation"></a>
+### Preservation of Information<a name="information_preservation"></a>
The last two decades have seen a proliferation of Domain-Specific Languages
(DSLs) that have been very successful at limited application domains.
The main commonality between these systems is their use of a significantly
@@ -460,7 +459,7 @@ when it is not irremediably lost.
These remarks, coupled with MLIR's suitability for defining IR at multiple
levels of abstraction led to the following 2 principles.
-### Declarative Specification: Avoid Raising<a name="declarative_specification"></a>
+#### Declarative Specification: Avoid Raising<a name="declarative_specification"></a>
Compiler transformations need static structural information (e.g. loop-nests,
graphs of basic blocks, pure functions etc). When that structural information
@@ -480,7 +479,7 @@ MLIR makes it easy to define op semantics declaratively thanks to the use of
regions and attributes. This is an ideal opportunity to define new abstractions
to convey user-intent directly into the proper abstraction.
-### Progressive Lowering: Don't Lose Information too Quickly<a name="#progressive_lowering"></a>
+#### Progressive Lowering: Don't Lose Information too Quickly<a name="#progressive_lowering"></a>
Lowering too quickly to affine, generic loops or CFG form reduces the
amount of structure available to derive transformations from. While
@@ -500,19 +499,19 @@ MLIR is designed as an infrastructure for ***progressive lowering***.
Linalg fully embraces this notion and thinks of codegen in terms of
*reducing a potential function*. That potential function is loosely
defined in terms of number of low-level instructions in a particular
-Linalg ops (i.e. how heavy or lightweight the Linalg op is).
+Linalg ops (i.e. how heavy or lightweight the Linalg op is).
Linalg-based codegen and transformations start from higher-level IR
ops and dialects. Then each transformation application reduces the
potential by introducing lower-level IR ops and *smaller* Linalg ops.
This gradually reduces the potential, all the way to Loops + VectorOps
and LLVMIR.
-## Composable and Declarative Transformations<a name="declarative_transformations"></a>
+### Composable and Declarative Transformations<a name="declarative_transformations"></a>
Complex and impactful transformations need not be hard to manipulate, write or
maintain. Mixing XLA-style high-level op semantics knowledge with generic
properties to describe these semantics, directly in MLIR, is a promising way to:
- Design transformations that are correct by construction, easy to
-write, easy to verify and easy to maintain.
+write, easy to verify and easy to maintain.
- Provide a way to specify transformations and the units of IR they manipulate
declaratively. In turn this allows using local pattern rewrite rules in MLIR
(i.e. [DRR](https://mlir.llvm.org/docs/DeclarativeRewrites/)).
@@ -522,10 +521,10 @@ and other enabling rewrites in a single pass. The result is a system where pass
fusion is very simple to obtain and gives hope to solving certain
[phase ordering issues](https://dl.acm.org/doi/10.1145/201059.201061).
-## Suitability for Search and Machine Learning<a name="ml"></a>
+### Suitability for Search and Machine Learning<a name="ml"></a>
Compiler heuristics are hand-crafted human-engineered features: it is
ripe for disruption by machine-learning techniques.
-To enable search, compiler transformations should be fine-grained,
+To enable search, compiler transformations should be fine-grained,
[composable](#declarative_transformations) and expose tuning parameters that
can modify their behavior, guided by lessons from previous experience
with [Tensor Comprehensions](#lessonstc).
@@ -540,9 +539,9 @@ building cost models.
Still, this does not mean Linalg refuses cost models: instead we
prefer to invest in infrastructure that will enable [ML-based
techniques to automatically build cost
-models](http://homepages.inf.ed.ac.uk/hleather/publications/2009_autofeatures_cgo.pdf).
+models](http://homepages.inf.ed.ac.uk/hleather/publications/2009_autofeatures_cgo.pdf).
-## Extensibility and Future-Proofness<a name="future"></a>
+### Extensibility and Future-Proofness<a name="future"></a>
MLIR allows defining IR for structured control flow and structured
data types. We choose to take advantage of these properties for the
reasons described above.
@@ -561,12 +560,12 @@ While there is no concrete solution
today to solve these problems in MLIR, it is pretty clear that perfect
static knowledge and analyses will not be serious contenders for these problems.
-# Key Observations<a name="keyobservation"></a>
+## Key Observations<a name="keyobservation"></a>
The following key observations have influenced the design of Linalg and helped
reconcile [core guiding principles](#guiding_principles) with real-world
requirements when producing an implementation based on MLIR.
-## Algorithms + Data Structures = Programs<a name="data_and_compute"></a>
+### Algorithms + Data Structures = Programs<a name="data_and_compute"></a>
This is a twist on Niklaus Wirth's formulation but captures the essence of the
design of Linalg: control-flow does not exist in a vacuum, independently of
data.
@@ -580,21 +579,21 @@ certain transformations are better done:
occurred,
- as extensions to the Linalg dialect in terms of new ops or attributes.
-## The Dialect Need not be Closed Under Transformations<a name="dialect_not_closed"></a>
+### The Dialect Need not be Closed Under Transformations<a name="dialect_not_closed"></a>
This is probably the most surprising and counter-intuitive
observation. When one designs IR for transformations, closed-ness is
often a nonnegotiable property.
This is a key design principle of polyhedral IRs such as
[URUK](http://icps.u-strasbg.fr/~bastoul/research/papers/GVBCPST06-IJPP.pdf)
-and
+and
[ISL-based IRs](https://en.wikipedia.org/wiki/Integer_set_library):
they are closed under affine transformations.
-In MLIR, multiple dialects coexist and form a coherent whole. After
+In MLIR, multiple dialects coexist and form a coherent whole. After
experimenting with
diff erent alternatives, it became clear that strict
dialect closed-ness wasn't necessary and could be relaxed. Previous
systems did not have simple and principled means of building new IR
and probably suffered from this limitation. We conjecture this is a
-key reason they required the IR to be closed under transformations.
+key reason they required the IR to be closed under transformations.
Despite the fact that Linalg ops only allow perfectly nested
semantics, once tiling and fusion kick in, imperfectly nested loops
@@ -608,7 +607,7 @@ transformation would dictate that the potential remains constant.
In contrast, Linalg advocates for ***monotonicity*** under
transformations.
-## Summary of Existing Alternatives a Picture<a name="observationssummary"></a>
+### Summary of Existing Alternatives a Picture<a name="observationssummary"></a>
Lastly, we summarize our observations of lessons from [Prior
Art](#prior_art)---when viewed under the lense of our [Core Guiding
Principles](#guiding_principles)---with the following picture.
diff --git a/mlir/docs/Tutorials/Toy/Ch-3.md b/mlir/docs/Tutorials/Toy/Ch-3.md
index a535d1c95c65..fee947ff5fda 100644
--- a/mlir/docs/Tutorials/Toy/Ch-3.md
+++ b/mlir/docs/Tutorials/Toy/Ch-3.md
@@ -22,7 +22,7 @@ rule-based pattern-match and rewrite using table-driven
use of DRR requires that the operations be defined using ODS, as described in
[Chapter 2](Ch-2.md).
-# Optimize Transpose using C++ style pattern-match and rewrite
+## Optimize Transpose using C++ style pattern-match and rewrite
Let's start with a simple pattern and try to eliminate a sequence of two
transpose that cancel out: `transpose(transpose(X)) -> X`. Here is the
@@ -163,7 +163,7 @@ Perfect! No `transpose` operation is left - the code is optimal.
In the next section, we use DRR for pattern match optimizations associated with
the Reshape op.
-# Optimize Reshapes using DRR
+## Optimize Reshapes using DRR
Declarative, rule-based pattern-match and rewrite (DRR) is an operation
DAG-based declarative rewriter that provides a table-based syntax for
diff --git a/mlir/docs/Tutorials/Toy/Ch-6.md b/mlir/docs/Tutorials/Toy/Ch-6.md
index faa5bdc9b009..34b76008f163 100644
--- a/mlir/docs/Tutorials/Toy/Ch-6.md
+++ b/mlir/docs/Tutorials/Toy/Ch-6.md
@@ -7,7 +7,7 @@ In the [previous chapter](Ch-5.md), we introduced the
many of the `Toy` operations to affine loop nests for optimization. In this
chapter, we will finally lower to LLVM for code generation.
-# Lowering to LLVM
+## Lowering to LLVM
For this lowering, we will again use the dialect conversion framework to perform
the heavy lifting. However, this time, we will be performing a full conversion
@@ -54,7 +54,7 @@ Now that the lowering for the printf operation has been defined, we can specify
the components necessary for the lowering. These are largely the same as the
components defined in the [previous chapter](Ch-5.md).
-## Conversion Target
+### Conversion Target
For this conversion, aside from the top-level module, we will be lowering
everything to the LLVM dialect.
@@ -65,7 +65,7 @@ everything to the LLVM dialect.
target.addLegalOp<mlir::ModuleOp, mlir::ModuleTerminatorOp>();
```
-## Type Converter
+### Type Converter
This lowering will also transform the MemRef types which are currently being
operated on into a representation in LLVM. To perform this conversion, we use a
@@ -79,7 +79,7 @@ enough for our use case.
LLVMTypeConverter typeConverter(&getContext());
```
-## Conversion Patterns
+### Conversion Patterns
Now that the conversion target has been defined, we need to provide the patterns
used for lowering. At this point in the compilation process, we have a
@@ -99,7 +99,7 @@ by relying on [transitive lowering](../../../getting_started/Glossary.md#transit
patterns.insert<PrintOpLowering>(&getContext());
```
-## Full Lowering
+### Full Lowering
We want to completely lower to LLVM, so we use a `FullConversion`. This ensures
that only legal operations will remain after the conversion.
@@ -169,13 +169,13 @@ llvm.func @main() {
See [Conversion to the LLVM IR Dialect](../../ConversionToLLVMDialect.md) for
more in-depth details on lowering to the LLVM dialect.
-# CodeGen: Getting Out of MLIR
+## CodeGen: Getting Out of MLIR
At this point we are right at the cusp of code generation. We can generate code
in the LLVM dialect, so now we just need to export to LLVM IR and setup a JIT to
run it.
-## Emitting LLVM IR
+### Emitting LLVM IR
Now that our module is comprised only of operations in the LLVM dialect, we can
export to LLVM IR. To do this programmatically, we can invoke the following
@@ -270,7 +270,7 @@ int dumpLLVMIR(mlir::ModuleOp module) {
}
```
-## Setting up a JIT
+### Setting up a JIT
Setting up a JIT to run the module containing the LLVM dialect can be done using
the `mlir::ExecutionEngine` infrastructure. This is a utility wrapper around
More information about the Mlir-commits
mailing list