[Mlir-commits] [mlir] 1842fd5 - [mlir] Fix multiple titles

Mon Feb 17 13:55:56 PST 2020

Author: Jacques Pienaar
Date: 2020-02-17T13:55:46-08:00
New Revision: 1842fd50d2ff66209cee11268ddf4bee28c482f7

URL: https://github.com/llvm/llvm-project/commit/1842fd50d2ff66209cee11268ddf4bee28c482f7
DIFF: https://github.com/llvm/llvm-project/commit/1842fd50d2ff66209cee11268ddf4bee28c482f7.diff

LOG: [mlir] Fix multiple titles

We have one title in every doc which corresponds to `#`, in the some
there are multiple and it is expected to be h1 headers (visual elements
rather than organizational). Indent every nesting by one in all of the
docs with multiple titles.

Also fixing trailing whitespace.

Added: 
    

Modified: 
    mlir/docs/Dialects/Linalg.md
    mlir/docs/Dialects/Vector.md
    mlir/docs/RationaleLinalgDialect.md
    mlir/docs/Tutorials/Toy/Ch-3.md
    mlir/docs/Tutorials/Toy/Ch-6.md

Removed: 
    


################################################################################
diff  --git a/mlir/docs/Dialects/Linalg.md b/mlir/docs/Dialects/Linalg.md
index ddc0cd7f2dbd..6255f23ae4b8 100644

--- a/mlir/docs/Dialects/Linalg.md
+++ b/mlir/docs/Dialects/Linalg.md
@@ -2,18 +2,18 @@
 
 [TOC]
 
-# Rationale
+## Rationale
 
 <img width="90" align="left" alt="MLIR Codegen Flow" src="https://user-images.githubusercontent.com/10148468/73613629-c5586580-45c5-11ea-94b7-074aeea94c7b.png">
 
 Linalg is designed to solve the High-level Hierarchical Optimization
 (HHO box) in MLIR and to interoperate nicely within a
-*Mixture Of Expert Compilers* environment (i.e. the *CGSel* box). 
+*Mixture Of Expert Compilers* environment (i.e. the *CGSel* box).
 
 The [Rationale Document](https://mlir.llvm.org/docs/RationaleLinalgDialect)
 goes into significantly more design and architectural decision details.
 
-# Set of Key Transformations<a name="key_transformations"></a>
+## Set of Key Transformations<a name="key_transformations"></a>
 
 The following key transformations have been central to driving the design of
 Linalg. They are all implemented in terms of the properties of the
@@ -33,7 +33,7 @@ performed on the Linalg IR and that have influenced its design:
 1. Lower to Library Calls or Special Instructions, Intrinsics or ISA.
 1. Partially Lower to Iterations Over a Finer-Grained Linalg Op.
 
-# High-Level Description of Linalg Ops<a name="linalg_ops"></a>
+## High-Level Description of Linalg Ops<a name="linalg_ops"></a>
 Linalg takes at least some inspiration from all previously [listed prior
 art](#prior_art). The design enables the definition of ***CustomOps*** with
 generic properties that enable [key transformations](#key_transformations),
@@ -42,7 +42,7 @@ library calls and intrinsics.
 
 These ops can have ***either tensor or buffer operands***.
 
-## Payload-Carrying Ops<a name="payload_ops"></a>
+### Payload-Carrying Ops<a name="payload_ops"></a>
 Linalg defines two payload carrying operations that implement the [structured ops](
 https://docs.google.com/presentation/d/1P-j1GrH6Q5gLBjao0afQ-GfvcAeF-QU4GXXeSy0eJ9I/edit#slide=id.p
 ) abstraction on tensors and buffers. This is architected as two generic operations
@@ -52,7 +52,7 @@ The properties of these generic ops are the result of applying the
 guiding principles described in the [Rationale Document](https://mlir.llvm.org/docs/RationaleLinalgDialect).
 They are listed next, with a brief example and discussion for each.
 
-### Property 1: Input and Output Operands Define The Iteration Space<a name="prop1"></a>
+#### Property 1: Input and Output Operands Define The Iteration Space<a name="prop1"></a>
 A `linalg.generic` op fully *derives* the specification of its iteration space
 from its operands.
 The property enforces that a localized IR element (the op) *has* all the information
@@ -63,7 +63,7 @@ to [URUK](http://icps.u-strasbg.fr/~bastoul/research/papers/GVBCPST06-IJPP.pdf).
 Consider the following, partially specified, `linalg.generic` example:
 ```
 #attrs = {args_in: 1, args_out: 1}
-func @example(%A: memref<?xf32, layout1>, 
+func @example(%A: memref<?xf32, layout1>,
               %B: memref<?xvector<4xf32, layout2>>) {
   linalg.generic #attrs (%2, %3): memref<?xf32, layout1>,
                                   memref<?xvector<4xf32, layout2>>
@@ -74,7 +74,7 @@ func @example(%A: memref<?xf32, layout1>,
 The property "*Input and Output Operands Define The Iteration Space*" is
 materialized by a lowering into a form that will resemble:
 ```
-func @example(%A: memref<?xf32, layout1>, 
+func @example(%A: memref<?xf32, layout1>,
               %B: memref<?xvector<4xf32, layout2>>) {
   %M = "dim" %A, 0: index
   %N = "dim" %B, 0: index
@@ -119,18 +119,18 @@ and [position-dependent
 arrays](https://www.lift-project.org/publications/2019/pizzuti19positiondependentarrays.pdf),
 as well as [TACO](http://tensor-compiler.org/), has shown.
 
-### Property 2: Reversible Mappings Between Control and Data Structures<a name="prop2"></a>
+#### Property 2: Reversible Mappings Between Control and Data Structures<a name="prop2"></a>
 A `linalg.generic` *defines* the mapping between the iteration space (i.e. the
-loops) and the data. 
+loops) and the data.
 
 Consider the following, partially specified, `linalg.generic` example:
 ```
-#indexing_maps = { 
-  (i, j) -> (j, i), 
-  (i, j) -> (j) 
+#indexing_maps = {
+  (i, j) -> (j, i),
+  (i, j) -> (j)
 }
 #attrs = {args_in: 1, args_out: 1, indexings: indexing_maps}
-func @example(%A: memref<?xf32, layout1>, 
+func @example(%A: memref<?xf32, layout1>,
               %B: memref<?xvector<4xf32, layout2>>) {
   linalg.generic #attrs (%A, %B): memref<?xf32, layout1>,
                                   memref<?xvector<4xf32, layout2>>
@@ -142,13 +142,13 @@ The property "*Reversible Mappings Between Control and Data Structures*" is
 materialized by a lowering into a form that will resemble:
 ```
 #attrs = {args_in: 1, args_out: 1, indexings: indexing_maps}
-func @example(%A: memref<?xf32, layout1>, 
+func @example(%A: memref<?xf32, layout1>,
               %B: memref<?xvector<4xf32, layout2>>) {
   // loop bounds determined from data sizes by “inverting the map”
   %J = "dim" %2, 0: index
   %I = "dim" %2, 1: index
   %J2 = "dim" %3, 0: index
-  // iteration space is consistent with data + mapping inference 
+  // iteration space is consistent with data + mapping inference
   %eq = "eq" %J, %J2: i1
   "assert" %eq: (i1) -> ()
   for %i = 0 to %I {           // loop order is fully defined by indexing maps
@@ -170,7 +170,7 @@ write?
 - Given a subset of data read or written, what subset of the iteration space
 is responsible for this read or write?
 
-Answering these `2` questions is one of the main analyses that Linalg uses to 
+Answering these `2` questions is one of the main analyses that Linalg uses to
 implement transformations such as tiling, tiled producer-consumer fusion, and
 promotion to temporary buffers in fast memory.
 
@@ -179,7 +179,7 @@ This is a pragmatic short-term solution, but in the longer term note that
 this property could be even evaluated dynamically, similarly to
 inspector-executor algorithms.
 
-### Property 3: The Type Of Iterators is Defined Explicitly<a name="prop3"></a>
+#### Property 3: The Type Of Iterators is Defined Explicitly<a name="prop3"></a>
 A `linalg.generic` op fully *declares* the type of its iterators. This
 information is used in transformations.
 
@@ -192,21 +192,21 @@ preserved***.
 
 This can be better captured directly at the loop level thanks to specific
 iterator types, among which:
-*parallel*, *reduction*, *partition*, *permutable/monotonic*, *sequential*, 
+*parallel*, *reduction*, *partition*, *permutable/monotonic*, *sequential*,
 *dependence distance*, ...
 
 These types are traditionally the result of complex dependence analyses and
 have been referred to as "*bands*" in the polyhedral community (e.g. *parallel
 bands*, *permutable bands*, etc, in
 [ISL](https://en.wikipedia.org/wiki/Integer_set_library) schedule tree
-parlance). 
+parlance).
 
 Specifying the information declaratively in a `linalg.generic` allows
 conveying properties that may be hard (or even impossible) to derive from
 lower-level information. These properties can be brought all the way to the
 moment when they are useful for transformations, used and then discarded.
 
-Additionally, these properties may also be viewed as a contract that the 
+Additionally, these properties may also be viewed as a contract that the
 frontend/user guarantees and that the compiler may take advantage of. The
 common example is the use of data-dependent reduction semantics for
 specifying histogram computations. If the frontend has additional knowledge
@@ -216,8 +216,8 @@ parallel semantics and use the special atomic in the computation region.
 At this time, Linalg only has an explicit use for *parallel* and *reduction*
 loops but previous experience shows that the abstraction generalizes.
 
-### Property 4: The Compute Payload is Specified With a Region<a name="prop4"></a>
-A `linalg.generic` op has a compute payload that is fully generic thanks to 
+#### Property 4: The Compute Payload is Specified With a Region<a name="prop4"></a>
+A `linalg.generic` op has a compute payload that is fully generic thanks to
 the use of
 [Regions](https://github.com/llvm/llvm-project/blob/58265ad42a90ae8905be6a447cb42e53529a54a0/mlir/docs/LangRef.md#regions).
 
@@ -230,16 +230,16 @@ At this time there are no additional restrictions to the region
 semantics. This is meant to allow the exploration of various design tradeoffs
 at the intersection of regions and iterator types.
 In particular, the frontend is responsible for the semantics of iterator types
-to correspond to the operations inside the region: the region can capture 
+to correspond to the operations inside the region: the region can capture
 buffers arbitrarily and write into them. If this conflicts with some parallel
 iterator requirement, this is undefined behavior.
 
 Concretely, consider the following, partially specified, `linalg.generic`
 example:
 ```
-#indexing_maps = { 
-  (i, j) -> (i, j), 
-  (i, j) -> (i, j) 
+#indexing_maps = {
+  (i, j) -> (i, j),
+  (i, j) -> (i, j)
 }
 #attrs = {args_in: 1, args_out: 1, indexings: #indexing_maps}
 func @example(%A: memref<?x?xf32>, %B: memref<?x?xf32>, %C: memref<?x?xf32>) {
@@ -276,24 +276,24 @@ proposal](https://llvm.discourse.group/t/introduce-std-inlined-call-op-proposal/
 We expect to be able to reuse the common lower-level infrastructure provided
 it evolves to support both region arguments and captures.
 
-### Property 5: May Map To an External Library Call<a name="prop5"></a>
+#### Property 5: May Map To an External Library Call<a name="prop5"></a>
 A `linalg.generic` op may map to an external library call by specifying a
-`SymbolAttr`. At this level of abstraction, the important glue is the ability 
+`SymbolAttr`. At this level of abstraction, the important glue is the ability
 to perform transformations that preserve the structure necessary to ***call
 the external library after 
diff erent transformations have been applied***.
 
 This involves considerations related to preservation of op semantics
 and integration at the ABI level. Regardless of whether one wants to use
-external library calls or a custom ISA, the problem for codegen is similar: 
+external library calls or a custom ISA, the problem for codegen is similar:
 preservation of a fixed granularity.
 
 Consider the following, partially specified, `linalg.generic`
 example:
 ```
 #fun_attr = "pointwise_add"
-#indexing_maps = { 
-  (i, j) -> (i, j), 
-  (i, j) -> (i, j) 
+#indexing_maps = {
+  (i, j) -> (i, j),
+  (i, j) -> (i, j)
 }
 #attrs = {args_in: 1, args_out: 1, indexings: #indexing_maps, fun: #fun_attr}
 func @example(%A: memref<?x?xf32>, %B: memref<?x?xf32>, %C: memref<?x?xf32>) {
@@ -313,7 +313,7 @@ materialized by a lowering into a form that will resemble:
 func @pointwise_add_sxsxf32_sxsxf32(memref<?x?xf32>, memref<?x?xf32>, memref<?x?xf32>) -> ()
 
 func @example(%A: memref<?x?xf32>, %B: memref<?x?xf32>, %C: memref<?x?xf32>) {
-  call @pointwise_add_sxsxf32_sxsxf32 (%A, %B, %C): 
+  call @pointwise_add_sxsxf32_sxsxf32 (%A, %B, %C):
     (memref<?x?xf32>, memref<?x?xf32>, memref<?x?xf32>) -> ()
   return
 }
@@ -321,20 +321,20 @@ func @example(%A: memref<?x?xf32>, %B: memref<?x?xf32>, %C: memref<?x?xf32>) {
 
 Which, after lowering to LLVM resembles:
 ```
-func @pointwise_add_sxsxf32_sxsxf32(!llvm<"{ float*, i64, [2 x i64], [3 x i64] }*">, 
-                                    !llvm<"{ float*, i64, [2 x i64], [3 x i64] }*">, 
+func @pointwise_add_sxsxf32_sxsxf32(!llvm<"{ float*, i64, [2 x i64], [3 x i64] }*">,
+                                    !llvm<"{ float*, i64, [2 x i64], [3 x i64] }*">,
                                     !llvm<"{ float*, i64, [2 x i64], [3 x i64] }*">) -> ()
 
-func @example(%A: !llvm<"{ float*, i64, [2 x i64], [3 x i64] }*">, 
-              %B: !llvm<"{ float*, i64, [2 x i64], [3 x i64] }*">, 
+func @example(%A: !llvm<"{ float*, i64, [2 x i64], [3 x i64] }*">,
+              %B: !llvm<"{ float*, i64, [2 x i64], [3 x i64] }*">,
               %C: !llvm<"{ float*, i64, [2 x i64], [3 x i64] }*">) {
-  llvm.call @pointwise_add_sxsxf32_sxsxf32 (%A, %B, %C): 
+  llvm.call @pointwise_add_sxsxf32_sxsxf32 (%A, %B, %C):
     (!llvm<"{ float*, i64, [2 x i64], [3 x i64] }*">...) -> ()
   return
 }
 ```
 
-#### Convention For External Library Interoperability
+##### Convention For External Library Interoperability
 The `linalg` dialect adopts a convention that is similar to `BLAS` when
 offloading operations to fast library implementations: pass a non-owning
 pointer to input and output data with additional metadata. This convention
@@ -349,7 +349,7 @@ There is an [ongoing
 discussion](https://llvm.discourse.group/t/lowering-optional-attributes-in-linalg-structuredops-to-standard-dialect/333/3)
 on the topic of extending interoperability in the presence of key attributes.
 
-### Property 6: Perfectly Nested Writes To The Whole Output Operands<a name="prop6"></a>
+#### Property 6: Perfectly Nested Writes To The Whole Output Operands<a name="prop6"></a>
 Perfectly nested loops form a particularly important class of structure that
 enables key loop transformations such as tiling and mapping to library calls.
 Unfortunately, this type of structure is easily broken by transformations such
@@ -363,12 +363,12 @@ entire memory region.  This is a structural constraint across regions and
 loops that has proven to be key in simplifying transformations.
 
 One particular point to mention is that converting imperfectly nested code
-into perfectly nested code can often be done with enough loop distribution 
+into perfectly nested code can often be done with enough loop distribution
 and embedding of conditionals down to the innermost loop level.
 
 Previous experience with Tensor Comprehensions gave us the intuition that
 forcing innermost control-flow nesting is a lot like writing data-parallel
-code with arrays of boolean values and predication. 
+code with arrays of boolean values and predication.
 This type of trick has also been used before in polyhedral compilers to
 convert non-affine control into affine compute dependencies.
 
@@ -376,7 +376,7 @@ While it may be possible to automate such rewrites from generic IR,
 `linalg.generic` just forces the semantics for now.
 
 The key implication is that this conversion to deep predication needs to be
-undone once we are done with Linalg transformations. 
+undone once we are done with Linalg transformations.
 After iterators and induction variables are materialized (i.e. after lowering
 out of `linalg.generic` occurred), the overall performance will be greatly
 influenced by the quality of canonicalizations, foldings and *Loop Independent
@@ -384,10 +384,10 @@ Code Motion* (LICM).
 
 In the grander scheme, the reliance on late LICM was deemed a necessary risk.
 
-### Putting it Together<a name="summary"></a>
+#### Putting it Together<a name="summary"></a>
 As it stands, the six properties above define the semantics of a
 `linalg.generic` op. It is an open question whether all of these semantics are
-strictly necessary in practice and whether some should or could be derived 
+strictly necessary in practice and whether some should or could be derived
 automatically while still maintaining the [core guiding
 principles](#guiding_principles).
 
@@ -396,7 +396,7 @@ because of empirical evidence building and working on multiple high-level
 compilers. As we lay those down and engage more with the community, we expect
 multiple rounds of discussions and design changes to the original architecture.
 
-## Data Representation: Views<a name="views"></a>
+### Data Representation: Views<a name="views"></a>
 The current implementation uses the [Strided MemRef (a.k.a View)](
 https://groups.google.com/a/tensorflow.org/forum/#!topic/mlir/MaL8m2nXuio)
 abstraction. The name *View* is used interchangeably in `linalg` to signify
@@ -408,7 +408,7 @@ experience from existing LIFT abstractions for
 and [position-dependent
 arrays](https://www.lift-project.org/publications/2019/pizzuti19positiondependentarrays.pdf).
 
-## Metadata Ops<a name="metadata_ops"></a>
+### Metadata Ops<a name="metadata_ops"></a>
 A set of ops that manipulate metadata but do not move memory. These ops take
 `view` operands + extra attributes and return new `view`s. The returned
 `view`s generally alias the operand `view`. At the moment the existing ops
@@ -435,7 +435,7 @@ In a longer-term future, the abstractions from [Legion data-centric
 programming model](https://legion.stanford.edu/overview/) seem generally
 appealing.
 
-## Named Payload-Carrying Ops<a name="named_ops"></a>
+### Named Payload-Carrying Ops<a name="named_ops"></a>
 Additionally, `linalg` provides a small subset of commonly named operations:
 
     * `linalg.copy`,
@@ -446,12 +446,12 @@ Additionally, `linalg` provides a small subset of commonly named operations:
 
 These named operations adhere to the `linalg.generic` op interface. Work is in
 progress to define declarative mechanisms to automatically generate named ops
-from a description in terms of only the generic op interface. 
+from a description in terms of only the generic op interface.
 
 This is the main reason there are only a small number of ops today: we expect
 them to be auto-generated from Tablegen soon.
 
-# Open Issues and Design Alternatives<a name="open_issues"></a>
+## Open Issues and Design Alternatives<a name="open_issues"></a>
 Multiple open issues and design alternatives are in flight and it is time to
 lay them out for the community to discuss and pick apart:
 1. Should `linalg.generic` support nesting?
@@ -466,6 +466,6 @@ extended, if at all?
 ...
 
 These key questions (and much more) should be really thought of in the general
-context of MLIR in which 
diff erent levels of IR interoperate seamlessly. In 
-practice, it is not necessary (or beneficial) to try and solve all problems in the 
+context of MLIR in which 
diff erent levels of IR interoperate seamlessly. In
+practice, it is not necessary (or beneficial) to try and solve all problems in the
 same IR.

diff  --git a/mlir/docs/Dialects/Vector.md b/mlir/docs/Dialects/Vector.md
index 0ce393a435d0..1c00269c615a 100644
--- a/mlir/docs/Dialects/Vector.md
+++ b/mlir/docs/Dialects/Vector.md
@@ -11,7 +11,7 @@ targets. These abstractions serve to separate concerns between operations on
 new proposal but rather a textual documentation of existing MLIR components
 along with a rationale.
 
-# Positioning in the Codegen Infrastructure 
+## Positioning in the Codegen Infrastructure
 The following diagram, recently presented with the [StructuredOps
 abstractions](https://drive.google.com/corp/drive/u/0/folders/1sRAsgsd8Bvpm_IxREmZf2agsGU2KvrK-),
 captures the current codegen paths implemented in MLIR in the various existing
@@ -34,7 +34,7 @@ that the higher level of `vector`-level primitives we build and target from
 codegen (or some user/language level), the simpler our task will be, the more
 complex patterns can be expressed and the better performance will be.
 
-# Components of a Generic Retargetable Vector-Level Dialect
+## Components of a Generic Retargetable Vector-Level Dialect
 The existing MLIR `vector`-level dialects are related to the following
 bottom-up abstractions:
 
@@ -65,9 +65,9 @@ lowering patterns that are specified manually for now.
 manually at the moment and that should be automated, following the `LLVM ->
 Hardware Vector` ops generation as closely as possible.
 
-# Short Description of the Existing Infrastructure
+## Short Description of the Existing Infrastructure
 
-## LLVM level
+### LLVM level
 On CPU, the `n-D` `vector` type currently lowers to
 `!llvm<array<vector>>`. More concretely, `vector<4x8x128xf32>` lowers to
 `!llvm<[4 x [ 8 x [ 128 x float ]]]>`.
@@ -76,7 +76,7 @@ one uses `llvm.extractelement`, `llvm.insertelement` and
 `llvm.shufflevector`. A [deeper dive section](#DeeperDive) discusses the
 current lowering choices and tradeoffs.
 
-## Hardware Vector Ops
+### Hardware Vector Ops
 Hardware Vector Ops are implemented as one dialect per target.
 For internal hardware, we are auto-generating the specific HW dialects.
 For `GPU`, the `NVVM` dialect adds operations such as `mma.sync`, `shfl` and
@@ -90,7 +90,7 @@ Specialized `CPU` dialects that would capture specific features not well
 captured by LLVM peephole optimizations of on 
diff erent types that core MLIR
 supports (e.g. Scalable Vectors) are welcome future extensions.
 
-## Virtual Vector Ops
+### Virtual Vector Ops
 Some existing Standard and VectorOps Dialect on `n-D` `vector` types comprise:
 ```
 %2 = std.addf %0, %1 : vector<3x7x8xf32>  // -> vector<3x7x8xf32>
@@ -101,14 +101,14 @@ Some existing Standard and VectorOps Dialect on `n-D` `vector` types comprise:
 %1 = vector.extract %0[1, 5]: vector<3x7x8xf32>            // -> vector<8xf32>
 %2 = vector.outerproduct %0, %1: vector<4xf32>, vector<8xf32>     // -> vector<4x8xf32>
 %3 = vector.outerproduct %0, %1, %2: vector<4xf32>, vector<8xf32> // fma when adding %2
-%3 = vector.strided_slice %0 {offsets = [2, 2], sizes = [2, 2], strides = [1, 1]}: 
+%3 = vector.strided_slice %0 {offsets = [2, 2], sizes = [2, 2], strides = [1, 1]}:
    vector<4x8x16xf32> // Returns a slice of type vector<2x2x16xf32>
 
-%2 = vector.transfer_read %A[%0, %1] 
+%2 = vector.transfer_read %A[%0, %1]
   {permutation_map = (d0, d1) -> (d0)}: memref<7x?xf32>, vector<4xf32>
 
-vector.transfer_write %f1, %A[%i0, %i1, %i2, %i3] 
-  {permutation_map = (d0, d1, d2, d3) -> (d3, d1, d0)} : 
+vector.transfer_write %f1, %A[%i0, %i1, %i2, %i3]
+  {permutation_map = (d0, d1, d2, d3) -> (d3, d1, d0)} :
     vector<5x4x3xf32>, memref<?x?x?x?xf32>
 ```
 
@@ -120,38 +120,38 @@ building and populates the [Vector
 doc](https://github.com/llvm/llvm-project/blob/master/mlir/docs/Dialects/Vector.md)). Recent
 extensions are driven by concrete use cases of interest. A notable such use
 case is the `vector.contract` op which applies principles of the StructuredOps
-abstraction to `vector` types. 
+abstraction to `vector` types.
 
-## Virtual Vector Rewrite Patterns
+### Virtual Vector Rewrite Patterns
 
 The following rewrite patterns exist at the `VV->VV` level:
 
 1. The now retired `MaterializeVector` pass used to legalize ops on a
 coarse-grained virtual `vector` to a finer-grained virtual `vector` by
 unrolling. This has been rewritten as a retargetable unroll-and-jam pattern on
-`vector` ops and `vector` types. 
+`vector` ops and `vector` types.
 2. The lowering of `vector_transfer` ops legalizes `vector` load/store ops to
 permuted loops over scalar load/stores. This should evolve to loops over
 `vector` load/stores + `mask` operations as they become available `vector` ops
-at the `VV` level. 
+at the `VV` level.
 
 The general direction is to add more Virtual Vector level ops and implement
 more useful `VV -> VV` rewrites as composable patterns that the PatternRewrite
-infrastructure can apply iteratively. 
+infrastructure can apply iteratively.
 
-## Virtual Vector to Hardware Vector Lowering
+### Virtual Vector to Hardware Vector Lowering
 For now, `VV -> HWV`  are specified in C++ (see for instance the
 [SplatOpLowering for n-D
 vectors](https://github.com/tensorflow/mlir/commit/0a0c4867c6a6fcb0a2f17ef26a791c1d551fe33d)
 or the [VectorOuterProductOp
-lowering](https://github.com/tensorflow/mlir/commit/957b1ca9680b4aacabb3a480fbc4ebd2506334b8)). 
+lowering](https://github.com/tensorflow/mlir/commit/957b1ca9680b4aacabb3a480fbc4ebd2506334b8)).
 
 Simple [conversion
 tests](https://github.com/llvm/llvm-project/blob/master/mlir/test/Conversion/VectorToLLVM/vector-to-llvm.mlir)
-are available for the `LLVM` target starting from the Virtual Vector Level. 
+are available for the `LLVM` target starting from the Virtual Vector Level.
 
-# Rationale
-## Hardware as `vector` Machines of Minimum Granularity
+## Rationale
+### Hardware as `vector` Machines of Minimum Granularity
 
 Higher-dimensional `vector`s are ubiquitous in modern HPC hardware. One way to
 think about Generic Retargetable `vector`-Level Dialect is that it operates on
@@ -163,25 +163,25 @@ Some notable `vector` sizes of interest include:
 
 1. CPU: `vector<HW_vector_size * k>`,  `vector<core_count * k’ x
 HW_vector_size * k>` and  `vector<socket_count x core_count * k’ x
-HW_vector_size * k>` 
+HW_vector_size * k>`
 2. GPU: `vector<warp_size * k>`, `vector<warp_size * k  x float4>` and
-`vector<warp_size * k x 4 x 4 x 4>` for tensor_core sizes, 
-3. Other accelerators:  n-D `vector` as first-class citizens in the HW. 
+`vector<warp_size * k x 4 x 4 x 4>` for tensor_core sizes,
+3. Other accelerators:  n-D `vector` as first-class citizens in the HW.
 
 Depending on the target, ops on sizes that are not multiples of the HW
 `vector` size may either produce slow code (e.g. by going through `LLVM`
 legalization) or may not legalize at all (e.g. some unsupported accelerator X
-combination of ops and types). 
+combination of ops and types).
 
-## Transformations Problems Avoided
+### Transformations Problems Avoided
 A `vector<16x32x64xf32>` virtual `vector` is a coarse-grained type that can be
 “unrolled” to HW-specific sizes. The multi-dimensional unrolling factors are
 carried in the IR by the `vector` type. After unrolling, traditional
-instruction-level scheduling can be run.  
+instruction-level scheduling can be run.
 
 The following key transformations (along with the supporting analyses and
 structural constraints) are completely avoided by operating on a ``vector``
-`ssa-value` abstraction: 
+`ssa-value` abstraction:
 
 1. Loop unroll and unroll-and-jam.
 2. Loop and load-store restructuring for register reuse.
@@ -191,9 +191,9 @@ structural constraints) are completely avoided by operating on a ``vector``
 Note that “unrolling” in the context of `vector`s corresponds to partial loop
 unroll-and-jam and not full unrolling. As a consequence this is expected to
 compose with SW pipelining where applicable and does not result in ICache blow
-up. 
+up.
 
-## The Big Out-Of-Scope Piece: Automatic Vectorization
+### The Big Out-Of-Scope Piece: Automatic Vectorization
 One important piece not discussed here is automatic vectorization
 (automatically raising from scalar to n-D `vector` ops and types). The TL;DR
 is that when the first "super-vectorization" prototype was implemented, MLIR
@@ -226,7 +226,7 @@ with expressing `vector`s in the IR directly and simple
 pattern-rewrites. [EDSC](https://github.com/llvm/llvm-project/blob/master/mlir/docs/EDSC.md)s
 provide a simple way of driving such a notional language directly in C++.
 
-# Bikeshed Naming Discussion
+## Bikeshed Naming Discussion
 There are arguments against naming an n-D level of abstraction `vector`
 because most people associate it with 1-D `vector`s. On the other hand,
 `vector`s are first-class n-D values in MLIR.
@@ -236,7 +236,7 @@ hardware.
 For now, we generally use the `n-D` `vector` name and are open to better
 suggestions.
 
-# DeeperDive
+## DeeperDive
 
 This section describes the tradeoffs involved in lowering the MLIR n-D vector
 type and  operations on it to LLVM-IR. Putting aside the [LLVM
@@ -257,7 +257,7 @@ MLIR operations are prefixed by the `vector.` dialect prefix
 (e.g. `vector.insertelement`). Such ops operate exclusively on MLIR `n-D`
 `vector` types.
 
-## Alternatives For Lowering an n-D Vector Type to LLVM  
+### Alternatives For Lowering an n-D Vector Type to LLVM
 Consider a vector of rank n with  static sizes `{s_0, ... s_{n-1}}` (i.e. an
 MLIR `vector<s_0x...s_{n-1}xf32>`). Lowering such an `n-D` MLIR vector type to
 an LLVM descriptor can be done by either:
@@ -274,7 +274,7 @@ discuss. It is important to note that “a mix of both” immediately reduces to
 vector<4x8x16x32xf32> to vector<4x4096xf32>` operation, that flattens the most
 "k" minor dimensions.
 
-## Constraints Inherited from LLVM (see LangRef)
+### Constraints Inherited from LLVM (see LangRef)
 The first constraint was already mentioned: LLVM only supports `1-D` `vector`
 types natively.
 Additional constraints are related to the 
diff erence in LLVM between vector
@@ -295,7 +295,7 @@ types.
 
 The next sentence illustrates a recurrent tradeoff, also found in MLIR,
 between “value types” (subject to SSA use-def chains) and “memory types”
-(subject to aliasing and side-effects): 
+(subject to aliasing and side-effects):
 ```
 “Structures in memory are accessed using ‘load’ and ‘store’ by getting a
 pointer to a field with the llvm.getelementptr instruction. Structures in
@@ -306,62 +306,62 @@ instructions.”
 When transposing this to MLIR, `llvm.getelementptr` works on pointers to `n-D`
 vectors in memory. For `n-D`, vectors values that live in registers we can use
 `vector.extract` and `vector.insert` which do not accept dynamic indices. Note
-that this is consistent with hardware considerations as discussed below. 
+that this is consistent with hardware considerations as discussed below.
 
 An alternative is to use an LLVM `1-D` `vector` type for which one can use
 `llvm.extractelement`, `llvm.insertelement` and `llvm.shufflevector`. These
 operations accept dynamic indices. The implication is that one has to use a
-flattened lowering of an MLIR n-D vector to an LLVM 1-D vector. 
+flattened lowering of an MLIR n-D vector to an LLVM 1-D vector.
 
 There are multiple tradeoffs involved that mix implications on the programming
 model, execution on actual HW and what is visible or hidden from codegen. They
-are discussed in the following sections. 
+are discussed in the following sections.
 
-## Nested Aggregate 
+### Nested Aggregate
 Pros:
 
-1. Natural encoding n-D vector -> (n-1)-D aggregate over 1-D vector. 
-2. No need for linearization / delinearization logic inserted everywhere. 
-3. `llvm.insertvalue`, `llvm.extractvalue` of `(n-k)-D` aggregate is natural. 
+1. Natural encoding n-D vector -> (n-1)-D aggregate over 1-D vector.
+2. No need for linearization / delinearization logic inserted everywhere.
+3. `llvm.insertvalue`, `llvm.extractvalue` of `(n-k)-D` aggregate is natural.
 4. `llvm.insertelement`, `llvm.extractelement`, `llvm.shufflevector` over
-`1-D` vector type is natural. 
+`1-D` vector type is natural.
 
 Cons:
 
 1. `llvm.insertvalue` / `llvm.extractvalue` does not accept dynamic indices
-but only static ones. 
+but only static ones.
 2. Dynamic indexing on the non-most-minor dimension requires roundtrips to
-memory. 
+memory.
 3. Special intrinsics and native instructions in LLVM operate on `1-D`
 vectors. This is not expected to be a practical limitation thanks to a
 `vector.cast %0: vector<4x8x16x32xf32> to vector<4x4096xf32>` operation, that
 flattens the most minor dimensions (see the bigger picture in implications on
-codegen). 
+codegen).
 
-## Flattened 1-D Vector Type
+### Flattened 1-D Vector Type
 
 Pros:
 
 1. `insertelement` / `extractelement` / `shufflevector` with dynamic indexing
-is possible over the whole lowered `n-D` vector type. 
+is possible over the whole lowered `n-D` vector type.
 2. Supports special intrinsics and native operations.
 
 Cons:
 1. Requires linearization/delinearization logic everywhere, translations are
-complex. 
+complex.
 2. Hides away the real HW structure behind dynamic indexing: at the end of the
 day, HW vector sizes are generally fixed and multiple vectors will be needed
-to hold a vector that is larger than the HW. 
+to hold a vector that is larger than the HW.
 3. Unlikely peephole optimizations will result in good code: arbitrary dynamic
 accesses, especially at HW vector boundaries unlikely to result in regular
-patterns. 
+patterns.
 
-## Discussion
-### HW Vectors and Implications on the SW and the Programming Model
+### Discussion
+#### HW Vectors and Implications on the SW and the Programming Model
 As of today, the LLVM model only support `1-D` vector types. This is
 unsurprising because historically, the vast majority of HW only supports `1-D`
 vector registers. We note that multiple HW vendors are in the process of
-evolving to higher-dimensional physical vectors. 
+evolving to higher-dimensional physical vectors.
 
 In the following discussion, let's assume the HW vector size is `1-D and the
 SW vector size is `n-D`, with `n >= 1`. The same discussion would apply with
@@ -370,7 +370,7 @@ register file. The number of such vectors is fixed.
 Depending on the rank and sizes of the SW vector abstraction and the HW vector
 sizes and number of registers, an `n-D` SW vector type may be materialized by
 a mix of multiple `1-D` HW vector registers + memory locations at a given
-point in time. 
+point in time.
 
 The implication of the physical HW constraints on the programming model are
 that one cannot index dynamically across hardware registers: a register file
@@ -381,28 +381,28 @@ programmers: when declaring a `private float a[4]`; and subsequently indexing
 with a *dynamic* value results in so-called **local memory** usage
 (i.e. roundtripping to memory).
 
-### Implication on codegen
+#### Implication on codegen
 MLIR `n-D` vector types are currently represented as `(n-1)-D` arrays of `1-D`
-vectors when lowered to LLVM. 
+vectors when lowered to LLVM.
 This introduces the consequences on static vs dynamic indexing discussed
 previously: `extractelement`, `insertelement` and `shufflevector` on `n-D`
 vectors in MLIR only support static indices. Dynamic indices are only
 supported on the most minor `1-D` vector but not the outer `(n-1)-D`.
-For other cases, explicit load / stores are required. 
+For other cases, explicit load / stores are required.
 
 The implications on codegen are as follows:
 
 1. Loops around `vector` values are indirect addressing of vector values, they
-must operate on explicit load / store operations over `n-D` vector types. 
+must operate on explicit load / store operations over `n-D` vector types.
 2. Once an `n-D` `vector` type is loaded into an SSA value (that may or may
 not live in `n` registers, with or without spilling, when eventually lowered),
 it may be unrolled to smaller `k-D` `vector` types and operations that
 correspond to the HW. This level of MLIR codegen is related to register
-allocation and spilling that occur much later in the LLVM pipeline. 
+allocation and spilling that occur much later in the LLVM pipeline.
 3. HW may support >1-D vectors with intrinsics for indirect addressing within
 these vectors. These can be targeted thanks to explicit `vector_cast`
 operations from MLIR `k-D` vector types and operations to LLVM `1-D` vectors +
-intrinsics. 
+intrinsics.
 
 Alternatively, we argue that directly lowering to a linearized abstraction
 hides away the codegen complexities related to memory accesses by giving a
@@ -410,7 +410,7 @@ false impression of magical dynamic indexing across registers. Instead we
 prefer to make those very explicit in MLIR and allow codegen to explore
 tradeoffs.
 Different HW will require 
diff erent tradeoffs in the sizes involved in steps
-1., 2. and 3. 
+1., 2. and 3.
 
 Decisions made at the MLIR level will have implications at a much later stage
 in LLVM (after register allocation). We do not envision to expose concerns
@@ -421,7 +421,7 @@ MLIR level will be able to target. Such costs at the MLIR level will be
 abstract and used for ranking, not for accurate performance modeling. In the
 future such costs will be learned.
 
-### Implication on Lowering to Accelerators
+#### Implication on Lowering to Accelerators
 To target accelerators that support higher dimensional vectors natively, we
 can start from either `1-D` or `n-D` vectors in MLIR and use `vector.cast` to
 flatten the most minor dimensions to `1-D` `vector<Kxf32>` where `K` is an
@@ -442,17 +442,17 @@ and intra-vector shuffling that may not be worthwhile or even feasible,
 i.e. infinite cost.
 
 However `vector.cast %0: vector<K1x...xKnxf32> to vector<Kxf32>` when `K =
-K1 * … * Kn` should be close to a noop. 
+K1 * … * Kn` should be close to a noop.
 
 As we start building accelerator-specific abstractions, we hope to achieve
 retargetable codegen: the same infra is used for CPU, GPU and accelerators
-with extra MLIR patterns and costs. 
+with extra MLIR patterns and costs.
 
-### Implication on calling external functions that operate on vectors
+#### Implication on calling external functions that operate on vectors
 It is possible (likely) that we additionally need to linearize when calling an
-external function. 
+external function.
 
-## Relationship to LLVM matrix type proposal.
+### Relationship to LLVM matrix type proposal.
 The LLVM matrix proposal was formulated 1 year ago but seemed to be somewhat
 stalled until recently. In its current form, it is limited to 2-D matrix types
 and operations are implemented with LLVM intrinsics.
@@ -467,7 +467,7 @@ document could become the unifying abstraction that people should target for
 >1-D vectors and the LLVM matrix proposal can be viewed as a subset of this
 work.
 
-## Conclusion
+### Conclusion
 The flattened 1-D vector design in the LLVM matrix proposal is good in a
 HW-specific world with special intrinsics. This is a good abstraction for
 register allocation, Instruction-Level-Parallelism and
@@ -481,7 +481,7 @@ This makes “nested aggregate type of 1-D vector” an appealing abstraction fo
 lowering from MLIR because:
 
 1. it does not hide complexity related to the buffer vs value semantics and
-the memory subsystem and 
+the memory subsystem and
 2. it does not rely on LLVM to magically make all the things work from a too
 low-level abstraction.
 
@@ -489,5 +489,5 @@ The use of special intrinsics in a `1-D` LLVM world is still available thanks
 to an explicit `vector.cast` op.
 
 
-## Operations
+### Operations
 

diff  --git a/mlir/docs/RationaleLinalgDialect.md b/mlir/docs/RationaleLinalgDialect.md
index 39a675d9be5f..2c07502a9fb3 100644
--- a/mlir/docs/RationaleLinalgDialect.md
+++ b/mlir/docs/RationaleLinalgDialect.md
@@ -2,33 +2,32 @@
 
 [TOC]
 
-# Introduction<a name="introduction"></a>
+## Introduction<a name="introduction"></a>
 
-## Positioning
+### Positioning
 
 <img width="180" align="left" alt="MLIR Codegen Flow" src="https://user-images.githubusercontent.com/10148468/73613629-c5586580-45c5-11ea-94b7-074aeea94c7b.png">
 
-This document describes the key design principles 
+This document describes the key design principles
 that led to the existing implementation of Linalg and aims at exposing
 the tradeoffs involved when building higher-level Intermediate
 Representations (IR) and Dialects to facilitate code
 generation. Consider the simplified schema describing codegen in MLIR.
 Linalg is designed to solve the High-level Hierarchical Optimization
 (HHO box) and to interoperate nicely within a
-*Mixture Of Expert Compilers* environment (i.e. the *CGSel* box). 
+*Mixture Of Expert Compilers* environment (i.e. the *CGSel* box).
 This work is inspired by a wealth of [prior art](#prior_art) in
 the field, from which it seeks to learn key lessons. This documentation
 and introspection effort also comes in the context of the proposal for a
 working group for discussing the [Development of high-level Tensor Compute
 Primitives dialect(s) and
-transformations](https://llvm.discourse.group/t/development-of-high-level-tensor-compute-primitives-dialect-s-and-transformations/388/3). 
+transformations](https://llvm.discourse.group/t/development-of-high-level-tensor-compute-primitives-dialect-s-and-transformations/388/3).
 We hope that the lessons from prior art, the design principles outlined in
-this doc and the architecture of Linalg can help inform the community on a 
+this doc and the architecture of Linalg can help inform the community on a
 path to defining these High-Level Tensor Compute Primitives.
 
+### Inception
 
-## Inception
- 
 Linalg started as a pragmatic dialect to bootstrap code generation in MLIR, by
 *defining away* complex code generation problems like precise dependence
 analysis or polyhedral code generation and by introducing the ability to call
@@ -41,31 +40,31 @@ so as not to miss out on simple performance benefits. For example, if
 one's favorite HPC library or ISA has a `matmul` primitive running at 95% of
 the achievable peak performance, for operands stored in some memory, one should
 be able to **use the primitive** when possible *and* generate code otherwise.
- 
+
 However, as the design of Linalg co-evolved with the design of MLIR, it became
 apparent that it could extend to larger application domains than just machine
 learning on dense tensors.
- 
+
 The design and evolution of Linalg follows a *codegen-friendly* approach where
 the IR and the transformations evolve hand-in-hand.
 The key idea is that op semantics *declare* and transport information that is
-traditionally obtained by compiler analyses. 
+traditionally obtained by compiler analyses.
 This information captures the legality and applicability of transformations and
 is **not lost by lowering prematurely to loop or CFG form**. The key
 transformations are designed so as to **preserve this information** as long as
 necessary. For example, `linalg.matmul` remains `linalg.matmul` after tiling
 and fusion.
- 
+
 Furthermore, Linalg decouples transformation validity from profitability
 considerations and voluntarily leaves the latter aside in the first iteration
 (see the [suitability for search](#suitability_for_search) guiding principle).
- 
+
 The first incarnation of these ideas was presented as an example at the
 EuroLLVM 2019 developer's meeting as part of the
 [Linalg section](https://llvm.org/devmtg/2019-04/slides/Tutorial-AminiVasilacheZinenko-MLIR.pdf)
 of the first [MLIR Tutorial](https://www.youtube.com/watch?v=cyICUIZ56wQ).
- 
-## Evolution
+
+### Evolution
 Since the initial implementation, the design has evolved with, and partially
 driven the evolution of the core MLIR infrastructure to use
 [Regions](https://mlir.llvm.org/docs/LangRef/#regions),
@@ -82,14 +81,14 @@ which define structured operations on vectors, following the same rationale and
 design principles as Linalg. (Vector dialect includes the higher-level
 operations on multi-dimensional vectors and abstracts away the lowering to
 single-dimensional vectors).
- 
+
 The Linalg dialect itself grew beyond linear algebra-like operations to become
 more expressive, in particular by providing an abstraction of a loop nest
 supporting parallelism, reductions and sliding windows around arbitrary MLIR
 [regions](https://mlir.llvm.org/docs/LangRef/#regions). It also has the
 potential of growing beyond *dense* linear-algebra to support richer data
 types, such as sparse and ragged tensors and buffers.
- 
+
 Linalg design remains open to evolution and cross-pollination with other
 dialects and approaches. It has been successfully used as the staging ground
 for code generation-related abstractions, spinning off the generalization of
@@ -102,22 +101,22 @@ unsurprising ABI conventions;
 to the *structured control flow* dialect (named `LoopOps`).
 More components can be extracted, redesigned and generalized when new uses or
 requirements arise.
- 
+
 Several [design questions](#open_issues) remain open in Linalg, which does not
 claim to be a general solution to all compilation problems.
 It does aim at driving thinking and implementations of domain-specific
 abstractions where programmer's intent can be captured at a very high level,
 directly in the IR.
- 
+
 Given the evolution of the scope, it becomes apparent that a better name than
 "Linalg" could remove some of the confusions related to the dialect (and the
 underlying approach), its goals and limitations.
 
-# Prior Art<a name=""></a>
+## Prior Art<a name=""></a>
 Linalg draws inspiration from decades of prior art to design a modern a
 pragmatic solution. The following non-exhaustive list refers to some of the
 projects that influenced Linalg design:
- 
+
 - [ONNX](https://onnx.ai/),
 - [LIFT](https://www.lift-project.org/),
 - [XLA](https://www.tensorflow.org/xla/architecture),
@@ -132,17 +131,17 @@ projects that influenced Linalg design:
 [Optimizing Compilers for Modern Architectures](
 https://www.elsevier.com/books/optimizing-compilers-for-modern-architectures/allen/978-0-08-051324-9))
 - Traditional compiler CFGs with SSA forms.
- 
+
 Additionally, experience with the following tools proved very valuable when
 thinking holistically about how all these components interplay all the way
 up to the user and down to the hardware:
- 
+
 - the [Torch](http://torch.ch/) machine-learning framework,
 - the LLVM compiler, specifically in JIT mode,
 - high-performance libraries (MKL, CUBLAS, FBFFT)
 - the [PeachPy](https://www.cs.utexas.edu/users/flame/BLISRetreat/BLISRetreatTalks/PeachPy.pdf) assembler
 - current and potentially upcoming hardware ISAs.
- 
+
 The novelty of MLIR's code base and its unprecedented support for defining and
 mixing abstractions, enabling one to reflect on and integrate the key elements
 of the prior art success as well as avoid the common pitfalls in the area of
@@ -150,28 +149,28 @@ code generation. Thus, instead of diverging into a discussion about the
 implications of adopting any of the existing solutions, Linalg had the
 possibility to build on all of them and learn from their experience while
 leveraging the benefit of hindsight.
- 
+
 The following reflections on prior art have influenced the design of Linalg.
 The discussion is by no means exhaustive but should capture the key motivations
 behind Linalg.
- 
-## Lessons from ONNX<a name="lessonsonnx"></a>
+
+### Lessons from ONNX<a name="lessonsonnx"></a>
 ONNX is a specification of operations that appear in Machine Learning
 workloads. As such, it is predominantly driven by the expressiveness requirements
 of ML, and less by the considerations of IR design for HPC code generation.
- 
+
 Similarly to ONNX, Linalg defines *"semantically charged" named ops*.
 But it also considers *transformations on these ops* as a key component and
 defines the IR to support the transformations, preferring transformations over
 expressiveness if necessary.
- 
+
 Linalg hopes to additionally address the following:
 - facilitate frontend-compiler co-design by taking into account compiler
   transformations and lowerings in op definition;
 - minimize the set of available ops by making them non-overlapping with each
   other, thus simplifying the intermediate representation.
- 
-## Lessons from LIFT<a name="lessonslift"></a>
+
+### Lessons from LIFT<a name="lessonslift"></a>
 [LIFT](https://www.lift-project.org/) is a system to write computational
 kernels based on functional abstractions. Transformations are
 represented by additional nodes in the IR, whose semantics are at the
@@ -179,11 +178,11 @@ level of the algorithm (e.g. `partialReduce`).
 LIFT applies and composes transformations by using [local rewrite
 rules](https://www.lift-project.org/presentations/2015/ICFP-2015.pdf) that
 embed these additional nodes directly in the functional abstraction.
- 
+
 Similarly to LIFT, Linalg uses local rewrite rules implemented with the MLIR
 [Declarative Rewrite Rules](https://mlir.llvm.org/docs/DeclarativeRewrites/)
 mechanisms.
- 
+
 Linalg builds on, and helps separate concerns in the LIFT approach as follows:
 - transformations are either separated from the representation or expressed as
   composable attributes that are independent of the actual computation,
@@ -191,7 +190,7 @@ Linalg builds on, and helps separate concerns in the LIFT approach as follows:
 - abstractions are split into smaller components (e.g., control flow and data
   structure abstractions) potentially reusable across 
diff erent dialects in the
   MLIR's open ecosystem.
- 
+
 LIFT is expected to further influence the design of Linalg as it evolve. In
 particular, extending the data structure abstractions to support non-dense
 tensors can use the experience of LIFT abstractions for
@@ -199,10 +198,10 @@ tensors can use the experience of LIFT abstractions for
 and [position-dependent
 arrays](https://www.lift-project.org/publications/2019/pizzuti19positiondependentarrays.pdf).
 
-## Lessons from XLA<a name="lessonsxla"></a>
+### Lessons from XLA<a name="lessonsxla"></a>
 [XLA](https://www.tensorflow.org/xla/architecture) is one of the first
 post-Theano ML compilers that was introduced as a pragmatic compilation
-solution for TensorFlow. It shines on Google's xPU 
+solution for TensorFlow. It shines on Google's xPU
 hardware and is an important piece of the puzzle. It is particularly good at
 (1) transforming code back and forth between the scalar and the vector
 worlds, (2) passing function boundaries for handling both host and device
@@ -231,7 +230,7 @@ Since the transformations themselves are not simple local rewrite patterns
 independently. This monolithic design makes the system not portable: xPU passes
 and GPU passes do not share much code.
 
-## Lessons from Halide and TVM<a name="lessonshalide"></a>
+### Lessons from Halide and TVM<a name="lessonshalide"></a>
 [Halide](https://halide-lang.org/) is a DSL embedded in C++ that provides a
 way of metaprogramming the HalideIR and applying transformations declaratively
 to let the expert user transform and optimize the program in tailored ways.
@@ -253,42 +252,42 @@ canonicalization rules that are also very prevalent in MLIR.
 
 Linalg hopes to additionally address the following:
 - Halide scheduling is powerful and explores a large swath of possible
-transformations. But it's still too hard for newcomers to use or extend. The 
+transformations. But it's still too hard for newcomers to use or extend. The
 level of performance you get from Halide is very 
diff erent depending on
 whether one is a seasoned veteran or a newcomer. This is especially true as
 the number of transformations grow.
-- Halide raises rather than lowers in two ways, going counter-current to the 
+- Halide raises rather than lowers in two ways, going counter-current to the
 design goals we set for high-level codegen abstractions in in MLIR. First,
-canonical Halide front-end code uses explicit indexing and math on scalar 
+canonical Halide front-end code uses explicit indexing and math on scalar
 values, so to target BLAS/DNN libraries one needs to add pattern matching
-which is similarly brittle as in the affine case. While Halide's performance 
-is on par with the libraries on programmable targets (CPU/GPU), that 
+which is similarly brittle as in the affine case. While Halide's performance
+is on par with the libraries on programmable targets (CPU/GPU), that
 approach doesn't work on mobile accelerators or on xPUs, where the framework
-ingests whole-tensor operations. 
-Second, reductions and scans are expressed using serial iteration, again 
-requiring pattern matching before they can be transformed (e.g. to do a 
-reduction using atomics, or hierarchically). The lesson to draw is that we 
+ingests whole-tensor operations.
+Second, reductions and scans are expressed using serial iteration, again
+requiring pattern matching before they can be transformed (e.g. to do a
+reduction using atomics, or hierarchically). The lesson to draw is that we
 should start with higher-level primitives than Halide.
 
-## Lessons from Tensor Comprehensions<a name="lessonstc"></a>
+### Lessons from Tensor Comprehensions<a name="lessonstc"></a>
 [Tensor Comprehensions](https://arxiv.org/abs/1802.04730) is a
 high-level language to express tensor computations with a syntax
 generalizing the Einstein notation, coupled to an end-to-end
 compilation flow capable of lowering to efficient GPU code. It was
-integrated with 2 ML frameworks: Caffe2 and PyTorch. 
+integrated with 2 ML frameworks: Caffe2 and PyTorch.
 
 <img width="600" alt="MLIR Codegen Flow"
 src="https://user-images.githubusercontent.com/10148468/73613272-df904480-45c1-11ea-88f9-214dee7464cf.png">
 
 The compilation flow combines [Halide](#lessonshalide) and a Polyhedral Compiler
 derived from [ISL](https://en.wikipedia.org/wiki/Integer_set_library)
-and uses both HalideIR and the ISL *schedule-tree* IR. 
+and uses both HalideIR and the ISL *schedule-tree* IR.
 The compiler provides a collection of polyhedral compilation
 algorithms to perform fusion and favor multi-level parallelism and
 promotion to deeper levels of the memory hierarchy.
 Tensor Comprehensions showed that, fixing a few predefined strategies
 with parametric transformations and tuning knobs, can already provide
-great results. In that previous work, simple 
+great results. In that previous work, simple
 genetic search combined with an autotining framework was sufficient
 to find good implementations in the ***non-compute bound regime***.
 This requires code versions obtainable by the
@@ -320,7 +319,7 @@ are Stuck in a Rut](https://dl.acm.org/doi/10.1145/3317550.3321441).
 Many of those issues are naturally addressed by implementing these ideas
 in the MLIR infrastructure.
 
-## Lessons from Polyhedral compilers<a name="lessonspolyhedral"></a>
+### Lessons from Polyhedral compilers<a name="lessonspolyhedral"></a>
 The polyhedral model has been on the cutting edge of loop-level optimization for
 decades, with several incarnations in production compilers such as
 [GRAPHITE](https://gcc.gnu.org/wiki/Graphite) for GCC and
@@ -333,7 +332,7 @@ analysis of the role of polyhedral transformations is provided in the
 [simplified polyhedral
 form](https://mlir.llvm.org/docs/RationaleSimplifiedPolyhedralForm/) document
 dating back to the inception of MLIR.
- 
+
 In particular, polyhedral abstractions have proved challenging to integrate with
 a more conventional compiler due to the following.
 - The transformed code (or IR) quickly gets complex and thus hard to analyze and
@@ -345,7 +344,7 @@ a more conventional compiler due to the following.
 - Expressiveness limitations, although addressed in the scientific literature
   through, e.g., summary functions, often remain present in actual
   implementations.
- 
+
 The Affine dialect in MLIR was specifically designed to address the integration
 problems mention above. In particular, it maintains the IR in the same form
 (loops with additional constraints on how the bounds are expressed) throughout
@@ -353,13 +352,13 @@ the transformation, decreasing the need for one-shot conversion between
 drastically 
diff erent representations. It also embeds the polyhedral
 representation into the SSA form by using MLIR regions and thus allows one to
 combine polyhedral and SSA-based transformations.
- 
-## Lessons from the Affine dialect<a name="lessonsaffine"></a>
+
+### Lessons from the Affine dialect<a name="lessonsaffine"></a>
 The Affine dialect in MLIR brings the polyhedral abstraction closer to the
 conventional SSA representation. It addresses several long-standing integration
 challenges as described above and is likely to be more suitable when compiling
 from a C language-level abstraction.
- 
+
 MLIR makes it possible to start from a higher-level abstraction than C, for
 example in machine learning workloads. In such cases, it may be possible to
 avoid complex analyses (data-flow analysis across loop iterations is
@@ -372,7 +371,7 @@ the loop nest computing a matrix multiplication, no need to additionally rely on
 affine dependence analysis to check this). This information is not readily
 available in the Affine dialect, and can only be derived using potentially
 expensive pattern-matching algorithms.
- 
+
 Informed by the practical experience in polyhedral compilation and with the
 Affine dialects in particular, Linalg takes the following decisions.
 - **Discourage loop skewing**: the loop skewing transformation, that is
@@ -403,7 +402,7 @@ Affine dialects in particular, Linalg takes the following decisions.
   result of tiling a matrix multiplication is loops around a smaller matrix
   multiplication. Even with pattern-matching on top of the Affine dialect, this
   would have required another step of pattern-matching after the transformation.
- 
+
 Given these choices, Linalg intends to be a better fit for **high-level
 compilation** were significantly more information is readily available in the
 input representation and should be leveraged before lowering to other
@@ -412,9 +411,9 @@ and is used as a lowering target for Linalg, enabling further transformations
 and combination of semantically-loaded and lower-level inputs. As such, Linalg
 is intended to complement Affine rather than replace it.
 
-# Core Guiding Principles<a name="guiding_principles"></a>
+## Core Guiding Principles<a name="guiding_principles"></a>
 
-## Transformations and Simplicity First<a name="transformations_first"></a>
+### Transformations and Simplicity First<a name="transformations_first"></a>
 The purpose of the Linalg IR and its operations is primarily to:
 - develop a set of key transformations, and
 - make them correct by construction by carefully curating the set of
@@ -447,7 +446,7 @@ abstractions.
 This is not merely a reimplementation of idea X in system Y: simplicity
 **must be the outcome** of this introspection effort.
 
-## Preservation of Information<a name="information_preservation"></a>
+### Preservation of Information<a name="information_preservation"></a>
 The last two decades have seen a proliferation of Domain-Specific Languages
 (DSLs) that have been very successful at limited application domains.
 The main commonality between these systems is their use of a significantly
@@ -460,7 +459,7 @@ when it is not irremediably lost.
 These remarks, coupled with MLIR's suitability for defining IR at multiple
 levels of abstraction led to the following 2 principles.
 
-### Declarative Specification: Avoid Raising<a name="declarative_specification"></a>
+#### Declarative Specification: Avoid Raising<a name="declarative_specification"></a>
 
 Compiler transformations need static structural information (e.g. loop-nests,
 graphs of basic blocks, pure functions etc). When that structural information
@@ -480,7 +479,7 @@ MLIR makes it easy to define op semantics declaratively thanks to the use of
 regions and attributes. This is an ideal opportunity to define new abstractions
 to convey user-intent directly into the proper abstraction.
 
-### Progressive Lowering: Don't Lose Information too Quickly<a name="#progressive_lowering"></a>
+#### Progressive Lowering: Don't Lose Information too Quickly<a name="#progressive_lowering"></a>
 
 Lowering too quickly to affine, generic loops or CFG form reduces the
 amount of structure available to derive transformations from. While
@@ -500,19 +499,19 @@ MLIR is designed as an infrastructure for ***progressive lowering***.
 Linalg fully embraces this notion and thinks of codegen in terms of
 *reducing a potential function*. That potential function is loosely
 defined in terms of number of low-level instructions in a particular
-Linalg ops (i.e. how heavy or lightweight the Linalg op is). 
+Linalg ops (i.e. how heavy or lightweight the Linalg op is).
 Linalg-based codegen and transformations start from higher-level IR
 ops and dialects. Then each transformation application reduces the
 potential by introducing lower-level IR ops and *smaller* Linalg ops.
 This gradually reduces the potential, all the way to Loops + VectorOps
 and LLVMIR.
 
-## Composable and Declarative Transformations<a name="declarative_transformations"></a>
+### Composable and Declarative Transformations<a name="declarative_transformations"></a>
 Complex and impactful transformations need not be hard to manipulate, write or
 maintain. Mixing XLA-style high-level op semantics knowledge with generic
 properties to describe these semantics, directly in MLIR, is a promising way to:
 - Design transformations that are correct by construction, easy to
-write, easy to verify and easy to maintain. 
+write, easy to verify and easy to maintain.
 - Provide a way to specify transformations and the units of IR they manipulate
 declaratively. In turn this allows using local pattern rewrite rules in MLIR
 (i.e. [DRR](https://mlir.llvm.org/docs/DeclarativeRewrites/)).
@@ -522,10 +521,10 @@ and other enabling rewrites in a single pass. The result is a system where pass
 fusion is very simple to obtain and gives hope to solving certain
 [phase ordering issues](https://dl.acm.org/doi/10.1145/201059.201061).
 
-## Suitability for Search and Machine Learning<a name="ml"></a>
+### Suitability for Search and Machine Learning<a name="ml"></a>
 Compiler heuristics are hand-crafted human-engineered features: it is
 ripe for disruption by machine-learning  techniques.
-To enable search, compiler transformations should be fine-grained, 
+To enable search, compiler transformations should be fine-grained,
 [composable](#declarative_transformations) and expose tuning parameters that
 can modify their behavior, guided by lessons from previous experience
 with [Tensor Comprehensions](#lessonstc).
@@ -540,9 +539,9 @@ building cost models.
 Still, this  does not mean Linalg refuses cost models: instead we
 prefer to invest in infrastructure that will enable [ML-based
 techniques to automatically build cost
-models](http://homepages.inf.ed.ac.uk/hleather/publications/2009_autofeatures_cgo.pdf). 
+models](http://homepages.inf.ed.ac.uk/hleather/publications/2009_autofeatures_cgo.pdf).
 
-## Extensibility and Future-Proofness<a name="future"></a>
+### Extensibility and Future-Proofness<a name="future"></a>
 MLIR allows defining IR for structured control flow and structured
 data types. We choose to take advantage of these properties for the
 reasons described above.
@@ -561,12 +560,12 @@ While there is no concrete solution
 today to solve these problems in MLIR, it is pretty clear that perfect
 static knowledge and analyses will not be serious contenders for these problems.
 
-# Key Observations<a name="keyobservation"></a>
+## Key Observations<a name="keyobservation"></a>
 The following key observations have influenced the design of Linalg and helped
 reconcile [core guiding principles](#guiding_principles) with real-world
 requirements when producing an implementation based on MLIR.
 
-## Algorithms + Data Structures = Programs<a name="data_and_compute"></a>
+### Algorithms + Data Structures = Programs<a name="data_and_compute"></a>
 This is a twist on Niklaus Wirth's formulation but captures the essence of the
 design of Linalg: control-flow does not exist in a vacuum, independently of
 data.
@@ -580,21 +579,21 @@ certain transformations are better done:
 occurred,
 - as extensions to the Linalg dialect in terms of new ops or attributes.
 
-## The Dialect Need not be Closed Under Transformations<a name="dialect_not_closed"></a>
+### The Dialect Need not be Closed Under Transformations<a name="dialect_not_closed"></a>
 This is probably the most surprising and counter-intuitive
 observation. When one designs IR for transformations, closed-ness is
 often a nonnegotiable property.
 This is a key design principle of polyhedral IRs such as
 [URUK](http://icps.u-strasbg.fr/~bastoul/research/papers/GVBCPST06-IJPP.pdf)
-and 
+and
 [ISL-based IRs](https://en.wikipedia.org/wiki/Integer_set_library):
 they are closed under affine transformations.
-In MLIR, multiple dialects coexist and form a coherent whole. After 
+In MLIR, multiple dialects coexist and form a coherent whole. After
 experimenting with 
diff erent alternatives, it became clear that strict
 dialect closed-ness wasn't necessary and could be relaxed. Previous
 systems did not have simple and principled means of building new IR
 and probably suffered from this limitation. We conjecture this is a
-key reason they required the IR to be closed under transformations. 
+key reason they required the IR to be closed under transformations.
 
 Despite the fact that Linalg ops only allow perfectly nested
 semantics, once tiling and fusion kick in, imperfectly nested loops
@@ -608,7 +607,7 @@ transformation would dictate that the potential remains constant.
 In contrast, Linalg advocates for ***monotonicity*** under
 transformations.
 
-## Summary of Existing Alternatives a Picture<a name="observationssummary"></a>
+### Summary of Existing Alternatives a Picture<a name="observationssummary"></a>
 Lastly, we summarize our observations of lessons from [Prior
 Art](#prior_art)---when viewed under the lense of our [Core Guiding
 Principles](#guiding_principles)---with the following picture.

diff  --git a/mlir/docs/Tutorials/Toy/Ch-3.md b/mlir/docs/Tutorials/Toy/Ch-3.md
index a535d1c95c65..fee947ff5fda 100644
--- a/mlir/docs/Tutorials/Toy/Ch-3.md
+++ b/mlir/docs/Tutorials/Toy/Ch-3.md
@@ -22,7 +22,7 @@ rule-based pattern-match and rewrite using table-driven
 use of DRR requires that the operations be defined using ODS, as described in
 [Chapter 2](Ch-2.md).
 
-# Optimize Transpose using C++ style pattern-match and rewrite
+## Optimize Transpose using C++ style pattern-match and rewrite
 
 Let's start with a simple pattern and try to eliminate a sequence of two
 transpose that cancel out: `transpose(transpose(X)) -> X`. Here is the
@@ -163,7 +163,7 @@ Perfect! No `transpose` operation is left - the code is optimal.
 In the next section, we use DRR for pattern match optimizations associated with
 the Reshape op.
 
-# Optimize Reshapes using DRR
+## Optimize Reshapes using DRR
 
 Declarative, rule-based pattern-match and rewrite (DRR) is an operation
 DAG-based declarative rewriter that provides a table-based syntax for

diff  --git a/mlir/docs/Tutorials/Toy/Ch-6.md b/mlir/docs/Tutorials/Toy/Ch-6.md
index faa5bdc9b009..34b76008f163 100644
--- a/mlir/docs/Tutorials/Toy/Ch-6.md
+++ b/mlir/docs/Tutorials/Toy/Ch-6.md
@@ -7,7 +7,7 @@ In the [previous chapter](Ch-5.md), we introduced the
 many of the `Toy` operations to affine loop nests for optimization. In this
 chapter, we will finally lower to LLVM for code generation.
 
-# Lowering to LLVM
+## Lowering to LLVM
 
 For this lowering, we will again use the dialect conversion framework to perform
 the heavy lifting. However, this time, we will be performing a full conversion
@@ -54,7 +54,7 @@ Now that the lowering for the printf operation has been defined, we can specify
 the components necessary for the lowering. These are largely the same as the
 components defined in the [previous chapter](Ch-5.md).
 
-## Conversion Target
+### Conversion Target
 
 For this conversion, aside from the top-level module, we will be lowering
 everything to the LLVM dialect.
@@ -65,7 +65,7 @@ everything to the LLVM dialect.
   target.addLegalOp<mlir::ModuleOp, mlir::ModuleTerminatorOp>();
 ```
 
-## Type Converter
+### Type Converter
 
 This lowering will also transform the MemRef types which are currently being
 operated on into a representation in LLVM. To perform this conversion, we use a
@@ -79,7 +79,7 @@ enough for our use case.
   LLVMTypeConverter typeConverter(&getContext());
 ```
 
-## Conversion Patterns
+### Conversion Patterns
 
 Now that the conversion target has been defined, we need to provide the patterns
 used for lowering. At this point in the compilation process, we have a
@@ -99,7 +99,7 @@ by relying on [transitive lowering](../../../getting_started/Glossary.md#transit
   patterns.insert<PrintOpLowering>(&getContext());
 ```
 
-## Full Lowering
+### Full Lowering
 
 We want to completely lower to LLVM, so we use a `FullConversion`. This ensures
 that only legal operations will remain after the conversion.
@@ -169,13 +169,13 @@ llvm.func @main() {
 See [Conversion to the LLVM IR Dialect](../../ConversionToLLVMDialect.md) for
 more in-depth details on lowering to the LLVM dialect.
 
-# CodeGen: Getting Out of MLIR
+## CodeGen: Getting Out of MLIR
 
 At this point we are right at the cusp of code generation. We can generate code
 in the LLVM dialect, so now we just need to export to LLVM IR and setup a JIT to
 run it.
 
-## Emitting LLVM IR
+### Emitting LLVM IR
 
 Now that our module is comprised only of operations in the LLVM dialect, we can
 export to LLVM IR. To do this programmatically, we can invoke the following
@@ -270,7 +270,7 @@ int dumpLLVMIR(mlir::ModuleOp module) {
 }
 ```
 
-## Setting up a JIT
+### Setting up a JIT
 
 Setting up a JIT to run the module containing the LLVM dialect can be done using
 the `mlir::ExecutionEngine` infrastructure. This is a utility wrapper around