[PATCH] D73893: [MLIR][GPU] Implement initial mapping from loop.parallel to gpu.launch.

Tue Feb 11 06:19:28 PST 2020

herhut added a comment.

In D73893#1865719 <https://reviews.llvm.org/D73893#1865719>, @mravishankar wrote:

> (Sorry for jumping on this late)
>
> This is now adding three different lowering of loops to GPU
>
> 1. From perfectly nested loops to GPU
> 2. From imperfectly nested loops to GPU
> 3. Parallel loops to GPU I confess I contributed to one of them (2). We should probably have a plan to consolidate these. This is too much duplication of functionality. We should probably deprecate (2) in favor of this change, and adapt all uses accordingly.

All but 3 should probably go away unless someone invests into the dependency analysis to make 1 and 2 safe.

================
Comment at: mlir/lib/Conversion/LoopsToGPU/LoopsToGPU.cpp:504
+std::tuple<unsigned, AffineMap, AffineMap>
+extractMapAndOperand(Attribute attribute) {
+  DictionaryAttr dict = attribute.dyn_cast<DictionaryAttr>();
----------------
mravishankar wrote:
> If this is really a dictionary_attribute, this lookup might be simplified using a StructAttr
I had a struct attribute before but I did not want to specify the attribute on the `loop.parallel` operation, as it is GPU dialect specific. So I went with an unspecified optional attribute with custom accessor.

================
Comment at: mlir/lib/Conversion/LoopsToGPU/LoopsToGPU.cpp:521
+  // TODO(herhut): Verify that this is a valid GPU mapping.
+  // processor ids: 0-2 block [x/y/z], 3-5 -> thread [x/y/z], 6-> sequential
+  ArrayAttr mapping = parallelOp.getAttrOfType<ArrayAttr>("mapping");
----------------
ftynse wrote:
> Should we use string attributes instead? E.g. having `mapping = ["thread-0", "thread-1", "thread-2", "block-0", "block-1", "block-2", "seq"]` ?
I wanted to stay with processor ids for now, as that is also what @nicolasvasilache was using. 

================
Comment at: mlir/lib/Conversion/LoopsToGPU/LoopsToGPU.cpp:522
+  // processor ids: 0-2 block [x/y/z], 3-5 -> thread [x/y/z], 6-> sequential
+  ArrayAttr mapping = parallelOp.getAttrOfType<ArrayAttr>("mapping");
+  // TODO(herhut): Support reductions.
----------------
mravishankar wrote:
> I am just coming upto speed on implementation of loop.parallel, but in the ODS I dont see any attribute for "mapping". When is this added?
This is added by producers of mappings that want to lower GPU. I have a `GreedyMapper` (I will send that out shortly) that essentially implements what the existing helper functions did.

================
Comment at: mlir/lib/Conversion/LoopsToGPU/LoopsToGPU.cpp:552
+
+    if (processor < gpu::LaunchOp::kNumConfigOperands) {
+      // Use the corresponding thread/grid index as replacement for the loop iv.
----------------
rriddle wrote:
> This function is quite large. Can you split it up a bit?
A lot of it is comments but I have moved some things out.

================
Comment at: mlir/lib/Conversion/LoopsToGPU/LoopsToGPU.cpp:582
+            !isa<ConstantOp>(upperBound.getDefiningOp())) {
+          // TODO(herhut): Is there a helper in affine for this?
+          if (AffineMinOp minOp =
----------------
ftynse wrote:
> We should have patterns to fold constants into affine maps (in canonicalization), eg
> 
> %0 = constant 42 : index
> %1 = afifne.min affine_map<(d0,d1)->(d0,d1)>(%0, %arg0)
> 
> should fold into
> 
> %0 = affine.min affine_map<(d0)->(d0,42)>(%0)
> 
> LMK if it's not the case.
So you are saying that by running with -canonicalize I should see the affine.min change? I could not reproduce this.

================
Comment at: mlir/lib/Conversion/LoopsToGPU/LoopsToGPU.cpp:671
+    // Now walk over the body and clone it.
+    // TODO: This is only correct if there either is no further loop.parallel
+    //       nested or this code is side-effect free. Otherwise we might need
----------------
mravishankar wrote:
> This comment seems outdated now.
No, it is unfortunately still true. If there are nested `loop.parallel` the code up to them has to be side-effect free, as we are essentially replicating it per thread. I will add some checking in a later version.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D73893/new/

https://reviews.llvm.org/D73893