[PATCH] D73893: [MLIR][GPU] Implement initial mapping from loop.parallel to gpu.launch.

Tue Feb 4 04:52:16 PST 2020

herhut added a comment.

In D73893#1855737 <https://reviews.llvm.org/D73893#1855737>, @ftynse wrote:

> I suppose you want high-level feedback on this, so I'm not nitpicking in the code.

Indeed. Thanks for the feedback.

> I think this goes into the right direction, but you are right that this should not be the final design. Using pattern rewrites looks like the right thing to do. I would consider having a pattern and/or a utility function that a future driver can call and have a parallel loop mapped to GPU.

I have not written code yet to produce the mapping attributes but a greedy mapper will probably be the first thing I'll do.

> We can have test passes where the application is controlled by attributes, but we can also have more automated approaches where we would, e.g., greedily map the outermost parallel loop to GPUs. My thinking is that we should eventually have a tree-like loop structure on which we can decide where to map to blocks/threads/do promotions.

I agree. To do any meaningful mapping one needs to see the whole tree. Also, the heuristics that does the tiling might already know how it wants things to be mapped. Likewise, if you explicit loads into shared memory, for example, you would know that this has to be mapped to thread groups (like warps). So maybe one want to specify the mapping at that point already, which is why I though attributes are a good way to model this.

> That being said, I am not 100% sure we actually want to materialize the decisions as attributes on the operations, or at least as attributes that operations know about.

I am strongly of the opinion that the mapping should be attributes so that the producer and consumer of the mapping decisions is decoupled. I agree that they should be invisible to the op and I just added them to the `,td` file temporarily for documentation. They attribute is already optional.

> One thing to keep in mind, nested parallel loops may be tricky, in particular you may need synchronizations inside the outermost parallel loop and make sure they are called by all threads in a "convergent" fashion. For the first take, I'd map one parallel loop construct or perfectly nested constructs and extend from there.

There is a TODO in there that it only supports nesting of the instructions up to the innermost nest are sideeffect free. Otherwise we will need predication to have only a single hardware thread materialize the sideeffects and, as you state, a barrier if other iterations of nested loops depend on that code. Let's start with the simple case, though.

================
Comment at: mlir/test/Conversion/LoopsToGPU/parallel_loop.mlir:164
+// CHECK:           gpu.launch blocks([[VAL_96:%.*]], [[VAL_97:%.*]], [[VAL_98:%.*]]) in ([[VAL_99:%.*]] = [[VAL_93]], [[VAL_100:%.*]] = [[VAL_94]], [[VAL_101:%.*]] = [[VAL_93]]) threads([[VAL_102:%.*]], [[VAL_103:%.*]], [[VAL_104:%.*]]) in ([[VAL_105:%.*]] = [[VAL_93]], [[VAL_106:%.*]] = [[VAL_95]], [[VAL_107:%.*]] = [[VAL_93]]) {
+// CHECK:             [[VAL_108:%.*]] = addi [[VAL_97]], [[VAL_84]] : index
+// CHECK:             loop.for [[VAL_109:%.*]] = [[VAL_85]] to [[VAL_87]] step [[VAL_92]] {
----------------
nicolasvasilache wrote:
> Could we use capture names such as `TIDX`, `BIDY` or something similar that would help "see" the mapping better? 
> 
Will do once this has stabilized a bit. For now, the tests are fully auto-generated.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D73893/new/

https://reviews.llvm.org/D73893