[PATCH] D75287: [mlir][GPU] Expose the functionality to create a gpu.GPUFuncOp from a gpu.GPULaunchOp

Mon Mar 2 20:13:15 PST 2020

mehdi_amini added inline comments.

================
Comment at: mlir/lib/Dialect/GPU/Transforms/KernelOutlining.cpp:77
+    Operation *operandOp = operand.getDefiningOp();
+    if (!operandOp || !isInliningBeneficiary(operandOp))
+      continue;
----------------
mravishankar wrote:
> mravishankar wrote:
> > mehdi_amini wrote:
> > > mravishankar wrote:
> > > > mehdi_amini wrote:
> > > > > herhut wrote:
> > > > > > ftynse wrote:
> > > > > > > mehdi_amini wrote:
> > > > > > > > This whole sinking transformation does not seem safe in general: this should check *legality* rather than "benefit". 
> > > > > > > > Also it isn't clear to me why this is done during the outlining and not as a pre-pass. The launch operation with the region abstraction seems perfectly suited to model this. I rather have this exposed in a separate API / as a separate step.
> > > > > > > > This whole sinking transformation does not seem safe in general: this should check *legality* rather than "benefit".
> > > > > > > 
> > > > > > > The function just seems misnamed, should be something like `shouldSink` because it mixes validity and benefit. In practice, it only returns `true` for `constant` and `dim` operations that don't have side effects.
> > > > > > > This whole sinking transformation does not seem safe in general: this should check *legality* rather than "benefit".
> > > > > > 
> > > > > > Well, it should check both. You do not want to move all legal operation either :)
> > > > > > 
> > > > > > > Also it isn't clear to me why this is done during the outlining and not as a pre-pass. The launch operation with the region abstraction seems perfectly suited to model this. I rather have this exposed in a separate API / as a separate step
> > > > > > 
> > > > > > This has purely historical reasons. Not long ago, the `gpu.launch` was closed from above, so this transformation was done when moving to function form. I have a separate pass for this in a local client, which I can send out next week. It just needs tests.
> > > > > > 
> > > > > > It was implemented as a "post transformation" to the outlining and I would prefer if we do not mix it into the outlining transformation itself. When written separately, the transformations are trivial.
> > > > > > It was implemented as a "post transformation" to the outlining and
> > > > > 
> > > > > Pre-outlining seems easier to manage because region vs inter-procedural (and also can be kept a function pass). 
> > > > > 
> > > > > > I would prefer if we do not mix it into the outlining transformation itself. When written separately, the transformations are trivial.
> > > > > 
> > > > > Seems like we're in agreement :)
> > > > A pre-pass is fine, but I think it would be better to leave it here. Eventually, it would be good if all transformations can be expressed as a pattern match and rewrite. This "outlining" is essentially converting a gpu.launchOp to a gpu.launchFuncOp. If you need to have a separate pass to sink the instructions, then it breaks the ability of going from loops -> GPU -> NVVM/SPIR-V. I am not saying anybody does this today (not doing this in IREE), but in general it seems like it would be beneficial to have transformations as patterns, and passes as just a light-weight wrapper around patterns.
> > > > 
> > > > Re: being able to keep it as a function pass, is related to where the gpu.module is created. As set up right now it is put outside of the function that the gpu.launch operation lives in. Thats a a very specific choice and would be very useful to allow "clients" of the outlining to decide where to put the gpu.module. 
> > > >  If you need to have a separate pass to sink the instructions, then it breaks the ability of going from loops -> GPU -> NVVM/SPIR-V. 
> > > 
> > > I don't understand what you mean, can you elaborate?
> > > 
> > > > but in general it seems like it would be beneficial to have transformations as patterns, and passes as just a light-weight wrapper around patterns.
> > > 
> > > This is mixing an optimization within an unrelated transformation: this just does not belong here IMO.
> > > 
> > > > Re: being able to keep it as a function pass, is related to where the gpu.module is created. 
> > > 
> > > I don't know what you mean or how it answer the point about the function pass right now.
> > Re : separate pass to sink instructions.
> > 
> > The Dialect conversion framework is designed to go from A -> B -> C. If I want to target SPIR-V/NVVM from Linalg dialect vial Loop dialect and GPU dialects (i.e. Linalg -> Loops -> GPU -> SPIRV/NVVM), I can add all the patterns for the conversion into the dialect conversion framework. Currently Loops to GPU dialect is not exposed as a conversion pattern. GPU to SPIRV is. By adding extra steps as a "pre-condition"  will limit the ability to the entire conversion being done using the dialect conversion framework (which is what it is built for). You could add the "sinking" as a canonicalization pattern, but it seems to me this sinking is useful only when the gpu.launch region is outlined to create a gpu.func operation. So doing the sinking during the conversion makes sense to me.
> > 
> > Re: fusion pass vs module pass
> > 
> > The current setup of the gpu.launch to gpu.launch_func conversion creates a gpu.module that is inserted just after the function the gpu.launch is in. This makes it a module pass, and this behavior is only relevant for the CUDA/NVVM side of things. For IREE, we are only interested in the device side for now. So we can make this a function pass if we can control where the gpu.module is inserted. See this [[ https://github.com/google/iree/pull/863/commits/5af3e8abe9ad628ca76998aab28fcba22b82baf7 | dependent PR ]] in IREE that uses this change and makes the conversion of gpu.launch to gpu.func as a function pass.
> @mehdi_amini : Update on the function pass vs module pass. You were right that the outlining can only be a module pass since the gpu.module also has a symbol so it needs to be added to a module (or an op with symbol table). So I was wrong about that. I update the PR shown above to be module pass as well, but FYI there was no assert when i did it as a function pass.
> 
> Just filling in some details about discussion offline. It is true that the sinking could be done as a prepass. If so then it is a separate "pre-processing" pass. It is unclear if sinking can be expressed as a pattern. 
(typed the following this morning before you last comment, but didn't click send) 

> The Dialect conversion framework is designed to go from A -> B -> C. 

Yes, but if you just say this, you can shoehorn anything in there: you could use this mental model to go from Swift SIL to X86 assembly in a single "legalize()" call, I don't think this is a good use of the framework.
We should use the lowering framework where it makes sense and where it is *the* way to solve a problem. If your problem fits into the pass pipeline, then why not start there? This is the most natural way of thinking about  chain of transformations.

> the ability to the entire conversion being done using the dialect conversion framework (which is what it is built for). 

I disagree that this is what it is built for. I think this is a misconception of what the framework is intended to solve.
If you can express a pass pipeline where you want to do A->C as a logical sequence of A->B and then B->C, where B is an "interesting level of abstraction", I believe this should be separate passes.

If we take for example the  `HLO -> SPIR-V` pipeline, we can likely identify logical stages like  `HLO->Loops`, `Loops->GPU Kernel`, and `GPU Kernel -> SPIRV`. These stages are fully disjoint as far as I can tell, and there is no immediate benefit to combine them in a single lowering. On the opposite: if these stages are well separated, this provides the opportunity for passes to run on each intermediate level of abstraction (including some generic things like canonicalization or CSE), and it allows also more reusable blocks (`Loops->GPU Kernel` can be reused even when you don't come from HLO). It also forces testing at every level and help compiler engineers keeping a mental model where we can reason about this stages and how they compose independently.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D75287/new/

https://reviews.llvm.org/D75287