[llvm-dev] [RFC] Heterogeneous LLVM-IR Modules

Thu Jul 30 05:57:05 PDT 2020

[off topic] I'm not a fan of the "reply-to-list" default.

Thanks for the feedback! More below.

On 7/30/20 6:01 AM, David Chisnall via llvm-dev wrote:
> On 28/07/2020 07:00, Johannes Doerfert via llvm-dev wrote:
>> TL;DR
>> -----
>>
>> Let's allow to merge to LLVM-IR modules for different targets (with
>> compatible data layouts) into a single LLVM-IR module to facilitate
>> host-device code optimizations.
>
> I think it's worth taking a step back here and thinking through the 
> problem.  The proposed solution makes me nervous because it is quite a 
> significant change to the compiler flow that comes from thinking of 
> heterogeneous optimisation as an fat LTO problem, when to me it feels 
> more like a thin LTO problem.
>
> At the moment, there's an implicit assumption that everything in a 
> Module will flow to the same CodeGen back end.  It can make global 
> assumptions about cost models, can inline everything, and so on.
>
FWIW, I would expect that we split the module *before* the codegen stage 
such that the back end doesn't have to deal with heterogeneous models 
(right now).

I'm not sure about cost models and such though. As far as I know, we 
don't do global decisions anywhere but I might be wrong. Put 
differently, I hope we don't do global decisions as it seems quite easy 
to disturb the result with unrelated code changes.

> It sounds as if we have a couple of use cases:
>
>  - Analysis flow between modules
>  - Transforms that modify two modules
>
Yes! Notably the first bullet is bi-directional and cyclic ;)

> The first case is where the motivating example of constant 
> propagation. This feels like the right approach is something like 
> ThinLTO, where you can collect in one module the fact that a kernel is 
> invoked only with specific constant arguments in the host module and 
> consume that result in the target module.
>
Except that you can have cyclic dependencies which makes this 
problematic again. You might not propagate constants from the device 
module to the host one, but if memory is only read/written on the device 
is very interesting on the host side. You can avoid memory copies, 
remove globals, etc. That is just what comes to mind right away. The 
proposed heterogeneous modules should not limit you to "monolithic LTO", 
or "thin LTO" for that matter.

> The second example is what you'd need for things like kernel fusion, 
> where you need to both combine two kernels in the target module and 
> also modify the callers to invoke the single kernel and skip some data 
> flow. For this, you need a kind of pass that can work over things that 
> begin in two modules.
>
Right. Splitting, fusing, moving code, etc. all require you to modify 
both modules at the same time. Even if you only modify one module, you 
want information from both, either direction.

> It seems that a less invasive change would be:
>
>  - Use ThinLTO metadata for the first case, extend it as required.
>  - Add a new kind of ModuleSetPass that takes a set of Modules and is 
> allowed to modify both.
>
> This avoids any modifications for the common (single-target) case, but 
> should give you the required functionality.  Am I missing something?
>
This is similar to what Renato suggested early on. In addition to the 
"ThinLTO metadata" inefficiencies outlined above, the problem I have 
with the second part is that it requires to write completely new passes 
in a different style than anything we have. It is certainly a 
possibility but we can probably do it without any changes to the 
infrastructure.

In addition to the analysis/optimization infrastructure reasons I would 
like to point out that this would make our toolchains a lot easier. We 
have some embedding of device code in host code right now (on every 
level) and things like LTO for all offloading models would become much 
easier if we distribute the heterogeneous modules instead of yet another 
embedding. I might be biased by the way "clang offload bundler" is used 
right now for OpenMP, HIP, etc. but I would very much like to replace 
that with a "clean" toolchain that performs as much LTO as possible, at 
least for the accelerator code.

I hope this makes some sense, feel free to ask questions :)

~ Johannes

> David
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev