[llvm-dev] [RFC] Heterogeneous LLVM-IR Modules

Tue Jul 28 20:26:51 PDT 2020

On 7/28/20 6:13 PM, Renato Golin wrote:
> On Tue, 28 Jul 2020 at 21:52, Johannes Doerfert
> <johannesdoerfert at gmail.com> wrote:
>> Let's take OpenMP.
>> The compiler cannot know what your memory actually is because types are,
>> you know, just hints for the most part. So we need the devices to match
>> the host data layout wrt. padding, alignment, etc. or we could not copy
>> an array of structs from one to the other and expect it to work. CUDA,
>> HIP, SYCL, ... should all be the same. I hope someone corrects me if I
>> have some misconceptions here :)
> All those programming models have already been made to inter-work with
> CPUs like that. So, if we take the conscious decision that
> accelerators' drivers must implement that transparent layer in order
> to benefit from LLVM IR's multi-DL, fine.
>
> I have no stakes in any particular accelerator, but we should make it
> clear that they must implement that level of transparency to use this
> feature of LLVM IR.

Yes. Whatever we do, it should be clear what requirements there are

for you to create a multi-target module. We can probably even verify

some of them, like the direct call edge thing.

>> The "important" part is there is no direct call edge between the two
>> modules.
> Right! This makes it a lot simpler. We just need to annotate each
> global symbol with the right DL and trust that the lowering was done
> properly.
>
> What about optimisation passes? GPU code skips most of the CPU
> pipeline not to break codegen later on, but AFAIK, this is done by
> registering a new pass manager.

That is an interesting point. We could arguably teach the (new) PM to run

different pipelines for the different devices. FWIW, I'm not even sure

we do that right now, e.g., for CUDA compilation. [long live uniformity!]

> We'd need to teach passes (or the pass manager) to not throw
> accelerator code into the CPU pipeline and vice-versa.

What do you mean by accelerator code? Intrinsics, vector length,

etc. should be controlled by the triple, so that should be handled.

~ Johannes

>
> --renato