[cfe-dev] [RFC][OpenMP][CUDA] Unified Offloading Support in Clang Driver

Thu Mar 3 08:20:09 PST 2016

On Thu, Mar 3, 2016 at 10:19 PM, Ronan Keryell <ronan at keryell.fr> wrote:
>>>>>> On Thu, 3 Mar 2016 18:19:43 +0700, C Bergström via cfe-dev <cfe-dev at lists.llvm.org> said:
>
>     C> On Thu, Mar 3, 2016 at 5:50 PM, Ronan KERYELL via cfe-dev
>     C> <cfe-dev at lists.llvm.org> wrote:
>
>     >> Just to be sure to understand: you are thinking about being able
>     >> to outline several "languages" at once, such as CUDA *and*
>     >> OpenMP, right ?
>     >>
>     >> I think it is required for serious applications. For example, in
>     >> the HPC world, it is common to have hybrid multi-node
>     >> heterogeneous applications that use MPI+OpenMP+OpenCL for
>     >> example. Since MPI and OpenCL are just libraries, there is only
>     >> OpenMP to off-load here. But if we move to OpenCL SYCL instead
>     >> with MPI+OpenMP+SYCL then both OpenMP and SYCL have to be managed
>     >> by the Clang off-loading infrastructure at the same time and be
>     >> sure they combine gracefully...
>     >>
>     >> I think your second proposal about (un)bundling can already
>     >> manage this.
>     >>
>     >> Otherwise, what about the code outlining itself used in the
>     >> off-loading process? The code generation itself requires to
>     >> outline the kernel code to some external functions to be compiled
>     >> by the kernel compiler. Do you think it is up to the programmer
>     >> to re-use the recipes used by OpenMP and CUDA for example or it
>     >> would be interesting to have a third proposal to abstract more
>     >> the outliner to be configurable to handle globally OpenMP, CUDA,
>     >> SYCL...?
>
>     C> Some very good points above and back to my broken record..
>
>     C> If all offloading is done in a single unified library -
>     C> a. Lowering in LLVM is greatly simplified since there's ***1***
>     C> offload API to be supported A region that's outlined for SYCL,
>     C> CUDA or something else is essentially the same thing. (I do
>     C> realize that some transformation may be highly target specific,
>     C> but to me that's more target hw driven than programming model
>     C> driven)
>
>     C> b. Mixing CUDA/OMP/ACC/Foo in theory may "just work" since the
>     C> same runtime will handle them all. (With the limitation that if
>     C> you want CUDA to *talk to* OMP or something else there needs to
>     C> be some glue.  I'm merely saying that 1 application with multiple
>     C> models in a way that won't conflict)
>
>     C> c. The driver doesn't need to figure out do I link against some
>     C> or a multitude of combining/conflicting libcuda, libomp,
>     C> libsomething - it's liboffload - done
>
> Yes, a unified target library would help.
>
>     C> The driver proposal and the liboffload proposal should imnsho be
>     C> tightly coupled and work together as *1*. The goals are
>     C> significantly overlapping and relevant. If you get the liboffload
>     C> OMP people to make that more agnostic - I think it simplifies the
>     C> driver work.
>
> So basically it is about introducing a fourth unification: liboffload.
>
> A great unification sounds great.
> My only concern is that if we tie everything together, it would increase
> the entry cost: all the different components should be ready in
> lock-step.
> If there is already a runtime available, it would be easier to start
> with and develop the other part in the meantime.
> So from a pragmatic agile point-of-view, I would prefer not to impose a
> strong unification.

I think may not be explaining clearly - let me elaborate by example a bit below

> In the proposal of Samuel, all the parts seem independent.
>
>     C>   ------ More specific to this proposal - device
>     C> linker vs host linker. What do you do for IPA/LTO or whole
>     C> program optimizations? (Outside the scope of this project.. ?)
>
> Ouch. I did not think about it. It sounds like science-fiction for
> now. :-) Probably outside the scope of this project..

It should certainly not be science fiction or an after-thought. I
won't go into shameless self promotion, but there are certainly useful
things you can do when you have a "whole device kernel" perspective.

To digress into the liboffload component of this (sorry)
what we have today is basically liboffload/src/all source files mucked together

What I'm proposing would look more like this

liboffload/src/common_middle_layer_glue # to start this may be "best effort"
liboffload/src/omp # This code should exist today, but ideally should
build on top of the middle layer
liboffload/src/ptx # this may exist today - not sure
liboffload/src/amd_gpu # probably doesn't exist, but
wouldn't/shouldn't block anything
liboffload/src/phi # may exist in some form
liboffload/src/cuda # may exist in some form outside of the OMP work

The end result would be liboffload.

Above and below the common middle layer API are programming model or
hardware specific. To add a new hw backend you just implement the
things the middle layer needs. To add a new programming model you
build on top of the common layer. I'm not trying to force
anyone/everyone to switch to this now - I'm hoping that by being a
squeaky wheel this isolation of design and layers is there from the
start - even if not perfect. I think it's sloppy to not consider this
actually. LLVM's code generation is clean and has a nice separation
per target (for the most part) - why should the offload library have
bad design which just needs to be refactored later. I've seen others
in the community beat up Intel to force them to have higher quality
code before inclusion... some of this may actually be just minor
refactoring to come close to the target. (No pun intended)
-------------
If others become open to this design - I'm happy to contribute more
tangible details on the actual middle API.

the objects which the driver has to deal with may and probably do
overlap to some extent with the objects the liboffload has to load or
deal with. Is there an API the driver can hook into to magically
handle that or is it all per-device and 1-off..