[cfe-dev] [RFC][OpenMP][CUDA] Unified Offloading Support in Clang Driver

Thu Mar 3 12:03:37 PST 2016

Hi Chris,

I agree with Andrey when he says this should be a separate discussion.

I think that aiming at having a library that would support any possible
programming model would take a long time, as it requires a lot of consensus
namely from who is maintaining programming models already in clang (e.g.
CUDA). We should try to have something incremental.

I'm happy to discuss and know more about the design and code you would like
to contribute to this, but I think you should post it in a different thread.

Thanks,
Samuel

2016-03-03 11:20 GMT-05:00 C Bergström <cfe-dev at lists.llvm.org>:

> On Thu, Mar 3, 2016 at 10:19 PM, Ronan Keryell <ronan at keryell.fr> wrote:
> >>>>>> On Thu, 3 Mar 2016 18:19:43 +0700, C Bergström via cfe-dev <
> cfe-dev at lists.llvm.org> said:
> >
> >     C> On Thu, Mar 3, 2016 at 5:50 PM, Ronan KERYELL via cfe-dev
> >     C> <cfe-dev at lists.llvm.org> wrote:
> >
> >     >> Just to be sure to understand: you are thinking about being able
> >     >> to outline several "languages" at once, such as CUDA *and*
> >     >> OpenMP, right ?
> >     >>
> >     >> I think it is required for serious applications. For example, in
> >     >> the HPC world, it is common to have hybrid multi-node
> >     >> heterogeneous applications that use MPI+OpenMP+OpenCL for
> >     >> example. Since MPI and OpenCL are just libraries, there is only
> >     >> OpenMP to off-load here. But if we move to OpenCL SYCL instead
> >     >> with MPI+OpenMP+SYCL then both OpenMP and SYCL have to be managed
> >     >> by the Clang off-loading infrastructure at the same time and be
> >     >> sure they combine gracefully...
> >     >>
> >     >> I think your second proposal about (un)bundling can already
> >     >> manage this.
> >     >>
> >     >> Otherwise, what about the code outlining itself used in the
> >     >> off-loading process? The code generation itself requires to
> >     >> outline the kernel code to some external functions to be compiled
> >     >> by the kernel compiler. Do you think it is up to the programmer
> >     >> to re-use the recipes used by OpenMP and CUDA for example or it
> >     >> would be interesting to have a third proposal to abstract more
> >     >> the outliner to be configurable to handle globally OpenMP, CUDA,
> >     >> SYCL...?
> >
> >     C> Some very good points above and back to my broken record..
> >
> >     C> If all offloading is done in a single unified library -
> >     C> a. Lowering in LLVM is greatly simplified since there's ***1***
> >     C> offload API to be supported A region that's outlined for SYCL,
> >     C> CUDA or something else is essentially the same thing. (I do
> >     C> realize that some transformation may be highly target specific,
> >     C> but to me that's more target hw driven than programming model
> >     C> driven)
> >
> >     C> b. Mixing CUDA/OMP/ACC/Foo in theory may "just work" since the
> >     C> same runtime will handle them all. (With the limitation that if
> >     C> you want CUDA to *talk to* OMP or something else there needs to
> >     C> be some glue.  I'm merely saying that 1 application with multiple
> >     C> models in a way that won't conflict)
> >
> >     C> c. The driver doesn't need to figure out do I link against some
> >     C> or a multitude of combining/conflicting libcuda, libomp,
> >     C> libsomething - it's liboffload - done
> >
> > Yes, a unified target library would help.
> >
> >     C> The driver proposal and the liboffload proposal should imnsho be
> >     C> tightly coupled and work together as *1*. The goals are
> >     C> significantly overlapping and relevant. If you get the liboffload
> >     C> OMP people to make that more agnostic - I think it simplifies the
> >     C> driver work.
> >
> > So basically it is about introducing a fourth unification: liboffload.
> >
> > A great unification sounds great.
> > My only concern is that if we tie everything together, it would increase
> > the entry cost: all the different components should be ready in
> > lock-step.
> > If there is already a runtime available, it would be easier to start
> > with and develop the other part in the meantime.
> > So from a pragmatic agile point-of-view, I would prefer not to impose a
> > strong unification.
>
> I think may not be explaining clearly - let me elaborate by example a bit
> below
>
> > In the proposal of Samuel, all the parts seem independent.
> >
> >     C>   ------ More specific to this proposal - device
> >     C> linker vs host linker. What do you do for IPA/LTO or whole
> >     C> program optimizations? (Outside the scope of this project.. ?)
> >
> > Ouch. I did not think about it. It sounds like science-fiction for
> > now. :-) Probably outside the scope of this project..
>
> It should certainly not be science fiction or an after-thought. I
> won't go into shameless self promotion, but there are certainly useful
> things you can do when you have a "whole device kernel" perspective.
>
> To digress into the liboffload component of this (sorry)
> what we have today is basically liboffload/src/all source files mucked
> together
>
> What I'm proposing would look more like this
>
> liboffload/src/common_middle_layer_glue # to start this may be "best
> effort"
> liboffload/src/omp # This code should exist today, but ideally should
> build on top of the middle layer
> liboffload/src/ptx # this may exist today - not sure
> liboffload/src/amd_gpu # probably doesn't exist, but
> wouldn't/shouldn't block anything
> liboffload/src/phi # may exist in some form
> liboffload/src/cuda # may exist in some form outside of the OMP work
>
> The end result would be liboffload.
>
> Above and below the common middle layer API are programming model or
> hardware specific. To add a new hw backend you just implement the
> things the middle layer needs. To add a new programming model you
> build on top of the common layer. I'm not trying to force
> anyone/everyone to switch to this now - I'm hoping that by being a
> squeaky wheel this isolation of design and layers is there from the
> start - even if not perfect. I think it's sloppy to not consider this
> actually. LLVM's code generation is clean and has a nice separation
> per target (for the most part) - why should the offload library have
> bad design which just needs to be refactored later. I've seen others
> in the community beat up Intel to force them to have higher quality
> code before inclusion... some of this may actually be just minor
> refactoring to come close to the target. (No pun intended)
> -------------
> If others become open to this design - I'm happy to contribute more
> tangible details on the actual middle API.
>
> the objects which the driver has to deal with may and probably do
> overlap to some extent with the objects the liboffload has to load or
> deal with. Is there an API the driver can hook into to magically
> handle that or is it all per-device and 1-off..
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20160303/681e3dd7/attachment.html>