[cfe-dev] [RFC][OpenMP][CUDA] Unified Offloading Support in Clang Driver

Thu Mar 3 07:19:07 PST 2016

>>>>> On Thu, 3 Mar 2016 18:19:43 +0700, C Bergström via cfe-dev <cfe-dev at lists.llvm.org> said:

    C> On Thu, Mar 3, 2016 at 5:50 PM, Ronan KERYELL via cfe-dev
    C> <cfe-dev at lists.llvm.org> wrote:

    >> Just to be sure to understand: you are thinking about being able
    >> to outline several "languages" at once, such as CUDA *and*
    >> OpenMP, right ?
    >> 
    >> I think it is required for serious applications. For example, in
    >> the HPC world, it is common to have hybrid multi-node
    >> heterogeneous applications that use MPI+OpenMP+OpenCL for
    >> example. Since MPI and OpenCL are just libraries, there is only
    >> OpenMP to off-load here. But if we move to OpenCL SYCL instead
    >> with MPI+OpenMP+SYCL then both OpenMP and SYCL have to be managed
    >> by the Clang off-loading infrastructure at the same time and be
    >> sure they combine gracefully...
    >> 
    >> I think your second proposal about (un)bundling can already
    >> manage this.
    >> 
    >> Otherwise, what about the code outlining itself used in the
    >> off-loading process? The code generation itself requires to
    >> outline the kernel code to some external functions to be compiled
    >> by the kernel compiler. Do you think it is up to the programmer
    >> to re-use the recipes used by OpenMP and CUDA for example or it
    >> would be interesting to have a third proposal to abstract more
    >> the outliner to be configurable to handle globally OpenMP, CUDA,
    >> SYCL...?

    C> Some very good points above and back to my broken record..

    C> If all offloading is done in a single unified library -
    C> a. Lowering in LLVM is greatly simplified since there's ***1***
    C> offload API to be supported A region that's outlined for SYCL,
    C> CUDA or something else is essentially the same thing. (I do
    C> realize that some transformation may be highly target specific,
    C> but to me that's more target hw driven than programming model
    C> driven)

    C> b. Mixing CUDA/OMP/ACC/Foo in theory may "just work" since the
    C> same runtime will handle them all. (With the limitation that if
    C> you want CUDA to *talk to* OMP or something else there needs to
    C> be some glue.  I'm merely saying that 1 application with multiple
    C> models in a way that won't conflict)

    C> c. The driver doesn't need to figure out do I link against some
    C> or a multitude of combining/conflicting libcuda, libomp,
    C> libsomething - it's liboffload - done

Yes, a unified target library would help.

    C> The driver proposal and the liboffload proposal should imnsho be
    C> tightly coupled and work together as *1*. The goals are
    C> significantly overlapping and relevant. If you get the liboffload
    C> OMP people to make that more agnostic - I think it simplifies the
    C> driver work.

So basically it is about introducing a fourth unification: liboffload.

A great unification sounds great.
My only concern is that if we tie everything together, it would increase
the entry cost: all the different components should be ready in
lock-step.
If there is already a runtime available, it would be easier to start
with and develop the other part in the meantime.
So from a pragmatic agile point-of-view, I would prefer not to impose a
strong unification.
In the proposal of Samuel, all the parts seem independent.

    C>   ------ More specific to this proposal - device
    C> linker vs host linker. What do you do for IPA/LTO or whole
    C> program optimizations? (Outside the scope of this project.. ?)

Ouch. I did not think about it. It sounds like science-fiction for
now. :-) Probably outside the scope of this project..

Are you thinking to having LTO separately on each side independently,
host + target? Of course having LTO on host and target at the same time
seems trickier... :-) But I can see here a use case for "constant
specialization" available in SPIR-V, if we can have some simple host
LTO knowledge about constant values flowing down into device IR.

For non link-time IPA, I think it is simpler since I guess the
programming models envisioned here are all single source, so we can
apply most of the IPA *before* outlining I hope. But perhaps wild
preprocessor differences for host and device may cause havoc here?

-- 
  Ronan KERYELL
  Xilinx Research Labs, Dublin, Ireland