[cfe-dev] [RFC][OpenMP][CUDA] Unified Offloading Support in Clang Driver

Thu Mar 3 03:19:43 PST 2016

On Thu, Mar 3, 2016 at 5:50 PM, Ronan KERYELL via cfe-dev
<cfe-dev at lists.llvm.org> wrote:
>>>>>> On Wed, 24 Feb 2016 19:01:31 -0500, Samuel F Antao via cfe-dev <cfe-dev at lists.llvm.org> said:
>
>     Samuel> Hi all,
>
> Hi Samuel!
>
>     Samuel>  I’d like to propose a change in the Driver implementation
>     Samuel> to support programming models that require offloading with a
>     Samuel> unified infrastructure.  The goal is to have a design that
>     Samuel> is general enough to cover different programming models with
>     Samuel> as little as possible customization that is
>     Samuel> programming-model specific. Some of this discussion already
>     Samuel> took place in http://reviews.llvm.org/D9888 but would like
>     Samuel> to continue that here in he mailing list and try to collect
>     Samuel> as much feedback as possible.
>
>     Samuel> Currently, there are two programming models supported by
>     Samuel> clang that require offloading - CUDA and OpenMP. Examples of
>     Samuel> other offloading models that can could benefit of a unified
>     Samuel> driver design as they become supported in clang are also
>     Samuel> SYCL (https://www.khronos.org/sycl) and OpenACC
>     Samuel> (http://www.openacc.org/).
>
> Great proposal!
>
> Very à propos since I am just thinking about implementing it with Clang
> in my SYCL implementation (see
> https://github.com/amd/triSYCL#possible-futures for possible way I am
> thinking of).
>
>     Samuel> OpenMP (Host IR has to be read by the device to determine
>     Samuel> which declarations have to be emitted and the device binary
>     Samuel> is embedded in the host binary at link phase through a
>     Samuel> proper linker script):
>
>     Samuel> Src -> Host PP -> A
>
>     Samuel> A -> HostCompile -> B
>
>     Samuel> A,B -> DeviceCompile -> C
>
>     Samuel> C -> DeviceAssembler -> D
>
>     Samuel> E -> DeviceLinker -> F
>
>     Samuel> B -> HostAssembler -> G
>
>     Samuel> G,F -> HostLinker -> Out
>
> In SYCL it would be pretty close. Something like:
>
> Src -> Host PP -> A
>
> A -> HostCompile -> B
>
> B -> HostAssembler -> C
>
> Src -> Device PP -> D
>
> D -> DeviceCompile -> E
>
> E -> DeviceAssembler -> F
>
> F -> DeviceLinker -> G
>
> C,G -> HostLinker -> Out
>
>     Samuel> As an hypothetical example, lets assume we wanted to compile
>     Samuel> code that uses both CUDA for a nvptx64 device, OpenMP for an
>     Samuel> x86_64 device, and a powerpc64le host, one could invoke the
>     Samuel> driver as:
>
>     Samuel> clang -target powerpc64le-ibm-linux-gnu <more host options>
>
>     Samuel> -target-offload=nvptx64-nvidia-cuda -fcuda -mcpu sm_35 <more
>     Samuel> options for the nvptx toolchain>
>
>     Samuel> -target-offload=x86_64-pc-linux-gnu -fopenmp <more options
>     Samuel> for the x86_64 toolchain>
>
> Just to be sure to understand: you are thinking about being able to
> outline several "languages" at once, such as CUDA *and* OpenMP, right ?
>
> I think it is required for serious applications. For example, in the HPC
> world, it is common to have hybrid multi-node heterogeneous applications
> that use MPI+OpenMP+OpenCL for example. Since MPI and OpenCL are just
> libraries, there is only OpenMP to off-load here. But if we move to
> OpenCL SYCL instead with MPI+OpenMP+SYCL then both OpenMP and SYCL have
> to be managed by the Clang off-loading infrastructure at the same time
> and be sure they combine gracefully...
>
> I think your second proposal about (un)bundling can already manage this.
>
> Otherwise, what about the code outlining itself used in the off-loading
> process? The code generation itself requires to outline the kernel code
> to some external functions to be compiled by the kernel compiler. Do you
> think it is up to the programmer to re-use the recipes used by OpenMP
> and CUDA for example or it would be interesting to have a third proposal
> to abstract more the outliner to be configurable to handle globally
> OpenMP, CUDA, SYCL...?

Some very good points above and back to my broken record..

If all offloading is done in a single unified library -
a. Lowering in LLVM is greatly simplified since there's ***1***
offload API to be supported
A region that's outlined for SYCL, CUDA or something else is
essentially the same thing. (I do realize that some transformation may
be highly target specific, but to me that's more target hw driven than
programming model driven)

b. Mixing CUDA/OMP/ACC/Foo in theory may "just work" since the same
runtime will handle them all. (With the limitation that if you want
CUDA to *talk to* OMP or something else there needs to be some glue.
I'm merely saying that 1 application with multiple models in a way
that won't conflict)

c. The driver doesn't need to figure out do I link against some or a
multitude of combining/conflicting libcuda, libomp, libsomething -
it's liboffload - done

The driver proposal and the liboffload proposal should imnsho be
tightly coupled and work together as *1*. The goals are significantly
overlapping and relevant. If you get the liboffload OMP people to make
that more agnostic - I think it simplifies the driver work.
------
More specific to this proposal - device linker vs host linker. What do
you do for IPA/LTO or whole program optimizations? (Outside the scope
of this project.. ?)