[cfe-dev] [RFC][OpenMP][CUDA] Unified Offloading Support in Clang Driver

Thu Mar 3 07:48:06 PST 2016

Chris,

A unified offload library, as good as it might be to have one, is
completely orthogonal to Samuel's proposal.

He proposed a unified driver support; it doesn't matter what offload
library individual compiler components called by driver are targeting.

Yours,
Andrey
=====
Software Engineer
Intel Compiler Team

On Thu, Mar 3, 2016 at 2:19 PM, C Bergström <cfe-dev at lists.llvm.org> wrote:
> On Thu, Mar 3, 2016 at 5:50 PM, Ronan KERYELL via cfe-dev
> <cfe-dev at lists.llvm.org> wrote:
>>>>>>> On Wed, 24 Feb 2016 19:01:31 -0500, Samuel F Antao via cfe-dev <cfe-dev at lists.llvm.org> said:
>>
>>     Samuel> Hi all,
>>
>> Hi Samuel!
>>
>>     Samuel>  I’d like to propose a change in the Driver implementation
>>     Samuel> to support programming models that require offloading with a
>>     Samuel> unified infrastructure.  The goal is to have a design that
>>     Samuel> is general enough to cover different programming models with
>>     Samuel> as little as possible customization that is
>>     Samuel> programming-model specific. Some of this discussion already
>>     Samuel> took place in http://reviews.llvm.org/D9888 but would like
>>     Samuel> to continue that here in he mailing list and try to collect
>>     Samuel> as much feedback as possible.
>>
>>     Samuel> Currently, there are two programming models supported by
>>     Samuel> clang that require offloading - CUDA and OpenMP. Examples of
>>     Samuel> other offloading models that can could benefit of a unified
>>     Samuel> driver design as they become supported in clang are also
>>     Samuel> SYCL (https://www.khronos.org/sycl) and OpenACC
>>     Samuel> (http://www.openacc.org/).
>>
>> Great proposal!
>>
>> Very à propos since I am just thinking about implementing it with Clang
>> in my SYCL implementation (see
>> https://github.com/amd/triSYCL#possible-futures for possible way I am
>> thinking of).
>>
>>     Samuel> OpenMP (Host IR has to be read by the device to determine
>>     Samuel> which declarations have to be emitted and the device binary
>>     Samuel> is embedded in the host binary at link phase through a
>>     Samuel> proper linker script):
>>
>>     Samuel> Src -> Host PP -> A
>>
>>     Samuel> A -> HostCompile -> B
>>
>>     Samuel> A,B -> DeviceCompile -> C
>>
>>     Samuel> C -> DeviceAssembler -> D
>>
>>     Samuel> E -> DeviceLinker -> F
>>
>>     Samuel> B -> HostAssembler -> G
>>
>>     Samuel> G,F -> HostLinker -> Out
>>
>> In SYCL it would be pretty close. Something like:
>>
>> Src -> Host PP -> A
>>
>> A -> HostCompile -> B
>>
>> B -> HostAssembler -> C
>>
>> Src -> Device PP -> D
>>
>> D -> DeviceCompile -> E
>>
>> E -> DeviceAssembler -> F
>>
>> F -> DeviceLinker -> G
>>
>> C,G -> HostLinker -> Out
>>
>>     Samuel> As an hypothetical example, lets assume we wanted to compile
>>     Samuel> code that uses both CUDA for a nvptx64 device, OpenMP for an
>>     Samuel> x86_64 device, and a powerpc64le host, one could invoke the
>>     Samuel> driver as:
>>
>>     Samuel> clang -target powerpc64le-ibm-linux-gnu <more host options>
>>
>>     Samuel> -target-offload=nvptx64-nvidia-cuda -fcuda -mcpu sm_35 <more
>>     Samuel> options for the nvptx toolchain>
>>
>>     Samuel> -target-offload=x86_64-pc-linux-gnu -fopenmp <more options
>>     Samuel> for the x86_64 toolchain>
>>
>> Just to be sure to understand: you are thinking about being able to
>> outline several "languages" at once, such as CUDA *and* OpenMP, right ?
>>
>> I think it is required for serious applications. For example, in the HPC
>> world, it is common to have hybrid multi-node heterogeneous applications
>> that use MPI+OpenMP+OpenCL for example. Since MPI and OpenCL are just
>> libraries, there is only OpenMP to off-load here. But if we move to
>> OpenCL SYCL instead with MPI+OpenMP+SYCL then both OpenMP and SYCL have
>> to be managed by the Clang off-loading infrastructure at the same time
>> and be sure they combine gracefully...
>>
>> I think your second proposal about (un)bundling can already manage this.
>>
>> Otherwise, what about the code outlining itself used in the off-loading
>> process? The code generation itself requires to outline the kernel code
>> to some external functions to be compiled by the kernel compiler. Do you
>> think it is up to the programmer to re-use the recipes used by OpenMP
>> and CUDA for example or it would be interesting to have a third proposal
>> to abstract more the outliner to be configurable to handle globally
>> OpenMP, CUDA, SYCL...?
>
> Some very good points above and back to my broken record..
>
> If all offloading is done in a single unified library -
> a. Lowering in LLVM is greatly simplified since there's ***1***
> offload API to be supported
> A region that's outlined for SYCL, CUDA or something else is
> essentially the same thing. (I do realize that some transformation may
> be highly target specific, but to me that's more target hw driven than
> programming model driven)
>
> b. Mixing CUDA/OMP/ACC/Foo in theory may "just work" since the same
> runtime will handle them all. (With the limitation that if you want
> CUDA to *talk to* OMP or something else there needs to be some glue.
> I'm merely saying that 1 application with multiple models in a way
> that won't conflict)
>
> c. The driver doesn't need to figure out do I link against some or a
> multitude of combining/conflicting libcuda, libomp, libsomething -
> it's liboffload - done
>
> The driver proposal and the liboffload proposal should imnsho be
> tightly coupled and work together as *1*. The goals are significantly
> overlapping and relevant. If you get the liboffload OMP people to make
> that more agnostic - I think it simplifies the driver work.
> ------
> More specific to this proposal - device linker vs host linker. What do
> you do for IPA/LTO or whole program optimizations? (Outside the scope
> of this project.. ?)
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev