[cfe-dev] [RFC][OpenMP][CUDA] Unified Offloading Support in Clang Driver
Andrey Bokhanko via cfe-dev
cfe-dev at lists.llvm.org
Thu Mar 3 07:48:06 PST 2016
A unified offload library, as good as it might be to have one, is
completely orthogonal to Samuel's proposal.
He proposed a unified driver support; it doesn't matter what offload
library individual compiler components called by driver are targeting.
Intel Compiler Team
On Thu, Mar 3, 2016 at 2:19 PM, C Bergström <cfe-dev at lists.llvm.org> wrote:
> On Thu, Mar 3, 2016 at 5:50 PM, Ronan KERYELL via cfe-dev
> <cfe-dev at lists.llvm.org> wrote:
>>>>>>> On Wed, 24 Feb 2016 19:01:31 -0500, Samuel F Antao via cfe-dev <cfe-dev at lists.llvm.org> said:
>> Samuel> Hi all,
>> Hi Samuel!
>> Samuel> I’d like to propose a change in the Driver implementation
>> Samuel> to support programming models that require offloading with a
>> Samuel> unified infrastructure. The goal is to have a design that
>> Samuel> is general enough to cover different programming models with
>> Samuel> as little as possible customization that is
>> Samuel> programming-model specific. Some of this discussion already
>> Samuel> took place in http://reviews.llvm.org/D9888 but would like
>> Samuel> to continue that here in he mailing list and try to collect
>> Samuel> as much feedback as possible.
>> Samuel> Currently, there are two programming models supported by
>> Samuel> clang that require offloading - CUDA and OpenMP. Examples of
>> Samuel> other offloading models that can could benefit of a unified
>> Samuel> driver design as they become supported in clang are also
>> Samuel> SYCL (https://www.khronos.org/sycl) and OpenACC
>> Samuel> (http://www.openacc.org/).
>> Great proposal!
>> Very à propos since I am just thinking about implementing it with Clang
>> in my SYCL implementation (see
>> https://github.com/amd/triSYCL#possible-futures for possible way I am
>> thinking of).
>> Samuel> OpenMP (Host IR has to be read by the device to determine
>> Samuel> which declarations have to be emitted and the device binary
>> Samuel> is embedded in the host binary at link phase through a
>> Samuel> proper linker script):
>> Samuel> Src -> Host PP -> A
>> Samuel> A -> HostCompile -> B
>> Samuel> A,B -> DeviceCompile -> C
>> Samuel> C -> DeviceAssembler -> D
>> Samuel> E -> DeviceLinker -> F
>> Samuel> B -> HostAssembler -> G
>> Samuel> G,F -> HostLinker -> Out
>> In SYCL it would be pretty close. Something like:
>> Src -> Host PP -> A
>> A -> HostCompile -> B
>> B -> HostAssembler -> C
>> Src -> Device PP -> D
>> D -> DeviceCompile -> E
>> E -> DeviceAssembler -> F
>> F -> DeviceLinker -> G
>> C,G -> HostLinker -> Out
>> Samuel> As an hypothetical example, lets assume we wanted to compile
>> Samuel> code that uses both CUDA for a nvptx64 device, OpenMP for an
>> Samuel> x86_64 device, and a powerpc64le host, one could invoke the
>> Samuel> driver as:
>> Samuel> clang -target powerpc64le-ibm-linux-gnu <more host options>
>> Samuel> -target-offload=nvptx64-nvidia-cuda -fcuda -mcpu sm_35 <more
>> Samuel> options for the nvptx toolchain>
>> Samuel> -target-offload=x86_64-pc-linux-gnu -fopenmp <more options
>> Samuel> for the x86_64 toolchain>
>> Just to be sure to understand: you are thinking about being able to
>> outline several "languages" at once, such as CUDA *and* OpenMP, right ?
>> I think it is required for serious applications. For example, in the HPC
>> world, it is common to have hybrid multi-node heterogeneous applications
>> that use MPI+OpenMP+OpenCL for example. Since MPI and OpenCL are just
>> libraries, there is only OpenMP to off-load here. But if we move to
>> OpenCL SYCL instead with MPI+OpenMP+SYCL then both OpenMP and SYCL have
>> to be managed by the Clang off-loading infrastructure at the same time
>> and be sure they combine gracefully...
>> I think your second proposal about (un)bundling can already manage this.
>> Otherwise, what about the code outlining itself used in the off-loading
>> process? The code generation itself requires to outline the kernel code
>> to some external functions to be compiled by the kernel compiler. Do you
>> think it is up to the programmer to re-use the recipes used by OpenMP
>> and CUDA for example or it would be interesting to have a third proposal
>> to abstract more the outliner to be configurable to handle globally
>> OpenMP, CUDA, SYCL...?
> Some very good points above and back to my broken record..
> If all offloading is done in a single unified library -
> a. Lowering in LLVM is greatly simplified since there's ***1***
> offload API to be supported
> A region that's outlined for SYCL, CUDA or something else is
> essentially the same thing. (I do realize that some transformation may
> be highly target specific, but to me that's more target hw driven than
> programming model driven)
> b. Mixing CUDA/OMP/ACC/Foo in theory may "just work" since the same
> runtime will handle them all. (With the limitation that if you want
> CUDA to *talk to* OMP or something else there needs to be some glue.
> I'm merely saying that 1 application with multiple models in a way
> that won't conflict)
> c. The driver doesn't need to figure out do I link against some or a
> multitude of combining/conflicting libcuda, libomp, libsomething -
> it's liboffload - done
> The driver proposal and the liboffload proposal should imnsho be
> tightly coupled and work together as *1*. The goals are significantly
> overlapping and relevant. If you get the liboffload OMP people to make
> that more agnostic - I think it simplifies the driver work.
> More specific to this proposal - device linker vs host linker. What do
> you do for IPA/LTO or whole program optimizations? (Outside the scope
> of this project.. ?)
> cfe-dev mailing list
> cfe-dev at lists.llvm.org
More information about the cfe-dev