[cfe-dev] [RFC][OpenMP][CUDA] Unified Offloading Support in Clang Driver
Ronan Keryell via cfe-dev
cfe-dev at lists.llvm.org
Thu Mar 3 07:19:07 PST 2016
>>>>> On Thu, 3 Mar 2016 18:19:43 +0700, C Bergström via cfe-dev <cfe-dev at lists.llvm.org> said:
C> On Thu, Mar 3, 2016 at 5:50 PM, Ronan KERYELL via cfe-dev
C> <cfe-dev at lists.llvm.org> wrote:
>> Just to be sure to understand: you are thinking about being able
>> to outline several "languages" at once, such as CUDA *and*
>> OpenMP, right ?
>> I think it is required for serious applications. For example, in
>> the HPC world, it is common to have hybrid multi-node
>> heterogeneous applications that use MPI+OpenMP+OpenCL for
>> example. Since MPI and OpenCL are just libraries, there is only
>> OpenMP to off-load here. But if we move to OpenCL SYCL instead
>> with MPI+OpenMP+SYCL then both OpenMP and SYCL have to be managed
>> by the Clang off-loading infrastructure at the same time and be
>> sure they combine gracefully...
>> I think your second proposal about (un)bundling can already
>> manage this.
>> Otherwise, what about the code outlining itself used in the
>> off-loading process? The code generation itself requires to
>> outline the kernel code to some external functions to be compiled
>> by the kernel compiler. Do you think it is up to the programmer
>> to re-use the recipes used by OpenMP and CUDA for example or it
>> would be interesting to have a third proposal to abstract more
>> the outliner to be configurable to handle globally OpenMP, CUDA,
C> Some very good points above and back to my broken record..
C> If all offloading is done in a single unified library -
C> a. Lowering in LLVM is greatly simplified since there's ***1***
C> offload API to be supported A region that's outlined for SYCL,
C> CUDA or something else is essentially the same thing. (I do
C> realize that some transformation may be highly target specific,
C> but to me that's more target hw driven than programming model
C> b. Mixing CUDA/OMP/ACC/Foo in theory may "just work" since the
C> same runtime will handle them all. (With the limitation that if
C> you want CUDA to *talk to* OMP or something else there needs to
C> be some glue. I'm merely saying that 1 application with multiple
C> models in a way that won't conflict)
C> c. The driver doesn't need to figure out do I link against some
C> or a multitude of combining/conflicting libcuda, libomp,
C> libsomething - it's liboffload - done
Yes, a unified target library would help.
C> The driver proposal and the liboffload proposal should imnsho be
C> tightly coupled and work together as *1*. The goals are
C> significantly overlapping and relevant. If you get the liboffload
C> OMP people to make that more agnostic - I think it simplifies the
C> driver work.
So basically it is about introducing a fourth unification: liboffload.
A great unification sounds great.
My only concern is that if we tie everything together, it would increase
the entry cost: all the different components should be ready in
If there is already a runtime available, it would be easier to start
with and develop the other part in the meantime.
So from a pragmatic agile point-of-view, I would prefer not to impose a
In the proposal of Samuel, all the parts seem independent.
C> ------ More specific to this proposal - device
C> linker vs host linker. What do you do for IPA/LTO or whole
C> program optimizations? (Outside the scope of this project.. ?)
Ouch. I did not think about it. It sounds like science-fiction for
now. :-) Probably outside the scope of this project..
Are you thinking to having LTO separately on each side independently,
host + target? Of course having LTO on host and target at the same time
seems trickier... :-) But I can see here a use case for "constant
specialization" available in SPIR-V, if we can have some simple host
LTO knowledge about constant values flowing down into device IR.
For non link-time IPA, I think it is simpler since I guess the
programming models envisioned here are all single source, so we can
apply most of the IPA *before* outlining I hope. But perhaps wild
preprocessor differences for host and device may cause havoc here?
Xilinx Research Labs, Dublin, Ireland
More information about the cfe-dev