[cfe-dev] [RFC][OpenMP][CUDA] Unified Offloading Support in Clang Driver

Thu Mar 3 02:50:04 PST 2016

>>>>> On Wed, 24 Feb 2016 19:01:31 -0500, Samuel F Antao via cfe-dev <cfe-dev at lists.llvm.org> said:

    Samuel> Hi all,

Hi Samuel!

    Samuel>  I’d like to propose a change in the Driver implementation
    Samuel> to support programming models that require offloading with a
    Samuel> unified infrastructure.  The goal is to have a design that
    Samuel> is general enough to cover different programming models with
    Samuel> as little as possible customization that is
    Samuel> programming-model specific. Some of this discussion already
    Samuel> took place in http://reviews.llvm.org/D9888 but would like
    Samuel> to continue that here in he mailing list and try to collect
    Samuel> as much feedback as possible.

    Samuel> Currently, there are two programming models supported by
    Samuel> clang that require offloading - CUDA and OpenMP. Examples of
    Samuel> other offloading models that can could benefit of a unified
    Samuel> driver design as they become supported in clang are also
    Samuel> SYCL (https://www.khronos.org/sycl) and OpenACC
    Samuel> (http://www.openacc.org/).

Great proposal!

Very à propos since I am just thinking about implementing it with Clang
in my SYCL implementation (see
https://github.com/amd/triSYCL#possible-futures for possible way I am
thinking of).

    Samuel> OpenMP (Host IR has to be read by the device to determine
    Samuel> which declarations have to be emitted and the device binary
    Samuel> is embedded in the host binary at link phase through a
    Samuel> proper linker script):

    Samuel> Src -> Host PP -> A

    Samuel> A -> HostCompile -> B

    Samuel> A,B -> DeviceCompile -> C

    Samuel> C -> DeviceAssembler -> D

    Samuel> E -> DeviceLinker -> F

    Samuel> B -> HostAssembler -> G

    Samuel> G,F -> HostLinker -> Out

In SYCL it would be pretty close. Something like:

Src -> Host PP -> A

A -> HostCompile -> B

B -> HostAssembler -> C

Src -> Device PP -> D

D -> DeviceCompile -> E

E -> DeviceAssembler -> F

F -> DeviceLinker -> G

C,G -> HostLinker -> Out

    Samuel> As an hypothetical example, lets assume we wanted to compile
    Samuel> code that uses both CUDA for a nvptx64 device, OpenMP for an
    Samuel> x86_64 device, and a powerpc64le host, one could invoke the
    Samuel> driver as:

    Samuel> clang -target powerpc64le-ibm-linux-gnu <more host options>

    Samuel> -target-offload=nvptx64-nvidia-cuda -fcuda -mcpu sm_35 <more
    Samuel> options for the nvptx toolchain>

    Samuel> -target-offload=x86_64-pc-linux-gnu -fopenmp <more options
    Samuel> for the x86_64 toolchain>

Just to be sure to understand: you are thinking about being able to
outline several "languages" at once, such as CUDA *and* OpenMP, right ?

I think it is required for serious applications. For example, in the HPC
world, it is common to have hybrid multi-node heterogeneous applications
that use MPI+OpenMP+OpenCL for example. Since MPI and OpenCL are just
libraries, there is only OpenMP to off-load here. But if we move to
OpenCL SYCL instead with MPI+OpenMP+SYCL then both OpenMP and SYCL have
to be managed by the Clang off-loading infrastructure at the same time
and be sure they combine gracefully...

I think your second proposal about (un)bundling can already manage this.

Otherwise, what about the code outlining itself used in the off-loading
process? The code generation itself requires to outline the kernel code
to some external functions to be compiled by the kernel compiler. Do you
think it is up to the programmer to re-use the recipes used by OpenMP
and CUDA for example or it would be interesting to have a third proposal
to abstract more the outliner to be configurable to handle globally
OpenMP, CUDA, SYCL...?

Thanks a lot,
-- 
  Ronan KERYELL
  Xilinx Research Labs, Dublin, Ireland