[cfe-dev] [RFC][OpenMP][CUDA] Unified Offloading Support in Clang Driver

Wed Feb 24 16:01:31 PST 2016

Hi all,

I’d like to propose a change in the Driver implementation to support
programming models that require offloading with a unified infrastructure.
The goal is to have a design that is general enough to cover different
programming models with as little as possible customization that is
programming-model specific. Some of this discussion already took place in
http://reviews.llvm.org/D9888 but would like to continue that here in he
mailing list and try to collect as much feedback as possible.

Currently, there are two programming models supported by clang that require
offloading - CUDA and OpenMP. Examples of other offloading models that can
could benefit of a unified driver design as they become supported in clang
are also SYCL (https://www.khronos.org/sycl) and OpenACC (
http://www.openacc.org/). Therefore, I’ll try to make the discussion a
general as possible, but will occasionally provide examples on how that
applies on CUDA and OpenMP, given that is what people may care about more
immediately.

I hope I covered all the possible implications of a general offloading
implementation. Let me know if you think there is something missing that
should also be covered, your suggestions and concerns. Any feedback is very
much welcome!

Thanks!

Samuel

================

Proposal Description

================

a) Create toolchains for host and offload devices before creating the
actions.

The driver has to detect the employed programming models through the
provided options (e.g. -fcuda or -fopenmp) or file extensions. For each
host and offloading device and programming model, it should create a
toolchain. In general, the same target can be used as host and offloading
device, therefore the creation of the toolchain should be provided a “kind"
that unequivocally specify what that toolchain is used for. These kinds
(e.g. CudaHostKind, CudaDeviceKind, OpenMPHost, etc...) would be kept in
ToolChain and could be accessed through some public method so they can be
used to drive the creations of commands by Tools.

b) Keep the generation of Actions independent of the program model.

In my view, the Actions should only depend on the compile phases requested
by the user and the file extensions of the input files. Only the way those
actions are interpreted to create jobs should be dependent on the
programming model. This would avoid complicating the actions creation with
dependencies that only make sense to some programming models, which would
make the implementation hard to scale when new programming models are to be
adopted.

c) Use unbundling and bundling tools agnostic of the programming model.

I propose a single change in the action creation and that is the creation
of a “unbundling” and "bundling” action whose goal is to prevent the user
to have to deal with multiple files generated from multiple toolchains
(host toolchain and offloading devices’ toolchains) if he uses separate
compilation in his build system. This would prevent the user from
redesigning his build system if he wants to adopt a programming model with
offloading. These actions would be introduced if offloading is required,
i.e. there are toolchains that refer to offloading devices (regardless of
the programming model being supported). Unbundling would be inserted if the
initial action is not a source input action, and Bundling would be
introduced if the last phase is not a linking phase.

Unbundling and Bundling could be supported by a tool specifically
implemented for that purpose. I’ll post a separate RFC for this tool.

d) Allow the target toolchain to request the host toolchain to be used for
a given action.

In some cases the definition in the host toolchain are the correct ones to
use. E.g. a preprocessing phase may fail because the header files are
expecting host macros in a given system. This can be done by implementing a
query in the proper ToolChain that takes into account the device target and
the offloading kinds it has associated.

e)  Use a job results cache to enable sharing results between device and
host toolchains.

At some point, an offloading device object has to be integrated into the
host object. Other intermediate job results may as well have to be shared
between host and device and vice-versa. As an example, these are the
dependencies that are required for CUDA and OpenMP:

CUDA (the device object is injected at the host compile phase):

Src -> Device PP -> A

A    -> DeviceCompile -> B

B    -> DeviceAssembler -> C

C    -> DeviceLinker -> D

Src -> Host PP -> E

E,D -> HostCompile -> F

F    -> HostAssembler -> G

G    -> HostLinker -> Out

OpenMP (Host IR has to be read by the device to determine which
declarations have to be emitted and the device binary is embedded in the
host binary at link phase through a proper linker script):

Src -> Host PP -> A

A    -> HostCompile -> B

A,B -> DeviceCompile -> C

C    -> DeviceAssembler -> D

E    -> DeviceLinker -> F

B    -> HostAssembler -> G

G,F -> HostLinker -> Out

This can be done by generating the jobs and storing them in cache so they
can be referred to later on. This was proposed in
http://reviews.llvm.org/D9888 and a very similar mechanism is used today
inserted by the CUDA implementation. It is possible this cache has to be
extended to have more queries and to have the results sorted by Action,
ToolChain and Offloading Kind.

f) Intercept the jobs creation before the emission of the command.

In my view this is the only change required in the driver (apart from the
obvious toolchain changes) that would be dependent on the programming
model. A job result post-processing function could check that there are
offloading toolchains to be used and spawn the jobs creation for those
toolchains as well as append results from one toolchain to the results of
some other accordingly to the programming model implementation needs.

E.g. for the CUDA programing model, the linker action would be recovered by
the host compile post-processing call, which would spawn the creation of
all the device jobs and append the result to the host compile phase inputs.

For the OpenMP programming model, the post-processing call for the host
linker action would spawn the creation of the device jobs and append to the
list of host linker inputs and the post-processing call of the device
compile action would retrieve the host compile phase result.

g) Reflect the offloading programming model in the naming of the save-temps
files.

Given that the same action is interpreted by different toolchains, if using
save-temps the resulting file could be append with the programming model
name by the target triple so that files don’t get overwritten.

E.g. for OpenMP one would get a.bc and a-openmp-<triple>.bc if the driver
is invoked with 'clang -c -save-temps a.c’.

h) Use special options -target-offload=<triple> to specify offloading
targets and delimit options meant for a toolchain.

To avoid the proliferation of driver (and possibly frontend) options that
are specific for a programming model I propose a new option that would
specify an offloading device and have all the options following it
processed for its toolchain. This would allow using the already existing
options like -mcpu or -L/-l to tune the implementation for a given machine
or provide linking commands that only make sense for the device.

As an hypothetical example, lets assume we wanted to compile code that uses
both CUDA for a nvptx64 device, OpenMP for an x86_64 device, and a
powerpc64le host, one could invoke the driver as:

clang -target powerpc64le-ibm-linux-gnu <more host options>

-target-offload=nvptx64-nvidia-cuda -fcuda -mcpu sm_35 <more options for
the nvptx toolchain>

-target-offload=x86_64-pc-linux-gnu -fopenmp <more options for the x86_64
toolchain>

-target-offload=host <more options for the host>

-target-offload=all <options for all toolchains>

-fcuda or -fopenmp (or any other flag specifying a programming model)
associated with an offload target would specify the programming model to be
used for that target, and an error would be emitted if no programming model
flag is found. I am also proposing having as special target-offload devices
“host” and “all” to provide a convenient way for the user to pass options
for all toolchains or to the host.

i) Use the offload kinds in the toolchain to drive the commands generation
by Tools.

The offloading kinds in the target toolchain can be used during the
creation of commands to distinguish between different programming models
that use the same toolchain and create options that would make sense only
for a given programming model.

============

Call For Action

============

Please review this proposal (especially if you are concerned with CUDA and
OpenMP support!) and provide your feedback. Our goal is to reach an
agreement in the community and proceed with implementation.

=================

Implementation Plan

=================

1. Upon reaching the agreement on the proposal, we (IBM compiler team) will
start to submit patches implementing required functionality in clang
driver. Code review would be much appreciated!

2. After implementing general functionality, IBM compiler team will submit
patches that implement OpenMP-specific parts of the proposal.

3. We are willing to help with implementation of CUDA-specific parts when
they overlap with the common infrastructure; though we expect that effort
to be driven also by other contributors specifically interested in CUDA
support that have the necessary know-how (both on CUDA itself and how it is
supported in Clang / LLVM).

Thanks!

Samuel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20160224/aeffefbb/attachment.html>