<div dir="ltr">


<p class=""><span class="">Hi all,</span></p>

<p class=""><span class="">I’d like to propose a change in the Driver implementation to support programming models that require offloading with a unified infrastructure. The goal is to have a design that is general enough to cover different programming models with as little as possible customization that is programming-model specific. Some of this discussion already took place in <a href="http://reviews.llvm.org/D9888"><span class="">http://reviews.llvm.org/D9888</span></a> but would like to continue that here in he mailing list and try to collect as much feedback as possible.</span></p><p class=""><span class=""><br></span></p>

<p class=""><span class="">Currently, there are two programming models supported by clang that require offloading - CUDA and OpenMP. Examples of other offloading models that can could benefit of a unified driver design as they become supported in clang are also SYCL (<a href="https://www.khronos.org/sycl"><span class="">https://www.khronos.org/sycl</span></a>) and OpenACC (<a href="http://www.openacc.org/"><span class="">http://www.openacc.org/</span></a>). Therefore, I’ll try to make the discussion a general as possible, but will occasionally provide examples on how that applies on CUDA and OpenMP, given that is what people may care about more immediately. </span></p><p class=""><span class=""><br></span></p>

<p class=""><span class="">I hope I covered all the possible implications of a general offloading implementation. Let me know if you think there is something missing that should also be covered, your suggestions and concerns. Any feedback is very much welcome!</span></p><p class=""><span class=""><br></span></p><p class=""><span class="">Thanks!</span></p><p class=""><span class="">Samuel</span></p>

<p class=""><span class="">================</span></p><p class="">Proposal Description</p>

<p class=""><span class="">================</span></p>

<p class=""><span class="">a) Create toolchains for host and offload devices before creating the actions.</span></p>

<p class=""><span class="">The driver has to detect the employed programming models through the provided options (e.g. -fcuda or -fopenmp) or file extensions. For each host and offloading device and programming model, it should create a toolchain. In general, the same target can be used as host and offloading device, therefore the creation of the toolchain should be provided a “kind" that unequivocally specify what that toolchain is used for. These kinds (e.g. CudaHostKind, CudaDeviceKind, OpenMPHost, etc...) would be kept in ToolChain and could be accessed through some public method so they can be used to drive the creations of commands by Tools.</span></p>

<p class=""><span class=""></span><br></p>

<p class=""><span class="">b) Keep the generation of Actions independent of the program model. </span></p>

<p class=""><span class="">In my view, the Actions should only depend on the compile phases requested by the user and the file extensions of the input files. Only the way those actions are interpreted to create jobs should be dependent on the programming model. This would avoid complicating the actions creation with dependencies that only make sense to some programming models, which would make the implementation hard to scale when new programming models are to be adopted.</span></p>

<p class=""><span class=""></span><br></p>

<p class=""><span class="">c) Use unbundling and bundling tools agnostic of the programming model.</span></p>

<p class=""><span class="">I propose a single change in the action creation and that is the creation of a “unbundling” and "bundling” action whose goal is to prevent the user to have to deal with multiple files generated from multiple toolchains (host toolchain and offloading devices’ toolchains) if he uses separate compilation in his build system. This would prevent the user from redesigning his build system if he wants to adopt a programming model with offloading. These actions would be introduced if offloading is required, i.e. there are toolchains that refer to offloading devices (regardless of the programming model being supported). Unbundling would be inserted if the initial action is not a source input action, and Bundling would be introduced if the last phase is not a linking phase. </span></p>

<p class=""><span class=""></span><br></p>

<p class=""><span class="">Unbundling and Bundling could be supported by a tool specifically implemented for that purpose. I’ll post a separate RFC for this tool. </span></p>

<p class=""><span class=""></span><br></p>

<p class=""><span class="">d) Allow the target toolchain to request the host toolchain to be used for a given action.</span></p>

<p class=""><span class="">In some cases the definition in the host toolchain are the correct ones to use. E.g. a preprocessing phase may fail because the header files are expecting host macros in a given system. This can be done by implementing a query in the proper ToolChain that takes into account the device target and the offloading kinds it has associated.</span></p>

<p class=""><span class=""> </span></p>

<p class=""><span class="">e)  Use a job results cache to enable sharing results between device and host toolchains. </span></p>

<p class=""><span class="">At some point, an offloading device object has to be integrated into the host object. Other intermediate job results may as well have to be shared between host and device and vice-versa. As an example, these are the dependencies that are required for CUDA and OpenMP:</span></p>

<p class=""><span class=""></span><br></p>

<p class=""><span class="">CUDA (the device object is injected at the host compile phase):</span></p>

<p class=""><span class="">Src -> Device PP -> A</span></p>

<p class=""><span class="">A    -> DeviceCompile -> B</span></p>

<p class=""><span class="">B    -> DeviceAssembler -> C</span></p>

<p class=""><span class="">C    -> DeviceLinker -> D</span></p>

<p class=""><span class="">Src -> Host PP -> E</span></p>

<p class=""><span class="">E,D -> HostCompile -> F</span></p>

<p class=""><span class="">F    -> HostAssembler -> G</span></p>

<p class=""><span class="">G    -> HostLinker -> Out</span></p>

<p class=""><span class=""></span><br></p>

<p class=""><span class="">OpenMP (Host IR has to be read by the device to determine which declarations have to be emitted and the device binary is embedded in the host binary at link phase through a proper linker script):</span></p>

<p class=""><span class="">Src -> Host PP -> A</span></p>

<p class=""><span class="">A    -> HostCompile -> B</span></p>

<p class=""><span class="">A,B -> DeviceCompile -> C</span></p>

<p class=""><span class="">C    -> DeviceAssembler -> D</span></p>

<p class=""><span class="">E    -> DeviceLinker -> F</span></p>

<p class=""><span class="">B    -> HostAssembler -> G</span></p>

<p class=""><span class="">G,F -> HostLinker -> Out</span></p>

<p class=""><span class=""></span><br></p>

<p class=""><span class="">This can be done by generating the jobs and storing them in cache so they can be referred to later on. This was proposed in <a href="http://reviews.llvm.org/D9888"><span class="">http://reviews.llvm.org/D9888</span></a> and a very similar mechanism is used today inserted by the CUDA implementation. It is possible this cache has to be extended to have more queries and to have the results sorted by Action, ToolChain and Offloading Kind.</span></p>

<p class=""><span class=""></span><br></p>

<p class=""><span class="">f) Intercept the jobs creation before the emission of the command.</span></p>

<p class=""><span class="">In my view this is the only change required in the driver (apart from the obvious toolchain changes) that would be dependent on the programming model. A job result post-processing function could check that there are offloading toolchains to be used and spawn the jobs creation for those toolchains as well as append results from one toolchain to the results of some other accordingly to the programming model implementation needs. </span></p>

<p class=""><span class=""></span><br></p>

<p class=""><span class="">E.g. for the CUDA programing model, the linker action would be recovered by the host compile post-processing call, which would spawn the creation of all the device jobs and append the result to the host compile phase inputs.</span></p>

<p class=""><span class=""></span><br></p>

<p class=""><span class="">For the OpenMP programming model, the post-processing call for the host linker action would spawn the creation of the device jobs and append to the list of host linker inputs and the post-processing call of the device compile action would retrieve the host compile phase result. </span></p>

<p class=""><span class=""></span><br></p>

<p class=""><span class="">g) Reflect the offloading programming model in the naming of the save-temps files.</span></p>

<p class=""><span class="">Given that the same action is interpreted by different toolchains, if using save-temps the resulting file could be append with the programming model name by the target triple so that files don’t get overwritten.</span></p>

<p class=""><span class=""></span><br></p>

<p class=""><span class="">E.g. for OpenMP one would get a.bc and a-openmp-<triple>.bc if the driver is invoked with 'clang -c -save-temps a.c’.</span></p>

<p class=""><span class=""></span><br></p>

<p class=""><span class="">h) Use special options -target-offload=<triple> to specify offloading targets and delimit options meant for a toolchain.</span></p>

<p class=""><span class="">To avoid the proliferation of driver (and possibly frontend) options that are specific for a programming model I propose a new option that would specify an offloading device and have all the options following it processed for its toolchain. This would allow using the already existing options like -mcpu or -L/-l to tune the implementation for a given machine or provide linking commands that only make sense for the device.</span></p>

<p class=""><span class=""></span><br></p>

<p class=""><span class="">As an hypothetical example, lets assume we wanted to compile code that uses both CUDA for a nvptx64 device, OpenMP for an x86_64 device, and a powerpc64le host, one could invoke the driver as:</span></p>

<p class=""><span class=""></span><br></p>

<p class=""><span class="">clang -target powerpc64le-ibm-linux-gnu <more host options></span></p>

<p class=""><span class="">-target-offload=nvptx64-nvidia-cuda -fcuda -mcpu sm_35 <more options for the nvptx toolchain></span></p>

<p class=""><span class="">-target-offload=x86_64-pc-linux-gnu -fopenmp <more options for the x86_64 toolchain></span></p>

<p class=""><span class="">-target-offload=host <more options for the host></span></p>

<p class=""><span class="">-target-offload=all <options for all toolchains></span></p>

<p class=""><span class=""></span><br></p>

<p class=""><span class="">-fcuda or -fopenmp (or any other flag specifying a programming model) associated with an offload target would specify the programming model to be used for that target, and an error would be emitted if no programming model flag is found. I am also proposing having as special target-offload devices “host” and “all” to provide a convenient way for the user to pass options for all toolchains or to the host.</span></p>

<p class=""><span class=""></span><br></p>

<p class=""><span class="">i) Use the offload kinds in the toolchain to drive the commands generation by Tools.</span></p>

<p class=""><span class="">The offloading kinds in the target toolchain can be used during the creation of commands to distinguish between different programming models that use the same toolchain and create options that would make sense only for a given programming model. </span></p>

<p class=""><br><span class=""></span></p>

<p class=""><span class="">============</span></p>

<p class=""><span class="">Call For Action</span></p>

<p class=""><span class="">============</span></p>

<p class=""><span class="">Please review this proposal (especially if you are concerned with CUDA and OpenMP support!) and provide your feedback. Our goal is to reach an agreement in the community and proceed with implementation.</span></p>

<p class=""><span class=""></span><br></p>

<p class=""><span class="">================= </span></p>

<p class=""><span class="">Implementation Plan</span></p>

<p class=""><span class="">=================</span></p>

<p class=""><span class="">1. Upon reaching the agreement on the proposal, we (IBM compiler team) will start to submit patches implementing required functionality in clang driver. Code review would be much appreciated!</span></p>

<p class=""><span class="">2. After implementing general functionality, IBM compiler team will submit patches that implement OpenMP-specific parts of the proposal.</span></p>

<p class=""><span class="">3. We are willing to help with implementation of CUDA-specific parts when they overlap with the common infrastructure; though we expect that effort to be driven also by other contributors specifically interested in CUDA support that have the necessary know-how (both on CUDA itself and how it is supported in Clang / LLVM).</span></p>

<p class=""><span class=""></span><br></p>

<p class=""><span class="">Thanks!</span></p>

<p class=""><span class="">Samuel</span></p></div>