[PATCH] D9888: [OPENMP] Driver support for OpenMP offloading

Wed Oct 7 18:56:57 PDT 2015

sfantao added a comment.

In http://reviews.llvm.org/D9888#262325, @tra wrote:

> In http://reviews.llvm.org/D9888#257904, @sfantao wrote:
>
> > This diff refactors the original patch and is rebased on top of the latests offloading changes inserted for CUDA.
> >
> > Here I don't touch the CUDA support. I tried, however, to have the implementation modular enough so that it could eventually be combined with the CUDA implementation. In my view OpenMP offloading is more general in the sense that it does not refer to a given tool chain, instead it uses existing toolchains to generate code for offloading devices. So, I believe that a tool chain (which I did not include in this patch) targeting NVPTX will be able to handle both CUDA and OpenMP offloading models.
>
>
> What do you mean by "does not refer to a given toolchain"? Do you have the toolchain patch available?

I mean not having to create a toolchain for a specific offloading model. OpenMP offloading is meant for any target and possibility many different targets simultaneously, so having a toolchain for each combination would be overwhelming.

I don't have a patch for the toolchain out for review yet. I'm planing to port what we have in clang-omp for the NVPTX toolchain once I have the host functionality in place. In there (https://github.com/clang-omp/clang_trunk/tree/master/lib/Driver) the Driver is implemented in a different way, I guess the version I'm proposing here is much cleaner. However, the ToolChains shouldn't be that different. All the tweaking is moved to the `Tool` itself, and I imagine I can drive that using the `ToolChain` offloading kind I'm proposing here. In https://github.com/clang-omp/clang_trunk/blob/master/lib/Driver/Tools.cpp I basically pick some arguments to forward to the tool and do some tricks to include libdevice in compilation when required. Do you think something like that could also work for CUDA?

> Creating a separate toolchain for CUDA was a crutch that was available to craft appropriate cc1 command line for device-side compilation using existing toolchain. It works, but it's rather rigid arrangement. Creating a NVPTX toolchain which can be parameterized to produce CUDA or OpenMP would be an improvement.

> 

> Ideally toolchain tweaking should probably be done outside of the toolchain itself so that it can be used with any combination of {CUDA or OpenMP target tweaks}x{toolchains capable of generating target code}.

I agree. I decided to move all the offloading tweaking to the tools, given that that is what clang tool already does: customizes the arguments based on the `ToolChain` that is passed to it.

> 

> 

> > b ) The building of the driver actions is unchanged.

> 

> > 

> 

> > I don't create device specific actions. Instead only the bundling/unbundling are inserted as first or last action if the file type requires that.

> 

> 

> Could you elaborate on that? The way I read it, the driver sees linear chain of compilation steps plus bundling/unbundling at the beginning/end and that each action would result in multiple compiler invocations, presumably per target.

> 

> If that's the case, then it may present a bit of a challenge in case one part of compilation depends on results of another. That's the case for CUDA where results of device-side compilation must be present for host-side compilation so we can generate additional code to initialize it at runtime.

That's right. I try to tackle the challenge of passing host/device results to device/host jobs by using a cache of results as I had described in d). The goal here is to add the flexibility required to accommodate different offloading models. In OpenMP we use host compile results in device compile jobs, and device link results in host link jobs whereas in CUDA the assemble result is used in compile job. I believe that we can have that cache to include whatever information is required to suit all needs.

> > c) Add offloading kind to `ToolChain`

> 

> > 

> 

> > Offloading does not require a new toolchain to be created. Existent toolchains are used and the offloading kind is used to drive specific behavior in each toolchain so that valid device code is generated.

> 

> > 

> 

> > This is a major difference from what is currently done for CUDA. But I guess the CUDA implementation easily fits this design and the Nvidia GPU toolchain could be reused for both CUDA and OpenMP offloading.

> 

> 

> Sounds good. I'd be happy to make necessary make CUDA support use it.

Great! Thanks.

> > d) Use Job results cache to easily use host results in device actions and vice-versa.

> 

> > 

> 

> > An array of the results for each job is kept so that the device job can use the result previously generated for the host and used it as input or vice-versa.

> 

> 

> Nice. That's something that will be handy for CUDA and may help to avoid passing bits of info about other jobs explicitly throughout the driver.

> 

> > The result cache can also be updated to keep the required information for the CUDA implementation to decide host/device binaries combining  (injection is the term used in the code). I don't have a concrete proposal for that however, given that is not clear to me what are the plans for CUDA to support separate compilation, I understand that the CUDA binary is inserted directly in host IR (Art, can you shed some light on this?).

> 

> 

> Currently CUDA depends on libcudart which assumes that GPU code and its initialization is done the way nvcc does it. Currently we do include PTX assembly (as in readable text)  generated on device side into host-side IR *and* generate some host data structures and init code to register GPU binaries with libcudart. I haven't figured out a way to compile host/device sides of CUDA without a host-side compilation depending on device results.

> 

> Long-term we're considering implementing CUDA runtime support based on plain driver interface which would give us more control over where we keep GPU code and how we initialize it. Then we could simplify things and, for example, incorporate GPU code via linker script.  Alas for the time being we're stuck with libcudart and sequential device and host compilation phases.

> 

> As for separate compilation -- compilation part is doable. It's using the results of such compilation that becomes tricky. CUDA's triple-bracket kernel launch syntax depends on libcudart and will not work, because we would not generate init code. You can still launch kernels manually using raw driver API, but it's quite a bit more convoluted.

Ok, I see. I am not aware of what exactly libcudart does, but I can elaborate on what the OpenMP offloading implementation we have in place does:

We have a descriptor that is registered with the runtime library (we generate a function for that called before any global initializers are executed ), this descriptor has (among other things) fields that are initialized with the symbols defined by the linker script (so that the runtime library can immediately get the CUDA module)  and also the names of the kernels (in OpenMP with don't have user-defined names for these kernels, so we generate some mangling to make sure they are unique). While launching the kernel, the runtime gets a pointer from which he can easily retrieve the name, and the CUDA driver API is used to get the CUDA function to be launched. We have been successfully generating a CUDA module that works well with separate compilation using ptxas and nvlink.

Part of my work is also port the runtime library in clang-omp to the LLLVM OpenMP project. I see CUDA as a simplified version of what OpenMP does, given that the user controls the data mappings explicitly, so I am sure we can find some synergies in the runtime library too and you may be able to use something that we already have in there.

Thanks!
Samuel

> --Artem

http://reviews.llvm.org/D9888