[PATCH] D9888: [OPENMP] Driver support for OpenMP offloading

Wed Oct 7 16:50:02 PDT 2015

tra added a comment.

In http://reviews.llvm.org/D9888#257904, @sfantao wrote:

> This diff refactors the original patch and is rebased on top of the latests offloading changes inserted for CUDA.
>
> Here I don't touch the CUDA support. I tried, however, to have the implementation modular enough so that it could eventually be combined with the CUDA implementation. In my view OpenMP offloading is more general in the sense that it does not refer to a given tool chain, instead it uses existing toolchains to generate code for offloading devices. So, I believe that a tool chain (which I did not include in this patch) targeting NVPTX will be able to handle both CUDA and OpenMP offloading models.

What do you mean by "does not refer to a given toolchain"? Do you have the toolchain patch available?

Creating a separate toolchain for CUDA was a crutch that was available to craft appropriate cc1 command line for device-side compilation using existing toolchain. It works, but it's rather rigid arrangement. Creating a NVPTX toolchain which can be parameterized to produce CUDA or OpenMP would be an improvement.

Ideally toolchain tweaking should probably be done outside of the toolchain itself so that it can be used with any combination of {CUDA or OpenMP target tweaks}x{toolchains capable of generating target code}.

> b ) The building of the driver actions is unchanged.

> 

> I don't create device specific actions. Instead only the bundling/unbundling are inserted as first or last action if the file type requires that.

Could you elaborate on that? The way I read it, the driver sees linear chain of compilation steps plus bundling/unbundling at the beginning/end and that each action would result in multiple compiler invocations, presumably per target.

If that's the case, then it may present a bit of a challenge in case one part of compilation depends on results of another. That's the case for CUDA where results of device-side compilation must be present for host-side compilation so we can generate additional code to initialize it at runtime.

> c) Add offloading kind to `ToolChain`

> 

> Offloading does not require a new toolchain to be created. Existent toolchains are used and the offloading kind is used to drive specific behavior in each toolchain so that valid device code is generated.

> 

> This is a major difference from what is currently done for CUDA. But I guess the CUDA implementation easily fits this design and the Nvidia GPU toolchain could be reused for both CUDA and OpenMP offloading.

Sounds good. I'd be happy to make necessary make CUDA support use it.

> d) Use Job results cache to easily use host results in device actions and vice-versa.

> 

> An array of the results for each job is kept so that the device job can use the result previously generated for the host and used it as input or vice-versa.

Nice. That's something that will be handy for CUDA and may help to avoid passing bits of info about other jobs explicitly throughout the driver.

> The result cache can also be updated to keep the required information for the CUDA implementation to decide host/device binaries combining  (injection is the term used in the code). I don't have a concrete proposal for that however, given that is not clear to me what are the plans for CUDA to support separate compilation, I understand that the CUDA binary is inserted directly in host IR (Art, can you shed some light on this?).

Currently CUDA depends on libcudart which assumes that GPU code and its initialization is done the way nvcc does it. Currently we do include PTX assembly (as in readable text)  generated on device side into host-side IR *and* generate some host data structures and init code to register GPU binaries with libcudart. I haven't figured out a way to compile host/device sides of CUDA without a host-side compilation depending on device results.

Long-term we're considering implementing CUDA runtime support based on plain driver interface which would give us more control over where we keep GPU code and how we initialize it. Then we could simplify things and, for example, incorporate GPU code via linker script.  Alas for the time being we're stuck with libcudart and sequential device and host compilation phases.

As for separate compilation -- compilation part is doable. It's using the results of such compilation that becomes tricky. CUDA's triple-bracket kernel launch syntax depends on libcudart and will not work, because we would not generate init code. You can still launch kernels manually using raw driver API, but it's quite a bit more convoluted.

--Artem

================
Comment at: include/clang/Driver/Driver.h:208
@@ +207,3 @@
+  /// CreateUnbundledOffloadingResult - Create a command to unbundle the input
+  /// and use the resulting input info. If there re inputs already cached in
+  /// OffloadingHostResults for that action use them instead. If no offloading
----------------
re -> are

================
Comment at: include/clang/Driver/Driver.h:210
@@ +209,3 @@
+  /// OffloadingHostResults for that action use them instead. If no offloading
+  /// is being support just return the provided input info.
+  InputInfo CreateUnbundledOffloadingResult(
----------------
"If offloading is not supported" perhaps?

================
Comment at: lib/Driver/Driver.cpp:2090
@@ +2089,3 @@
+          dyn_cast<OffloadUnbundlingJobAction>(A)) {
+    // The input of the unbundling job has to a single input non-source file,
+    // so we do not consider it having multiple architectures. We just use the
----------------
"has to be"

http://reviews.llvm.org/D9888