[PATCH] D9888: [OPENMP] Driver support for OpenMP offloading

Thu Oct 1 11:22:54 PDT 2015

sfantao updated this revision to Diff 36263.
sfantao added a comment.

This diff refactors the original patch and is rebased on top of the latests offloading changes inserted for CUDA.

Here I don't touch the CUDA support. I tried, however, to have the implementation modular enough so that it could eventually be combined with the CUDA implementation. In my view OpenMP offloading is more general in the sense that it does not refer to a given tool chain, instead it uses existing toolchains to generate code for offloading devices. So, I believe that a tool chain (which I did not include in this patch) targeting NVPTX will be able to handle both CUDA and OpenMP offloading models.

Chris, Art, I understand you have worked out the latest CUDA changes so any feedback from you is greatly appreciated!

Here are few more details about this diff:

Add tool to bundle and unbundle corresponding host and device files into a single one.

One of the goals of OpenMP offloading is to enable users to offload with little effort, by annotating the code with a few pragmas. I'd also like to save users the trouble of changing their existent applications' build system. So having the compiler always return a single file instead of one for the host and each target even if the user is doing separate compilation is desirable.

This diff includes a tool named clang-offload-bundled (happy to change the name or even include it in the driver if someone thinks it is the best direction to go) that is used on all input files that are not source files to unbundle them, and on top level jobs that are not linking jobs to bundle the results obtained for host and each target.

The format of the bundled files is currently very simple: text formats are concatenated with comments that have a magic string and target identifying triple in between, and binary formats have a header that contains the triple and the offset and size of the code for host and each target.

This tool still has to be improved in the future to deal with archive files so that each individual file in the archive is properly dealt with. We see that archives are very commonly used in current application to combine separate compilation results. So I'm convinced users would enjoy this feature.

The building of the driver actions is unchanged.

I don't create device specific actions. Instead only the bundling/unbundling are inserted as first or last action if the file type requires that.

Add offloading kind to `ToolChain`

Offloading does not require a new toolchain to be created. Existent toolchains are used and the offloading kind is used to drive specific behavior in each toolchain so that valid device code is generated.

This is a major difference from what is currently done for CUDA. But I guess the CUDA implementation easily fits this design and the Nvidia GPU toolchain could be reused for both CUDA and OpenMP offloading.

Use Job results cache to easily use host results in device actions and vice-versa.

An array of the results for each job is kept so that the device job can use the result previously generated for the host and used it as input or vice-versa.

In OpenMP the device declarations have be communicated from the host frontend to the device frontend. So this is used to conveniently pass that information. Unlike CUDA, OpenMP doesn't have already outline functions with "device" attributes that the frontend can rely on to make the decision on what to be emitted or not.

The result cache can also be updated to keep the required information for the CUDA implementation to decide host/device binaries combining  (injection is the term used in the code). I don't have a concrete proposal for that however, given that is not clear to me what are the plans for CUDA to support separate compilation, I understand that the CUDA binary is inserted directly in host IR (Art, can you shed some light on this?).

Use compiler generated linker script to do the device/host code combining and correctly support separate compilation.

Currently the OpenMP support  in the toolchains is only implemented for Generic GCC targets and a linker script is used to embed the resulting device images into the host binary ELF sections. Also, the linker script defines the symbols that are emitted during code generation so that the address of the images can be easily retrieved.

Minor refactoring of the existing code to enable reusing.

I've outlined some of the exiting code into static function so that it could be reused by the new offloading related hooks.

Any comments/remarks are very welcome!

Thanks!
Samuel

http://reviews.llvm.org/D9888

Files:
  include/clang/Basic/DiagnosticDriverKinds.td
  include/clang/Driver/Action.h
  include/clang/Driver/CC1Options.td
  include/clang/Driver/Driver.h
  include/clang/Driver/Options.td
  include/clang/Driver/ToolChain.h
  include/clang/Driver/Types.h
  lib/Driver/Action.cpp
  lib/Driver/Compilation.cpp
  lib/Driver/Driver.cpp
  lib/Driver/ToolChain.cpp
  lib/Driver/ToolChains.cpp
  lib/Driver/ToolChains.h
  lib/Driver/Tools.cpp
  lib/Driver/Tools.h
  lib/Driver/Types.cpp
  test/OpenMP/target_driver.c
  tools/CMakeLists.txt
  tools/Makefile
  tools/clang-offload-bundler/CMakeLists.txt
  tools/clang-offload-bundler/ClangOffloadBundler.cpp
  tools/clang-offload-bundler/Makefile

-------------- next part --------------
A non-text attachment was scrubbed...
Name: D9888.36263.patch
Type: text/x-patch
Size: 91618 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/cfe-commits/attachments/20151001/049878c2/attachment-0001.bin>