[PATCH] D9888: [OPENMP] Driver support for OpenMP offloading

Wed Apr 6 18:53:31 PDT 2016

sfantao marked 8 inline comments as done.
sfantao added a comment.

Hi Eric,

Thanks for the review!

As you are probably a aware, I started partitioning this patch following your initial concern related with the size of this patch and the feedback I got from http://lists.llvm.org/pipermail/cfe-dev/2016-February/047547.html. I am keeping this patch as it shows the big picture of what I am trying to accomplish, so if you prefer to add other higher level suggesting here that's perfectly fine. Let me know if there is a more proper way to link patches.

So, I am incorporating your suggestions here in the partioned patches as specified in the inline comments. The partitioned patches are http://reviews.llvm.org/D18170, http://reviews.llvm.org/D18171 and http://reviews.llvm.org/D18172.

> One meta comment: offload appears to be something that could be used for CUDA and OpenMP (and OpenACC etc) as a term. I think we should either merge these concepts or pick a different name :)


Yes, I agree. I am now using `offloading`. I only refer to the programming model name if the code relates to something specific of that programming model.

Thanks again,
Samuel


================
Comment at: include/clang/Driver/Driver.h:210-213
@@ +209,6 @@
+  /// owns all the ToolChain objects stored in it, and will clean them up when
+  /// torn down. We use a different cache for offloading as it is possible to
+  /// have offloading toolchains with the same triple the host has, and the
+  /// implementation has to differentiate the two in order to adjust the
+  /// commands for offloading.
+  mutable llvm::StringMap<ToolChain *> OffloadToolChains;
----------------
echristo wrote:
> Example?
I got rid of this extra toolchain cache and I am organizing it in a multimap by offload kind as Art suggested in http://reviews.llvm.org/D18170. That avoids the multiple containers for the offloading toolchains (this one and the ordered one).

================
Comment at: include/clang/Driver/Driver.h:216-217
@@ +215,4 @@
+
+  /// \brief Array of the toolchains of offloading targets in the order they
+  /// were requested by the user.
+  SmallVector<const ToolChain *, 4> OrderedOffloadingToolchains;
----------------
echristo wrote:
> Any reason?
Currently in OpenMP any directive that relates with offloading supports a `device()` clause that basically specifies which device to use for that region or data transfer. E.g.

```
void foo() {
 ...
}

void bar(int i) {
  #pragma omp target device(i)
   foo();
}
```
... here foo is going to be executed on the device `i`. The problem is that the device is an integer - it does not tell which device type it is - so it is up to the implementation to decide how `i` is interpreted. So, if we have a system with two GPUs and two DSP devices. We may bind 0-1 to the GPUs and 2-3 to the DSPs. 

My goal with preserving the order of the toolchains was to allow codegen to leverage that information and make a better decision on how to bind devices to integers. Maybe, if the user requests the GPU toolchain first he may be interested in prioritizing its use, so the first IDs would map to GPUs. Making a long story short, this is only about preserving information so that codegen can use it. 

In any case, this is going to change in the future as the OpenMP language  committee is working on having a device identifier to use instead of an integer.  So, if you prefer remove the `ordered` out of the name, I am not opposed to that.


================
Comment at: include/clang/Driver/Driver.h:427-435
@@ -383,10 +426,11 @@
   /// action \p A.
   void BuildJobsForAction(Compilation &C,
                           const Action *A,
                           const ToolChain *TC,
                           const char *BoundArch,
                           bool AtTopLevel,
                           bool MultipleArchs,
                           const char *LinkingOutput,
-                          InputInfo &Result) const;
+                          InputInfo &Result,
+                          OffloadingHostResultsTy &OffloadingHostResults) const;
 
----------------
echristo wrote:
> This function is starting to get a little silly. Perhaps we should look into refactoring such that this doesn't need to be "the one function that rules them all". Perhaps a different ownership model for the things that are arguments here?
This has changed a little in recent CUDA work, in the version http://reviews.llvm.org/D18171 is based on, `Result` is returned instead of being passed by reference, and we have a `string/action-result map. I'll have to add to that string the offloading kind eventually, but in the partitioned patches I didn't touch that yet.

Do you suggest having that cache owned by the driver instead of passing it along?

================
Comment at: lib/Driver/Compilation.cpp:66-67
@@ +65,4 @@
+
+    // Check if there is any offloading specific translation to do.
+    DerivedArgList *OffloadArgs = TC->TranslateOffloadArgs(*Entry, BoundArch);
+    if (OffloadArgs) {
----------------
echristo wrote:
> Hmm?
This relates in some extend to your other question: how do we pass device-specific options.

So, right now we are relying on the host options to derive device-specific options. This hook was meant to make the tuning of the host options so that things that do not make sense on the device are filtered. Also, the device resulting image is usually a shared library so it that can be easily loaded, this hook is also used to specify the options that result in a shared library, even if the host options don't ask for a host shared library.

Can you think of a better way to abstract this?

================
Comment at: lib/Driver/Driver.cpp:224-225
@@ +223,4 @@
+
+/// \brief Dump the job bindings for a given action.
+static void DumpJobBindings(ArrayRef<const ToolChain *> TCs, StringRef ToolName,
+                            ArrayRef<InputInfo> Inputs,
----------------
echristo wrote:
> This can probably be done separately? Can you split this out and make it generally useful?
Given the feedback I got in http://lists.llvm.org/pipermail/cfe-dev/2016-February/047547.html, I end up moving most the functionality that I have in jobs creation to the creation of actions. Having a action graph that shows the offloading specifics was desired feature. As a result, what gets more complex is the dump of the actions. 

In http://reviews.llvm.org/D18171 I have an example on how that dump looks like. That patch also proposes a unified offloading action that should be reused by the different offloading programming models. Does this address your concern?

================
Comment at: lib/Driver/Driver.cpp:2045-2051
@@ -1739,11 +2044,9 @@
     // checking the backend tool, check if the tool for the CompileJob
-    // has an integrated assembler.
-    const ActionList *BackendInputs = &(*Inputs)[0]->getInputs();
-    // Compile job may be wrapped in CudaHostAction, extract it if
-    // that's the case and update CollapsedCHA if we combine phases.
-    CudaHostAction *CHA = dyn_cast<CudaHostAction>(*BackendInputs->begin());
-    JobAction *CompileJA =
-        cast<CompileJobAction>(CHA ? *CHA->begin() : *BackendInputs->begin());
-    assert(CompileJA && "Backend job is not preceeded by compile job.");
-    const Tool *Compiler = TC->SelectTool(*CompileJA);
-    if (!Compiler)
+    // has an integrated assembler. However, if OpenMP offloading is required
+    // the backend and compile jobs have to be kept separate and an integrated
+    // assembler of the backend job will be queried instead.
+    JobAction *CurJA = cast<BackendJobAction>(*Inputs->begin());
+    const ActionList *BackendInputs = &CurJA->getInputs();
+    CudaHostAction *CHA = nullptr;
+    if (!RequiresOpenMPOffloading(TC)) {
+      // Compile job may be wrapped in CudaHostAction, extract it if
----------------
echristo wrote:
> Might be time to make some specialized versions of this function. This may take it from "ridiculously confusing" to "code no one should ever look at" :)
I agree. This function is really messy... :S

In http://reviews.llvm.org/D18171 I am proposing `collapseOffloadingAction` that drives the collapsing of offload actions and abstracts some of the complexity in `selectToolForJob`. Do you think that goes in the right direction, or you think I should do something else?

================
Comment at: lib/Driver/Tools.cpp:6032
@@ +6031,3 @@
+  // The (un)bundling command looks like this:
+  // clang-offload-bundler -type=bc
+  //   -omptargets=host-triple,device-triple1,device-triple2
----------------
echristo wrote:
> Should we get the offload bundler in first so that the interface is there and testable? (Honest question, no particular opinion here). Though the command lines there will affect how this code is written. 
Yes, sure, I proposed an implementation of the bundler, using a generic format in http://reviews.llvm.org/D13909. Let me know any comments you have about that specific component.

I still need to add testing specific to http://reviews.llvm.org/D13909, which I didn't yet because I didn't know where it was supposed to live - maybe in the Driver? Do you have an opinion about that?

Also, in http://lists.llvm.org/pipermail/cfe-dev/2016-February/047547.html,  the generic opinion was that the bundler should use the host object format to bundle whenever possible. So, I also have to add a default behavior for the binary bundler when the input is an object file. For the other input types, I don't think there were any strong opinions. Do you happen to have one?

In any case, I was planing to add the object file specific bundling in a separate patch, which seems to me  a natural way to partition the bundler functionality. Does that sound like a good plan?

================
Comment at: test/OpenMP/target_driver.c:41-47
@@ +40,9 @@
+
+// CHK-PHASES-LIB-DAG: {{.*}}: linker, {[[L0:[0-9]+]], [[A0:[0-9]+]]}, image
+// CHK-PHASES-LIB-DAG: [[A0]]: assembler, {[[A1:[0-9]+]]}, object
+// CHK-PHASES-LIB-DAG: [[A1]]: backend, {[[A2:[0-9]+]]}, assembler
+// CHK-PHASES-LIB-DAG: [[A2]]: compiler, {[[A3:[0-9]+]]}, ir
+// CHK-PHASES-LIB-DAG: [[A3]]: preprocessor, {[[I:[0-9]+]]}, cpp-output
+// CHK-PHASES-LIB-DAG: [[I]]: input, {{.*}}, c
+// CHK-PHASES-LIB-DAG: [[L0]]: input, "m", object
+
----------------
echristo wrote:
> Do we really think the phases should be a DAG check?
Using a DAG seemed to me a robust way to test that. I'd have to double check, but several map containers are used for the inputs and actions, so the order may depend on the implementation of the container. I was just trying to use a safe way to test.

Do you prefer to change this to the exact sequence I am getting?

================
Comment at: test/OpenMP/target_driver.c:54
@@ +53,3 @@
+// RUN:   echo 'bla' > %t.o
+// RUN:   %clang -ccc-print-phases -lm -fopenmp=libomp -target powerpc64-ibm-linux-gnu -omptargets=x86_64-pc-linux-gnu,powerpc64-ibm-linux-gnu %s %t.o 2>&1 \
+// RUN:   | FileCheck -check-prefix=CHK-PHASES-OBJ %s
----------------
echristo wrote:
> How do you pass options to individual omptargets? e.g. -mvsx or -mavx2?
Well, currently I don't. In http://lists.llvm.org/pipermail/cfe-dev/2016-February/047547.html I was proposing something to tackle that, but the opinion was that it was somewhat secondary and the driver design should be settled first.

What I as proposing was some sort of group option associated with the device triple. The idea was to avoid proliferation of device specific options and reuse what we already have, just organize it groups so that i could be forwarded to the right tool chain. The goal was to make things like this possible:
```
clang -mcpu=pwr8 -target-offload=nvptx64-nvidia-cuda -fopenmp -mcpu=sm_35 -target-offload=nvptx64-nvidia-cuda -fcuda -mcpu=sm_32 a.c 
```
... where mcpu is used to specify the cpu/gpu for the different tool chains and programing models. This would also be useful to specify include and library paths that only make sense to the device.

Do you have any opinion about that?


http://reviews.llvm.org/D9888