[PATCH] D100609: [Offload][OpenMP][CUDA] Allow fembed-bitcode for device offload

Fri Apr 16 14:36:10 PDT 2021

tra added inline comments.

================
Comment at: clang/test/Driver/embed-bitcode-nvptx.cu:1
+// RUN: %clang -Xclang -triple -Xclang nvptx64 -S -Xclang -target-feature -Xclang +ptx70 -fembed-bitcode=all --cuda-device-only -nocudalib -nocudainc %s -o - | FileCheck %s
+// REQUIRES: nvptx-registered-target
----------------
jdoerfert wrote:
> tra wrote:
> > jdoerfert wrote:
> > > tra wrote:
> > > > This command line looks extremely odd to me.
> > > > If you are compiling with `--cuda-device-only`, then clang should've already set the right triple and the features.
> > > > 
> > > > Could you tell me more about what is the intent of the compilation and why you use this particular set of options?
> > > > I.e. why not just do `clang -x cuda --offload-arch=sm_70 --cuda-device-only -nocudalib -nocudainc`.
> > > > 
> > > > Could you tell me more about what is the intent of the compilation and why you use this particular set of options?
> > > 
> > > because I never compiled cuda really ;)
> > > 
> > > I'll go with your options.
> > Something still does not add up. 
> > 
> > AFAICT, the real problem is that that we're not adding `-target-cpu`, but rather that `-fembed-bitcode=all` splits `-S` compilation into two phases -- source-to-bitcode (this part gets all the right command line options and compiles fine) and `IR -> PTX` compilation which does end up only with the subset of the options and ends up failing because the intrinsics are not enabled.
> > 
> > I think what we want to do in this case is to prevent splitting GPU-side compilation. Adding a '-target-gpu' to the `IR->PTX` subcompilation may make things work in this case, but it does not really fix the root cause. E.g. we should also pass through the features set by the driver and, possibly, other options to keep both source->IR and IR->PTX compilations in sync.
> > 
> > I think what we want to do in this case is to prevent splitting GPU-side compilation.
> 
> I doubt that is as easy as it sounds. Where do we take the IR from then? (I want the GPU IR embedded after all)
> 
> > E.g. we should also pass through the features set by the driver and ..
> 
> I agree, what if I move the embedding handling to the end, keep the "blacklist" that removes arguments we don't want, and see where that leads us?
Ah, so you do grab the intermediate IR. I assume that the PTX does get used, too. 

Another way to deal with this may be to do two independent compilations -- source-to-IR and source-to-PTX. Each would use the standard compilation flags. The downside is that parsing and optimization time will double, so split compilation combined with filtering args is probably more practical.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D100609/new/

https://reviews.llvm.org/D100609