[PATCH] D127901: [LinkerWrapper] Add PTX output to CUDA fatbinary in LTO-mode

Joseph Huber via Phabricator via cfe-commits cfe-commits at lists.llvm.org
Thu Jun 16 14:54:27 PDT 2022


jhuber6 added a comment.

In D127901#3590402 <https://reviews.llvm.org/D127901#3590402>, @tra wrote:

> Playing devil's advocate, I've got to ask -- do we even want to support JIT?
>
> JIT brings more trouble than benefits.
>
> - substantial start-up time on nontrivial apps. Last time I tried launching a tensorflow app and needed to JIT its kernels, it took about half an hour until JIT was done.
> - substantial increase in the size of the executable. Statically linked tensorflow apps are already pushing the limits of the executables that use small memory model (-mcmodel=small is the default for clang and gcc, AFAICT).
> - very easy to make a mistake, compile for a wrong GPU and not notice it, because JIT will try to keep it running using PTX.
> - makes executables and tests non-hermetic -- the code that will run on GPU (and thus the behavior) will depend on particular driver version the apps uses at runtime.
>
> Benefits: It *may* allow us to run a miscompiled/outdated CUDA app. Whether it's actually a benefit is questionable. To me it looks like a way to paper over a problem.
>
> We (google) have experienced all of the above and ended up disabling PTX JIT'ting altogether.
>
> That said, we do embed PTX by default at the moment, so this patch does not really change the status quo, so I'm not opposed to it, as long is we can disable PTX embedding if we need/want to.

I guess it's one of those situations where I figured since we have it when we do LTO anyway I may as well add it. I don't know much about the usage of it w.r.t. performance, but I figured that this was a shortcoming of the RDC-mode support for Clang considering that NVIDIA can JIT RDC-mode compilations. We could definitely have an argument that disables this, I'm assuming there's an argument that does that in Clang already that we could overload to pass something to the linker wrapper. Or we could decide which behaviour we want to be the default.

The problem with LTO however is that many "compile-only" flags are suddenly relevant during linking. So let's say for a build someone did `clang foo.cu -c -no-embed-ptx -foffload-lto` and then `clang foo.o` we won't have the argument. I think regular LTO can embed the command line in the bitcode or something. We also have the option to embed the arguments in the binary format I made.

Also one problem with the RDC mode support with this is that we don't gracefully error if something was wrong with the image. so the following is really unhelpful

  clang app.cu --offload-arch=sm_<not correct> -fgpu-rdc --offload-new-driver
  ./a.out // Gives no output, kernel simply never executes.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D127901/new/

https://reviews.llvm.org/D127901



More information about the cfe-commits mailing list