[PATCH] D127901: [LinkerWrapper] Add PTX output to CUDA fatbinary in LTO-mode

Wed Jun 22 16:46:29 PDT 2022

jhuber6 added a comment.

In D127901#3603467 <https://reviews.llvm.org/D127901#3603467>, @tra wrote:

> I'm not sure I follow. WDYM by "go inside the binary itself" ? I assume you mean the per-GPU offload binaries inside per-TU .o. so that it could be used when that GPU object gets linked into GPU executable?
>
> What if different TUs that we're linking were compiled using different/contradictory options?
>
> The problem is that conceptually "--cuda-include-ptx" option ultimately affects the final GPU executable. If we're in RDC mode, then PTX is probably useless for JITT-ing purposes, as you can't link PTX and create the final executable. Well, I guess it might sort of be possible by concatenating the .s files and adding bunch of forward declarations for the functions, and merging debug info, and removing duplicate weak functions,,... Well, basically by writing a linker for a new "PTX" architecture. Doable, but so not worth it, IMO.
>
> TUs are compiled to IR, then PTX generation shifts to the final link phase. I think we may need to rely on the user to supply PTX controls there explicitly. Or, at the very least, check that `cuda-include-ptx` propagated from TUs is used consistently in all TUs.

I just mean that right now the `--[no-]cuda-include-ptx` is done at the compilation phase, whereas this in LTO so we'd need to make sure we have those arguments. It's true that we could just require the user to pass it to the linker instead, but conceptually PTX generation happens in the "compiler" and not the linker.

>> We'll probably just use the same default as that flag (which is on I think).
>>
>>> This brings another question -- which GPU variant will we generate PTX for? One? All (if more than one is specified)? The ones specified by `--[no-]cuda-include-ptx=` ?
>>
>> Right now, it'll be the one that's attached to the LTO job. So if the user specified `sm_70` they'll get PTX for `sm_70`.
>
> I mean, when the user specifies more than one GPU variant to target. 
> E.g. both `sm_70` and `sm_50`. 
> PTX for the former would probably provide better performance if we run on a newer GPU (e.g. sm_80). 
> On the other hand, it will likely fail if we were to attempt running from PTX on sm_60. 
> Both would probably fail if we were to run on sm_35. Including all PTX variants is wasteful (Tensorflow-using applications are already pushing the limits on small memory model and sometimes fail to link due to the executable being too large).
>
> The point is that there's no "one true choice" for the PTX architecture (as there's no safe/sensible choice for the offload target). Only the end user would know their intent. We do need explicit controls and a documented policy on what we produce by default.

This is a good point I haven't thought of. This right now is basically just a by-product of the LTO pass. We run LTO for the target and since we got a PTX output we might as well include it. This may be what we do in Clang as well, I think we just include the PTX output in with the Cubin for each offload job. Even if we went to LLVM-IR we'd still be restricted by some features I think. As it stands, this patch just makes `clang++ cuda.cu --offload-new-driver -fgpu-rdc --offload-arch=sm_60 -foffload-lto` give a fatbinary with sm_60 PTX / Cubins. I think that is controlled by the user as it's only going to generate PTX for the architecture they specified via --offload-arch (or default).

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D127901/new/

https://reviews.llvm.org/D127901