[PATCH] D127901: [LinkerWrapper] Add PTX output to CUDA fatbinary in LTO-mode

Wed Jun 22 16:39:03 PDT 2022

tra added a comment.

In D127901#3603118 <https://reviews.llvm.org/D127901#3603118>, @jhuber6 wrote:

> In D127901#3603006 <https://reviews.llvm.org/D127901#3603006>, @tra wrote:
>
>> Then we do need a knob controlling whether we do want to embed PTX or not. The default should be "off" IMO.
>> We currently have `--[no-]cuda-include-ptx=` we may reuse for that purpose.
>
> We could definitely re-use that. It's another option that probably need to go inside the binary itself since normally those options aren't passed to the linker.

I'm not sure I follow. WDYM by "go inside the binary itself" ? I assume you mean the per-GPU offload binaries inside per-TU .o. so that it could be used when that GPU object gets linked into GPU executable?

What if different TUs that we're linking were compiled using different/contradictory options?

The problem is that conceptually "--cuda-include-ptx" option ultimately affects the final GPU executable. If we're in RDC mode, then PTX is probably useless for JITT-ing purposes, as you can't link PTX and create the final executable. Well, I guess it might sort of be possible by concatenating the .s files and adding bunch of forward declarations for the functions, and merging debug info, and removing duplicate weak functions,,... Well, basically by writing a linker for a new "PTX" architecture. Doable, but so not worth it, IMO.

TUs are compiled to IR, then PTX generation shifts to the final link phase. I think we may need to rely on the user to supply PTX controls there explicitly. Or, at the very least, check that `cuda-include-ptx` propagated from TUs is used consistently in all TUs.

> We'll probably just use the same default as that flag (which is on I think).
>
>> This brings another question -- which GPU variant will we generate PTX for? One? All (if more than one is specified)? The ones specified by `--[no-]cuda-include-ptx=` ?
>
> Right now, it'll be the one that's attached to the LTO job. So if the user specified `sm_70` they'll get PTX for `sm_70`.

I mean, when the user specifies more than one GPU variant to target. 
E.g. both `sm_70` and `sm_50`. 
PTX for the former would probably provide better performance if we run on a newer GPU (e.g. sm_80). 
On the other hand, it will likely fail if we were to attempt running from PTX on sm_60. 
Both would probably fail if we were to run on sm_35. Including all PTX variants is wasteful (Tensorflow-using applications are already pushing the limits on small memory model and sometimes fail to link due to the executable being too large).

The point is that there's no "one true choice" for the PTX architecture (as there's no safe/sensible choice for the offload target). Only the end user would know their intent. We do need explicit controls and a documented policy on what we produce by default.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D127901/new/

https://reviews.llvm.org/D127901