[PATCH] D44435: Add the module name to __cuda_module_ctor and __cuda_module_dtor for unique function names

Wed Mar 14 11:19:47 PDT 2018

tra added inline comments.

================
Comment at: unittests/CodeGen/IncrementalProcessingTest.cpp:176-178
+
+// In CUDA incremental processing, a CUDA ctor or dtor will be generated for 
+// every statement if a fatbinary file exists.
----------------
SimeonEhrig wrote:
> tra wrote:
> > I don't understand the comment. What is 'CUDA incremental processing' and what exactly is meant by 'statement' here? I'd appreciate if you could give me more details. My understanding is that ctor/dtor are generated once per TU. I suspect "incremental processing" may change that, but I have no idea what exactly does it do.
> A CUDA ctor/dtor will generates for every llvm::module. The TU can also composed of many modules. In our interpreter, we add new code to our AST with new modules at runtime. 
> The ctor/dtor generation is depend on the fatbinary code. The CodeGen checks, if a path to a fatbinary file is set. If it is, it generates an ctor with at least a __cudaRegisterFatBinary() function call. So, the generation is independent of the source code in the module and we can use every statement. A statement can be an expression, a declaration, a definition and so one.   
I still don't understand how it's going to work. Do you have some sort of design document outlining how the interpreter is going to work with CUDA?

The purpose of the ctor/dtor is to stitch together host-side kernel launch with the GPU-side kernel binary which resides in the GPU binary created by device-side compilation. 

So, the question #1 -- if you pass GPU-side binary to the compiler, where did you get it? Normally it's the result of device-side compilation of the same TU. In your case it's not quite clear what exactly would that be, if you feed the source to the compiler incrementally. I.e. do you somehow recompile everything we've seen on device side so far for each new chunk of host-side source you feed to the compiler? 

Next question is -- assuming that device side does have correct GPU-side binary, when do you call those ctors/dtors? JIT model does not quite fit the assumptions that drive regular CUDA compilation.

Let's consider this:
```
__global__ void foo();
__global__ void bar();

// If that's all we've  fed to compiler so far, we have no GPU code yet, so there 
// should be no fatbin file. If we do have it, what's in it?

void launch() {
  foo<<<1,1>>>();
  bar<<<1,1>>>();
}
// If you've generated ctors/dtors at this point they would be 
// useless as no GPU code exists in the preceding code.

__global__ void foo() {}
// Now we'd have some GPU code, but how can we need to retrofit it into 
// all the ctors/dtors we've generated before. 
__global__ void bar() {}
// Does bar end up in its own fatbinary? Or is it combined into a new 
// fatbin which contains both boo and bar?
// If it's a new fatbin, you somehow need to update existing ctors/dtors, 
// unless you want to leak CUDA resources fast.
// If it's a separate fatbin, then you will need to at the very least change the way 
// ctors/dtors are generated by the 'launch' function, because now they need to 
// tie each kernel launch to a different fatbin.

```

It looks to me that if you want to JIT CUDA code you will need to take over GPU-side kernel management.
ctors/dtors do that for full-TU compilation, but they rely on device-side code being compiled and available during host-side compilation. For JIT, the interpreter should be in charge of registering new kernels with the CUDA runtime and unregistering/unloading them when a kernel goes away. This makes ctors/dtors completely irrelevant.

Repository:
  rC Clang

https://reviews.llvm.org/D44435