[PATCH] D44435: Add the module name to __cuda_module_ctor and __cuda_module_dtor for unique function names

Thu Mar 15 03:22:26 PDT 2018

SimeonEhrig added inline comments.

================
Comment at: unittests/CodeGen/IncrementalProcessingTest.cpp:176-178
+
+// In CUDA incremental processing, a CUDA ctor or dtor will be generated for 
+// every statement if a fatbinary file exists.
----------------
tra wrote:
> SimeonEhrig wrote:
> > tra wrote:
> > > I don't understand the comment. What is 'CUDA incremental processing' and what exactly is meant by 'statement' here? I'd appreciate if you could give me more details. My understanding is that ctor/dtor are generated once per TU. I suspect "incremental processing" may change that, but I have no idea what exactly does it do.
> > A CUDA ctor/dtor will generates for every llvm::module. The TU can also composed of many modules. In our interpreter, we add new code to our AST with new modules at runtime. 
> > The ctor/dtor generation is depend on the fatbinary code. The CodeGen checks, if a path to a fatbinary file is set. If it is, it generates an ctor with at least a __cudaRegisterFatBinary() function call. So, the generation is independent of the source code in the module and we can use every statement. A statement can be an expression, a declaration, a definition and so one.   
> I still don't understand how it's going to work. Do you have some sort of design document outlining how the interpreter is going to work with CUDA?
> 
> The purpose of the ctor/dtor is to stitch together host-side kernel launch with the GPU-side kernel binary which resides in the GPU binary created by device-side compilation. 
> 
> So, the question #1 -- if you pass GPU-side binary to the compiler, where did you get it? Normally it's the result of device-side compilation of the same TU. In your case it's not quite clear what exactly would that be, if you feed the source to the compiler incrementally. I.e. do you somehow recompile everything we've seen on device side so far for each new chunk of host-side source you feed to the compiler? 
> 
> Next question is -- assuming that device side does have correct GPU-side binary, when do you call those ctors/dtors? JIT model does not quite fit the assumptions that drive regular CUDA compilation.
> 
> Let's consider this:
> ```
> __global__ void foo();
> __global__ void bar();
> 
> // If that's all we've  fed to compiler so far, we have no GPU code yet, so there 
> // should be no fatbin file. If we do have it, what's in it?
> 
> void launch() {
>   foo<<<1,1>>>();
>   bar<<<1,1>>>();
> }
> // If you've generated ctors/dtors at this point they would be 
> // useless as no GPU code exists in the preceding code.
> 
> __global__ void foo() {}
> // Now we'd have some GPU code, but how can we need to retrofit it into 
> // all the ctors/dtors we've generated before. 
> __global__ void bar() {}
> // Does bar end up in its own fatbinary? Or is it combined into a new 
> // fatbin which contains both boo and bar?
> // If it's a new fatbin, you somehow need to update existing ctors/dtors, 
> // unless you want to leak CUDA resources fast.
> // If it's a separate fatbin, then you will need to at the very least change the way 
> // ctors/dtors are generated by the 'launch' function, because now they need to 
> // tie each kernel launch to a different fatbin.
> 
> ```
> 
> It looks to me that if you want to JIT CUDA code you will need to take over GPU-side kernel management.
> ctors/dtors do that for full-TU compilation, but they rely on device-side code being compiled and available during host-side compilation. For JIT, the interpreter should be in charge of registering new kernels with the CUDA runtime and unregistering/unloading them when a kernel goes away. This makes ctors/dtors completely irrelevant.
At the moment, there is no documentation, because we still develop the feature. I try to describe how it works.

The device side compilation works with a second compiler (a normal clang), which we start via syscall. In the interpreter, we check if the input line is a kernel definition or a kernel launch. Then we write the source code to a file and compile it with the clang to a PCH-file.  Then the PCH-file will be compiled to PTX and then to a fatbin. If we add a new kernel, we will send the source code with the existing PCH-file to clang compiler. So we easy extend the AST and generate a PTX-file with all defined kernels. 

An implementation of this feature can you see at my prototype: <https://github.com/SimeonEhrig/CUDA-Runtime-Interpreter>

Running the ctor/dtor isn't hard. I search after the JITSymbol and generate an function pointer. Than I can simply run it. This feature can you also see in my prototype. So, we can run the ctor, if new fatbin code is generated and the dtor before, if code was already registered. The CUDA runtime also provide the possibility to run the (un)register functions many times.

  __global__ void foo();
  __global__ void bar();

  //At this point, there is no fatbin file and it will no generated. 

  void launch() {
    foo<<<1,1>>>();
    bar<<<1,1>>>();
  }

  // The definition of launch() is not possible at the direct input mode (type in line by line) in cling. 
  // At this point, we need a definition of foo() and bar(). But there is a exception. 
  // We have a function to read in a piece of code from file. This piece of code will translate in a single module. 

  __global__ void foo() {}
  __global__ void bar() {}

  // In our case, we will compile this 8 lines of code in a single module in cling and send it  to the CUDA device JIT, too. 

  // We have on file fatbinary file, which will extend with new kernels. The file have to unregistered and registered every time, if it will changed.
  // When and which ctor/dtor have to run is managed by the interpreter. 

I don't know, if I understand it right. Do you mean, we should implement the content of the ctor/dtor direct in our cling source code? For example, we call direct the `__cudaRegisterFatBinary()` function in the source code of cling after the generating of a new fatbin-file as opposed of calling `__cuda_module_ctor`, which we generated with JIT-backend of our interpreter.

Repository:
  rC Clang

https://reviews.llvm.org/D44435