[PATCH] D44435: Add the module name to __cuda_module_ctor and __cuda_module_dtor for unique function names

Fri Mar 16 04:39:03 PDT 2018

SimeonEhrig added inline comments.

================
Comment at: unittests/CodeGen/IncrementalProcessingTest.cpp:176-178
+
+// In CUDA incremental processing, a CUDA ctor or dtor will be generated for 
+// every statement if a fatbinary file exists.
----------------
tra wrote:
> SimeonEhrig wrote:
> > tra wrote:
> > > SimeonEhrig wrote:
> > > > tra wrote:
> > > > > I don't understand the comment. What is 'CUDA incremental processing' and what exactly is meant by 'statement' here? I'd appreciate if you could give me more details. My understanding is that ctor/dtor are generated once per TU. I suspect "incremental processing" may change that, but I have no idea what exactly does it do.
> > > > A CUDA ctor/dtor will generates for every llvm::module. The TU can also composed of many modules. In our interpreter, we add new code to our AST with new modules at runtime. 
> > > > The ctor/dtor generation is depend on the fatbinary code. The CodeGen checks, if a path to a fatbinary file is set. If it is, it generates an ctor with at least a __cudaRegisterFatBinary() function call. So, the generation is independent of the source code in the module and we can use every statement. A statement can be an expression, a declaration, a definition and so one.   
> > > I still don't understand how it's going to work. Do you have some sort of design document outlining how the interpreter is going to work with CUDA?
> > > 
> > > The purpose of the ctor/dtor is to stitch together host-side kernel launch with the GPU-side kernel binary which resides in the GPU binary created by device-side compilation. 
> > > 
> > > So, the question #1 -- if you pass GPU-side binary to the compiler, where did you get it? Normally it's the result of device-side compilation of the same TU. In your case it's not quite clear what exactly would that be, if you feed the source to the compiler incrementally. I.e. do you somehow recompile everything we've seen on device side so far for each new chunk of host-side source you feed to the compiler? 
> > > 
> > > Next question is -- assuming that device side does have correct GPU-side binary, when do you call those ctors/dtors? JIT model does not quite fit the assumptions that drive regular CUDA compilation.
> > > 
> > > Let's consider this:
> > > ```
> > > __global__ void foo();
> > > __global__ void bar();
> > > 
> > > // If that's all we've  fed to compiler so far, we have no GPU code yet, so there 
> > > // should be no fatbin file. If we do have it, what's in it?
> > > 
> > > void launch() {
> > >   foo<<<1,1>>>();
> > >   bar<<<1,1>>>();
> > > }
> > > // If you've generated ctors/dtors at this point they would be 
> > > // useless as no GPU code exists in the preceding code.
> > > 
> > > __global__ void foo() {}
> > > // Now we'd have some GPU code, but how can we need to retrofit it into 
> > > // all the ctors/dtors we've generated before. 
> > > __global__ void bar() {}
> > > // Does bar end up in its own fatbinary? Or is it combined into a new 
> > > // fatbin which contains both boo and bar?
> > > // If it's a new fatbin, you somehow need to update existing ctors/dtors, 
> > > // unless you want to leak CUDA resources fast.
> > > // If it's a separate fatbin, then you will need to at the very least change the way 
> > > // ctors/dtors are generated by the 'launch' function, because now they need to 
> > > // tie each kernel launch to a different fatbin.
> > > 
> > > ```
> > > 
> > > It looks to me that if you want to JIT CUDA code you will need to take over GPU-side kernel management.
> > > ctors/dtors do that for full-TU compilation, but they rely on device-side code being compiled and available during host-side compilation. For JIT, the interpreter should be in charge of registering new kernels with the CUDA runtime and unregistering/unloading them when a kernel goes away. This makes ctors/dtors completely irrelevant.
> > At the moment, there is no documentation, because we still develop the feature. I try to describe how it works.
> > 
> > The device side compilation works with a second compiler (a normal clang), which we start via syscall. In the interpreter, we check if the input line is a kernel definition or a kernel launch. Then we write the source code to a file and compile it with the clang to a PCH-file.  Then the PCH-file will be compiled to PTX and then to a fatbin. If we add a new kernel, we will send the source code with the existing PCH-file to clang compiler. So we easy extend the AST and generate a PTX-file with all defined kernels. 
> > 
> > An implementation of this feature can you see at my prototype: <https://github.com/SimeonEhrig/CUDA-Runtime-Interpreter>
> > 
> > Running the ctor/dtor isn't hard. I search after the JITSymbol and generate an function pointer. Than I can simply run it. This feature can you also see in my prototype. So, we can run the ctor, if new fatbin code is generated and the dtor before, if code was already registered. The CUDA runtime also provide the possibility to run the (un)register functions many times.
> > 
> >   __global__ void foo();
> >   __global__ void bar();
> > 
> >   //At this point, there is no fatbin file and it will no generated. 
> > 
> >   void launch() {
> >     foo<<<1,1>>>();
> >     bar<<<1,1>>>();
> >   }
> > 
> >   // The definition of launch() is not possible at the direct input mode (type in line by line) in cling. 
> >   // At this point, we need a definition of foo() and bar(). But there is a exception. 
> >   // We have a function to read in a piece of code from file. This piece of code will translate in a single module. 
> >   
> > 
> >   __global__ void foo() {}
> >   __global__ void bar() {}
> > 
> >   // In our case, we will compile this 8 lines of code in a single module in cling and send it  to the CUDA device JIT, too. 
> > 
> >   // We have on file fatbinary file, which will extend with new kernels. The file have to unregistered and registered every time, if it will changed.
> >   // When and which ctor/dtor have to run is managed by the interpreter. 
> > 
> > I don't know, if I understand it right. Do you mean, we should implement the content of the ctor/dtor direct in our cling source code? For example, we call direct the `__cudaRegisterFatBinary()` function in the source code of cling after the generating of a new fatbin-file as opposed of calling `__cuda_module_ctor`, which we generated with JIT-backend of our interpreter.
> Do I understand it correctly that every time you see new kernel definition, you'll recompile everything you've seen until this point? In the example above you'll do compilation twice first time after foo() and then again after bar(). Compilation after bar() will have GPU code for both foo() and bar(), at which point you'll call dtor, which will unregister foo() from the first compilation and will then call ctor from the new compilation, which will register both foo() and bar() from the new fatbin. If that's the case, it may work OK with a few caveats.
> 
> * it's not clear whether that would leak resources. CUDA runtime API is undocumented, so I can't tell whether unregistering old kernels will release everything (e.g. I have no idea whether it unloads old kernel). Similarly with ctor, I don't know whether it allocates some resources every time it's called. All of that is not an issue when ctor/dtor is called once during app runtime, but I'd be very cautious about using that repeatedly.
> * recompiling/loading/unloading everything every time you parse new kernel is ~quadratically expensive. You probably not going to get too many kernels during any given session, so it may be OK for an interpreter. Still, it could be avoided. All you need (in theory) is to compile and register one new kernel. That's why I suggested for it to be done by the interpreter. 
> 
> Just my $.02, mostly beyond the scope of this review.
> BTW, XLA/GPU in TensorFlow, does have an implementation of jit-for-GPU. While it's not directly comparable with your project, it may have some useful ideas on dealing with GPU-side code.
> Most of the interesting JIT bits are in [[ https://github.com/tensorflow/tensorflow/blob/master/tensorflow/stream_executor/cuda/cuda_driver.cc | cuda_driver.cc ]] and [[ https://github.com/tensorflow/tensorflow/blob/master/tensorflow/compiler/xla/service/gpu/gpu_compiler.cc | gpu_compiler.cc ]]
Thats the way, we do it with one exception. We do not recompile everything. We hold the AST of the device code in PCH-files. That should significant speed up the compilation of a new PTX file. We have to look, if this way is fast enough for our use cases. I think, we can also implement a lazy compilation for the device code, without great effort. But at the moment, we use the way, which you describe.

* That's a good point. It's really a problem, that I've to solve it. At the moment, I've two Ideas. The first is the simple version. We run example programs on cling and look if they worked right and check the resources. The second idea is to reproduce the behavior of cling with static compiled programs, which are generated with the nvcc. For example:

  // func.hpp
  #ifndef FUNC_H
  #define FUNC_H

  void launch_func();

  #endif

  // main.cu
  #include <iostream>
  #include "func.hpp"

  __global__ void kernel_main(){}

  int main(int argc, char const *argv[])
  {
        kernel<<<1,1>>>();
        std::cout << "main: " << cudaGetLastError() << std::endl;
        kernel_main();
        return 0;
  }

  #include <iostream>
  #include "func.hpp"

  __global__ void kernel_func(){}

  void launch_func(){
         kernel_func<<<1,1>>>();
         std::cout << "func: " << cudaGetLastError() << std::endl;
  }

If compile each .cu file to a own object file and then link it together, you can see, that `__cudaRegisterFatBinary`  will runs twice at gdb. Also, nvcc use unique ctor names (use a hash function).
But, I have to discuss it with my colleagues, if it is a good solution. 

* Can you explain, why just one new kernel is enough?

  // In this case, I understand it. We can put foo() and bar() together in 
  // one compiling process.
  __global__ void foo();
  __global__ void bar();
  foo<<<1,1>>>();
  bar<<1,1>>>();

  // But in case, I see no solution, to use just one new kernel.
  // The process is interactive, so we have no knowledge of the follow lines.
  __global__ void foo();
  foo<<<1,1>>>();
  __global__ void bar();
  bar<<1,1>>>();

I will have a look at the TensorFlow but this needs some time.

Repository:
  rC Clang

https://reviews.llvm.org/D44435