[cfe-dev] Cuda Dynamic Parallelism

Thu Oct 19 11:53:28 PDT 2017

Hi everyone,

I have been planning to implement CUDA dynamic parallelism in clang for 
a while now. I found some time to dig into clang's code recently in 
order to grasp how clang handles calls to kernels (thanks to Justin 
Lebar and Artem Belevich for pointing me to the most important 
locations!). I was wondering if there would be any expert on the CUDA 
code on the dev list that was willing to help me get a better 
understanding and clear out any misconceptions that I might have.

Allow me to summarize how it works right now. It appears that the host 
side kernel calls work something like this:

During parsing of function calls, when the <<< >>> syntax is 
encountered, a node is build that calls cudaConfigureCall using the 
execution configuration parameters. Then, a CUDAKernelCallExpr is build 
using the remaining function parameters, and the CallExpr for 
cudaConfigureCall is passed into that node.

During CodeGen, the CUDAKernelCallExpr is translated into something like 
this:

   if ( cudaConfigureCall( .... ) )
   {
       <call to sub for the specific kernel referenced by the call expr>
   }

The stub for the kernel looks something like this:

   __host__ <something> kernel( arg1, arg2, ... argn ) {

   if( cudaSetupArgument( <stuff for arg0> ) ) {
      :
   if( cudaSetupArgument( <stuff for argn> ) ) {
       cudaLaunch( <kernelName> );
   } ... }

   }

Did I miss anything important?

On a side note, it would also seem that each kernel call is parsed twice 
(once with CUDAIsDevice set to true and false, respectively).

For a device side kernel call, I would have to construct something like 
this instead (see e.g. CUDA dynamic parallel programming guide [1]), the
first four parameters being the execution configuration:

   __device__ <something> kernel( gridDim, blockDim, sharedMem, stream, 
arg0, arg1, ..., argn )
   {
     <compute overall size and alignment for storing args to a buffer>

     void * buf = cudaGetParameterBuffer( <alignment>, <size> );

     <copy args into buffer at proper offset and alignment>

     cudaLaunchDevice( <kernel name>, buf, gridDim, blockDim, sharedMem, 
stream );
   }

It would appear that I cannot setup the execution configuration before 
the call to the stub, so I need to pass it into the stub instead. Thus, 
I would have to store the expressions for evaluating these in the 
CUDAKernelCallExpr instead of the call to cudaConfigureCall. The latter 
has to be moved to the host stub. The "__device__" would also prevent us 
from calling that overload of the stub from __host__ __device__ 
functions (the host version would be changed to also take the exec 
config in as params). I do not know how to mark the stubs as __host__ or 
__device__ though.

Would that be a reasonable approach? Would the current implementation 
then be able to correctly infer which stub to call based on __host__ and 
__device__ attributes? Is it possible to have the device side stub be 
inlined? Function calls tend to cost a lot of registers in my 
experience, so actually calling a function like that might be quite 
expensive. Also, I would have to check whether we are compiling with at 
least sm_35 before emitting the device stub at all.

I would appreciate any feedback!

Regards,

  Andre Reichenbach

[1] 
https://docs.nvidia.com/cuda/pdf/CUDA_Dynamic_Parallelism_Programming_Guide.pdf