[cfe-dev] Cuda Dynamic Parallelism
Andre Reichenbach via cfe-dev
cfe-dev at lists.llvm.org
Thu Oct 19 11:53:28 PDT 2017
Hi everyone,
I have been planning to implement CUDA dynamic parallelism in clang for
a while now. I found some time to dig into clang's code recently in
order to grasp how clang handles calls to kernels (thanks to Justin
Lebar and Artem Belevich for pointing me to the most important
locations!). I was wondering if there would be any expert on the CUDA
code on the dev list that was willing to help me get a better
understanding and clear out any misconceptions that I might have.
Allow me to summarize how it works right now. It appears that the host
side kernel calls work something like this:
During parsing of function calls, when the <<< >>> syntax is
encountered, a node is build that calls cudaConfigureCall using the
execution configuration parameters. Then, a CUDAKernelCallExpr is build
using the remaining function parameters, and the CallExpr for
cudaConfigureCall is passed into that node.
During CodeGen, the CUDAKernelCallExpr is translated into something like
this:
if ( cudaConfigureCall( .... ) )
{
<call to sub for the specific kernel referenced by the call expr>
}
The stub for the kernel looks something like this:
__host__ <something> kernel( arg1, arg2, ... argn ) {
if( cudaSetupArgument( <stuff for arg0> ) ) {
:
if( cudaSetupArgument( <stuff for argn> ) ) {
cudaLaunch( <kernelName> );
} ... }
}
Did I miss anything important?
On a side note, it would also seem that each kernel call is parsed twice
(once with CUDAIsDevice set to true and false, respectively).
For a device side kernel call, I would have to construct something like
this instead (see e.g. CUDA dynamic parallel programming guide [1]), the
first four parameters being the execution configuration:
__device__ <something> kernel( gridDim, blockDim, sharedMem, stream,
arg0, arg1, ..., argn )
{
<compute overall size and alignment for storing args to a buffer>
void * buf = cudaGetParameterBuffer( <alignment>, <size> );
<copy args into buffer at proper offset and alignment>
cudaLaunchDevice( <kernel name>, buf, gridDim, blockDim, sharedMem,
stream );
}
It would appear that I cannot setup the execution configuration before
the call to the stub, so I need to pass it into the stub instead. Thus,
I would have to store the expressions for evaluating these in the
CUDAKernelCallExpr instead of the call to cudaConfigureCall. The latter
has to be moved to the host stub. The "__device__" would also prevent us
from calling that overload of the stub from __host__ __device__
functions (the host version would be changed to also take the exec
config in as params). I do not know how to mark the stubs as __host__ or
__device__ though.
Would that be a reasonable approach? Would the current implementation
then be able to correctly infer which stub to call based on __host__ and
__device__ attributes? Is it possible to have the device side stub be
inlined? Function calls tend to cost a lot of registers in my
experience, so actually calling a function like that might be quite
expensive. Also, I would have to check whether we are compiling with at
least sm_35 before emitting the device stub at all.
I would appreciate any feedback!
Regards,
Andre Reichenbach
[1]
https://docs.nvidia.com/cuda/pdf/CUDA_Dynamic_Parallelism_Programming_Guide.pdf
More information about the cfe-dev
mailing list