[cfe-dev] Cuda Dynamic Parallelism

Fri Oct 20 15:32:09 PDT 2017

> On a side note, it would also seem that each kernel call is parsed twice (once
with CUDAIsDevice set to true and false, respectively).

Yes, in two separate invocations of clang -cc1 -- one for the host, one for
the device.

> The "__device__" would also prevent us from calling that overload of the
stub from __host__ __device__ functions (the host version would be changed
to also take the exec config in as params).

Specifically, it's OK to call __device__ functions from __host__ __device__
functions, but only if the __host__ __device__ function is never codegen'ed
for the host.  Infrastructure for doing this already exists, so you
shouldn't have to do anything.

> I do not know how to mark the stubs as __host__ or __device__ though.

It's __attribute__((device)), aka CUDADeviceAttr.  See e.g. SemaDecl.cpp.

> Would the current implementation then be able to correctly infer which
stub to call based on __host__ and __device__ attributes?

Yes.  These attributes are part of the function signature, and clang
chooses which function to call based on (somewhat convoluted) overloading
rules.  It should DTRT.

> Is it possible to have the device side stub be inlined?

Yes, LLVM will do that for you automatically if it's profitable.  If it
turns out it's not inlining the function when it should, I'd recommend we
look into fixing the heuristics rather than slapping an always_inline on
the function.

> Also, I would have to check whether we are compiling with at least sm_35
before emitting the device stub at all.

Right.

> Would that be a reasonable approach?

SGTM!

On Thu, Oct 19, 2017 at 11:53 AM Andre Reichenbach <A.Reiba at gmx.net> wrote:

> Hi everyone,
>
> I have been planning to implement CUDA dynamic parallelism in clang for
> a while now. I found some time to dig into clang's code recently in
> order to grasp how clang handles calls to kernels (thanks to Justin
> Lebar and Artem Belevich for pointing me to the most important
> locations!). I was wondering if there would be any expert on the CUDA
> code on the dev list that was willing to help me get a better
> understanding and clear out any misconceptions that I might have.
>
>
> Allow me to summarize how it works right now. It appears that the host
> side kernel calls work something like this:
>
> During parsing of function calls, when the <<< >>> syntax is
> encountered, a node is build that calls cudaConfigureCall using the
> execution configuration parameters. Then, a CUDAKernelCallExpr is build
> using the remaining function parameters, and the CallExpr for
> cudaConfigureCall is passed into that node.
>
> During CodeGen, the CUDAKernelCallExpr is translated into something like
> this:
>
>    if ( cudaConfigureCall( .... ) )
>    {
>        <call to sub for the specific kernel referenced by the call expr>
>    }
>
> The stub for the kernel looks something like this:
>
>    __host__ <something> kernel( arg1, arg2, ... argn ) {
>
>    if( cudaSetupArgument( <stuff for arg0> ) ) {
>       :
>    if( cudaSetupArgument( <stuff for argn> ) ) {
>        cudaLaunch( <kernelName> );
>    } ... }
>
>    }
>
> Did I miss anything important?
>
> On a side note, it would also seem that each kernel call is parsed twice
> (once with CUDAIsDevice set to true and false, respectively).
>
>
>
> For a device side kernel call, I would have to construct something like
> this instead (see e.g. CUDA dynamic parallel programming guide [1]), the
> first four parameters being the execution configuration:
>
>    __device__ <something> kernel( gridDim, blockDim, sharedMem, stream,
> arg0, arg1, ..., argn )
>    {
>      <compute overall size and alignment for storing args to a buffer>
>
>      void * buf = cudaGetParameterBuffer( <alignment>, <size> );
>
>      <copy args into buffer at proper offset and alignment>
>
>      cudaLaunchDevice( <kernel name>, buf, gridDim, blockDim, sharedMem,
> stream );
>    }
>
>
> It would appear that I cannot setup the execution configuration before
> the call to the stub, so I need to pass it into the stub instead. Thus,
> I would have to store the expressions for evaluating these in the
> CUDAKernelCallExpr instead of the call to cudaConfigureCall. The latter
> has to be moved to the host stub. The "__device__" would also prevent us
> from calling that overload of the stub from __host__ __device__
> functions (the host version would be changed to also take the exec
> config in as params). I do not know how to mark the stubs as __host__ or
> __device__ though.
>
> Would that be a reasonable approach? Would the current implementation
> then be able to correctly infer which stub to call based on __host__ and
> __device__ attributes? Is it possible to have the device side stub be
> inlined? Function calls tend to cost a lot of registers in my
> experience, so actually calling a function like that might be quite
> expensive. Also, I would have to check whether we are compiling with at
> least sm_35 before emitting the device stub at all.
>
> I would appreciate any feedback!
>
> Regards,
>
>   Andre Reichenbach
>
>
> [1]
>
> https://docs.nvidia.com/cuda/pdf/CUDA_Dynamic_Parallelism_Programming_Guide.pdf
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20171020/225f1f5c/attachment.html>