[cfe-dev] [CUDA] Why "Disallow 'extern __shared__' variables"?
Mehdi Amini via cfe-dev
cfe-dev at lists.llvm.org
Fri Oct 28 22:50:02 PDT 2016
> On Oct 28, 2016, at 9:58 AM, Arpith C Jacob via cfe-dev <cfe-dev at lists.llvm.org> wrote:
>
> Hi Justin,
>
> Thanks for your response.
>
> I am using a mix of our OpenMP nvptx toolchain for OpenMP-gpu programs and Clang-Cuda for the OpenMP runtime that we've written in Cuda. This may be the source of some of your surprises.
>
> I translate the Cuda code to LLVM IR and pull it into the user's GPU program (with -mlink-cuda-bitcode, similar to how you pull in libdevice.compute.bc). We then use our toolchain to build relocatable objects with ptxas. I'll be happy to talk more about our use case and how we can make the improvements you suggest.
>
> > Given that "extern __shared__" means "get me a pointer to the
> > dynamically-allocated shared memory for this kernel," using a
> > non-array / non-pointer type would be...odd?
> >
>
> I believe the difference is whether the cuda code is being compiled in whole-program or separate compilation modes. The following section covers the case I described for separate compilation mode, which is what I'm doing:
> https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#device-memory-qualifiers <https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#device-memory-qualifiers>
>
> "When compiling in the separate compilation mode (see the nvcc user manual for a description of this mode), __device__, __shared__, and __constant__ variables can be defined as external using the extern keyword. nvlink will generate an error when it cannot find a definition for an external variable (unless it is a dynamically allocated __shared__ variable)."
>
> Can we add a flag in Clang-Cuda to indicate separate compilation mode?
>
> Could you point me to patches/code that I can look at to understand the implications of separate compilation?
>
> What LLVM optimizations benefit from whole-program compilation mode?
>
The main impact is that the optimizer in general knows it sees all the uses of every variables and function. It means the ABI/calling convention can be changed, arguments can be eliminated, there is less tradeoff inlining a function when there is a single use, global variable can be turned into local variable sometimes, alias analysis is a lot better for global variables, etc.
—
Mehdi
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20161028/4b02355c/attachment.html>
More information about the cfe-dev
mailing list