[cfe-dev] [CUDA] Why "Disallow 'extern shared' variables"?

Fri Oct 28 10:15:26 PDT 2016

> Can we add a flag in Clang-Cuda to indicate separate compilation mode?

Yes, I would be happy to take such a patch.

> Could you point me to patches/code that I can look at to understand the implications of separate compilation?

There's a TODO in NVPTXAsmPrinter, but that's an ABI compatibility
issue, which isn't a problem if you're compiling everything with
clang.

The "mark everything as internal" code is in
CodeGenModule::getLLVMLinkageForDeclarator -- that's the big one, off
the top of my head.

> What LLVM optimizations benefit from whole-program compilation mode?

Many interprocedural optimizations will not fire on externally-visible
ODR functions (basically, anything "inline" or a template that's not
"static" or in an anon namespace).

I believe, for CUDA specifically, there's an optimization for const
__restrict pointers that lets us translate reads into __ldg
instructions, but we can't do this when the function is not internal.
I am not sure, though.

> What, if anything, breaks if I use it for separate compilation?

See above, and my previous email.

Regards,

-Justin

On Fri, Oct 28, 2016 at 9:58 AM, Arpith C Jacob <acjacob at us.ibm.com> wrote:
> Hi Justin,
>
> Thanks for your response.
>
> I am using a mix of our OpenMP nvptx toolchain for OpenMP-gpu programs and
> Clang-Cuda for the OpenMP runtime that we've written in Cuda. This may be
> the source of some of your surprises.
>
> I translate the Cuda code to LLVM IR and pull it into the user's GPU program
> (with -mlink-cuda-bitcode, similar to how you pull in libdevice.compute.bc).
> We then use our toolchain to build relocatable objects with ptxas. I'll be
> happy to talk more about our use case and how we can make the improvements
> you suggest.
>
>> Given that "extern __shared__" means "get me a pointer to the
>> dynamically-allocated shared memory for this kernel," using a
>> non-array / non-pointer type would be...odd?
>>
>
> I believe the difference is whether the cuda code is being compiled in
> whole-program or separate compilation modes.  The following section covers
> the case I described for separate compilation mode, which is what I'm doing:
> https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#device-memory-qualifiers
>
> "When compiling in the separate compilation mode (see the nvcc user manual
> for a description of this mode), __device__, __shared__, and __constant__
> variables can be defined as external using the extern keyword. nvlink will
> generate an error when it cannot find a definition for an external variable
> (unless it is a dynamically allocated __shared__ variable)."
>
> Can we add a flag in Clang-Cuda to indicate separate compilation mode?
>
> Could you point me to patches/code that I can look at to understand the
> implications of separate compilation?  What LLVM optimizations benefit from
> whole-program compilation mode?  What, if anything, breaks if I use it for
> separate compilation?
>
> Thanks in advance for your time.
>
> Arpith

[cfe-dev] [CUDA] Why "Disallow 'extern __shared__' variables"?

[cfe-dev] [CUDA] Why "Disallow 'extern shared' variables"?