<html><head><meta http-equiv="Content-Type" content="text/html charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class=""><br class=""><div><blockquote type="cite" class=""><div class="">On Oct 28, 2016, at 9:58 AM, Arpith C Jacob via cfe-dev <<a href="mailto:cfe-dev@lists.llvm.org" class="">cfe-dev@lists.llvm.org</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><div class=""><p class="">Hi Justin,<br class=""><br class="">Thanks for your response.<br class=""><br class="">I am using a mix of our OpenMP nvptx toolchain for OpenMP-gpu programs and Clang-Cuda for the OpenMP runtime that we've written in Cuda. This may be the source of some of your surprises.<br class=""><br class="">I translate the Cuda code to LLVM IR and pull it into the user's GPU program (with -<font face="Menlo-Regular" class="">mlink-cuda-bitcode</font>, similar to how you pull in libdevice.compute.bc). We then use our toolchain to build relocatable objects with ptxas. I'll be happy to talk more about our use case and how we can make the improvements you suggest.<br class=""><br class=""><tt class="">> Given that "extern __shared__" means "get me a pointer to the<br class="">> dynamically-allocated shared memory for this kernel," using a<br class="">> non-array / non-pointer type would be...odd?<br class="">> </tt><br class=""><br class=""><tt class="">I believe the difference is whether the cuda code is being compiled in whole-program or separate compilation modes. The following section covers the case I described for separate compilation mode, which is what I'm doing:</tt><br class=""><a href="https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#device-memory-qualifiers" class=""><tt class="">https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#device-memory-qualifiers</tt></a><br class=""><br class=""><tt class="">"When compiling in the separate compilation mode (see the nvcc user manual for a description of this mode), __device__, __shared__, and __constant__ variables can be defined as external using the extern keyword. nvlink will generate an error when it cannot find a definition for an external variable (unless it is a dynamically allocated __shared__ variable)."</tt><br class=""><br class=""><tt class="">Can we add a flag in Clang-Cuda to indicate separate compilation mode?</tt><br class=""><tt class=""><br class="">Could you point me to patches/code that I can look at to understand the implications of separate compilation? </tt></p></div></div></blockquote><div><br class=""></div><br class=""><blockquote type="cite" class=""><div class=""><div class=""><p class=""><tt class="">What LLVM optimizations benefit from whole-program compilation mode?</tt><br class=""></p></div></div></blockquote><div><br class=""></div><div>The main impact is that the optimizer in general knows it sees all the uses of every variables and function. It means the ABI/calling convention can be changed, arguments can be eliminated, there is less tradeoff inlining a function when there is a single use, global variable can be turned into local variable sometimes, alias analysis is a lot better for global variables, etc.</div><div><br class=""></div><div>— </div><div>Mehdi</div><div><br class=""></div></div></body></html>