<div dir="ltr">Sure. <div><br></div><div>Artem, you said you have some more flags you'd like to explain. Would you mind adding them altogether? </div></div><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Jan 26, 2016 at 3:49 PM, Hal Finkel <span dir="ltr"><<a href="mailto:hfinkel@anl.gov" target="_blank">hfinkel@anl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi Jingyue,<br>

<br>

I don't know what caveats exist (and, thus, don't feel I'm the best person to write the patch), but a colleague of mine just tried the CUDA support for the first time, and he needed to affect how __CUDA_ARCH__ was defined to compile his code. I found --cuda-gpu-arch=sm_35 from the regression tests, which seemed to work, and we should probably add something about this flag to our documentation.<br>

<br>

Thanks again,<br>

Hal<br>

<div class="HOEnZb"><div class="h5"><br>

----- Original Message -----<br>

> From: "Jingyue Wu via llvm-commits" <<a href="mailto:llvm-commits@lists.llvm.org">llvm-commits@lists.llvm.org</a>><br>

> To: <a href="mailto:llvm-commits@lists.llvm.org">llvm-commits@lists.llvm.org</a><br>

> Sent: Tuesday, November 10, 2015 4:35:47 PM<br>

> Subject: [llvm] r252660 - [doc] Compile CUDA with LLVM<br>

><br>

> Author: jingyue<br>

> Date: Tue Nov 10 16:35:47 2015<br>

> New Revision: 252660<br>

><br>

> URL: <a href="http://llvm.org/viewvc/llvm-project?rev=252660&view=rev" rel="noreferrer" target="_blank">http://llvm.org/viewvc/llvm-project?rev=252660&view=rev</a><br>

> Log:<br>

> [doc] Compile CUDA with LLVM<br>

><br>

> Summary:<br>

> This patch adds documentation on compiling CUDA with LLVM as<br>

> requested by many<br>

> engineers and researchers. It includes not only user guides but also<br>

> some<br>

> internals (mostly optimizations) so that early adopters can start<br>

> hacking and<br>

> contributing.<br>

><br>

> Quite a few researchers who contacted us haven't used LLVM before,<br>

> which is<br>

> unsurprising as it hasn't been long since LLVM picked up CUDA. So I<br>

> added a<br>

> short summary to help these folks get started with LLVM.<br>

><br>

> I expect this document to evolve substantially down the road. The<br>

> user guides<br>

> will be much simplified after the Clang integration is done. However,<br>

> the<br>

> internals should continue growing to include for example performance<br>

> debugging<br>

> and key areas to improve.<br>

><br>

> Reviewers: chandlerc, meheff, broune, tra<br>

><br>

> Subscribers: silvas, jingyue, llvm-commits, eliben<br>

><br>

> Differential Revision: <a href="http://reviews.llvm.org/D14370" rel="noreferrer" target="_blank">http://reviews.llvm.org/D14370</a><br>

><br>

> Added:<br>

>     llvm/trunk/docs/CompileCudaWithLLVM.rst<br>

> Modified:<br>

>     llvm/trunk/docs/index.rst<br>

><br>

> Added: llvm/trunk/docs/CompileCudaWithLLVM.rst<br>

> URL:<br>

> <a href="http://llvm.org/viewvc/llvm-project/llvm/trunk/docs/CompileCudaWithLLVM.rst?rev=252660&view=auto" rel="noreferrer" target="_blank">http://llvm.org/viewvc/llvm-project/llvm/trunk/docs/CompileCudaWithLLVM.rst?rev=252660&view=auto</a><br>

> ==============================================================================<br>

> --- llvm/trunk/docs/CompileCudaWithLLVM.rst (added)<br>

> +++ llvm/trunk/docs/CompileCudaWithLLVM.rst Tue Nov 10 16:35:47 2015<br>

> @@ -0,0 +1,192 @@<br>

> +===================================<br>

> +Compiling CUDA C/C++ with LLVM<br>

> +===================================<br>

> +<br>

> +.. contents::<br>

> +   :local:<br>

> +<br>

> +Introduction<br>

> +============<br>

> +<br>

> +This document contains the user guides and the internals of<br>

> compiling CUDA<br>

> +C/C++ with LLVM. It is aimed at both users who want to compile CUDA<br>

> with LLVM<br>

> +and developers who want to improve LLVM for GPUs. This document<br>

> assumes a basic<br>

> +familiarity with CUDA. Information about CUDA programming can be<br>

> found in the<br>

> +`CUDA programming guide<br>

> +<<a href="http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html" rel="noreferrer" target="_blank">http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html</a>>`_.<br>

> +<br>

> +How to Build LLVM with CUDA Support<br>

> +===================================<br>

> +<br>

> +The support for CUDA is still in progress and temporarily relies on<br>

> `this patch<br>

> +<<a href="http://reviews.llvm.org/D14452" rel="noreferrer" target="_blank">http://reviews.llvm.org/D14452</a>>`_. Below is a quick summary of<br>

> downloading and<br>

> +building LLVM with CUDA support. Consult the `Getting Started<br>

> +<<a href="http://llvm.org/docs/GettingStarted.html" rel="noreferrer" target="_blank">http://llvm.org/docs/GettingStarted.html</a>>`_ page for more details<br>

> on setting<br>

> +up LLVM.<br>

> +<br>

> +#. Checkout LLVM<br>

> +<br>

> +   .. code-block:: console<br>

> +<br>

> +     $ cd where-you-want-llvm-to-live<br>

> +     $ svn co <a href="http://llvm.org/svn/llvm-project/llvm/trunk" rel="noreferrer" target="_blank">http://llvm.org/svn/llvm-project/llvm/trunk</a> llvm<br>

> +<br>

> +#. Checkout Clang<br>

> +<br>

> +   .. code-block:: console<br>

> +<br>

> +     $ cd where-you-want-llvm-to-live<br>

> +     $ cd llvm/tools<br>

> +     $ svn co <a href="http://llvm.org/svn/llvm-project/cfe/trunk" rel="noreferrer" target="_blank">http://llvm.org/svn/llvm-project/cfe/trunk</a> clang<br>

> +<br>

> +#. Apply the temporary patch for CUDA support.<br>

> +<br>

> +   If you have installed `Arcanist<br>

> +<br>

>   <<a href="http://llvm.org/docs/Phabricator.html#requesting-a-review-via-the-command-line" rel="noreferrer" target="_blank">http://llvm.org/docs/Phabricator.html#requesting-a-review-via-the-command-line</a>>`_,<br>

> +   you can apply this patch using<br>

> +<br>

> +   .. code-block:: console<br>

> +<br>

> +     $ cd where-you-want-llvm-to-live<br>

> +     $ cd llvm/tools/clang<br>

> +     $ arc patch D14452<br>

> +<br>

> +   Otherwise, go to `its review page<br>

> <<a href="http://reviews.llvm.org/D14452" rel="noreferrer" target="_blank">http://reviews.llvm.org/D14452</a>>`_,<br>

> +   download the raw diff, and apply it manually using<br>

> +<br>

> +   .. code-block:: console<br>

> +<br>

> +     $ cd where-you-want-llvm-to-live<br>

> +     $ cd llvm/tools/clang<br>

> +     $ patch -p0 < D14452.diff<br>

> +<br>

> +#. Configure and build LLVM and Clang<br>

> +<br>

> +   .. code-block:: console<br>

> +<br>

> +     $ cd where-you-want-llvm-to-live<br>

> +     $ mkdir build<br>

> +     $ cd build<br>

> +     $ cmake [options] ..<br>

> +     $ make<br>

> +<br>

> +How to Compile CUDA C/C++ with LLVM<br>

> +===================================<br>

> +<br>

> +We assume you have installed the CUDA driver and runtime. Consult<br>

> the `NVIDIA<br>

> +CUDA installation Guide<br>

> +<<a href="https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html" rel="noreferrer" target="_blank">https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html</a>>`_<br>

> if<br>

> +you have not.<br>

> +<br>

> +Suppose you want to compile and run the following CUDA program<br>

> (``<a href="http://axpy.cu" rel="noreferrer" target="_blank">axpy.cu</a>``)<br>

> +which multiplies a ``float`` array by a ``float`` scalar (AXPY).<br>

> +<br>

> +.. code-block:: c++<br>

> +<br>

> +  #include <helper_cuda.h> // for checkCudaErrors<br>

> +<br>

> +  #include <iostream><br>

> +<br>

> +  __global__ void axpy(float a, float* x, float* y) {<br>

> +    y[threadIdx.x] = a * x[threadIdx.x];<br>

> +  }<br>

> +<br>

> +  int main(int argc, char* argv[]) {<br>

> +    const int kDataLen = 4;<br>

> +<br>

> +    float a = 2.0f;<br>

> +    float host_x[kDataLen] = {1.0f, 2.0f, 3.0f, 4.0f};<br>

> +    float host_y[kDataLen];<br>

> +<br>

> +    // Copy input data to device.<br>

> +    float* device_x;<br>

> +    float* device_y;<br>

> +    checkCudaErrors(cudaMalloc(&device_x, kDataLen *<br>

> sizeof(float)));<br>

> +    checkCudaErrors(cudaMalloc(&device_y, kDataLen *<br>

> sizeof(float)));<br>

> +    checkCudaErrors(cudaMemcpy(device_x, host_x, kDataLen *<br>

> sizeof(float),<br>

> +                               cudaMemcpyHostToDevice));<br>

> +<br>

> +    // Launch the kernel.<br>

> +    axpy<<<1, kDataLen>>>(a, device_x, device_y);<br>

> +<br>

> +    // Copy output data to host.<br>

> +    checkCudaErrors(cudaDeviceSynchronize());<br>

> +    checkCudaErrors(cudaMemcpy(host_y, device_y, kDataLen *<br>

> sizeof(float),<br>

> +                               cudaMemcpyDeviceToHost));<br>

> +<br>

> +    // Print the results.<br>

> +    for (int i = 0; i < kDataLen; ++i) {<br>

> +      std::cout << "y[" << i << "] = " << host_y[i] << "\n";<br>

> +    }<br>

> +<br>

> +    checkCudaErrors(cudaDeviceReset());<br>

> +    return 0;<br>

> +  }<br>

> +<br>

> +The command line for compilation is similar to what you would use<br>

> for C++.<br>

> +<br>

> +.. code-block:: console<br>

> +<br>

> +  $ clang++ -o axpy -I<CUDA install path>/samples/common/inc -L<CUDA<br>

> install path>/<lib64 or lib> <a href="http://axpy.cu" rel="noreferrer" target="_blank">axpy.cu</a> -lcudart_static -lcuda -ldl<br>

> -lrt -pthread<br>

> +  $ ./axpy<br>

> +  y[0] = 2<br>

> +  y[1] = 4<br>

> +  y[2] = 6<br>

> +  y[3] = 8<br>

> +<br>

> +Note that ``helper_cuda.h`` comes from the CUDA samples, so you need<br>

> the<br>

> +samples installed for this example. ``<CUDA install path>`` is the<br>

> root<br>

> +directory where you installed CUDA SDK, typically<br>

> ``/usr/local/cuda``.<br>

> +<br>

> +Optimizations<br>

> +=============<br>

> +<br>

> +CPU and GPU have different design philosophies and architectures.<br>

> For example, a<br>

> +typical CPU has branch prediction, out-of-order execution, and is<br>

> superscalar,<br>

> +whereas a typical GPU has none of these. Due to such differences, an<br>

> +optimization pipeline well-tuned for CPUs may be not suitable for<br>

> GPUs.<br>

> +<br>

> +LLVM performs several general and CUDA-specific optimizations for<br>

> GPUs. The<br>

> +list below shows some of the more important optimizations for GPUs.<br>

> Most of<br>

> +them have been upstreamed to ``lib/Transforms/Scalar`` and<br>

> +``lib/Target/NVPTX``. A few of them have not been upstreamed due to<br>

> lack of a<br>

> +customizable target-independent optimization pipeline.<br>

> +<br>

> +* **Straight-line scalar optimizations**. These optimizations reduce<br>

> redundancy<br>

> +  in straight-line code. Details can be found in the `design<br>

> document for<br>

> +  straight-line scalar optimizations <<a href="https://goo.gl/4Rb9As" rel="noreferrer" target="_blank">https://goo.gl/4Rb9As</a>>`_.<br>

> +<br>

> +* **Inferring memory spaces**. `This optimization<br>

> +<br>

>  <<a href="http://www.llvm.org/docs/doxygen/html/NVPTXFavorNonGenericAddrSpaces_8cpp_source.html" rel="noreferrer" target="_blank">http://www.llvm.org/docs/doxygen/html/NVPTXFavorNonGenericAddrSpaces_8cpp_source.html</a>>`_<br>

> +  infers the memory space of an address so that the backend can emit<br>

> faster<br>

> +  special loads and stores from it. Details can be found in the<br>

> `design<br>

> +  document for memory space inference <<a href="https://goo.gl/5wH2Ct" rel="noreferrer" target="_blank">https://goo.gl/5wH2Ct</a>>`_.<br>

> +<br>

> +* **Aggressive loop unrooling and function inlining**. Loop<br>

> unrolling and<br>

> +  function inlining need to be more aggressive for GPUs than for<br>

> CPUs because<br>

> +  control flow transfer in GPU is more expensive. They also promote<br>

> other<br>

> +  optimizations such as constant propagation and SROA which<br>

> sometimes speed up<br>

> +  code by over 10x. An empirical inline threshold for GPUs is 1100.<br>

> This<br>

> +  configuration has yet to be upstreamed with a target-specific<br>

> optimization<br>

> +  pipeline. LLVM also provides `loop unrolling pragmas<br>

> +<br>

>  <<a href="http://clang.llvm.org/docs/AttributeReference.html#pragma-unroll-pragma-nounroll" rel="noreferrer" target="_blank">http://clang.llvm.org/docs/AttributeReference.html#pragma-unroll-pragma-nounroll</a>>`_<br>

> +  and ``__attribute__((always_inline))`` for programmers to force<br>

> unrolling and<br>

> +  inling.<br>

> +<br>

> +* **Aggressive speculative execution**. `This transformation<br>

> +<br>

>  <<a href="http://llvm.org/docs/doxygen/html/SpeculativeExecution_8cpp_source.html" rel="noreferrer" target="_blank">http://llvm.org/docs/doxygen/html/SpeculativeExecution_8cpp_source.html</a>>`_<br>

> is<br>

> +  mainly for promoting straight-line scalar optimizations which are<br>

> most<br>

> +  effective on code along dominator paths.<br>

> +<br>

> +* **Memory-space alias analysis**. `This alias analysis<br>

> +  <<a href="http://llvm.org/docs/NVPTXUsage.html" rel="noreferrer" target="_blank">http://llvm.org/docs/NVPTXUsage.html</a>>`_ infers that two pointers<br>

> in different<br>

> +  special memory spaces do not alias. It has yet to be integrated to<br>

> the new<br>

> +  alias analysis infrastructure; the new infrastructure does not run<br>

> +  target-specific alias analysis.<br>

> +<br>

> +* **Bypassing 64-bit divides**. `An existing optimization<br>

> +<br>

>  <<a href="http://llvm.org/docs/doxygen/html/BypassSlowDivision_8cpp_source.html" rel="noreferrer" target="_blank">http://llvm.org/docs/doxygen/html/BypassSlowDivision_8cpp_source.html</a>>`_<br>

> +  enabled in the NVPTX backend. 64-bit integer divides are much<br>

> slower than<br>

> +  32-bit ones on NVIDIA GPUs due to lack of a divide unit. Many of<br>

> the 64-bit<br>

> +  divides in our benchmarks have a divisor and dividend which fit in<br>

> 32-bits at<br>

> +  runtime. This optimization provides a fast path for this common<br>

> case.<br>

><br>

> Modified: llvm/trunk/docs/index.rst<br>

> URL:<br>

> <a href="http://llvm.org/viewvc/llvm-project/llvm/trunk/docs/index.rst?rev=252660&r1=252659&r2=252660&view=diff" rel="noreferrer" target="_blank">http://llvm.org/viewvc/llvm-project/llvm/trunk/docs/index.rst?rev=252660&r1=252659&r2=252660&view=diff</a><br>

> ==============================================================================<br>

> --- llvm/trunk/docs/index.rst (original)<br>

> +++ llvm/trunk/docs/index.rst Tue Nov 10 16:35:47 2015<br>

> @@ -86,6 +86,7 @@ representation.<br>

>     GetElementPtr<br>

>     Frontend/PerformanceTips<br>

>     MCJITDesignAndImplementation<br>

> +   CompileCudaWithLLVM<br>

><br>

>  :doc:`GettingStarted`<br>

>     Discusses how to get up and running quickly with the LLVM<br>

>     infrastructure.<br>

> @@ -371,6 +372,9 @@ For API clients and LLVM developers.<br>

>  :doc:`FaultMaps`<br>

>    LLVM support for folding control flow into faulting machine<br>

>    instructions.<br>

><br>

> +:doc:`CompileCudaWithLLVM`<br>

> +  LLVM support for CUDA.<br>

> +<br>

>  Development Process Documentation<br>

>  =================================<br>

><br>

><br>

><br>

> _______________________________________________<br>

> llvm-commits mailing list<br>

> <a href="mailto:llvm-commits@lists.llvm.org">llvm-commits@lists.llvm.org</a><br>

> <a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits</a><br>

><br>

<br>

</div></div><span class="HOEnZb"><font color="#888888">--<br>

Hal Finkel<br>

Assistant Computational Scientist<br>

Leadership Computing Facility<br>

Argonne National Laboratory<br>

</font></span></blockquote></div><br></div>