[llvm] r252660 - [doc] Compile CUDA with LLVM
Hal Finkel via llvm-commits
llvm-commits at lists.llvm.org
Tue Jan 26 15:49:54 PST 2016
Hi Jingyue,
I don't know what caveats exist (and, thus, don't feel I'm the best person to write the patch), but a colleague of mine just tried the CUDA support for the first time, and he needed to affect how __CUDA_ARCH__ was defined to compile his code. I found --cuda-gpu-arch=sm_35 from the regression tests, which seemed to work, and we should probably add something about this flag to our documentation.
Thanks again,
Hal
----- Original Message -----
> From: "Jingyue Wu via llvm-commits" <llvm-commits at lists.llvm.org>
> To: llvm-commits at lists.llvm.org
> Sent: Tuesday, November 10, 2015 4:35:47 PM
> Subject: [llvm] r252660 - [doc] Compile CUDA with LLVM
>
> Author: jingyue
> Date: Tue Nov 10 16:35:47 2015
> New Revision: 252660
>
> URL: http://llvm.org/viewvc/llvm-project?rev=252660&view=rev
> Log:
> [doc] Compile CUDA with LLVM
>
> Summary:
> This patch adds documentation on compiling CUDA with LLVM as
> requested by many
> engineers and researchers. It includes not only user guides but also
> some
> internals (mostly optimizations) so that early adopters can start
> hacking and
> contributing.
>
> Quite a few researchers who contacted us haven't used LLVM before,
> which is
> unsurprising as it hasn't been long since LLVM picked up CUDA. So I
> added a
> short summary to help these folks get started with LLVM.
>
> I expect this document to evolve substantially down the road. The
> user guides
> will be much simplified after the Clang integration is done. However,
> the
> internals should continue growing to include for example performance
> debugging
> and key areas to improve.
>
> Reviewers: chandlerc, meheff, broune, tra
>
> Subscribers: silvas, jingyue, llvm-commits, eliben
>
> Differential Revision: http://reviews.llvm.org/D14370
>
> Added:
> llvm/trunk/docs/CompileCudaWithLLVM.rst
> Modified:
> llvm/trunk/docs/index.rst
>
> Added: llvm/trunk/docs/CompileCudaWithLLVM.rst
> URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/docs/CompileCudaWithLLVM.rst?rev=252660&view=auto
> ==============================================================================
> --- llvm/trunk/docs/CompileCudaWithLLVM.rst (added)
> +++ llvm/trunk/docs/CompileCudaWithLLVM.rst Tue Nov 10 16:35:47 2015
> @@ -0,0 +1,192 @@
> +===================================
> +Compiling CUDA C/C++ with LLVM
> +===================================
> +
> +.. contents::
> + :local:
> +
> +Introduction
> +============
> +
> +This document contains the user guides and the internals of
> compiling CUDA
> +C/C++ with LLVM. It is aimed at both users who want to compile CUDA
> with LLVM
> +and developers who want to improve LLVM for GPUs. This document
> assumes a basic
> +familiarity with CUDA. Information about CUDA programming can be
> found in the
> +`CUDA programming guide
> +<http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html>`_.
> +
> +How to Build LLVM with CUDA Support
> +===================================
> +
> +The support for CUDA is still in progress and temporarily relies on
> `this patch
> +<http://reviews.llvm.org/D14452>`_. Below is a quick summary of
> downloading and
> +building LLVM with CUDA support. Consult the `Getting Started
> +<http://llvm.org/docs/GettingStarted.html>`_ page for more details
> on setting
> +up LLVM.
> +
> +#. Checkout LLVM
> +
> + .. code-block:: console
> +
> + $ cd where-you-want-llvm-to-live
> + $ svn co http://llvm.org/svn/llvm-project/llvm/trunk llvm
> +
> +#. Checkout Clang
> +
> + .. code-block:: console
> +
> + $ cd where-you-want-llvm-to-live
> + $ cd llvm/tools
> + $ svn co http://llvm.org/svn/llvm-project/cfe/trunk clang
> +
> +#. Apply the temporary patch for CUDA support.
> +
> + If you have installed `Arcanist
> +
> <http://llvm.org/docs/Phabricator.html#requesting-a-review-via-the-command-line>`_,
> + you can apply this patch using
> +
> + .. code-block:: console
> +
> + $ cd where-you-want-llvm-to-live
> + $ cd llvm/tools/clang
> + $ arc patch D14452
> +
> + Otherwise, go to `its review page
> <http://reviews.llvm.org/D14452>`_,
> + download the raw diff, and apply it manually using
> +
> + .. code-block:: console
> +
> + $ cd where-you-want-llvm-to-live
> + $ cd llvm/tools/clang
> + $ patch -p0 < D14452.diff
> +
> +#. Configure and build LLVM and Clang
> +
> + .. code-block:: console
> +
> + $ cd where-you-want-llvm-to-live
> + $ mkdir build
> + $ cd build
> + $ cmake [options] ..
> + $ make
> +
> +How to Compile CUDA C/C++ with LLVM
> +===================================
> +
> +We assume you have installed the CUDA driver and runtime. Consult
> the `NVIDIA
> +CUDA installation Guide
> +<https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html>`_
> if
> +you have not.
> +
> +Suppose you want to compile and run the following CUDA program
> (``axpy.cu``)
> +which multiplies a ``float`` array by a ``float`` scalar (AXPY).
> +
> +.. code-block:: c++
> +
> + #include <helper_cuda.h> // for checkCudaErrors
> +
> + #include <iostream>
> +
> + __global__ void axpy(float a, float* x, float* y) {
> + y[threadIdx.x] = a * x[threadIdx.x];
> + }
> +
> + int main(int argc, char* argv[]) {
> + const int kDataLen = 4;
> +
> + float a = 2.0f;
> + float host_x[kDataLen] = {1.0f, 2.0f, 3.0f, 4.0f};
> + float host_y[kDataLen];
> +
> + // Copy input data to device.
> + float* device_x;
> + float* device_y;
> + checkCudaErrors(cudaMalloc(&device_x, kDataLen *
> sizeof(float)));
> + checkCudaErrors(cudaMalloc(&device_y, kDataLen *
> sizeof(float)));
> + checkCudaErrors(cudaMemcpy(device_x, host_x, kDataLen *
> sizeof(float),
> + cudaMemcpyHostToDevice));
> +
> + // Launch the kernel.
> + axpy<<<1, kDataLen>>>(a, device_x, device_y);
> +
> + // Copy output data to host.
> + checkCudaErrors(cudaDeviceSynchronize());
> + checkCudaErrors(cudaMemcpy(host_y, device_y, kDataLen *
> sizeof(float),
> + cudaMemcpyDeviceToHost));
> +
> + // Print the results.
> + for (int i = 0; i < kDataLen; ++i) {
> + std::cout << "y[" << i << "] = " << host_y[i] << "\n";
> + }
> +
> + checkCudaErrors(cudaDeviceReset());
> + return 0;
> + }
> +
> +The command line for compilation is similar to what you would use
> for C++.
> +
> +.. code-block:: console
> +
> + $ clang++ -o axpy -I<CUDA install path>/samples/common/inc -L<CUDA
> install path>/<lib64 or lib> axpy.cu -lcudart_static -lcuda -ldl
> -lrt -pthread
> + $ ./axpy
> + y[0] = 2
> + y[1] = 4
> + y[2] = 6
> + y[3] = 8
> +
> +Note that ``helper_cuda.h`` comes from the CUDA samples, so you need
> the
> +samples installed for this example. ``<CUDA install path>`` is the
> root
> +directory where you installed CUDA SDK, typically
> ``/usr/local/cuda``.
> +
> +Optimizations
> +=============
> +
> +CPU and GPU have different design philosophies and architectures.
> For example, a
> +typical CPU has branch prediction, out-of-order execution, and is
> superscalar,
> +whereas a typical GPU has none of these. Due to such differences, an
> +optimization pipeline well-tuned for CPUs may be not suitable for
> GPUs.
> +
> +LLVM performs several general and CUDA-specific optimizations for
> GPUs. The
> +list below shows some of the more important optimizations for GPUs.
> Most of
> +them have been upstreamed to ``lib/Transforms/Scalar`` and
> +``lib/Target/NVPTX``. A few of them have not been upstreamed due to
> lack of a
> +customizable target-independent optimization pipeline.
> +
> +* **Straight-line scalar optimizations**. These optimizations reduce
> redundancy
> + in straight-line code. Details can be found in the `design
> document for
> + straight-line scalar optimizations <https://goo.gl/4Rb9As>`_.
> +
> +* **Inferring memory spaces**. `This optimization
> +
> <http://www.llvm.org/docs/doxygen/html/NVPTXFavorNonGenericAddrSpaces_8cpp_source.html>`_
> + infers the memory space of an address so that the backend can emit
> faster
> + special loads and stores from it. Details can be found in the
> `design
> + document for memory space inference <https://goo.gl/5wH2Ct>`_.
> +
> +* **Aggressive loop unrooling and function inlining**. Loop
> unrolling and
> + function inlining need to be more aggressive for GPUs than for
> CPUs because
> + control flow transfer in GPU is more expensive. They also promote
> other
> + optimizations such as constant propagation and SROA which
> sometimes speed up
> + code by over 10x. An empirical inline threshold for GPUs is 1100.
> This
> + configuration has yet to be upstreamed with a target-specific
> optimization
> + pipeline. LLVM also provides `loop unrolling pragmas
> +
> <http://clang.llvm.org/docs/AttributeReference.html#pragma-unroll-pragma-nounroll>`_
> + and ``__attribute__((always_inline))`` for programmers to force
> unrolling and
> + inling.
> +
> +* **Aggressive speculative execution**. `This transformation
> +
> <http://llvm.org/docs/doxygen/html/SpeculativeExecution_8cpp_source.html>`_
> is
> + mainly for promoting straight-line scalar optimizations which are
> most
> + effective on code along dominator paths.
> +
> +* **Memory-space alias analysis**. `This alias analysis
> + <http://llvm.org/docs/NVPTXUsage.html>`_ infers that two pointers
> in different
> + special memory spaces do not alias. It has yet to be integrated to
> the new
> + alias analysis infrastructure; the new infrastructure does not run
> + target-specific alias analysis.
> +
> +* **Bypassing 64-bit divides**. `An existing optimization
> +
> <http://llvm.org/docs/doxygen/html/BypassSlowDivision_8cpp_source.html>`_
> + enabled in the NVPTX backend. 64-bit integer divides are much
> slower than
> + 32-bit ones on NVIDIA GPUs due to lack of a divide unit. Many of
> the 64-bit
> + divides in our benchmarks have a divisor and dividend which fit in
> 32-bits at
> + runtime. This optimization provides a fast path for this common
> case.
>
> Modified: llvm/trunk/docs/index.rst
> URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/docs/index.rst?rev=252660&r1=252659&r2=252660&view=diff
> ==============================================================================
> --- llvm/trunk/docs/index.rst (original)
> +++ llvm/trunk/docs/index.rst Tue Nov 10 16:35:47 2015
> @@ -86,6 +86,7 @@ representation.
> GetElementPtr
> Frontend/PerformanceTips
> MCJITDesignAndImplementation
> + CompileCudaWithLLVM
>
> :doc:`GettingStarted`
> Discusses how to get up and running quickly with the LLVM
> infrastructure.
> @@ -371,6 +372,9 @@ For API clients and LLVM developers.
> :doc:`FaultMaps`
> LLVM support for folding control flow into faulting machine
> instructions.
>
> +:doc:`CompileCudaWithLLVM`
> + LLVM support for CUDA.
> +
> Development Process Documentation
> =================================
>
>
>
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
>
--
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory
More information about the llvm-commits
mailing list