[llvm] r252660 - [doc] Compile CUDA with LLVM
Jingyue Wu via llvm-commits
llvm-commits at lists.llvm.org
Tue Jan 26 17:27:25 PST 2016
Sure.
Artem, you said you have some more flags you'd like to explain. Would you
mind adding them altogether?
On Tue, Jan 26, 2016 at 3:49 PM, Hal Finkel <hfinkel at anl.gov> wrote:
> Hi Jingyue,
>
> I don't know what caveats exist (and, thus, don't feel I'm the best person
> to write the patch), but a colleague of mine just tried the CUDA support
> for the first time, and he needed to affect how __CUDA_ARCH__ was defined
> to compile his code. I found --cuda-gpu-arch=sm_35 from the regression
> tests, which seemed to work, and we should probably add something about
> this flag to our documentation.
>
> Thanks again,
> Hal
>
> ----- Original Message -----
> > From: "Jingyue Wu via llvm-commits" <llvm-commits at lists.llvm.org>
> > To: llvm-commits at lists.llvm.org
> > Sent: Tuesday, November 10, 2015 4:35:47 PM
> > Subject: [llvm] r252660 - [doc] Compile CUDA with LLVM
> >
> > Author: jingyue
> > Date: Tue Nov 10 16:35:47 2015
> > New Revision: 252660
> >
> > URL: http://llvm.org/viewvc/llvm-project?rev=252660&view=rev
> > Log:
> > [doc] Compile CUDA with LLVM
> >
> > Summary:
> > This patch adds documentation on compiling CUDA with LLVM as
> > requested by many
> > engineers and researchers. It includes not only user guides but also
> > some
> > internals (mostly optimizations) so that early adopters can start
> > hacking and
> > contributing.
> >
> > Quite a few researchers who contacted us haven't used LLVM before,
> > which is
> > unsurprising as it hasn't been long since LLVM picked up CUDA. So I
> > added a
> > short summary to help these folks get started with LLVM.
> >
> > I expect this document to evolve substantially down the road. The
> > user guides
> > will be much simplified after the Clang integration is done. However,
> > the
> > internals should continue growing to include for example performance
> > debugging
> > and key areas to improve.
> >
> > Reviewers: chandlerc, meheff, broune, tra
> >
> > Subscribers: silvas, jingyue, llvm-commits, eliben
> >
> > Differential Revision: http://reviews.llvm.org/D14370
> >
> > Added:
> > llvm/trunk/docs/CompileCudaWithLLVM.rst
> > Modified:
> > llvm/trunk/docs/index.rst
> >
> > Added: llvm/trunk/docs/CompileCudaWithLLVM.rst
> > URL:
> >
> http://llvm.org/viewvc/llvm-project/llvm/trunk/docs/CompileCudaWithLLVM.rst?rev=252660&view=auto
> >
> ==============================================================================
> > --- llvm/trunk/docs/CompileCudaWithLLVM.rst (added)
> > +++ llvm/trunk/docs/CompileCudaWithLLVM.rst Tue Nov 10 16:35:47 2015
> > @@ -0,0 +1,192 @@
> > +===================================
> > +Compiling CUDA C/C++ with LLVM
> > +===================================
> > +
> > +.. contents::
> > + :local:
> > +
> > +Introduction
> > +============
> > +
> > +This document contains the user guides and the internals of
> > compiling CUDA
> > +C/C++ with LLVM. It is aimed at both users who want to compile CUDA
> > with LLVM
> > +and developers who want to improve LLVM for GPUs. This document
> > assumes a basic
> > +familiarity with CUDA. Information about CUDA programming can be
> > found in the
> > +`CUDA programming guide
> > +<http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html>`_.
> > +
> > +How to Build LLVM with CUDA Support
> > +===================================
> > +
> > +The support for CUDA is still in progress and temporarily relies on
> > `this patch
> > +<http://reviews.llvm.org/D14452>`_. Below is a quick summary of
> > downloading and
> > +building LLVM with CUDA support. Consult the `Getting Started
> > +<http://llvm.org/docs/GettingStarted.html>`_ page for more details
> > on setting
> > +up LLVM.
> > +
> > +#. Checkout LLVM
> > +
> > + .. code-block:: console
> > +
> > + $ cd where-you-want-llvm-to-live
> > + $ svn co http://llvm.org/svn/llvm-project/llvm/trunk llvm
> > +
> > +#. Checkout Clang
> > +
> > + .. code-block:: console
> > +
> > + $ cd where-you-want-llvm-to-live
> > + $ cd llvm/tools
> > + $ svn co http://llvm.org/svn/llvm-project/cfe/trunk clang
> > +
> > +#. Apply the temporary patch for CUDA support.
> > +
> > + If you have installed `Arcanist
> > +
> > <
> http://llvm.org/docs/Phabricator.html#requesting-a-review-via-the-command-line
> >`_,
> > + you can apply this patch using
> > +
> > + .. code-block:: console
> > +
> > + $ cd where-you-want-llvm-to-live
> > + $ cd llvm/tools/clang
> > + $ arc patch D14452
> > +
> > + Otherwise, go to `its review page
> > <http://reviews.llvm.org/D14452>`_,
> > + download the raw diff, and apply it manually using
> > +
> > + .. code-block:: console
> > +
> > + $ cd where-you-want-llvm-to-live
> > + $ cd llvm/tools/clang
> > + $ patch -p0 < D14452.diff
> > +
> > +#. Configure and build LLVM and Clang
> > +
> > + .. code-block:: console
> > +
> > + $ cd where-you-want-llvm-to-live
> > + $ mkdir build
> > + $ cd build
> > + $ cmake [options] ..
> > + $ make
> > +
> > +How to Compile CUDA C/C++ with LLVM
> > +===================================
> > +
> > +We assume you have installed the CUDA driver and runtime. Consult
> > the `NVIDIA
> > +CUDA installation Guide
> > +<https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html
> >`_
> > if
> > +you have not.
> > +
> > +Suppose you want to compile and run the following CUDA program
> > (``axpy.cu``)
> > +which multiplies a ``float`` array by a ``float`` scalar (AXPY).
> > +
> > +.. code-block:: c++
> > +
> > + #include <helper_cuda.h> // for checkCudaErrors
> > +
> > + #include <iostream>
> > +
> > + __global__ void axpy(float a, float* x, float* y) {
> > + y[threadIdx.x] = a * x[threadIdx.x];
> > + }
> > +
> > + int main(int argc, char* argv[]) {
> > + const int kDataLen = 4;
> > +
> > + float a = 2.0f;
> > + float host_x[kDataLen] = {1.0f, 2.0f, 3.0f, 4.0f};
> > + float host_y[kDataLen];
> > +
> > + // Copy input data to device.
> > + float* device_x;
> > + float* device_y;
> > + checkCudaErrors(cudaMalloc(&device_x, kDataLen *
> > sizeof(float)));
> > + checkCudaErrors(cudaMalloc(&device_y, kDataLen *
> > sizeof(float)));
> > + checkCudaErrors(cudaMemcpy(device_x, host_x, kDataLen *
> > sizeof(float),
> > + cudaMemcpyHostToDevice));
> > +
> > + // Launch the kernel.
> > + axpy<<<1, kDataLen>>>(a, device_x, device_y);
> > +
> > + // Copy output data to host.
> > + checkCudaErrors(cudaDeviceSynchronize());
> > + checkCudaErrors(cudaMemcpy(host_y, device_y, kDataLen *
> > sizeof(float),
> > + cudaMemcpyDeviceToHost));
> > +
> > + // Print the results.
> > + for (int i = 0; i < kDataLen; ++i) {
> > + std::cout << "y[" << i << "] = " << host_y[i] << "\n";
> > + }
> > +
> > + checkCudaErrors(cudaDeviceReset());
> > + return 0;
> > + }
> > +
> > +The command line for compilation is similar to what you would use
> > for C++.
> > +
> > +.. code-block:: console
> > +
> > + $ clang++ -o axpy -I<CUDA install path>/samples/common/inc -L<CUDA
> > install path>/<lib64 or lib> axpy.cu -lcudart_static -lcuda -ldl
> > -lrt -pthread
> > + $ ./axpy
> > + y[0] = 2
> > + y[1] = 4
> > + y[2] = 6
> > + y[3] = 8
> > +
> > +Note that ``helper_cuda.h`` comes from the CUDA samples, so you need
> > the
> > +samples installed for this example. ``<CUDA install path>`` is the
> > root
> > +directory where you installed CUDA SDK, typically
> > ``/usr/local/cuda``.
> > +
> > +Optimizations
> > +=============
> > +
> > +CPU and GPU have different design philosophies and architectures.
> > For example, a
> > +typical CPU has branch prediction, out-of-order execution, and is
> > superscalar,
> > +whereas a typical GPU has none of these. Due to such differences, an
> > +optimization pipeline well-tuned for CPUs may be not suitable for
> > GPUs.
> > +
> > +LLVM performs several general and CUDA-specific optimizations for
> > GPUs. The
> > +list below shows some of the more important optimizations for GPUs.
> > Most of
> > +them have been upstreamed to ``lib/Transforms/Scalar`` and
> > +``lib/Target/NVPTX``. A few of them have not been upstreamed due to
> > lack of a
> > +customizable target-independent optimization pipeline.
> > +
> > +* **Straight-line scalar optimizations**. These optimizations reduce
> > redundancy
> > + in straight-line code. Details can be found in the `design
> > document for
> > + straight-line scalar optimizations <https://goo.gl/4Rb9As>`_.
> > +
> > +* **Inferring memory spaces**. `This optimization
> > +
> > <
> http://www.llvm.org/docs/doxygen/html/NVPTXFavorNonGenericAddrSpaces_8cpp_source.html
> >`_
> > + infers the memory space of an address so that the backend can emit
> > faster
> > + special loads and stores from it. Details can be found in the
> > `design
> > + document for memory space inference <https://goo.gl/5wH2Ct>`_.
> > +
> > +* **Aggressive loop unrooling and function inlining**. Loop
> > unrolling and
> > + function inlining need to be more aggressive for GPUs than for
> > CPUs because
> > + control flow transfer in GPU is more expensive. They also promote
> > other
> > + optimizations such as constant propagation and SROA which
> > sometimes speed up
> > + code by over 10x. An empirical inline threshold for GPUs is 1100.
> > This
> > + configuration has yet to be upstreamed with a target-specific
> > optimization
> > + pipeline. LLVM also provides `loop unrolling pragmas
> > +
> > <
> http://clang.llvm.org/docs/AttributeReference.html#pragma-unroll-pragma-nounroll
> >`_
> > + and ``__attribute__((always_inline))`` for programmers to force
> > unrolling and
> > + inling.
> > +
> > +* **Aggressive speculative execution**. `This transformation
> > +
> > <
> http://llvm.org/docs/doxygen/html/SpeculativeExecution_8cpp_source.html>`_
> > is
> > + mainly for promoting straight-line scalar optimizations which are
> > most
> > + effective on code along dominator paths.
> > +
> > +* **Memory-space alias analysis**. `This alias analysis
> > + <http://llvm.org/docs/NVPTXUsage.html>`_ infers that two pointers
> > in different
> > + special memory spaces do not alias. It has yet to be integrated to
> > the new
> > + alias analysis infrastructure; the new infrastructure does not run
> > + target-specific alias analysis.
> > +
> > +* **Bypassing 64-bit divides**. `An existing optimization
> > +
> > <http://llvm.org/docs/doxygen/html/BypassSlowDivision_8cpp_source.html
> >`_
> > + enabled in the NVPTX backend. 64-bit integer divides are much
> > slower than
> > + 32-bit ones on NVIDIA GPUs due to lack of a divide unit. Many of
> > the 64-bit
> > + divides in our benchmarks have a divisor and dividend which fit in
> > 32-bits at
> > + runtime. This optimization provides a fast path for this common
> > case.
> >
> > Modified: llvm/trunk/docs/index.rst
> > URL:
> >
> http://llvm.org/viewvc/llvm-project/llvm/trunk/docs/index.rst?rev=252660&r1=252659&r2=252660&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/docs/index.rst (original)
> > +++ llvm/trunk/docs/index.rst Tue Nov 10 16:35:47 2015
> > @@ -86,6 +86,7 @@ representation.
> > GetElementPtr
> > Frontend/PerformanceTips
> > MCJITDesignAndImplementation
> > + CompileCudaWithLLVM
> >
> > :doc:`GettingStarted`
> > Discusses how to get up and running quickly with the LLVM
> > infrastructure.
> > @@ -371,6 +372,9 @@ For API clients and LLVM developers.
> > :doc:`FaultMaps`
> > LLVM support for folding control flow into faulting machine
> > instructions.
> >
> > +:doc:`CompileCudaWithLLVM`
> > + LLVM support for CUDA.
> > +
> > Development Process Documentation
> > =================================
> >
> >
> >
> > _______________________________________________
> > llvm-commits mailing list
> > llvm-commits at lists.llvm.org
> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
> >
>
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20160126/8377e4d1/attachment.html>
More information about the llvm-commits
mailing list