[llvm] r280857 - [CUDA] Move AXPY example into gist.
Justin Lebar via llvm-commits
llvm-commits at lists.llvm.org
Wed Sep 7 13:37:41 PDT 2016
Author: jlebar
Date: Wed Sep 7 15:37:41 2016
New Revision: 280857
URL: http://llvm.org/viewvc/llvm-project?rev=280857&view=rev
Log:
[CUDA] Move AXPY example into gist.
No need to have a long inline code snippet in this doc.
Also move "flags that control numerical code" underneath the "invoking
clang" section, and reformat things a bit.
Modified:
llvm/trunk/docs/CompileCudaWithLLVM.rst
Modified: llvm/trunk/docs/CompileCudaWithLLVM.rst
URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/docs/CompileCudaWithLLVM.rst?rev=280857&r1=280856&r2=280857&view=diff
==============================================================================
--- llvm/trunk/docs/CompileCudaWithLLVM.rst (original)
+++ llvm/trunk/docs/CompileCudaWithLLVM.rst Wed Sep 7 15:37:41 2016
@@ -1,6 +1,6 @@
-===================================
+=========================
Compiling CUDA with clang
-===================================
+=========================
.. contents::
:local:
@@ -36,58 +36,20 @@ by many Linux package managers; you prob
You will need CUDA 7.0 or 7.5 to compile with clang. CUDA 8 support is in the
works.
-Building AXPY
--------------
-
-Suppose you want to compile and run the following CUDA program (``axpy.cu``),
-which multiplies a ``float`` array by a ``float`` scalar.
-
-.. code-block:: c++
+Invoking clang
+--------------
- #include <iostream>
+Invoking clang for CUDA compilation works similarly to compiling regular C++.
+You just need to be aware of a few additional flags.
- __global__ void axpy(float a, float* x, float* y) {
- y[threadIdx.x] = a * x[threadIdx.x];
- }
-
- int main(int argc, char* argv[]) {
- const int kDataLen = 4;
-
- float a = 2.0f;
- float host_x[kDataLen] = {1.0f, 2.0f, 3.0f, 4.0f};
- float host_y[kDataLen];
-
- // Copy input data to device.
- float* device_x;
- float* device_y;
- cudaMalloc(&device_x, kDataLen * sizeof(float));
- cudaMalloc(&device_y, kDataLen * sizeof(float));
- cudaMemcpy(device_x, host_x, kDataLen * sizeof(float),
- cudaMemcpyHostToDevice);
-
- // Launch the kernel.
- axpy<<<1, kDataLen>>>(a, device_x, device_y);
-
- // Copy output data to host.
- cudaDeviceSynchronize();
- cudaMemcpy(host_y, device_y, kDataLen * sizeof(float),
- cudaMemcpyDeviceToHost);
-
- // Print the results.
- for (int i = 0; i < kDataLen; ++i) {
- std::cout << "y[" << i << "] = " << host_y[i] << "\n";
- }
-
- cudaDeviceReset();
- return 0;
- }
-
-The command line for compilation is similar to what you would use for C++.
+You can use `this <https://gist.github.com/855e277884eb6b388cd2f00d956c2fd4>_`
+program as a toy example. Save it as ``axpy.cu``. To build and run, run the
+following commands:
.. code-block:: console
- $ clang++ axpy.cu -o axpy --cuda-gpu-arch=<GPU arch> \
- -L<CUDA install path>/<lib64 or lib> \
+ $ clang++ axpy.cu -o axpy --cuda-gpu-arch=<GPU arch> \
+ -L<CUDA install path>/<lib64 or lib> \
-lcudart_static -ldl -lrt -pthread
$ ./axpy
y[0] = 2
@@ -95,50 +57,32 @@ The command line for compilation is simi
y[2] = 6
y[3] = 8
-``<CUDA install path>`` is the root directory where you installed CUDA SDK,
-typically ``/usr/local/cuda``. ``<GPU arch>`` is `the compute capability of
-your GPU <https://developer.nvidia.com/cuda-gpus>`_. For example, if you want
-to run your program on a GPU with compute capability of 3.5, you should specify
-``--cuda-gpu-arch=sm_35``.
-
-Note: You cannot pass ``compute_XX`` as an argument to ``--cuda-gpu-arch``;
-only ``sm_XX`` is currently supported. However, clang always includes PTX in
-its binaries, so e.g. a binary compiled with ``--cuda-gpu-arch=sm_30`` would be
-forwards-compatible with e.g. ``sm_35`` GPUs.
-
-You can pass ``--cuda-gpu-arch`` multiple times to compile for multiple archs.
-
-Detecting clang vs NVCC
-=======================
-
-Although clang's CUDA implementation is largely compatible with NVCC's, you may
-still want to detect when you're compiling CUDA code specifically with clang.
-
-This is tricky, because NVCC may invoke clang as part of its own compilation
-process! For example, NVCC uses the host compiler's preprocessor when
-compiling for device code, and that host compiler may in fact be clang.
-
-When clang is actually compiling CUDA code -- rather than being used as a
-subtool of NVCC's -- it defines the ``__CUDA__`` macro. ``__CUDA_ARCH__`` is
-defined only in device mode (but will be defined if NVCC is using clang as a
-preprocessor). So you can use the following incantations to detect clang CUDA
-compilation, in host and device modes:
+* clang detects that you're compiling CUDA by noticing that your source file ends
+ with ``.cu``. (Alternatively, you can pass ``-x cuda``.)
-.. code-block:: c++
+* ``<CUDA install path>`` is the root directory where you installed CUDA SDK,
+ typically ``/usr/local/cuda``.
- #if defined(__clang__) && defined(__CUDA__) && !defined(__CUDA_ARCH__)
- // clang compiling CUDA code, host mode.
- #endif
+ Pass e.g. ``/usr/local/cuda/lib64`` if compiling in 64-bit mode; otherwise,
+ pass ``/usr/local/cuda/lib``. (In CUDA, the device code and host code always
+ have the same pointer widths, so if you're compiling 64-bit code for the
+ host, you're also compiling 64-bit code for the device.)
+
+* ``<GPU arch>`` is `the compute capability of your GPU
+ <https://developer.nvidia.com/cuda-gpus>`_. For example, if you want to run
+ your program on a GPU with compute capability of 3.5, you should specify
+ ``--cuda-gpu-arch=sm_35``.
+
+ Note: You cannot pass ``compute_XX`` as an argument to ``--cuda-gpu-arch``;
+ only ``sm_XX`` is currently supported. However, clang always includes PTX in
+ its binaries, so e.g. a binary compiled with ``--cuda-gpu-arch=sm_30`` would be
+ forwards-compatible with e.g. ``sm_35`` GPUs.
- #if defined(__clang__) && defined(__CUDA__) && defined(__CUDA_ARCH__)
- // clang compiling CUDA code, device mode.
- #endif
-
-Both clang and nvcc define ``__CUDACC__`` during CUDA compilation. You can
-detect NVCC specifically by looking for ``__NVCC__``.
+ You can pass ``--cuda-gpu-arch`` multiple times to compile for multiple
+ archs.
Flags that control numerical code
-=================================
+---------------------------------
If you're using GPUs, you probably care about making numerical code run fast.
GPU hardware allows for more control over numerical operations than most CPUs,
@@ -177,6 +121,35 @@ Flags you may wish to tweak include:
This is implied by ``-ffast-math``.
+Detecting clang vs NVCC from code
+=================================
+
+Although clang's CUDA implementation is largely compatible with NVCC's, you may
+still want to detect when you're compiling CUDA code specifically with clang.
+
+This is tricky, because NVCC may invoke clang as part of its own compilation
+process! For example, NVCC uses the host compiler's preprocessor when
+compiling for device code, and that host compiler may in fact be clang.
+
+When clang is actually compiling CUDA code -- rather than being used as a
+subtool of NVCC's -- it defines the ``__CUDA__`` macro. ``__CUDA_ARCH__`` is
+defined only in device mode (but will be defined if NVCC is using clang as a
+preprocessor). So you can use the following incantations to detect clang CUDA
+compilation, in host and device modes:
+
+.. code-block:: c++
+
+ #if defined(__clang__) && defined(__CUDA__) && !defined(__CUDA_ARCH__)
+ // clang compiling CUDA code, host mode.
+ #endif
+
+ #if defined(__clang__) && defined(__CUDA__) && defined(__CUDA_ARCH__)
+ // clang compiling CUDA code, device mode.
+ #endif
+
+Both clang and nvcc define ``__CUDACC__`` during CUDA compilation. You can
+detect NVCC specifically by looking for ``__NVCC__``.
+
Optimizations
=============
More information about the llvm-commits
mailing list