[PATCH] D14370: [doc] Compile CUDA with LLVM

Thu Nov 5 11:09:01 PST 2015

broune accepted this revision.
This revision is now accepted and ready to land.

================
Comment at: docs/CompileCudaWithLLVM.rst:13
@@ +12,3 @@
+with LLVM. It is written for not only users who want to compile CUDA with LLVM
+but also developers who want to improve LLVM for GPUs. This
+document assumes a basic familiarity with CUDA. Information about CUDA
----------------
Could be:

It is aimed at both users who want to compile CUDA with LLVM and developers who want to improve LLVM for GPUs.

================
Comment at: docs/CompileCudaWithLLVM.rst:69
@@ +68,3 @@
+The CUDA driver compiles PTX at runtime to the low-level machine instruction set
+called *SASS* that executes naively on GPU.
+
----------------
naively -> natively

================
Comment at: docs/CompileCudaWithLLVM.rst:81
@@ +80,3 @@
+early adopter, one would have to manually extract device code to a separate
+file, compile it to PTX, and have the host code load and launch the kernel.
+
----------------
Could be:

Therefore, for early adopters using CUDA with LLVM now, it is necessary to manually ...

================
Comment at: docs/CompileCudaWithLLVM.rst:84
@@ +83,3 @@
+For example, suppose you want to compile the following mixed-mode CUDA program
+that multiplies a ``float`` vector by a ``float`` scalar (AXPY).
+
----------------
"vector" is correct here, though it could suggest a std::vector. "array" wouldn't have that connotation.

Also, maybe "(this operation is sometimes referred to as AXPY)", as just "(AXPY)" would likely seem rather cryptic to someone who doesn't know what AXPY is.

================
Comment at: docs/CompileCudaWithLLVM.rst:131
@@ +130,3 @@
+
+#. Extract the kernel to a separate file (supposingly ``axpy.cu``)
+
----------------
to a separate file (supposingly axpy.cu) -> to a separate file axpy.cu.

================
Comment at: docs/CompileCudaWithLLVM.rst:141
@@ +140,3 @@
+
+#. Compile the device code ``axpy.cu`` to PTX (supposingly ``axpy.ptx``)
+
----------------
to PTX (supposingly axpy.ptx) -> to a PTX file axpy.ptx

================
Comment at: docs/CompileCudaWithLLVM.rst:163
@@ +162,3 @@
+     looked up at the `CUDA GPUs <https://developer.nvidia.com/cuda-gpus>`_
+     page. For example, if your GPU is Tesla K40, the compute capabitliy should
+     be ``sm_35``.
----------------
capabitliy -> capability

================
Comment at: docs/CompileCudaWithLLVM.rst:169
@@ +168,3 @@
+
+#. Modify the host code (supposingly ``axpy.cc``) to load and launch the kernel
+   in the PTX
----------------
host code (supposingly axpy.cc) -> host code in axpy.cc

================
Comment at: docs/CompileCudaWithLLVM.rst:223
@@ +222,3 @@
+CPU and GPU have different design philosophies and architectures. For example, a
+typical CPU has branch prediction, out-of-order execution, and superscalar,
+whereas a typical GPU has none of these. Due to these differences, an
----------------
and superscalar -> and is superscalar

================
Comment at: docs/CompileCudaWithLLVM.rst:224
@@ +223,3 @@
+typical CPU has branch prediction, out-of-order execution, and superscalar,
+whereas a typical GPU has none of these. Due to these differences, an
+optimization pipeline well-tuned for CPUs may be not suitable for GPUs.
----------------
these differences -> such differences

(the list is not exhaustive)

================
Comment at: docs/CompileCudaWithLLVM.rst:228
@@ +227,3 @@
+LLVM performs several general and CUDA-specific optimizations for GPUs. Below is
+a list of major ones. Most of them have been upstreamed to
+``lib/Transforms/Scalar`` and ``lib/Target/NVPTX``. Some are punted due to lack
----------------
This suggests that these are the only major ones. "The list below shows some of the more important optimizations for GPUs."

================
Comment at: docs/CompileCudaWithLLVM.rst:229
@@ +228,3 @@
+a list of major ones. Most of them have been upstreamed to
+``lib/Transforms/Scalar`` and ``lib/Target/NVPTX``. Some are punted due to lack
+of a customizable target-independent optimization pipeline.
----------------
I had difficulty understanding this sentence. If I understood it correctly, this could be:

"A few of the optimizations have not been upstreamed due to ..."

================
Comment at: docs/CompileCudaWithLLVM.rst:238
@@ +237,3 @@
+  <http://www.llvm.org/docs/doxygen/html/NVPTXFavorNonGenericAddrSpaces_8cpp_source.html>`_
+  infers the memory space of an address so that emits fast special loads and
+  stores from it. Details can be found in the `design document for memory space
----------------
so that emits fast special loads -> so that the backend can emit faster specialized loads

================
Comment at: docs/CompileCudaWithLLVM.rst:243
@@ +242,3 @@
+* **Aggressive loop unrooling and function inlining**. Loop unrolling and
+  function inlining are more encouraged for GPUs than for CPUs because control
+  flow transfer in GPU is more expensive. They also promote other optimizations
----------------
more encouraged -> needs to be more aggressive

================
Comment at: docs/CompileCudaWithLLVM.rst:246
@@ +245,3 @@
+  such as constant propagation and SROA which sometimes speed up code by over
+  10x. An empirical inline threshold for GPUs is 1100. This configuration is yet
+  to be upstreamed with a target-specific optimization pipeline. LLVM also
----------------
is yet -> has yet

================
Comment at: docs/CompileCudaWithLLVM.rst:259
@@ +258,3 @@
+* **Memory-space alias analysis**. `This alias analysis
+  <http://llvm.org/docs/NVPTXUsage.html>`_ knows that two pointers in different
+  special memory spaces do not alias. It is yet to be integrated to the new
----------------
knows -> infers

================
Comment at: docs/CompileCudaWithLLVM.rst:260
@@ +259,3 @@
+  <http://llvm.org/docs/NVPTXUsage.html>`_ knows that two pointers in different
+  special memory spaces do not alias. It is yet to be integrated to the new
+  alias analysis infrastructure; the new infrastructure does not run
----------------
is yet -> has yet

http://reviews.llvm.org/D14370