[llvm] r280869 - [CUDA] Rework "optimizations" and "publication" section in CompileCudaWithLLVM.rst.
Justin Lebar via llvm-commits
llvm-commits at lists.llvm.org
Wed Sep 7 14:46:53 PDT 2016
Author: jlebar
Date: Wed Sep 7 16:46:53 2016
New Revision: 280869
URL: http://llvm.org/viewvc/llvm-project?rev=280869&view=rev
Log:
[CUDA] Rework "optimizations" and "publication" section in CompileCudaWithLLVM.rst.
Modified:
llvm/trunk/docs/CompileCudaWithLLVM.rst
Modified: llvm/trunk/docs/CompileCudaWithLLVM.rst
URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/docs/CompileCudaWithLLVM.rst?rev=280869&r1=280868&r2=280869&view=diff
==============================================================================
--- llvm/trunk/docs/CompileCudaWithLLVM.rst (original)
+++ llvm/trunk/docs/CompileCudaWithLLVM.rst Wed Sep 7 16:46:53 2016
@@ -158,67 +158,60 @@ detect NVCC specifically by looking for
Optimizations
=============
-CPU and GPU have different design philosophies and architectures. For example, a
-typical CPU has branch prediction, out-of-order execution, and is superscalar,
-whereas a typical GPU has none of these. Due to such differences, an
-optimization pipeline well-tuned for CPUs may be not suitable for GPUs.
-
-LLVM performs several general and CUDA-specific optimizations for GPUs. The
-list below shows some of the more important optimizations for GPUs. Most of
-them have been upstreamed to ``lib/Transforms/Scalar`` and
-``lib/Target/NVPTX``. A few of them have not been upstreamed due to lack of a
-customizable target-independent optimization pipeline.
-
-* **Straight-line scalar optimizations**. These optimizations reduce redundancy
- in straight-line code. Details can be found in the `design document for
- straight-line scalar optimizations <https://goo.gl/4Rb9As>`_.
-
-* **Inferring memory spaces**. `This optimization
- <https://github.com/llvm-mirror/llvm/blob/master/lib/Target/NVPTX/NVPTXInferAddressSpaces.cpp>`_
- infers the memory space of an address so that the backend can emit faster
- special loads and stores from it.
+Modern CPUs and GPUs are architecturally quite different, so code that's fast
+on a CPU isn't necessarily fast on a GPU. We've made a number of changes to
+LLVM to make it generate good GPU code. Among these changes are:
+
+* `Straight-line scalar optimizations <https://goo.gl/4Rb9As>`_ -- These
+ reduce redundancy within straight-line code.
+
+* `Aggressive speculative execution
+ <http://llvm.org/docs/doxygen/html/SpeculativeExecution_8cpp_source.html>`_
+ -- This is mainly for promoting straight-line scalar optimizations, which are
+ most effective on code along dominator paths.
+
+* `Memory space inference
+ <http://llvm.org/doxygen/NVPTXInferAddressSpaces_8cpp_source.html>`_ --
+ In PTX, we can operate on pointers that are in a paricular "address space"
+ (global, shared, constant, or local), or we can operate on pointers in the
+ "generic" address space, which can point to anything. Operations in a
+ non-generic address space are faster, but pointers in CUDA are not explicitly
+ annotated with their address space, so it's up to LLVM to infer it where
+ possible.
+
+* `Bypassing 64-bit divides
+ <http://llvm.org/docs/doxygen/html/BypassSlowDivision_8cpp_source.html>`_ --
+ This was an existing optimization that we enabled for the PTX backend.
+
+ 64-bit integer divides are much slower than 32-bit ones on NVIDIA GPUs.
+ Many of the 64-bit divides in our benchmarks have a divisor and dividend
+ which fit in 32-bits at runtime. This optimization provides a fast path for
+ this common case.
-* **Aggressive loop unrooling and function inlining**. Loop unrolling and
+* Aggressive loop unrooling and function inlining -- Loop unrolling and
function inlining need to be more aggressive for GPUs than for CPUs because
- control flow transfer in GPU is more expensive. They also promote other
- optimizations such as constant propagation and SROA which sometimes speed up
- code by over 10x. An empirical inline threshold for GPUs is 1100. This
- configuration has yet to be upstreamed with a target-specific optimization
- pipeline. LLVM also provides `loop unrolling pragmas
- <http://clang.llvm.org/docs/AttributeReference.html#pragma-unroll-pragma-nounroll>`_
- and ``__attribute__((always_inline))`` for programmers to force unrolling and
- inling.
+ control flow transfer in GPU is more expensive. More aggressive unrolling and
+ inlining also promote other optimizations, such as constant propagation and
+ SROA, which sometimes speed up code by over 10x.
-* **Aggressive speculative execution**. `This transformation
- <http://llvm.org/docs/doxygen/html/SpeculativeExecution_8cpp_source.html>`_ is
- mainly for promoting straight-line scalar optimizations which are most
- effective on code along dominator paths.
-
-* **Memory-space alias analysis**. `This alias analysis
- <http://reviews.llvm.org/D12414>`_ infers that two pointers in different
- special memory spaces do not alias. It has yet to be integrated to the new
- alias analysis infrastructure; the new infrastructure does not run
- target-specific alias analysis.
-
-* **Bypassing 64-bit divides**. `An existing optimization
- <http://llvm.org/docs/doxygen/html/BypassSlowDivision_8cpp_source.html>`_
- enabled in the NVPTX backend. 64-bit integer divides are much slower than
- 32-bit ones on NVIDIA GPUs due to lack of a divide unit. Many of the 64-bit
- divides in our benchmarks have a divisor and dividend which fit in 32-bits at
- runtime. This optimization provides a fast path for this common case.
+ (Programmers can force unrolling and inline using clang's `loop unrolling pragmas
+ <http://clang.llvm.org/docs/AttributeReference.html#pragma-unroll-pragma-nounroll>`_
+ and ``__attribute__((always_inline))``.)
Publication
===========
+The team at Google published a paper in CGO 2016 detailing the optimizations
+they'd made to clang/LLVM. Note that "gpucc" is no longer a meaningful name:
+The relevant tools are now just vanilla clang/LLVM.
+
| `gpucc: An Open-Source GPGPU Compiler <http://dl.acm.org/citation.cfm?id=2854041>`_
| Jingyue Wu, Artem Belevich, Eli Bendersky, Mark Heffernan, Chris Leary, Jacques Pienaar, Bjarke Roune, Rob Springer, Xuetian Weng, Robert Hundt
| *Proceedings of the 2016 International Symposium on Code Generation and Optimization (CGO 2016)*
-| `Slides for the CGO talk <http://wujingyue.com/docs/gpucc-talk.pdf>`_
-
-Tutorial
-========
-
-`CGO 2016 gpucc tutorial <http://wujingyue.com/docs/gpucc-tutorial.pdf>`_
+|
+| `Slides from the CGO talk <http://wujingyue.com/docs/gpucc-talk.pdf>`_
+|
+| `Tutorial given at CGO <http://wujingyue.com/docs/gpucc-tutorial.pdf>`_
Obtaining Help
==============
More information about the llvm-commits
mailing list