[cfe-dev] Proposal: pragma for branch divergence
Jingyue Wu
jingyue at google.com
Fri Jan 23 22:29:53 PST 2015
*Hi, I am considering a language extension to Clang for optimizing GPU
programs. This extension will allow the compiler to use different
optimization strategies for divergent and non-divergent branches (to be
explained below). We have observed significant performance gain by
leveraging this proposed extension, so I want to discuss it here to see how
the community likes/dislikes the idea. I will focus on the CUDA language
and the PTX backend for now, but I believe this addition will benefit
OpenCL and its backends too. Background on branch divergenceCUDA programs
have a very different execution model for code with branches. A CUDA
program is executed by an array of threads broken into groups called warps.
A warp typically contains 32 threads, and all the threads in a warp execute
instructions in lock-step, i.e., executing the same instruction at any
given time. Therefore, if the code contains divergent branches (i.e.,
threads in a warp do not agree on which path of the branch to take), the
warp has to execute all the paths from that branch with different subsets
of threads enabled until they converge at a post-dominating BB of the
paths. For example, // threadIdx.x returns the index of a thread in the
warpif (threadIdx.x == 0) { foo();} else { bar();}The warp that contains
thread 0-31 needs to execute foo() with only thread 0 enabled and then
bar() with the other 31 threads enabled. Therefore, the run time of the
above code will be the run time of foo() + the run time of bar(). More
details about branch divergence can be found in the CUDA C programming
guide:
http://docs.nvidia.com/cuda/cuda-c-programming-guide/#simt-architecture
<http://docs.nvidia.com/cuda/cuda-c-programming-guide/#simt-architecture>How
branch divergence affects compiler optimizationsDue to CUDA's different
execution model, some optimizations in LLVM, such as jump threading, can be
unfortunately harmful. The above figure illustrates jump threading. In the
original CFG (on the left), the first condition “if foo == bar” implies the
second condition “if foo <= bar”. Therefore, jump threading redirects BB1
directly to BB2 so that the transformed code needn’t compute the second
condition when the first condition is true.One important complication here
is that BB1 does not directly point to the second condition. Instead, the
code needs to call baz() before computing “if foo <= bar”. Therefore, jump
threading has to duplicate the function call to baz() to match the
semantics that the code runs baz() regardless of “if foo == bar”. For CPU
programs, jump threading likely increases execution speed, because it makes
some paths shorter. However, for CUDA programs, jump threading on divergent
branches is almost certainly a bad idea. Suppose both conditions in the
above example are divergent within a warp. To synchronize execution of all
threads in a warp, the warp has to sequentially execute all basic blocks in
the jump-threaded CFG. With baz() duplicated in the jump-threaded CFG, the
warp needs to execute more code than for the original CFG. We have observed
that jump threading incurs ~50% slowdown for some benchmarks. Note that
jump threading is not the only optimization that can hurt the performance
of CUDA programs due to branch divergence. Loop unswitching on divergent
branches can also hurt performance because it may duplicate code too.
Annotations for branch divergenceIdeally, we want the compiler to
automatically figure out which branches are divergent or not. However,
doing that precisely is extremely hard and can be expensive. Therefore, I
am proposing a compromise to have programmers provide some optimization
hints. The annotation in my mind is in the format of “#pragma clang branch
non_divergence”. Programmers can add this annotation right before a control
statement (such as if, for, and while), indicating the branch derived from
the control statement is not divergent.For example, #pragma clang branch
non_divergenceif (a > 0) { …}indicates the condition (a > 0) is uniform
across all threads in a warp. The optimizer can then enable certain
optimizations such as jump threading and loop unswitching only on
non-divergent branches. In longer term, the optimizer can even adopt some
cheap data-flow analysis to conservatively compute whether a branch is
non-divergent. For example, if a condition is not derived from blockIdx or
threadIdx, it is guaranteed to hold the same value for all threads in a
warp. How the compiler can leverage these annotationsSimilar to the
annotations for loop optimizations
(http://clang.llvm.org/docs/LanguageExtensions.html#extensions-for-loop-hint-optimizations
<http://clang.llvm.org/docs/LanguageExtensions.html#extensions-for-loop-hint-optimizations>),
clang can attach metadata to the branch instructions following "#pragma
clang non_divergent". For example, the source code snippet in the previous
section will be translated to: %cond = icmp sgt i32 %a, 0 br i1 %cond,
label %then, label %else, !llvm.branch !0...!0 = !{!0, !1}!1 =
!{!"llvm.branch.non_divergent"}The llvm.branch metadata indicates %cond
computes the same value for all threads in a warp. This metadata can be
leveraged by the IR optimizer and the NVPTX backend for better
optimization. Besides the opportunities of tuning certain IR optimizations
aforementioned, I also noticed the NVPTX backend could emit more efficient
PTX instructions (such as bra.uni and ret.uni) for non-divergent
branches.Thanks for reading! Any feedbacks are welcomed. *Jingyue
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20150123/0df9b2d3/attachment.html>
More information about the cfe-dev
mailing list