[LLVMdev] [cfe-dev] Proposal: pragma for branch divergence
Vinod Grover
vgrover528 at gmail.com
Sat Jan 24 09:06:45 PST 2015
In our experience, as Owen also suggests, a pragma or a language extension
can be avoided by a combination of static and dynamic analysis. We prefer
this approach in our compiler ;)
Regards,
Vinod
On Sat, Jan 24, 2015 at 12:09 AM, Owen Anderson <resistor at mac.com> wrote:
> Hi Jingyue,
>
> Have you considered using dynamic uniformity checks? In my experience you
> can obtain most of the benefit you describe without the need for static
> information simply by inserting branch-if-none instructions that jump over
> the bodies of conditional regions.
>
> This technique is described under Runtime Branch Uniformity Optimization
> in this paper, though I'm pretty confident it had been in use much longer
> than that:
> http://www.eecs.berkeley.edu/~yunsup/papers/predication-micro2014.pdf
>
> -Owen
>
> On Jan 23, 2015, at 10:29 PM, Jingyue Wu <jingyue at google.com> wrote:
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *Hi, I am considering a language extension to Clang for optimizing GPU
> programs. This extension will allow the compiler to use different
> optimization strategies for divergent and non-divergent branches (to be
> explained below). We have observed significant performance gain by
> leveraging this proposed extension, so I want to discuss it here to see how
> the community likes/dislikes the idea. I will focus on the CUDA language
> and the PTX backend for now, but I believe this addition will benefit
> OpenCL and its backends too. Background on branch divergenceCUDA programs
> have a very different execution model for code with branches. A CUDA
> program is executed by an array of threads broken into groups called warps.
> A warp typically contains 32 threads, and all the threads in a warp execute
> instructions in lock-step, i.e., executing the same instruction at any
> given time. Therefore, if the code contains divergent branches (i.e.,
> threads in a warp do not agree on which path of the branch to take), the
> warp has to execute all the paths from that branch with different subsets
> of threads enabled until they converge at a post-dominating BB of the
> paths. For example, // threadIdx.x returns the index of a thread in the
> warpif (threadIdx.x == 0) { foo();} else { bar();}The warp that contains
> thread 0-31 needs to execute foo() with only thread 0 enabled and then
> bar() with the other 31 threads enabled. Therefore, the run time of the
> above code will be the run time of foo() + the run time of bar(). More
> details about branch divergence can be found in the CUDA C programming
> guide:
> http://docs.nvidia.com/cuda/cuda-c-programming-guide/#simt-architecture
> <http://docs.nvidia.com/cuda/cuda-c-programming-guide/#simt-architecture>How
> branch divergence affects compiler optimizationsDue to CUDA's different
> execution model, some optimizations in LLVM, such as jump threading, can be
> unfortunately harmful. The above figure illustrates jump threading. In the
> original CFG (on the left), the first condition “if foo == bar” implies the
> second condition “if foo <= bar”. Therefore, jump threading redirects BB1
> directly to BB2 so that the transformed code needn’t compute the second
> condition when the first condition is true.One important complication here
> is that BB1 does not directly point to the second condition. Instead, the
> code needs to call baz() before computing “if foo <= bar”. Therefore, jump
> threading has to duplicate the function call to baz() to match the
> semantics that the code runs baz() regardless of “if foo == bar”. For CPU
> programs, jump threading likely increases execution speed, because it makes
> some paths shorter. However, for CUDA programs, jump threading on divergent
> branches is almost certainly a bad idea. Suppose both conditions in the
> above example are divergent within a warp. To synchronize execution of all
> threads in a warp, the warp has to sequentially execute all basic blocks in
> the jump-threaded CFG. With baz() duplicated in the jump-threaded CFG, the
> warp needs to execute more code than for the original CFG. We have observed
> that jump threading incurs ~50% slowdown for some benchmarks. Note that
> jump threading is not the only optimization that can hurt the performance
> of CUDA programs due to branch divergence. Loop unswitching on divergent
> branches can also hurt performance because it may duplicate code too.
> Annotations for branch divergenceIdeally, we want the compiler to
> automatically figure out which branches are divergent or not. However,
> doing that precisely is extremely hard and can be expensive. Therefore, I
> am proposing a compromise to have programmers provide some optimization
> hints. The annotation in my mind is in the format of “#pragma clang branch
> non_divergence”. Programmers can add this annotation right before a control
> statement (such as if, for, and while), indicating the branch derived from
> the control statement is not divergent.For example, #pragma clang branch
> non_divergenceif (a > 0) { …}indicates the condition (a > 0) is uniform
> across all threads in a warp. The optimizer can then enable certain
> optimizations such as jump threading and loop unswitching only on
> non-divergent branches. In longer term, the optimizer can even adopt some
> cheap data-flow analysis to conservatively compute whether a branch is
> non-divergent. For example, if a condition is not derived from blockIdx or
> threadIdx, it is guaranteed to hold the same value for all threads in a
> warp. How the compiler can leverage these annotationsSimilar to the
> annotations for loop optimizations
> (http://clang.llvm.org/docs/LanguageExtensions.html#extensions-for-loop-hint-optimizations
> <http://clang.llvm.org/docs/LanguageExtensions.html#extensions-for-loop-hint-optimizations>),
> clang can attach metadata to the branch instructions following "#pragma
> clang non_divergent". For example, the source code snippet in the previous
> section will be translated to: %cond = icmp sgt i32 %a, 0 br i1 %cond,
> label %then, label %else, !llvm.branch !0...!0 = !{!0, !1}!1 =
> !{!"llvm.branch.non_divergent"}The llvm.branch metadata indicates %cond
> computes the same value for all threads in a warp. This metadata can be
> leveraged by the IR optimizer and the NVPTX backend for better
> optimization. Besides the opportunities of tuning certain IR optimizations
> aforementioned, I also noticed the NVPTX backend could emit more efficient
> PTX instructions (such as bra.uni and ret.uni) for non-divergent
> branches.Thanks for reading! Any feedbacks are welcomed. *Jingyue
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150124/69a2ec17/attachment.html>
More information about the llvm-dev
mailing list