<html><head><meta http-equiv="content-type" content="text/html; charset=utf-8"></head><body dir="auto"><div>Hi Jingyue,</div><div><br></div><div>Have you considered using dynamic uniformity checks? In my experience you can obtain most of the benefit you describe without the need for static information simply by inserting branch-if-none instructions that jump over the bodies of conditional regions.</div><div><br></div><div>This technique is described under Runtime Branch Uniformity Optimization in this paper, though I'm pretty confident it had been in use much longer than that: <a href="http://www.eecs.berkeley.edu/~yunsup/papers/predication-micro2014.pdf">http://www.eecs.berkeley.edu/~yunsup/papers/predication-micro2014.pdf</a></div><div><br></div><div>-Owen<br></div><div><br>On Jan 23, 2015, at 10:29 PM, Jingyue Wu <<a href="mailto:jingyue@google.com">jingyue@google.com</a>> wrote:<br><br></div><blockquote type="cite"><div><div dir="ltr"><b style="font-weight:normal" id="docs-internal-guid-0a74336d-1a95-f187-3c7e-9b80da7c5db1"><p style="line-height:1.15;margin-top:0pt;margin-bottom:0pt"><font face="Arial"><span style="white-space:pre-wrap">Hi, </span></font></p><p dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt"><span style="font-size:13px;font-family:Arial;color:rgb(34,34,34);font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre-wrap;background-color:rgb(255,255,255)"><br></span></p><p dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt"><span style="font-size:13px;font-family:Arial;color:rgb(34,34,34);font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre-wrap;background-color:rgb(255,255,255)">I am considering a language extension to Clang for optimizing GPU programs. This extension will allow the compiler to use different optimization strategies for divergent and non-divergent branches (to be explained below). We have observed significant performance gain by leveraging this proposed extension, so I want to discuss it here to see how the community likes/dislikes the idea. I will focus on the CUDA language and the PTX backend for now, but I believe this addition will benefit OpenCL and its backends too. </span></p><br><p dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt"><span style="font-size:13px;font-family:Arial;color:rgb(34,34,34);font-weight:bold;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre-wrap;background-color:rgb(255,255,255)">Background on branch divergence</span></p><br><p dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt"><span style="font-size:13px;font-family:Arial;color:rgb(34,34,34);font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre-wrap;background-color:rgb(255,255,255)">CUDA programs have a very different execution model for code with branches. A CUDA program is executed by an array of threads broken into groups called </span><span style="font-size:13px;font-family:Arial;color:rgb(34,34,34);font-weight:normal;font-style:italic;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre-wrap;background-color:rgb(255,255,255)">warps</span><span style="font-size:13px;font-family:Arial;color:rgb(34,34,34);font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre-wrap;background-color:rgb(255,255,255)">. A warp typically contains 32 threads, and all the threads in a warp execute instructions in lock-step, i.e., executing the same instruction at any given time. Therefore, if the code contains </span><span style="font-size:13px;font-family:Arial;color:rgb(34,34,34);font-weight:normal;font-style:italic;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre-wrap;background-color:rgb(255,255,255)">divergent</span><span style="font-size:13px;font-family:Arial;color:rgb(34,34,34);font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre-wrap;background-color:rgb(255,255,255)"> branches (i.e., threads in a warp do not agree on which path of the branch to take), the warp has to execute all the paths from that branch with different subsets of threads enabled until they converge at a post-dominating BB of the paths. For example, </span></p><br><p dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt"><span style="font-size:13px;color:rgb(34,34,34);font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre-wrap;background-color:rgb(255,255,255)"><font face="monospace, monospace">// threadIdx.x returns the index of a thread in the warp</font></span></p><p dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt"><span style="font-size:13px;color:rgb(34,34,34);font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre-wrap;background-color:rgb(255,255,255)"><font face="monospace, monospace">if (threadIdx.x == 0) {</font></span></p><p dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt"><span style="font-size:13px;color:rgb(34,34,34);font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre-wrap;background-color:rgb(255,255,255)"><font face="monospace, monospace"> foo();</font></span></p><p dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt"><span style="font-size:13px;color:rgb(34,34,34);font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre-wrap;background-color:rgb(255,255,255)"><font face="monospace, monospace">} else {</font></span></p><p dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt"><span style="font-size:13px;color:rgb(34,34,34);font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre-wrap;background-color:rgb(255,255,255)"><font face="monospace, monospace"> bar();</font></span></p><p dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt"><span style="font-size:13px;color:rgb(34,34,34);font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre-wrap;background-color:rgb(255,255,255)"><font face="monospace, monospace">}</font></span></p><br><p dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt"><span style="font-size:13px;font-family:Arial;color:rgb(34,34,34);font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre-wrap;background-color:rgb(255,255,255)">The warp that contains thread 0-31 needs to execute foo() with only thread 0 enabled and then bar() with the other 31 threads enabled. Therefore, the run time of the above code will be the run time of foo() + the run time of bar(). </span></p><br><p dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt"><span style="font-size:13px;font-family:Arial;color:rgb(34,34,34);font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre-wrap;background-color:rgb(255,255,255)">More details about branch divergence can be found in the CUDA C programming guide: <a href="http://docs.nvidia.com/cuda/cuda-c-programming-guide/#simt-architecture">http://docs.nvidia.com/cuda/cuda-c-programming-guide/#simt-architecture</a></span></p><br><p dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt"><span style="font-size:13px;font-family:Arial;color:rgb(34,34,34);font-weight:bold;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre-wrap;background-color:rgb(255,255,255)">How branch divergence affects compiler optimizations</span></p><br><p dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt"><span style="font-size:13px;font-family:Arial;color:rgb(34,34,34);font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre-wrap;background-color:rgb(255,255,255)">Due to CUDA's different execution model, some optimizations in LLVM, such as jump threading, can be unfortunately harmful. </span></p><br><p dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt"><span style="font-size:13px;font-family:Arial;color:rgb(34,34,34);font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre-wrap;background-color:rgb(255,255,255)"><img src="https://lh5.googleusercontent.com/_QmMdnOgloSVe_t4BK5a5NHdPFyyhRjmHN6dp9xNHbXTcuFYCcke_Z6uWahC13vWxUw4hLTduLo52PMy9Ci-pB9fimFrMvXRxPdUTHlGwrL2FjlgBqEd4GIUTwTVia8ITw" width="447px;" height="204px;" style="border: none;"></span></p><br><br><p dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt"><span style="font-size:13px;font-family:Arial;color:rgb(34,34,34);font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre-wrap;background-color:rgb(255,255,255)">The above figure illustrates jump threading. In the original CFG (on the left), the first condition “if foo == bar” implies the second condition “if foo <= bar”. Therefore, jump threading redirects BB1 directly to BB2 so that the transformed code needn’t compute the second condition when the first condition is true.</span></p><br><p dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt"><span style="font-size:13px;font-family:Arial;color:rgb(34,34,34);font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre-wrap;background-color:rgb(255,255,255)">One important complication here is that BB1 does not directly point to the second condition. Instead, the code needs to call baz() before computing “if foo <= bar”. Therefore, jump threading has to duplicate the function call to baz() to match the semantics that the code runs baz() regardless of “if foo == bar”. </span></p><br><p dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt"><span style="font-size:13px;font-family:Arial;color:rgb(34,34,34);font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre-wrap;background-color:rgb(255,255,255)">For CPU programs, jump threading likely increases execution speed, because it makes some paths shorter. However, for CUDA programs, jump threading on divergent branches is almost certainly a bad idea. Suppose both conditions in the above example are divergent within a warp. To synchronize execution of all threads in a warp, the warp has to sequentially execute all basic blocks in the jump-threaded CFG. With baz() duplicated in the jump-threaded CFG, the warp needs to execute more code than for the original CFG. We have observed that jump threading incurs ~50% slowdown for some benchmarks. </span></p><br><p dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt"><span style="font-size:13px;font-family:Arial;color:rgb(34,34,34);font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre-wrap;background-color:rgb(255,255,255)">Note that jump threading is not the only optimization that can hurt the performance of CUDA programs due to branch divergence. Loop unswitching on divergent branches can also hurt performance because it may duplicate code too. </span></p><br><p dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt"><span style="font-size:13px;font-family:Arial;color:rgb(34,34,34);font-weight:bold;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre-wrap;background-color:rgb(255,255,255)">Annotations for branch divergence</span></p><br><p dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt"><span style="font-size:13px;font-family:Arial;color:rgb(34,34,34);font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre-wrap;background-color:rgb(255,255,255)">Ideally, we want the compiler to automatically figure out which branches are divergent or not. However, doing that precisely is extremely hard and can be expensive. Therefore, I am proposing a compromise to have programmers provide some optimization hints. </span></p><br><p dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt"><span style="font-size:13px;font-family:Arial;color:rgb(34,34,34);font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre-wrap;background-color:rgb(255,255,255)">The annotation in my mind is in the format of “#pragma clang branch non_divergence”. Programmers can add this annotation right before a control statement (such as if, for, and while), indicating the branch derived from the control statement is not divergent.</span></p><br><p dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt"><span style="font-size:13px;font-family:Arial;color:rgb(34,34,34);font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre-wrap;background-color:rgb(255,255,255)">For example, </span></p><p dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt"><span style="font-size:13px;color:rgb(34,34,34);font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre-wrap;background-color:rgb(255,255,255)"><font face="monospace, monospace"><br></font></span></p><p dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt"><span style="font-size:13px;color:rgb(34,34,34);font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre-wrap;background-color:rgb(255,255,255)"><font face="monospace, monospace">#pragma clang branch non_divergence</font></span></p><p dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt"><span style="font-size:13px;color:rgb(34,34,34);font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre-wrap;background-color:rgb(255,255,255)"><font face="monospace, monospace">if (a > 0) {</font></span></p><p dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt"><span style="font-size:13px;color:rgb(34,34,34);font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre-wrap;background-color:rgb(255,255,255)"><font face="monospace, monospace"> …</font></span></p><p dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt"><span style="font-size:13px;color:rgb(34,34,34);font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre-wrap;background-color:rgb(255,255,255)"><font face="monospace, monospace">}</font></span></p><p dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt"><span style="font-size:13px;font-family:Arial;color:rgb(34,34,34);font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre-wrap;background-color:rgb(255,255,255)"><br></span></p><p dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt"><span style="font-size:13px;font-family:Arial;color:rgb(34,34,34);font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre-wrap;background-color:rgb(255,255,255)">indicates the condition (a > 0) is uniform across all threads in a warp. </span></p><br><p dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt"><span style="font-size:13px;font-family:Arial;color:rgb(34,34,34);font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre-wrap;background-color:rgb(255,255,255)">The optimizer can then enable certain optimizations such as jump threading and loop unswitching only on non-divergent branches. In longer term, the optimizer can even adopt some cheap data-flow analysis to conservatively compute whether a branch is non-divergent. For example, if a condition is not derived from blockIdx or threadIdx, it is guaranteed to hold the same value for all threads in a warp. </span></p><br><p dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt"><span style="font-size:13px;font-family:Arial;color:rgb(34,34,34);font-weight:bold;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre-wrap;background-color:rgb(255,255,255)">How the compiler can leverage these annotations</span></p><br><p dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt"><span style="font-size:13px;font-family:Arial;color:rgb(34,34,34);font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre-wrap;background-color:rgb(255,255,255)">Similar to the annotations for loop optimizations (</span><a href="http://clang.llvm.org/docs/LanguageExtensions.html#extensions-for-loop-hint-optimizations" style="text-decoration:none"><span style="font-size:13px;font-family:Arial;color:rgb(17,85,204);font-weight:normal;font-style:normal;font-variant:normal;text-decoration:underline;vertical-align:baseline;white-space:pre-wrap;background-color:rgb(255,255,255)">http://clang.llvm.org/docs/LanguageExtensions.html#extensions-for-loop-hint-optimizations</span></a><span style="font-size:13px;font-family:Arial;color:rgb(34,34,34);font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre-wrap;background-color:rgb(255,255,255)">), clang can attach metadata to the branch instructions following "#pragma clang non_divergent". For example, the source code snippet in the previous section will be translated to:</span></p><p dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt"><span style="font-size:13px;color:rgb(34,34,34);font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre-wrap;background-color:rgb(255,255,255)"><font face="monospace, monospace"><br></font></span></p><p dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt"><span style="font-size:13px;color:rgb(34,34,34);font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre-wrap;background-color:rgb(255,255,255)"><font face="monospace, monospace"> %cond = icmp sgt i32 %a, 0</font></span></p><p dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt"><font face="monospace, monospace"><span style="font-size:13px;color:rgb(34,34,34);font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre-wrap;background-color:rgb(255,255,255)"> br i1 %cond, label %then, label %else, !llvm.branch !0</span><span style="font-size:13px;color:rgb(34,34,34);font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre-wrap;background-color:rgb(255,255,255)"><br class=""></span><span style="font-size:13px;color:rgb(34,34,34);font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre-wrap;background-color:rgb(255,255,255)">...</span><span style="font-size:13px;color:rgb(34,34,34);font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre-wrap;background-color:rgb(255,255,255)"><br class=""></span><span style="font-size:13px;color:rgb(34,34,34);font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre-wrap;background-color:rgb(255,255,255)">!0 = !{!0, !1}</span><span style="font-size:13px;color:rgb(34,34,34);font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre-wrap;background-color:rgb(255,255,255)"><br class=""></span><span style="font-size:13px;color:rgb(34,34,34);font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre-wrap;background-color:rgb(255,255,255)">!1 = !{!"llvm.branch.non_divergent"}</span></font></p><p dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt"><span style="font-size:13px;font-family:Arial;color:rgb(34,34,34);font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre-wrap;background-color:rgb(255,255,255)"><br></span></p><p dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt"><span style="font-size:13px;font-family:Arial;color:rgb(34,34,34);font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre-wrap;background-color:rgb(255,255,255)">The llvm.branch metadata indicates %cond computes the same value for all threads in a warp. </span></p><br><p dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt"><span style="font-size:13px;font-family:Arial;color:rgb(34,34,34);font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre-wrap;background-color:rgb(255,255,255)">This metadata can be leveraged by the IR optimizer and the NVPTX backend for better optimization. Besides the opportunities of tuning certain IR optimizations aforementioned, I also noticed the NVPTX backend could emit more efficient PTX instructions (such as bra.uni and ret.uni) for non-divergent branches.</span></p><div><b style="font-weight:normal"><br></b></div><div><b style="font-weight:normal">Thanks for reading! Any feedbacks are welcomed. </b></div><br></b>Jingyue</div>
</div></blockquote><blockquote type="cite"><div><span>_______________________________________________</span><br><span>LLVM Developers mailing list</span><br><span><a href="mailto:LLVMdev@cs.uiuc.edu">LLVMdev@cs.uiuc.edu</a> <a href="http://llvm.cs.uiuc.edu">http://llvm.cs.uiuc.edu</a></span><br><span><a href="http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev">http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev</a></span><br></div></blockquote></body></html>