<html><head><meta http-equiv="Content-Type" content="text/html charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class=""><div dir="auto" style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class=""><br class=""><div><blockquote type="cite" class=""><div class="">On Oct 27, 2016, at 2:26 PM, Xinliang David Li <<a href="mailto:davidxl@google.com" class="">davidxl@google.com</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><div dir="ltr" class=""><br class=""><div class="gmail_extra"><br class=""><div class="gmail_quote">On Thu, Oct 27, 2016 at 1:41 PM, Gerolf Hoflehner <span dir="ltr" class=""><<a href="mailto:ghoflehner@apple.com" target="_blank" class="">ghoflehner@apple.com</a>></span> wrote:<br class=""><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="word-wrap:break-word" class=""><br class=""><div class=""><span class=""><blockquote type="cite" class=""><div class="">On Oct 26, 2016, at 12:38 PM, Xinliang David Li <<a href="mailto:davidxl@google.com" target="_blank" class="">davidxl@google.com</a>> wrote:</div><br class="m_1267252449969621802Apple-interchange-newline"><div class=""><br class="m_1267252449969621802Apple-interchange-newline"><br style="font-family:Helvetica;font-size:18px;font-style:normal;font-variant-caps:normal;font-weight:normal;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px" class=""><div class="gmail_quote" style="font-family:Helvetica;font-size:18px;font-style:normal;font-variant-caps:normal;font-weight:normal;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px">On Wed, Oct 26, 2016 at 11:00 AM, Michael Kuperstein<span class="m_1267252449969621802Apple-converted-space"> </span><span dir="ltr" class=""><<a href="mailto:mkuper@google.com" target="_blank" class="">mkuper@google.com</a>></span><span class="m_1267252449969621802Apple-converted-space"><wbr class=""> </span>wrote:<br class=""><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">mkuper added a comment.<br class=""><br class="">Hi Gerolf,<br class=""><span class=""><br class="">In<span class="m_1267252449969621802Apple-converted-space"> </span><a href="https://reviews.llvm.org/D25963#579918" rel="noreferrer" target="_blank" class="">https://reviews.llvm.org/D2<wbr class="">5963#579918</a>, @Gerolf wrote:<br class=""><br class="">> Hi,<br class="">><br class="">> Could you provide more background on this idea?<br class=""><br class=""><br class=""></span>I can't take credit for the idea - this is something GCC already does.<br class=""><span class=""><br class="">> What is your motivational use case? When the trip count is low why optimize?<br class=""></span></blockquote></div></div></blockquote><div class=""><br class=""></div></span>Ok. I was wondering about specific apps...</div></div></blockquote><div class=""><br class=""></div><div class=""><br class=""></div><div class="">We have internal apps that show the benefit. </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="word-wrap:break-word" class=""><div class=""><span class=""><br class=""><blockquote type="cite" class=""><div class=""><div class="gmail_quote" style="font-family:Helvetica;font-size:18px;font-style:normal;font-variant-caps:normal;font-weight:normal;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><span class=""><br class=""></span>The motivational use case is a loop with a low trip count that is nested inside a loop with a high trip count.<br class="">Peeling the inner loop allows further passes in the optimization pipeline simplify the code for the iterations that actually run, making the outer loop's body better.<br class="">Consider something like this:<br class=""><br class=""> <span class="m_1267252449969621802Apple-converted-space"> </span>for (int i = 0; i < bignum; ++i) {<br class=""> <span class="m_1267252449969621802Apple-converted-space"> </span>int ret = 0;<br class=""> <span class="m_1267252449969621802Apple-converted-space"> </span>for (int j = 0; j < param; ++j)<br class=""> <span class="m_1267252449969621802Apple-converted-space"> </span>ret += arr[i][j];<br class=""> <span class="m_1267252449969621802Apple-converted-space"> </span>out[i] = ret;<br class=""> <span class="m_1267252449969621802Apple-converted-space"> </span>}<br class=""><br class="">Imagine param is usually 1.<br class="">We can then peel this into:<br class=""><br class=""> <span class="m_1267252449969621802Apple-converted-space"> </span>for (int i = 0; i < bignum; ++i) {<br class=""> <span class="m_1267252449969621802Apple-converted-space"> </span>int ret = 0;<br class=""> <span class="m_1267252449969621802Apple-converted-space"> </span>if (param == 0)<br class=""> <span class="m_1267252449969621802Apple-converted-space"> </span>continue;<br class=""> <span class="m_1267252449969621802Apple-converted-space"> </span>ret += arr[i][0]<br class=""> <span class="m_1267252449969621802Apple-converted-space"> </span>for (int j = 1; j < param; ++j)<br class=""> <span class="m_1267252449969621802Apple-converted-space"> </span>ret += arr[i][j];<br class=""> <span class="m_1267252449969621802Apple-converted-space"> </span>out[i] = ret;<br class=""> <span class="m_1267252449969621802Apple-converted-space"> </span>}<br class=""></blockquote></div></div></blockquote><div class=""><br class=""></div></span>That makes a lot of sense. I think of peeling as a client/enabler optimization. </div></div></blockquote><div class=""><br class=""></div><div class=""><br class=""></div><div class="">yes, in most of the cases, it is an enabler for other optimizations as shown in the example. There are also cases where it is considered optimization by itself. For instance</div><div class=""><br class=""></div><div class="">for (i = 0; i < N; i++) {</div><div class=""> if (i == 1) {</div><div class=""> do_extra();</div><div class=""> }</div><div class=""> do_somthing();</div><div class="">}</div><div class=""><br class=""></div><div class="">peeling two iterations has the effect of loop splitting -- the resulting loop is more efficient:</div><div class=""> </div><div class=""> do_somthing();</div><div class=""> if (N < 2) goto exit;</div><div class=""> do_extra();</div><div class=""> do_somthing();</div><div class=""> if (N < 3) goto exit;</div><div class=""> </div><div class=""> for (i = 2; i < N; i++) {</div><div class=""> do_something();</div><div class=""> }</div><div class=""> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="word-wrap:break-word" class=""><div class="">So some pass(es) - like unrolller, vectorizer, ... needs to have the cost model proper and invoke it when appropriate. Intuitively turning peeling on only based on PGO is like rolling the dice - some wins, some losses. </div></div></blockquote><div class=""><br class=""></div><div class=""><br class=""></div><div class="">The information to look at is not limited to just PGO and is tunable. Other things to look at including loop size, number of branches within the loop, or specific code patterns. All these tunings are independent of this patch and can be done as follow ups when more cases are seen.</div></div></div></div></div></blockquote><div><br class=""></div>Sure, but none of this I see in the code yet. <br class=""><blockquote type="cite" class=""><div class=""><div dir="ltr" class=""><div class="gmail_extra"><div class="gmail_quote"><div class=""><br class=""></div><div class="">David</div><div class=""><br class=""></div><div class=""> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="word-wrap:break-word" class=""><div class="">And I”m worried that my intuition turns out to be right.</div><span class="HOEnZb"><font color="#888888" class=""><div class=""><br class=""></div><div class="">-Gerolf</div></font></span><div class=""><div class="h5"><div class=""><br class=""><blockquote type="cite" class=""><div class=""><div class="gmail_quote" style="font-family:Helvetica;font-size:18px;font-style:normal;font-variant-caps:normal;font-weight:normal;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><br class="">Which then becomes something morally equivalent to:<br class=""><br class=""> <span class="m_1267252449969621802Apple-converted-space"> </span>for (int i = 0; i < bignum; ++i) {<br class=""> <span class="m_1267252449969621802Apple-converted-space"> </span>if (param == 0)<br class=""> continue;<br class=""> <span class="m_1267252449969621802Apple-converted-space"> </span>if (param == 1) {<br class=""> <span class="m_1267252449969621802Apple-converted-space"> </span>out[i] = arr[i][0];<br class=""> <span class="m_1267252449969621802Apple-converted-space"> </span>continue;<br class=""> <span class="m_1267252449969621802Apple-converted-space"> </span>}<br class=""> <span class="m_1267252449969621802Apple-converted-space"> </span>ret = arr[i][0];<br class=""> <span class="m_1267252449969621802Apple-converted-space"> </span>for (int j = 1; j < param; ++j)<br class=""> <span class="m_1267252449969621802Apple-converted-space"> </span>ret += arr[i][j];<br class=""> <span class="m_1267252449969621802Apple-converted-space"> </span>out[i] = ret;<br class=""> <span class="m_1267252449969621802Apple-converted-space"> </span>}<br class=""><br class="">So, we've improved the common case (param == 1) - we no longer have to initialize ret, we don't have to deal with the inner loop's IV, there's no add, just a direct assignment.<br class=""></blockquote><div class=""><br class=""></div><div class="">Beyond that, loop unswitching should also kick in which further improves.</div><div class=""><br class=""></div><div class=""><br class=""></div><div class="">Another example of loop peeling helping eliminating initialization code:</div><div class=""><br class=""></div><div class="">int res[4] = { 0, 0, 0, 0};</div><div class=""><br class=""></div><div class=""> for (....) {</div><div class=""> res[0] += ... ;</div><div class=""> res[1] += ... ;</div><div class=""> ...</div><div class=""> }</div><div class=""><br class=""></div><div class="">Peeling one iteration off allows res[] to be directly initialized with first iteration's value.</div><div class=""><br class=""></div><div class="">Other benefits come from </div><div class=""><br class=""></div><div class="">1) more cross iteration optimizations cross peeled iterations</div><div class="">2) better scheduling freedom </div><div class="">3) possibly improved branch prediction</div><div class="">4) other special cases where first couple of iterations of the loop do special things (allowing remaining loop body to be cleaner).</div><div class=""><br class=""></div><div class=""><br class=""></div><div class="">David</div><div class=""><br class=""></div><div class="">David</div><div class=""> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><span class=""><br class="">> If the profile is wrong and it actually is a hot loop for a regular/different input set peeling could hurt.<br class=""><br class=""></span>Sure, but this is true for any profile-guided optimization. PGO is only good with representative inputs. If an optimization was good regardless of input, we'd be doing it for non-PGO builds.<br class=""><span class=""><br class="">> There are also side effects on code size, register pressure etc. that could hurt performance.<br class=""><br class=""></span>Right. But that's not really different from any other kind of loop unrolling. Hence the thresholds, etc.<br class=""><br class="">> Thanks<br class="">> Gerolf<br class=""><br class="">Thanks,<br class="">Michael<br class=""><span class=""><br class=""><br class=""><br class="">================<br class="">Comment at: lib/Transforms/Utils/LoopUnrol<wbr class="">lPeel.cpp:87<br class="">+ VMap[*BB] = NewBB;<br class="">+ if (Header == *BB)<br class="">+ InsertTop->getTerminator()->se<wbr class="">tSuccessor(0, NewBB);<br class="">----------------<br class=""></span><span class="">davidxl wrote:<br class="">> Can this be moved outside the loop?<br class="">><br class="">> assert(VMap[Header]);<br class="">> InsertTop->getTerminator()->se<wbr class="">tSuccessor(0, VMap[Header]);<br class=""></span>Right, both this and the Latch handling should be moved outside the loop, thanks.<br class=""><span class=""><br class=""><br class="">================<br class="">Comment at: lib/Transforms/Utils/LoopUnrol<wbr class="">lPeel.cpp:90<br class="">+<br class="">+ if (Latch == *BB) {<br class="">+ VMap.erase((*BB)->getTerminato<wbr class="">r());<br class="">----------------<br class=""></span><span class="">davidxl wrote:<br class="">> This can be handled outside the loop too.<br class=""></span>Right, thanks.<br class=""><span class=""><br class=""><br class="">================<br class="">Comment at: lib/Transforms/Utils/LoopUnrol<wbr class="">lPeel.cpp:91<br class="">+ if (Latch == *BB) {<br class="">+ VMap.erase((*BB)->getTerminato<wbr class="">r());<br class="">+ BranchInst *LatchBR = cast<BranchInst>(NewBB->getTer<wbr class="">minator());<br class="">----------------<br class=""></span><span class="">davidxl wrote:<br class="">> What this this erase do?<br class=""></span>Nothing, nice catch!<br class="">(It's stale - it's needed when you replace LatchBR instead of modifying it in-place.)<br class=""><span class=""><br class=""><br class="">================<br class="">Comment at: lib/Transforms/Utils/LoopUnrol<wbr class="">lPeel.cpp:95<br class="">+ // If this is the last iteration, branch past the loop, otherwise<br class="">+ // branch to the next iteration (which may itself either be peeled off,<br class="">+ // or the loop's preheader)<br class="">----------------<br class=""></span><span class="">davidxl wrote:<br class="">> Is this a stale comment?<br class=""></span>No, but I guess it's not clear.<br class=""><br class="">Let's say we're peeling off K iterations.<br class=""><br class="">For iteration J in 1..K-1, we want the branch that terminates the copy of the latch to be:<br class="">if (cond) goto header(J+1) else goto exit<br class=""><br class="">For iteration K, we want to set this branch to be:<br class="">if (cond) goto new-ph else goto exit.<br class=""><br class="">Here, new-ph is the preheader of the new loop (that is, the loop that handles iterations >= K+1). Technically, this preheader should be empty, and only contains a branch to the loop header - the only reason it exists is to keep the loop in canonical form.<br class="">Does this make sense? If it does, I'll try to improve the comment.<br class=""><span class=""><br class=""><br class="">================<br class="">Comment at: lib/Transforms/Utils/LoopUnrol<wbr class="">lPeel.cpp:101<br class="">+ // We no longer know anything about the branch probability.<br class="">+ LatchBR->setMetadata(LLVMConte<wbr class="">xt::MD_prof, nullptr);<br class="">+ }<br class="">----------------<br class=""></span><span class="">davidxl wrote:<br class="">> Why? I think we should update the branch probability here -- it depends on the what iteration of the peeled clone. If peel count < average/estimated trip count, then each peeled iteration should be more biased towards fall through. If peel_count == est trip_count, then the last peel iteration should be biased toward exit.<br class=""></span>You're right, it's not that we don't know anything - but we don't know enough. I'm not sure how to attach a reasonable number to this, without knowing the distribution.<br class="">Do you have any suggestions? The trivial option would be to assume an extremely narrow distribution (the loop always exits after exactly K iterations), but that would mean having an extreme bias for all of the branches, and I'm not sure that's wise.<br class=""><span class=""><br class=""><br class="">================<br class="">Comment at: lib/Transforms/Utils/LoopUnrol<wbr class="">lPeel.cpp:220<br class="">+ }<br class="">+<br class="">+ // Now adjust the phi nodes in the loop header to get their initial values<br class="">----------------<br class=""></span><span class="">davidxl wrote:<br class="">> The profile meta data of the original loop's back branch should be adjusted too.<br class=""></span>Right, I missed that, thanks.<br class="">But, as above - I'm not sure by how much to adjust it.<br class=""><br class=""><br class=""><a href="https://reviews.llvm.org/D25963" rel="noreferrer" target="_blank" class="">https://reviews.llvm.org/D2596<wbr class="">3</a></blockquote></div></div></blockquote></div><br class=""></div></div></div></blockquote></div><br class=""></div></div>
</div></blockquote></div><br class=""></div></body></html>