<html><head><meta http-equiv="Content-Type" content="text/html charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class=""><br class=""><div><blockquote type="cite" class=""><div class="">On Feb 6, 2015, at 11:56 AM, Hal Finkel <<a href="mailto:hfinkel@anl.gov" class="">hfinkel@anl.gov</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><span style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; float: none; display: inline !important;" class="">----- Original Message -----</span><br style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><blockquote type="cite" style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class="">From: "Michael Zolotukhin" <<a href="mailto:mzolotukhin@apple.com" class="">mzolotukhin@apple.com</a>><br class="">To: "Hal J. Finkel" <<a href="mailto:hfinkel@anl.gov" class="">hfinkel@anl.gov</a>><br class="">Cc: "Commit Messages and Patches for LLVM" <<a href="mailto:llvm-commits@cs.uiuc.edu" class="">llvm-commits@cs.uiuc.edu</a>><br class="">Sent: Thursday, February 5, 2015 6:47:43 PM<br class="">Subject: Re: [RFC] Heuristic for complete loop unrolling<br class=""><br class=""><br class=""><br class="">Hi Hal and others!<br class=""><br class=""><br class="">This is the second part of the patch, in which we start to use the<br class="">obtained estimate of complete loop unrolling effect more<br class="">aggressively.<br class=""><br class=""><br class="">The current heuristic is following:<br class="">- We have three parameters: Threshold (the existing one),<br class="">MinPercentOfOptimizedInstructions (new), and AbsoluteThreshold(new).<br class="">- If a loop is so small, that unrolling it won’t exceed Threshold,<br class="">we’ll unroll it.<br class="">- If a loop is so big, that unrolling it will exceed<br class="">AbsoluteThreshold, don’t do it in any case.<br class="">- If a loop is something in the middle, unroll it if the unrolling<br class="">would lead to elimination of the specified part of instructions (in<br class="">percent). As I explained in previous messages, that could happen if<br class="">some loads become constants, and these constants help inst-simplify<br class="">(e.g. it’s 0 or 1, and it’s used in add or mul).<br class=""><br class=""><br class="">The parameters’ values haven’t been tuned yet, though I don’t see any<br class="">regressions on SPEC2006 with them (x86, O3, PGO, LTO, ref data), and<br class="">anyway I plan to do some experiments to tune them in future.<br class=""><br class=""><br class="">I also added InstSimplify passes right after loop-unroller, because<br class="">we expect this pass to clean up after complete unrolling if we<br class="">succeeded. This change can be discussed separately though, and if<br class="">needed we can skip it in this patch.<br class=""></blockquote><br style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><span style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; float: none; display: inline !important;" class="">I was pretty sure that Chandler just refactored things so that you could use the InstCombine logic directly without scheduling a run over everything again. Looks like this is not quite true, but we're pretty close. I'd much rather than the unroller just simplified the unrolled loop (InstCombine's worklist will take care of then simplifying users outside the loop too). Looks like we just need a little more refactoring of InstSimplify to make this possible.</span><br style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><br style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><blockquote type="cite" style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><br class=""><br class="">Does the patch look good?<br class=""><br class=""><br class=""></blockquote><br style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><span style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; float: none; display: inline !important;" class="">Yes, I think that the unroller changes look quite reasonable.</span><br style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""></div></blockquote>Thanks, Hal! I committed the patch without adding the InstSimplify passes in r228434.</div><div><br class=""></div><div>Now I’m going to do some performance testing with different values of the parameters and if I see any gains with different values, I’ll share the data. Also, I think we need testcases for these new features, I’ll prepare them soon.</div><div><br class=""></div><div>Michael<br class=""><blockquote type="cite" class=""><div class=""><br style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><blockquote type="cite" style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><br class=""><br class=""><br class=""><br class=""><br class="">Thanks,<br class="">Michael<br class=""><br class=""><br class=""><br class=""><br class=""><br class=""><br class="">On Feb 4, 2015, at 6:37 PM, Michael Zolotukhin <<br class=""><a href="mailto:mzolotukhin@apple.com" class="">mzolotukhin@apple.com</a> > wrote:<br class=""><br class=""><br class=""><br class=""><br class=""><br class=""><br class="">On Feb 4, 2015, at 6:17 PM, Hal Finkel < <a href="mailto:hfinkel@anl.gov" class="">hfinkel@anl.gov</a> > wrote:<br class=""><br class="">----- Original Message -----<br class=""><br class=""><br class="">From: "Michael Zolotukhin" < <a href="mailto:mzolotukhin@apple.com" class="">mzolotukhin@apple.com</a> ><br class="">To: "Hal J. Finkel" < <a href="mailto:hfinkel@anl.gov" class="">hfinkel@anl.gov</a> ><br class="">Cc: "Commit Messages and Patches for LLVM" < <a href="mailto:llvm-commits@cs.uiuc.edu" class="">llvm-commits@cs.uiuc.edu</a><br class=""><blockquote type="cite" class=""><br class=""></blockquote>Sent: Wednesday, February 4, 2015 8:10:02 PM<br class="">Subject: Re: [RFC] Heuristic for complete loop unrolling<br class=""><br class=""><br class="">On Feb 4, 2015, at 5:42 PM, Hal Finkel < <a href="mailto:hfinkel@anl.gov" class="">hfinkel@anl.gov</a> > wrote:<br class=""><br class="">----- Original Message -----<br class=""><br class=""><br class="">From: "Michael Zolotukhin" < <a href="mailto:mzolotukhin@apple.com" class="">mzolotukhin@apple.com</a> ><br class="">To: "Hal J. Finkel" < <a href="mailto:hfinkel@anl.gov" class="">hfinkel@anl.gov</a> ><br class="">Cc: "Commit Messages and Patches for LLVM" < <a href="mailto:llvm-commits@cs.uiuc.edu" class="">llvm-commits@cs.uiuc.edu</a><br class=""><br class=""><br class=""><br class="">Sent: Wednesday, February 4, 2015 1:29:34 PM<br class="">Subject: Re: [RFC] Heuristic for complete loop unrolling<br class=""><br class="">On Feb 3, 2015, at 7:28 PM, Hal Finkel < <a href="mailto:hfinkel@anl.gov" class="">hfinkel@anl.gov</a> > wrote:<br class=""><br class="">----- Original Message -----<br class=""><br class=""><br class="">From: "Michael Zolotukhin" < <a href="mailto:mzolotukhin@apple.com" class="">mzolotukhin@apple.com</a> ><br class="">To: "Hal J. Finkel" < <a href="mailto:hfinkel@anl.gov" class="">hfinkel@anl.gov</a> ><br class="">Cc: "Commit Messages and Patches for LLVM" < <a href="mailto:llvm-commits@cs.uiuc.edu" class="">llvm-commits@cs.uiuc.edu</a><br class=""><br class=""><br class=""><br class="">Sent: Tuesday, February 3, 2015 7:28:48 PM<br class="">Subject: Re: [RFC] Heuristic for complete loop unrolling<br class=""><br class=""><br class="">Hi Hal,<br class=""><br class=""><br class="">These are updated versions of the patches:<br class=""><br class="">I kept visiting only binary operators for now, but added a ‘TODO’<br class="">there. I’ll address it in later patches, since it looks like 1) it<br class="">can be done incrementally if we just add insn-visitors one by one,<br class="">2) if we decide to reuse code from inlining (I prefer this option),<br class="">then It’ll be an independent and probably big effort for refactoring<br class="">this code first.<br class=""><br class=""><br class="">As for the grammar fixes, everything should be fixed now, thank you<br class="">for pointing me to them! Also, I feel like variables and options<br class="">names that I used are also not ideal - if you have better ideas for<br class="">them, please let me know.<br class=""><br class=""><br class="">Is it ok to commit the first patch?<br class=""><br class="">Yes, but a few things to address first:<br class=""><br class="">1. There are a couple of places where you have dyn_cast that should<br class="">always succeed:<br class=""><br class="">+ NumberOfOptimizedInstructions +=<br class=""><br class="">+ TTI.getUserCost(dyn_cast<Instruction>(&I));<br class=""><br class="">+ LoadInst *LI = LoadDescr.first;<br class=""><br class="">...<br class="">+ NumberOfOptimizedInstructions +=<br class=""><br class="">+ TTI.getUserCost(dyn_cast<Instruction>(LI));<br class=""><br class=""><br class="">+ NumberOfOptimizedInstructions +=<br class=""><br class="">+ TTI.getUserCost(dyn_cast<Instruction>(I));<br class=""><br class="">If you need a cast in these place at all (you might for the<br class="">iterators, but for the LoadInst * you shouldn't), use a cast<>, not<br class="">a dyn_cast<>.<br class="">Thanks, fixed! Updated patch attached.<br class=""><br class=""><br class=""><br class=""><br class="">2. In EstimateNumberOfSimplifiedInsns(unsigned Iteration) we have:<br class=""><br class="">+ while (!Worklist.empty()) {<br class=""><br class="">+ Instruction *I = Worklist.pop_back_val();<br class=""><br class="">+ if (!visit(I))<br class=""><br class="">+ continue;<br class=""><br class="">+ for (auto U : I->users()) {<br class=""><br class="">+ Instruction *UI = dyn_cast<Instruction>(U);<br class=""><br class="">+ if (!UI)<br class=""><br class="">+ continue;<br class=""><br class="">+ if (!L->contains(UI))<br class=""><br class="">+ continue;<br class=""><br class="">+ Worklist.push_back(UI);<br class=""><br class="">+ }<br class=""><br class="">+ }<br class=""><br class="">Worklist is a SmallVector, and so I think you might potentially visit<br class="">the same users more than once. For example, if you have two loads<br class="">that can be turned into constants, and then you add the results<br class="">together, the add will be visited twice. One possible solution is to<br class="">keep a Visited set, and explicitly avoid visiting the same user<br class="">twice. To do this, you'll need to make sure that you visit the user<br class="">graph breadth first (by which I mean such that you know you've<br class="">visited all relevant operands of an instructions before visiting the<br class="">instruction itself), not depth first as you currently do.<br class="">That’s true, we can visit some instructions twice, but I don't think<br class="">we can easily avoid that. Breadth first search wouldn’t solve this<br class="">problem, e.g. in the following case:<br class="">(1) %a = load 1<br class="">(2) %b = load 2<br class="">(3) %use_a1 = add %a, 5<br class="">(4) %use_a2 = sub %use_a1, 6<br class="">(5) %common_use = mul %use_a2, %b<br class="">Starting from loads (1) and (2), BFS will visit (3) and (5) first,<br class="">and only then (4). Thus, it will visit (5) before one of its<br class="">operands - (4).<br class=""><br class=""><br class="">We can try to use topological order here, but in this case we would<br class="">visit all users - currently we only visit those that have at least<br class="">one operand simplified. I’m not sure what’s the right call here. I<br class="">think that in real cases the current algorithm should be fine,<br class="">though asymptotically it’s worse, than using topological order. To<br class="">play safe, we can add a limitation, so that we’ll give up if number<br class="">of performed iterations is too high.<br class=""><br class=""><br class="">What do you think?<br class=""><br class=""><br class="">I think there is a simpler solution: Keep as Visited set, but don't<br class="">use it to directly prune the search. Since we only really care about<br class="">the number of simplified instructions, add simplified instructions<br class="">to the Visited set, and if the instruction is already in the set,<br class="">don't increment NumberOfOptimizedInstructions. So we'll visit<br class="">multiple times (potentially), but we won't over-count. Do you think<br class="">that will work?<br class="">Yes, I think that would work. Here is a new patch:<br class=""><br class=""><br class=""><br class="">+ if (SimpleV && !CountedInsns.count(&I)) {<br class=""><br class="">+ NumberOfOptimizedInstructions += TTI.getUserCost(&I);<br class=""><br class="">+ CountedInsns.insert(&I);<br class=""><br class="">+ }<br class=""><br class=""><br class="">I think this can be:<br class=""><br class="">if (SimpleV && CountedInsns.insert(&I).second)<br class="">NumberOfOptimizedInstructions += TTI.getUserCost(&I);<br class=""><br class="">Fixed and committed in r228265. Thank you for your comments and<br class="">remarks! I’ll proceed with the second patch shortly.<br class=""><br class=""><br class="">Michael<br class=""><br class=""><br class=""><br class=""><br class="">Otherwise, LGTM.<br class=""><br class="">Thanks!<br class=""><br class=""><br class=""><br class=""><br class="">-Hal<br class=""><br class=""><br class=""><br class=""><br class=""><br class=""><br class="">Thanks,<br class="">Michael<br class=""><br class=""><br class=""><br class=""><br class="">-Hal<br class=""><br class=""><br class=""><br class=""><br class="">Michael<br class=""><br class=""><br class=""><br class=""><br class=""><br class=""><br class="">Otherwise, LGTM. Feel free to fixup those things and commit (or I'll<br class="">look at it again; that's up to you).<br class=""><br class=""><br class=""><br class=""><br class=""><br class="">As for the second patch - I implemented a new metrics as you<br class="">suggested (minimal percent of removed instructions instead of<br class="">‘bonus’ for each removed instruction). However, I think we need to<br class="">have a hign (absolute) threshold as well here: even if we optimize<br class="">50% of instructions, we don’t want to unroll a loop with 10^6<br class="">iterations - we’ll never finish compiling the current routine<br class="">otherwise.<br class=""><br class="">Agreed.<br class=""><br class=""><br class=""><br class="">I don’t plan to commit the second part in the current<br class="">state, but still post it to get a feedback.<br class=""><br class="">Thanks, I think that the idea is reasonable.<br class=""><br class="">-Hal<br class=""><br class=""><br class=""><br class=""><br class=""><br class="">Thanks,<br class="">Michael<br class=""><br class=""><br class=""><br class=""><br class=""><br class="">On Feb 2, 2015, at 9:38 PM, Hal Finkel < <a href="mailto:hfinkel@anl.gov" class="">hfinkel@anl.gov</a> > wrote:<br class=""><br class="">----- Original Message -----<br class=""><br class=""><br class="">From: "Michael Zolotukhin" < <a href="mailto:mzolotukhin@apple.com" class="">mzolotukhin@apple.com</a> ><br class="">To: "Hal J. Finkel" < <a href="mailto:hfinkel@anl.gov" class="">hfinkel@anl.gov</a> ><br class="">Cc: "Commit Messages and Patches for LLVM" < <a href="mailto:llvm-commits@cs.uiuc.edu" class="">llvm-commits@cs.uiuc.edu</a><br class=""><br class=""><br class=""><br class="">Sent: Monday, February 2, 2015 7:25:30 PM<br class="">Subject: Re: [RFC] Heuristic for complete loop unrolling<br class=""><br class=""><br class="">Hi Hal,<br class=""><br class="">Please find a new version attached. I broke it into two parts: the<br class="">first one is the implementation of new heuristic, and the second is<br class="">for adjusting cost model.<br class=""><br class="">1. 0001-Implement-new-heuristic-for-complete-loop-unrolling.patch<br class=""><br class="">I added a new class UnrollAnalyzer (like InlineAnalyzer), that<br class="">estimates possible optimization effects of complete-unrolling. Now<br class="">we simulate inst-simplify by visiting all users of loads that might<br class="">become constant, and then we simulate DCE, which also might perform<br class="">significant clean-ups here. The counted number of optimized<br class="">instruction then returned for further consideration (in this patch<br class="">we just add it to threshold - that’s a conservative way of treating<br class="">it, since after the mentioned optimizations we still should be under<br class="">the threshold).<br class=""><br class="">2. 0002-Use-estimated-number-of-optimized-insns-in-unroll-th.patch<br class=""><br class="">In this patch we use more aggressive strategy in choosing threshold.<br class="">My idea here is that we can go beyond the default threshold if we<br class="">can significantly optimize the unrolled body, but we still should be<br class="">in reasonable limits. Let’s look at examples to illustrate it<br class="">better:<br class="">a) DefaultThreshold=150, LoopSize=50, IterationsNumber=5,<br class="">UnrolledLoopSize=50*5=250,<br class="">NumberOfPotentiallyOptimizedInstructions=90<br class="">In this case after unroll we would get 250 instructions, and after<br class="">inst-simplify+DCE, we’ll go down to 160 instructions. Though it’s<br class="">still bigger than DefaultThreshold, I think we do want to unroll<br class="">here, since it would speed up this part by ~90/250=36%.<br class=""><br class=""><br class="">b) DefaultThreshold=150, LoopSize=1000, IterationsNumber=1000,<br class="">UnrolledLoopSize=1000*1000,<br class="">NumberOfPotentiallyOptimizedInstructions=500. The absolute number of<br class="">optimized instructions in this example is bigger than in the<br class="">previous one, but we don’t want to unroll here, because the<br class="">resultant code would be huge, and we’d only save<br class="">500/(1000*1000)=0.05% instructions.<br class=""><br class=""><br class="">To handle both situations, I suggest only unroll if we can optimize<br class="">significant portion of the loop body, e.g. if after unrolling we can<br class="">remove N% of code. We might want to have an absolute upper limit for<br class="">final code size too (i.e. even if we optimize 50% of instructions,<br class="">don’t unroll loop with 10^8 iterations), but for now I decided to<br class="">add only one new parameter to avoid over-complicating things.<br class=""><br class=""><br class="">In the patch I use weight for optimized instruction (yep, like in the<br class="">previous patch, but here it’s more meaningful), which is essentially<br class="">equivalent to a ratio between original and removed code size.<br class=""><br class="">I'm not particularly keen on this bonus parameter, but do like the<br class="">percentage reduction test you've outlined. Can you please implement<br class="">that instead?<br class=""><br class=""><br class=""><br class=""><br class=""><br class="">My testing is still in progress and probably I’ll do some tuning<br class="">later, the current default value (10) was taken almost at random.<br class=""><br class="">Do these patches look better?<br class=""><br class=""><br class="">Yes, much better. A few comments on the first:<br class=""><br class="">+ // variable. Now it's time if it corresponds to a global constant<br class="">global<br class="">+ // (in which case we can eliminate the load), or not.<br class=""><br class="">... time to see if ...<br class=""><br class="">+// This class is used to get an estimate of optimization effect that<br class="">we could<br class=""><br class="">optimization effect -> the optimization effects<br class=""><br class="">+// get from complete loop unrolling. It comes from the fact that<br class="">some loads<br class="">+// might be replaced with a concrete constant values and that could<br class="">trigger a<br class="">+// chain of instruction simplifications.<br class=""><br class="">a concrete -> concrete [remove 'a']<br class=""><br class="">+ bool visitInstruction(Instruction &I) { return false; };<br class="">+ bool visitBinaryOperator(BinaryOperator &I) {<br class=""><br class="">Visiting binary operators is good, but we should also visit ICmp,<br class="">FCmp, GetElementPtr, Trunc, ZExt, SExt, FPTrunc, FPExt, FPToUI,<br class="">FPToSI, UIToFP, SIToFP, BitCast, Select, ExtractElement,<br class="">InsertElement, ShuffleVector, ExtractValue, InsertValue. It might be<br class="">easier to just do this from the visitInstruction callback.<br class=""><br class="">+ unsigned ElemSize = CDS->getElementType()->getPrimitiveSizeInBits()<br class="">/ 8U;<br class="">+ unsigned Start = StartC.getLimitedValue();<br class="">+ unsigned Step = StepC.getLimitedValue();<br class="">+<br class="">+ unsigned Index = (Start + Step * Iteration) / ElemSize;<br class="">+<br class="">+ Constant *CV = CDS->getElementAsConstant(Index);<br class=""><br class="">I think you need to guard here against out-of-bounds accesses (we<br class="">should simply return nullptr and not assert or segfault from the OOB<br class="">access to the CDS.<br class=""><br class="">+ // Visit all loads the loop L, and for those that after complete<br class="">loop<br class="">+ // unrolling would have a constant address and it will point to<br class="">from a known<br class=""><br class="">... that, after complete loop unrolling, would ... [add commas]<br class=""><br class="">also<br class=""><br class="">it will point to from a known -> it will point to a known [remove<br class="">'from']<br class=""><br class="">+// This routine estimates this optimization effect and return number<br class="">of<br class="">+// instructions, that potentially might be optimized away.<br class=""><br class="">return number -> returns the number<br class=""><br class="">+ // iterations here. To limit ourselves here, we check only first<br class="">1000<br class="">+ // iterations, and then scale the found number, if necessary.<br class="">+ unsigned IterationsNumberForEstimate = std::min(1000u, TripCount);<br class=""><br class="">Don't embed 1000 here; make this a cl::opt.<br class=""><br class="">With these changes, the first patch LGTM.<br class=""><br class="">Thanks again,<br class="">Hal<br class=""><br class=""><br class=""><br class=""><br class="">Thanks,<br class="">Michael<br class=""><br class=""><br class=""><br class=""><br class=""><br class=""><br class=""><br class="">On Jan 26, 2015, at 11:50 AM, Michael Zolotukhin <<br class=""><a href="mailto:mzolotukhin@apple.com" class="">mzolotukhin@apple.com</a> > wrote:<br class=""><br class=""><br class=""><br class=""><br class=""><br class=""><br class="">On Jan 25, 2015, at 6:06 AM, Hal Finkel < <a href="mailto:hfinkel@anl.gov" class="">hfinkel@anl.gov</a> > wrote:<br class=""><br class="">----- Original Message -----<br class=""><br class=""><br class="">From: "Michael Zolotukhin" < <a href="mailto:mzolotukhin@apple.com" class="">mzolotukhin@apple.com</a> ><br class="">To: "Hal Finkel" < <a href="mailto:hfinkel@anl.gov" class="">hfinkel@anl.gov</a> ><br class="">Cc: "Arnold Schwaighofer" < <a href="mailto:aschwaighofer@apple.com" class="">aschwaighofer@apple.com</a> >, "Commit<br class="">Messages and Patches for LLVM"<br class="">< <a href="mailto:llvm-commits@cs.uiuc.edu" class="">llvm-commits@cs.uiuc.edu</a> ><br class="">Sent: Sunday, January 25, 2015 1:20:18 AM<br class="">Subject: Re: [RFC] Heuristic for complete loop unrolling<br class=""><br class=""><br class="">On Jan 24, 2015, at 12:38 PM, Hal Finkel < <a href="mailto:hfinkel@anl.gov" class="">hfinkel@anl.gov</a> > wrote:<br class=""><br class="">----- Original Message -----<br class=""><br class=""><br class="">From: "Michael Zolotukhin" < <a href="mailto:mzolotukhin@apple.com" class="">mzolotukhin@apple.com</a> ><br class="">To: "Hal Finkel" < <a href="mailto:hfinkel@anl.gov" class="">hfinkel@anl.gov</a> ><br class="">Cc: "Arnold Schwaighofer" < <a href="mailto:aschwaighofer@apple.com" class="">aschwaighofer@apple.com</a> >, "Commit<br class="">Messages and Patches for LLVM"<br class="">< <a href="mailto:llvm-commits@cs.uiuc.edu" class="">llvm-commits@cs.uiuc.edu</a> ><br class="">Sent: Saturday, January 24, 2015 2:26:03 PM<br class="">Subject: Re: [RFC] Heuristic for complete loop unrolling<br class=""><br class=""><br class="">Hi Hal,<br class=""><br class=""><br class="">Thanks for the review! Please see my comments inline.<br class=""><br class=""><br class=""><br class=""><br class="">On Jan 24, 2015, at 6:44 AM, Hal Finkel < <a href="mailto:hfinkel@anl.gov" class="">hfinkel@anl.gov</a> > wrote:<br class=""><br class="">[moving patch review to llvm-commits]<br class=""><br class="">+static bool CanEliminateLoadFrom(Value *V) {<br class=""><br class="">+ if (Constant *C = dyn_cast<Constant>(V)) {<br class=""><br class="">+ if (GlobalVariable *GV = dyn_cast<GlobalVariable>(C))<br class=""><br class="">+ if (GV->isConstant() && GV->hasDefinitiveInitializer())<br class=""><br class=""><br class="">Why are you casting to a Constant, and then to a GlobalVariable, and<br class="">then checking GV->isConstant()? There seems to be unnecessary<br class="">redundancy here ;)<br class="">Indeed:)<br class=""><br class=""><br class=""><br class=""><br class="">+ return GV->getInitializer();<br class=""><br class="">+ }<br class=""><br class="">+ return false;<br class=""><br class="">+}<br class=""><br class="">+static unsigned ApproximateNumberOfEliminatedInstruction(const Loop<br class="">*L,<br class=""><br class="">+ ScalarEvolution &SE) {<br class=""><br class="">This function seems mis-named. For one thing, it is not counting the<br class="">number of instructions potentially eliminated, but the number of<br class="">loads. Eliminating the loads might have lead to further constant<br class="">propagation, and really calculating the number of eliminated<br class="">instructions would require estimating that effect, right?<br class="">That’s right. But I’d rather change the function name, than add such<br class="">calculations, since it’ll look very-very narrow targeted, and I<br class="">worry that we might start to lose some cases as well. How about<br class="">‘NumberOfConstantFoldedLoads’?<br class=""><br class=""><br class="">Sounds good to me.<br class=""><br class=""><br class=""><br class=""><br class=""><br class=""><br class=""><br class="">if (TripCount && Count == TripCount) {<br class=""><br class="">- if (Threshold != NoThreshold && UnrolledSize > Threshold) {<br class=""><br class="">+ if (Threshold != NoThreshold && UnrolledSize > Threshold + 20 *<br class="">ElimInsns) {<br class=""><br class=""><br class="">20, huh? Is this a heuristic for constant propagation? It feels like<br class="">we should be able to do better than this.<br class="">Yep, that’s a parameter of the heuristic. Counting each ‘constant'<br class="">load as 1 is too conservative and doesn’t give much here,<br class=""><br class="">Understood.<br class=""><br class=""><br class=""><br class="">but since<br class="">we don’t actually count number of eliminated instructions we need<br class="">some estimate for it. This estimate is really rough, since in some<br class="">cases we can eliminate the entire loop body, while in the others we<br class="">can’t eliminate anything.<br class=""><br class="">I think that we might as well estimate it. Once we know the loads<br class="">that unrolling will constant fold, put them into a SmallPtrSet S.<br class="">Then, use a worklist-based iteration of the uses of instructions in<br class="">S. For each use that has operands that are all either constants, or<br class="">in S, queue them and keep going. Make sure you walk breadth first<br class="">(not depth first) so that you capture things will multiple feeder<br class="">loads properly. As you do this, count the number of instructions<br class="">that will be constant folded (keeping a Visited set in the usual way<br class="">so you don't over-count). This will be an estimate, but should be a<br class="">very accurate one, and can be done without any heuristic parameters<br class="">(and you still only visit a linear number of instructions, so should<br class="">not be too expensive).<br class="">The biggest gain comes not from expressions buf[0]*buf[3], which can<br class="">be constant folded when buf[0] and buf[3] are substituted with<br class="">constants, but from x*buf[0], when buf[0] turns out to be 0, or 1.<br class="">I.e. we don’t constant-fold it, but we simplify the expression based<br class="">on the particular value. I.e. the original convolution test is in<br class="">some sense equivalent to the following code:<br class="">const int a[] = [0, 1, 5, 1, 0];<br class="">int *b;<br class="">for(i = 0; i < 100; i ++) {<br class="">for(j = 0; j < 5; j++)<br class="">b[i] += b[i+j]*a[j];<br class="">}<br class=""><br class=""><br class="">Complete unrolling will give us:<br class=""><br class="">for(i = 0; i < 100; i ++) {<br class="">b[i] += b[i]*a[0];<br class=""><br class="">b[i] += b[i+1]*a[1];<br class=""><br class="">b[i] += b[i+2]*a[2];<br class=""><br class="">b[i] += b[i+3]*a[3];<br class=""><br class="">b[i] += b[i+4]*a[4];<br class="">}<br class=""><br class=""><br class="">After const-prop we’ll get:<br class=""><br class="">for(i = 0; i < 100; i ++) {<br class="">b[i] += b[i]*0;<br class=""><br class="">b[i] += b[i+1]*1;<br class=""><br class="">b[i] += b[i+2]*5;<br class=""><br class="">b[i] += b[i+3]*1;<br class=""><br class="">b[i] += b[i+4]*0;<br class="">}<br class="">And after simplification:<br class=""><br class="">for(i = 0; i < 100; i ++) {<br class="">b[i] += b[i+1];<br class=""><br class=""><br class="">b[i] += b[i+2]*5;<br class=""><br class="">b[i] += b[i+3];<br class=""><br class="">}<br class=""><br class=""><br class="">As you can see, we actually folded nothing here, but rather we<br class="">simplified more than a half of instructions. But it’s hard to<br class="">predict, which optimizations will be enabled by exact values<br class="">replaced loads<br class=""><br class="">Agreed.<br class=""><br class="">- i.e. of course we can check it, but it would be too<br class=""><br class=""><br class="">optimization-specific in my opinion.<br class=""><br class="">I understand your point, but I think using one magic number is<br class="">swinging the pendulum too far in the other direction.<br class=""><br class=""><br class=""><br class="">Thus, I decided to use some<br class="">number (20 in the current patch) which represents some average<br class="">profitability of replacing a load with constant. I think I should<br class="">get rid of this magic number though, and replace it with some<br class="">target-specific parameter (that’ll help to address Owen’s<br class="">suggestions).<br class=""><br class="">Using a target-specific cost (or costs) is a good idea, but having<br class="">target-specific magic numbers is going to be a mess. Different loops<br class="">will unroll, or not, for fairly arbitrary reasons, on different<br class="">targets. This is a heuristic, that I understand, and you won't be<br class="">able to exactly predict all possible later simplifications enabled<br class="">by the unrolling. However, the fact that you currently need a large<br class="">per-instruction boost factor, like 20, I think, means that the model<br class="">is too coarse.<br class=""><br class="">Maybe it would not be unreasonable to start from the other side: If<br class="">we first located loads that could be constant folded, and determined<br class="">the values those loads would actually take, and then simplified the<br class="">loop instructions based on those values, we'd get a pretty good<br class="">estimate. This is essentially what the code in<br class="">lib/Analysis/IPA/InlineCost.cpp does when computing the inlining<br class="">cost savings, and I think we could use the same technique here.<br class="">Admittedly, the inline cost analysis is relatively expensive, but<br class="">I'd think we could impose a reasonable size cutoff to limit the<br class="">overall expense for the case of full unrolling -- for this, maybe<br class="">the boost factor of 20 is appropriate -- and would also address<br class="">Owen's point, as far as I can tell, because like the inline cost<br class="">analysis, we can use TTI to compute the target-specific cost of<br class="">GEPs, etc. Ideally, we could refactor the ICA a little bit, and<br class="">re-use the code in there directly for this purpose. This way we can<br class="">limit the number of places in which we compute similar kinds of<br class="">heuristic simplification costs.<br class="">Hi Hal,<br class=""><br class=""><br class="">Thanks for the comments. I think that’s a good idea, and I’ll try<br class="">that out. If that doesn’t lead to over-complicated implementation,<br class="">I’d like it much better than having a magic number. I’ll prepare a<br class="">new patch soon.<br class=""><br class=""><br class="">Thanks,<br class="">Michael<br class=""><br class=""><br class=""><br class=""><br class=""><br class="">-Hal<br class=""><br class=""><br class=""><br class=""><br class=""><br class="">Thanks,<br class="">Michael<br class=""><br class=""><br class=""><br class=""><br class=""><br class="">-Hal<br class=""><br class=""><br class=""><br class=""><br class=""><br class="">I’d really appreciate a feedback on how to model this in a better<br class="">way.<br class=""><br class=""><br class="">Thanks,<br class="">Michael<br class=""><br class=""><br class=""><br class=""><br class="">Thanks again,<br class="">Hal<br class=""><br class="">----- Original Message -----<br class=""><br class=""><br class="">From: "Michael Zolotukhin" < <a href="mailto:mzolotukhin@apple.com" class="">mzolotukhin@apple.com</a> ><br class="">To: "LLVM Developers Mailing List ( <a href="mailto:llvmdev@cs.uiuc.edu" class="">llvmdev@cs.uiuc.edu</a> )" <<br class=""><a href="mailto:llvmdev@cs.uiuc.edu" class="">llvmdev@cs.uiuc.edu</a> ><br class="">Cc: "Hal J. Finkel" < <a href="mailto:hfinkel@anl.gov" class="">hfinkel@anl.gov</a> >, "Arnold Schwaighofer" <<br class=""><a href="mailto:aschwaighofer@apple.com" class="">aschwaighofer@apple.com</a> ><br class="">Sent: Friday, January 23, 2015 2:05:11 PM<br class="">Subject: [RFC] Heuristic for complete loop unrolling<br class=""><br class=""><br class=""><br class="">Hi devs,<br class=""><br class="">Recently I came across an interesting testcase that LLVM failed to<br class="">optimize well. The test does some image processing, and as a part of<br class="">it, it traverses all the pixels and computes some value basing on<br class="">the adjacent pixels. So, the hot part looks like this:<br class=""><br class="">for(y = 0..height) {<br class="">for (x = 0..width) {<br class="">val = 0<br class="">for (j = 0..5) {<br class="">for (i = 0..5) {<br class="">val += img[x+i,y+j] * weight[i,j]<br class="">}<br class="">}<br class="">}<br class="">}<br class=""><br class="">And ‘weight' is just a constant matrix with some coefficients.<br class=""><br class="">If we unroll the two internal loops (with tripcount 5), then we can<br class="">replace weight[i,j] with concrete constant values. In this<br class="">particular case, many of the coefficients are actually 0 or 1, which<br class="">enables huge code simplifications later on. But currently we unroll<br class="">only the innermost one, because unrolling both of them will exceed<br class="">the threshold.<br class=""><br class="">When deciding whether to unroll or not, we currently look only at the<br class="">instruction count of the loop. My proposal is to, on top of that,<br class="">check if we can enable any later optimizations by unrolling - in<br class="">this case by replacing a load with a constant. Similar to what we do<br class="">in inlining heuristics, we can estimate how many instructions would<br class="">be potentially eliminated after unrolling and adjust our threshold<br class="">with this value.<br class=""><br class="">I can imagine that it might be also useful for computations,<br class="">involving sparse constant matrixes (like identity matrix).<br class=""><br class="">The attached patch implements this feature, and with it we handle the<br class="">original testcase well.<br class=""><br class=""><br class=""><br class=""><br class=""><br class="">Does it look good? Of course, any ideas, suggestions and other<br class="">feedback are welcome!<br class=""><br class=""><br class="">Thanks,<br class="">Michael<br class=""><br class="">--<br class="">Hal Finkel<br class="">Assistant Computational Scientist<br class="">Leadership Computing Facility<br class="">Argonne National Laboratory<br class=""><br class=""><br class="">--<br class="">Hal Finkel<br class="">Assistant Computational Scientist<br class="">Leadership Computing Facility<br class="">Argonne National Laboratory<br class=""><br class=""><br class="">--<br class="">Hal Finkel<br class="">Assistant Computational Scientist<br class="">Leadership Computing Facility<br class="">Argonne National Laboratory<br class="">_______________________________________________<br class="">llvm-commits mailing list<br class=""><a href="mailto:llvm-commits@cs.uiuc.edu" class="">llvm-commits@cs.uiuc.edu</a><br class="">http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits<br class=""><br class=""><br class="">--<br class="">Hal Finkel<br class="">Assistant Computational Scientist<br class="">Leadership Computing Facility<br class="">Argonne National Laboratory<br class=""><br class=""><br class="">--<br class="">Hal Finkel<br class="">Assistant Computational Scientist<br class="">Leadership Computing Facility<br class="">Argonne National Laboratory<br class=""><br class=""><br class="">--<br class="">Hal Finkel<br class="">Assistant Computational Scientist<br class="">Leadership Computing Facility<br class="">Argonne National Laboratory<br class=""><br class=""><br class="">--<br class="">Hal Finkel<br class="">Assistant Computational Scientist<br class="">Leadership Computing Facility<br class="">Argonne National Laboratory<br class="">_______________________________________________<br class="">llvm-commits mailing list<br class="">llvm-commits@cs.uiuc.edu<br class="">http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits<br class=""><br class=""></blockquote><br style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><span style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; float: none; display: inline !important;" class="">--<span class="Apple-converted-space"> </span></span><br style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><span style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; float: none; display: inline !important;" class="">Hal Finkel</span><br style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><span style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; float: none; display: inline !important;" class="">Assistant Computational Scientist</span><br style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><span style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; float: none; display: inline !important;" class="">Leadership Computing Facility</span><br style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><span style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; float: none; display: inline !important;" class="">Argonne National Laboratory</span></div></blockquote></div><br class=""></body></html>