<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Jun 10, 2015 at 8:43 AM, Diego Novillo <span dir="ltr"><<a href="mailto:dnovillo@google.com" target="_blank">dnovillo@google.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><span class="">On Tue, Jun 9, 2015 at 9:34 PM, Duncan P. N. Exon Smith <<a href="mailto:dexonsmith@apple.com" target="_blank">dexonsmith@apple.com</a>> wrote:<br><br>> Firstly, you didn't update the documentation, which currently has these<br>> two "TODOs" (in a list of four):<br>> - Backedge frequencies are assumed to be evenly split between the<br>> headers of a given irreducible SCC. Instead, we could track the<br>> backedge mass separately for each header, and adjust their relative<br>> frequencies.<br>> - Entry frequencies are assumed to be evenly split between the headers<br>> of a given irreducible SCC, which is the only option if we need to<br>> compute mass in the SCC before its parent loop. Instead, we could<br>> partially compute mass in the parent loop, and stop when we get to<br></span>> the SCC. Here, we have the correct ratio of entry masses, which we0<span class=""><br>> can use to adjust their relative frequencies. Compute mass in the<br>> SCC, and then continue propagation in the parent.<br>><br>> Since you've implemented the backedge frequency point, please remove it,<br>> and update the rest of the doxygen to describe the current state.<br><br></span>Done.<br><br>As a follow up to this patch, I want to address the incoming mass distribution into an irregular loop. Divvying up a fixed mass loses precision.<span class=""><br><br>> Secondly, I wonder whether this will regress cases we care about if you<br>> don't incorporate the entry frequency point. E.g., your patch would<br>> make the frequencies *way* off for the following contrived example:<br>><br>> assert(n > 0);<br>> int i = n - 2;<br>> if (n % 2 == 0)<br>> goto SeventyFive;<br>><br>> for (; i < n; ++i) {<br>> if (i % n == 75) {<br>> SeventyFive:<br>> foo(i);<br>> } else {<br>> bar(i);<br>> }<br>> }<br>><br>> whereas (I think) the current in-tree code would get it roughly correct<br>> (by fluke?).<br><br></span>Actually, my patch addresses this problem in particular. This code is almost exactly the test case in PR 23525. The in-tree code gets it wrong the same way it gets PR 23525 wrong. The call to foo(i) is almost never taken vs the call to bar(i).<br><br>We get the following frequencies for this code snippet (I wrapped it into a function that I called 100 times). The left column is current trunk. The right column is trunk plus my patch. I redacted the floating point figures to make them more readable:<br><br></div></blockquote><div><br></div><div><br></div><div>Do you have a complete runnable example above?</div><div><br></div><div>David</div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><font face="monospace, monospace"> block-frequency-info: foo | block-frequency-info: foo<br> - entry: float = 1.0, int = 8388608 | - entry: float = 1.0, int = 8388608</font><div><font face="monospace, monospace"> - cond.true: float = 0.99, int = 8388600 | - cond.true: float = 0.99, int = 8388600<br> - cond.false: float = 0.00, int = 8 | - cond.false: float = 0.00, int = 8 </font><div><font face="monospace, monospace"> - : float = 0.0, int = 0 | - : float = 0.0, int = 0<br> - cond.end: float = 0.99, int = 8388600 | - cond.end: float = 0.99, int = 8388600</font><div><font face="monospace, monospace"> - if.then: float = 0.49, int = 4152772 | - if.then: float = 0.49, int = 4152772<br> - if.end: float = 0.50, int = 4235827 | - if.end: float = 0.50, int = 4235827</font><div><font face="monospace, monospace"><b>=> - for.cond: float = 2.49, int = 20971499 | - for.cond: float = 2.88, int = 24197884</b><br> - for.body: float = 1.49, int = 12582899 | - for.body: float = 1.49, int = 12582899</font><div><font face="monospace, monospace"><b><font color="#cc0000">=> - if.then.5: float = 0.02, int = 249991 | - if.then.5: float = 0.02, int = 249991</font></b><br><b style="background-color:rgb(255,255,255)"><font color="#cc0000">=> - SeventyFive: float = 2.49, int = 20971499 | - SeventyFive: float = 0.03, int = 288451</font></b></font><div><font face="monospace, monospace"> - if.else: float = 1.47, int = 12332908 | - if.else: float = 1.47, int = 12332908<br> - if.end.7: float = 2.49, int = 20971499 | - if.end.7: float = 4.58, int = 38428163</font><div><font face="monospace, monospace"><b>=> - for.inc: float = 2.49, int = 20971499 | - for.inc: float = 2.49, int = 20971499</b><br> - for.end: float = 0.99, int = 8388600 | - for.end: float = 0.99, int = 8388600</font></div><div><br></div><div>I highlighted the key blocks. In particular, blocks if.then.5 and SeventyFive. In current trunk, if.then.5 is correctly given a very low frequency (i % 75 is almost never 0), but block SeventyFive is given the same frequency as for.cond and for.inc. That's wrong.</div><div><br></div><div>With this patch, we correctly compute SeventyFive's frequency to a fairly low value.</div><span class=""><div><br>> Has your patch improved the workload where the previous approximation<br>> was causing you problems? Has it regressed any others? What has it<br>> done for the LNT test-suite?</div><div><br></div></span><div>I'm in the process of running various benchmarks. I'll run LNT too. It's fixed the problem in our internal code, but I'm yet to run the full benchmark.</div><span class=""><div><br><br>> Thirdly, what's your intuition for the kind of approximation we really<br>> need? Is this the right step forward?</div><div><br></div></span><div>Without thinking of a full re-implementation, I think the current model is fine with a couple additional tweaks. I don't think we need to go all out in trying to model irreducible flow. A couple of additional tweaks I'm planning is to address your TODO notes on iterating a couple of times.</div><span class=""><div><br>> As an historical note, irreducible edges were ignored entirely before I<br>> rewrote BFI, so the bar for improving/maintaining the irreducible<br>> control flow approximation was pretty low. But I wonder whether<br>> shoe-horning irreducible control flow into loop data structures (as I've<br>> done, and this patch would continue) is the right approach. Maybe they<br>> should be more first-class somehow?</div><div><br></div></span><div>Perhaps there is some coalescing of code we can do in treating them like regular loops that happen to have more than one header node. <span class=""><br><br>> For example, another approach for an irreducible subgraph: distribute<br>> mass from each "header" separately (in its own subgraph), thereby<br>> calculating the probability that we transition from each header to each<br>> other, giving us a probability matrix for a Markov Chain. Then,<br>> *waves-hands-furiously*, out pop accurate block masses for the headers<br>> (and thusly the other nodes).</span></div><div><br></div><div>If we take the initial mass as inherited from the mass flowing into each header, we could add a fixed number of iterations and call it a day. I *think* that should be enough.</div><span class=""><div><br>> Given that you've already done the work, I'm happy for this to go in<br>> basically as is, provided that it doesn't cause compile-time or run-time<br>> regressions. But I'd like to get a sense of where this is headed, and<br>> whether incrementally improving the approximation for irreducible<br>> control flow will address the use cases you care about.</div><div><br></div></span><div>Sure. I think that the current scheme will serve us well for now. As we find more cases, we can re-evaluate. I don't really want to over-engineer an approach to irreducible flow since it's not all that common (FLW ;)</div><span class=""><div><br>>> @@ -411,8 +414,10 @@<br>>><br>>> // Check for a backedge.<br>>> if (W.Type == Weight::Backedge) {<br>>> - OuterLoop->BackedgeMass += Taken;<br>>> - DEBUG(debugAssign(BlockNode(), Taken, "back"));<br>>> + if (W.TargetNode.Index >= OuterLoop->BackedgeMass.size())<br>>> + OuterLoop->BackedgeMass.resize(W.TargetNode.Index + 1);<br>><br>> This doesn't make sense to me. IIUC, you're using the index of the loop<br>> header in the RPOT of the function. Even without irreducible control<br>> flow, this adds quadratic memory usage (scales with the number of blocks times<br>> the number of loops, and there can be as many loops as there are blocks<br>> in the worst case).</div><div><br></div></span><div>Ah, yes.</div><span class=""><div><br>> I think you have two options:<br>><br>> - Force the size of the `BackedgeMass` vector to be exactly<br>> `NumHeaders`. In this code, do a binary search through the range<br>> between `NodeList.begin()` and `NodeList.begin() + NumHeaders` to<br>> find the index of the header in `NodeList`, and use the same index<br>> for `BackedgeMass`. This will have logarithmic runtime cost in the<br>> number of loop headers (locally), and linear memory requirements<br>> overall.<br>> - Use a `SmallDenseMap<BlockNode, BlockMass, 2>`.</div><div>><br></div><div>> I prefer the former approach, unless you have some reason against it?</div><div><br></div></span><div>No particular preference. The map approach seems a tad more convenient to implement, though. Seems to me that it would also make the code more readable?</div><span class=""><div><br></div><div><br>>> +#ifndef NDEBUG<br>>> + auto debugAssign = [&](const BlockNode &T, const BlockMass &M,<br>>> + const char *Desc) {<br>>> + dbgs() << " => assign " << M << " (" << D.RemMass << ")";<br>>> + if (Desc)<br>>> + dbgs() << " [" << Desc << "]";<br>>> + if (T.isValid())<br>>> + dbgs() << " to " << getBlockName(T);<br>>> + dbgs() << "\n";<br>>> + };<br>>> + (void)debugAssign;<br>><br>> Looks like this has been duplicated. Can it be moved to a `static`<br>> function?</div><div><br></div></span><div>Yeah, I was going to do this and forgot in then.</div><span class=""><div><br>><br>>> +#endif<br>>> +<br>>> + DEBUG(dbgs() << " Distribute loop mass " << LoopMass<br>>> + << " to headers using above weights\n");<br>>> + for (const Weight &W : Dist.Weights) {<br>>> + BlockMass Taken = D.takeMass(W.Amount);<br>>> + assert(W.Type == Weight::Local && "all weights should be local");<br>>> + Working[W.TargetNode.Index].getMass() = Taken;<br>>> + DEBUG(debugAssign(W.TargetNode, Taken, nullptr));<br>>> + }<br>><br>> The way `Distribution` and `DitheringDistributer` work together is<br>> pretty heavy for what you're doing here (e.g., with the local vs.<br>> backedge vs. exit distinction). Moreover, I think the simple behaviour<br>> needed here would be useful outside of BFI. Do you have any ideas for<br>> how to restructure those classes and/or factor out the common logic?</div><div><br></div></span><div>Well, all I really wanted originally was to compute ratios that I should use to alter the initial mass for each header, but there is no easy way to alter a block mass by a floating point fraction. The mass distributor seemed like the best approach.</div><div><br></div><div>In a followup patch I could move the mass distributor out. But where else are you thinking that this would be useful?</div><div><br></div><div><br></div><div>Thanks. Diego.</div></div></div></div></div></div></div></div>
<br>_______________________________________________<br>
llvm-commits mailing list<br>
<a href="mailto:llvm-commits@cs.uiuc.edu">llvm-commits@cs.uiuc.edu</a><br>
<a href="http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits" target="_blank">http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits</a><br>
<br></blockquote></div><br></div></div>