[LLVMdev] Separating loop nests based on profile information?
Xinliang David Li
xinliangli at gmail.com
Mon Jan 12 20:28:39 PST 2015
On Thu, Jan 8, 2015 at 12:33 PM, Philip Reames <listmail at philipreames.com>
wrote:
>
> On 01/07/2015 05:33 PM, Chandler Carruth wrote:
>
>> How does this compare with classical approaches of loop peeling,
>> partitioning, fission, or whatever you might call it?
>>
> I'm still developing a good sense for this, but let me lay out some
> observations. Fair warning, this is going to be long and rambling.
>
> Let's start with a toy example:
> while(c) {
> x = this->x;
> y = this->y;
> if (x == y) {
> rare();
> }
> }
>
>
With profile info, speculative PRE can be performed for memory access that
are not downsafe:
temp_c = c;
if (temp_c)
{
x = this->x;
y = this->y;
}
while (temp_c) {
if (x == y)
{
rare();
x = this->x;
y = this->y;
}
temp_c = c;
}
If you can prove this->x etc are safe control speculative (e.g, seen a
dominating memory access with the same base and offset), it can be
x = this->x;
y = this->y;
while (c) {
if (x == y) {
rare();
x = this->x;
y = this->y;
}
}
> If we could tell x and y were loop invariant, we could unswitch this
> loop. However, the rare call clobbers our view of memory, so LICM fails,
> and thus unswitch fails.
>
> We'd like to apply PRE here and push the reload into the loop preheader
> and the rare case. This currently fails for two reasons: 1) We don't allow
> code duplication during PRE, and 2) the canonical loop-simplify form of
> having a single latch means that PRE doesn't actually see the rare block at
> all, it sees the preheader and the join point after the if block.
>
> I think both of the previous are fixable:
> 1) This calls for profile guided duplication. If we can trade one hot
> load for two cold ones, that's great. Also, there might be static
> heuristics we could construct. (e.g. if it's not available in the
> preheader and one of many latches...)
> 2) We need PRE to look back through merge blocks. This should possibly
> even live in MemoryDependenceAnalysis. Alternatively, we could create a
> "deconstruct loop simplify form" pass to run right before GVN to side step
> this issue.
>
> Neither is entirely simple to implement, but both are doable. Probably.
>
> In theory, we've now solved the same problem my loop nest transform does.
> There's two snags: one minor(ish), one major.
>
> The minor one is that many of my rare conditions involve volatile loads.
> MemDepAnalysis currently completely gives up when faced with volatile
> loads. This is mostly fixable, but given the structure of the code may be
> fairly invasive. Oddly, LICM actually performs better in this case.
>
> For the major one, let's tweak our example slightly:
> while(c) {
> x = this->x;
> if (x == null) return;
> y = this->y;
> if (y == null) return;
> if (x == y) {
> rare();
> }
> }
>
> Let's assume that PRE worked as designed for 'x'. We now need to perform
> a transform analogous to loop unswitch, but a bit different. We need to
> move the condition loop exits into each of the predecessors where the
> answers not available. I don't believe we have anything like this today,
> and it involves a non-trivial bit of code. Even if we get that done, we're
> left with a pass ordering problem where we need to run PRE, then
> branch-PRE, then PRE, then branch-PRE, etc.. (In practice, these chains
> can be quite long due to null checks and range checks.)
>
> If we'd split this into a loop nest structure, we'd still have the chain,
> but a) that's easier to control with a custom pass order since LICM and
> LoopUnswitch are far cheaper than GVN/PRE and b) having LICM do a bit of
> trivial loop unswitch for the terminator of the header appears quite
> tractable.
>
if (c)
{
x = this->x;
if (!x) return;
y = this->y;
if (!y) return;
}
while (c) {
if (x == y) {
rare();
if (!x) return;
if (!y) return;
}
}
The 'branch PRE' (hoisting, sinking, assertion prop and branch elimination)
part is a little tricky to do though.
David
>
> One other point worth mentioning is that LLVM does not currently have a
> loop peeling pass. LoopRotate seems to get us most of the way there in
> common cases, but not always. If we did peel the above loop once, we'd
> have the same merge block problem I mentioned previously (this time in the
> preheader *and* the latch), but the profitability becomes slightly more
> clear without the need for profile information or loop specific heuristics.
>
> To be clear, I'm not opposed to fixing some of the issues above. I think
> that's a great idea, I'm just trying to find my way through to practical
> performance improvements with minimal invested effort. I probably will in
> fact fix some of these since they show up in a number of other cases as
> well.
>
>
> Ok, so that's that. Let's move on to why my proposed loop nest separation
> change is a bit scary.
>
> While the loop nest does allow things to be lifted from the inner loop to
> the outer loop - without any of the complex PRE or PGO reasoning required
> above - it suffers from all of the limitations of the existing loop
> optimization passes. In particular, LICM may not actually be able to do
> much if the inner loop is still complicated enough to defeat either it's
> fairly simple alias analysis or fault analysis. GVN/PRE is strictly more
> powerful - in theory if not always practice - here than what we have in
> LICM.
>
> We've also potentially created a loop nest which is harder to analyse than
> the original loop with two latches. Consider this example:
> for(int i = 0; i < 20000) {
> i++;
> if(i % 20 != 0) continue;
> a[i] = 0;
> }
> (This is not a real example of anything useful. The important point is
> the early return is the common case.)
>
> This loop clearly runs exactly 20000 times. ScalarEvolution almost
> certainly gets this right. (I haven't checked.)
>
> After separation, we'd have this:
>
> int i = 0;
> while(true) {
> while(i < 20000) {
> i++;
> if(i % 20 == 0) { // now a loop exit
> a[i] = 0;
> goto next:
> }
> }
> break:
> next:
> }
>
> It's not clear to me that SE will always be able to tell that both the
> inner and outer loop runs a maximum of 20000 times. (In this particular
> case it might actually optimize the inner loop substantially and thus get a
> more precise answer, but let's ignore that.) We could potentially loose
> information about the loop by doing this transformation. Of course, this
> concern applies to the *existing* loop nest separation heuristics too. I
> don't have any particular reason to believe one heuristic is worse than the
> other, but I also have no reason not to believe that either.
>
> As a final point, the inner loop now has two exiting blocks not one. I
> know at least some of our loop transforms give up on anything with more
> than one loop exit. I'm less concerned by that since a) we should fix
> them, and b) the cases I've looked at don't look too hard.
>
>
> And that's everything that occurs to me at the moment. I'll probably have
> to append in further responses as I remember more details, I know I'm
> forgetting some pieces.
>
> Philip
>
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150112/227d39a3/attachment.html>
More information about the llvm-dev
mailing list