[LLVMdev] Separating loop nests based on profile information?
Philip Reames
listmail at philipreames.com
Fri Jan 9 11:39:28 PST 2015
On 01/08/2015 12:33 PM, Philip Reames wrote:
>
> On 01/07/2015 05:33 PM, Chandler Carruth wrote:
>> How does this compare with classical approaches of loop peeling,
>> partitioning, fission, or whatever you might call it?
> I'm still developing a good sense for this, but let me lay out some
> observations. Fair warning, this is going to be long and rambling.
>
> Let's start with a toy example:
> while(c) {
> x = this->x;
> y = this->y;
> if (x == y) {
> rare();
> }
> }
>
> If we could tell x and y were loop invariant, we could unswitch this
> loop. However, the rare call clobbers our view of memory, so LICM
> fails, and thus unswitch fails.
>
> We'd like to apply PRE here and push the reload into the loop
> preheader and the rare case. This currently fails for two reasons: 1)
> We don't allow code duplication during PRE, and 2) the canonical
> loop-simplify form of having a single latch means that PRE doesn't
> actually see the rare block at all, it sees the preheader and the join
> point after the if block.
>
> I think both of the previous are fixable:
> 1) This calls for profile guided duplication. If we can trade one hot
> load for two cold ones, that's great. Also, there might be static
> heuristics we could construct. (e.g. if it's not available in the
> preheader and one of many latches...)
> 2) We need PRE to look back through merge blocks. This should
> possibly even live in MemoryDependenceAnalysis. Alternatively, we
> could create a "deconstruct loop simplify form" pass to run right
> before GVN to side step this issue.
Point 2 here may be inaccurate. I've definitely seen problems here in
the past, but when I sat down to write some test cases, it looks like
the simple merge block for latch (with the load being considered in the
loop header) works just fine. I suspect my original example is actually
blocked by something else; I'm just not sure what that is yet.
One case where I can get PRE to fail is if there's a load in one of two
paths through the loop, in the preheader, but not in the header. Here's
the IR:
; Function Attrs: readonly
declare void @hold(i32) #0
declare void @clobber()
define i32 @test1(i1 %cnd, i32* %p) {
entry:
%v2 = load i32* %p
call void @hold(i32 %v2)
br label %header
header: ; preds = %merge, %entry
br i1 %cnd, label %bb1, label %bb2
bb1: ; preds = %header
%v1 = load i32* %p
call void @hold(i32 %v1)
br label %merge
bb2: ; preds = %header
call void @clobber()
br label %merge
merge: ; preds = %bb2, %bb1
br label %header
}
I haven't looked to see what the issue here is yet.
>
> Neither is entirely simple to implement, but both are doable. Probably.
>
> In theory, we've now solved the same problem my loop nest transform
> does. There's two snags: one minor(ish), one major.
>
> The minor one is that many of my rare conditions involve volatile
> loads. MemDepAnalysis currently completely gives up when faced with
> volatile loads. This is mostly fixable, but given the structure of
> the code may be fairly invasive. Oddly, LICM actually performs better
> in this case.
>
> For the major one, let's tweak our example slightly:
> while(c) {
> x = this->x;
> if (x == null) return;
> y = this->y;
> if (y == null) return;
> if (x == y) {
> rare();
> }
> }
>
> Let's assume that PRE worked as designed for 'x'. We now need to
> perform a transform analogous to loop unswitch, but a bit different.
> We need to move the condition loop exits into each of the predecessors
> where the answers not available. I don't believe we have anything
> like this today, and it involves a non-trivial bit of code. Even if
> we get that done, we're left with a pass ordering problem where we
> need to run PRE, then branch-PRE, then PRE, then branch-PRE, etc..
> (In practice, these chains can be quite long due to null checks and
> range checks.)
>
> If we'd split this into a loop nest structure, we'd still have the
> chain, but a) that's easier to control with a custom pass order since
> LICM and LoopUnswitch are far cheaper than GVN/PRE and b) having LICM
> do a bit of trivial loop unswitch for the terminator of the header
> appears quite tractable.
>
> One other point worth mentioning is that LLVM does not currently have
> a loop peeling pass. LoopRotate seems to get us most of the way there
> in common cases, but not always. If we did peel the above loop once,
> we'd have the same merge block problem I mentioned previously (this
> time in the preheader *and* the latch), but the profitability becomes
> slightly more clear without the need for profile information or loop
> specific heuristics.
>
> To be clear, I'm not opposed to fixing some of the issues above. I
> think that's a great idea, I'm just trying to find my way through to
> practical performance improvements with minimal invested effort. I
> probably will in fact fix some of these since they show up in a number
> of other cases as well.
>
>
> Ok, so that's that. Let's move on to why my proposed loop nest
> separation change is a bit scary.
>
> While the loop nest does allow things to be lifted from the inner loop
> to the outer loop - without any of the complex PRE or PGO reasoning
> required above - it suffers from all of the limitations of the
> existing loop optimization passes. In particular, LICM may not
> actually be able to do much if the inner loop is still complicated
> enough to defeat either it's fairly simple alias analysis or fault
> analysis. GVN/PRE is strictly more powerful - in theory if not always
> practice - here than what we have in LICM.
>
> We've also potentially created a loop nest which is harder to analyse
> than the original loop with two latches. Consider this example:
> for(int i = 0; i < 20000) {
> i++;
> if(i % 20 != 0) continue;
> a[i] = 0;
> }
> (This is not a real example of anything useful. The important point
> is the early return is the common case.)
>
> This loop clearly runs exactly 20000 times. ScalarEvolution almost
> certainly gets this right. (I haven't checked.)
>
> After separation, we'd have this:
>
> int i = 0;
> while(true) {
> while(i < 20000) {
> i++;
> if(i % 20 == 0) { // now a loop exit
> a[i] = 0;
> goto next:
> }
> }
> break:
> next:
> }
>
> It's not clear to me that SE will always be able to tell that both the
> inner and outer loop runs a maximum of 20000 times. (In this
> particular case it might actually optimize the inner loop
> substantially and thus get a more precise answer, but let's ignore
> that.) We could potentially loose information about the loop by doing
> this transformation. Of course, this concern applies to the
> *existing* loop nest separation heuristics too. I don't have any
> particular reason to believe one heuristic is worse than the other,
> but I also have no reason not to believe that either.
>
> As a final point, the inner loop now has two exiting blocks not one.
> I know at least some of our loop transforms give up on anything with
> more than one loop exit. I'm less concerned by that since a) we
> should fix them, and b) the cases I've looked at don't look too hard.
>
>
> And that's everything that occurs to me at the moment. I'll probably
> have to append in further responses as I remember more details, I know
> I'm forgetting some pieces.
>
> Philip
>
>
>
More information about the llvm-dev
mailing list