[PATCH] D55851: Implement basic loop fusion pass

Mon Mar 25 20:18:39 PDT 2019

kbarton marked 10 inline comments as done.
kbarton added inline comments.

================
Comment at: llvm/lib/Transforms/Scalar/LoopFuse.cpp:858
+
+      // TODO: Can we actually use the dependence info analysis here?
+      return false;
----------------
dmgreen wrote:
> kbarton wrote:
> > dmgreen wrote:
> > > kbarton wrote:
> > > > dmgreen wrote:
> > > > > I would imagine, although I'm not sure, that there would at least be a lot of bugs here. We are dealing with different loops, but we can say that they are very similar.
> > > > > 
> > > > > What does it currently give? Anything useful? The SCEV checks you are doing above is obviously quite similar to what DA would be trying to do, but with the added loop rewrite. It would be a shame to duplicate all the effort but may, I guess, be necessary if DA doesn't do well enough.
> > > > At this point DA doesn't give anything useful - at least not for the test cases that I have tried. I have not had a chance to investigate why, or if there is a better/different way to do things where it can be useful (which is why I marked the TODO here). 
> > > > 
> > > > The one thing that DA is able to do, that SCEV currently cannot, is understand the restrict keyword and accurately identify no dependencies between the two loops in this case. Perhaps it would be better to try and teach SCEV about restrict, and then only rely on SCEV in the long run?
> > > > 
> > > I remember DA having problems on 64bit systems due to all the sexts that kept happening. Delinearising constant's too, maybe? There are certainly problems in it that would be worth trying to fix (or find a replacement to do it better).
> > > 
> > > The delinearisation that DA will attempt may also be helpful. It is nowadays all built on SCEV's too, so should in theory (baring the facts that we know here about dealing with different but similar loops) be a more powerful version of the SCEV code here. That power might be confusing it at the moment though.
> > > 
> > > I would expect that better DA is something llvm will need sooner or later as more loop optimisations are implemented. Perhaps this is something that we can improve in the future, relying on the SCEV code at the moment, but getting more help from DA as it is improved. The no-aliasing checks it can do are useful at least, it would seem.
> > I completely agree about improving DA and the need for it with other loop optimizations. I still haven't had a chance to look at it in detail, but would like to start looking at it once this patch is finalized and lands.
> > 
> > For now are you OK with the current usage of DA here? I'm hoping that as we extend/improve it, we can modify this code to take advantage of it.
> Sounds good to me. The !DepResult check can help in some cases at least, and the rest we can add as DA is improved.
I'm going to mark this as done, since I think we are in agreement we want to work on improving the dependence analysis in the future. If this is incorrect, please let me know. 

================
Comment at: llvm/lib/Transforms/Scalar/LoopFuse.cpp:129
+  BasicBlock *Header;
+  BasicBlock *ExitingBlocks;
+  BasicBlock *ExitBlock;
----------------
dmgreen wrote:
> Perhaps name it ExitingBlock, if there is only one.
Good idea. Will fix in patch I'm going to post soon. 

================
Comment at: llvm/lib/Transforms/Scalar/LoopFuse.cpp:929
+          // defined in that loop, and we likely don't want to disturb it.
+          PHINode *DefPHI = dyn_cast<PHINode>(Def);
+          if (DefPHI)
----------------
dmgreen wrote:
> kbarton wrote:
> > dmgreen wrote:
> > > I think that you can just do something like
> > >   if (Instruction *Def = dyn_cast<Instruction>(Op))
> > >     if (FC0.L->contains(Def->getParent()))
> > >       return false;
> > > Providing there are no lcssa phis, this should rule out anything that depends on the first loop. 
> > I just finished up testing this change, and then realized I think this is overly restrictive.
> > I had originally restricted it to PHIs because I only wanted to include things that are (re)defined in the first loop. Basically, the first loop needs to complete in order to get the correct definition going into the second loop.
> > 
> > This alternate sequence will also include stores, not just PHIs. The store can be loop invariant, and if it is not hoisted out of the loop prior, then it would also prevent fusion.
> This is the kind of thing I was thinking of:
> 
> ```
> define float @test(float* nocapture %a, i32 %n) {
> entry:
>   %conv = zext i32 %n to i64
>   %cmp32 = icmp eq i32 %n, 0
>   br i1 %cmp32, label %for.cond.cleanup7, label %for.body
> 
> for.body:                                         ; preds = %for.body, %entry
>   %i.034 = phi i64 [ %inc, %for.body ], [ 0, %entry ]
>   %sum1.033 = phi float [ %add, %for.body ], [ 0.000000e+00, %entry ]
>   %idxprom = trunc i64 %i.034 to i32
>   %arrayidx = getelementptr inbounds float, float* %a, i32 %idxprom
>   %0 = load float, float* %arrayidx, align 4
>   %add = fadd float %sum1.033, %0
>   %inc = add nuw nsw i64 %i.034, 1
>   %cmp = icmp ult i64 %inc, %conv
>   br i1 %cmp, label %for.body, label %for.body8
> 
> for.body8:                                        ; preds = %for.body, %for.body8
>   %i2.031 = phi i64 [ %inc14, %for.body8 ], [ 0, %for.body ]
>   %idxprom9 = trunc i64 %i2.031 to i32
>   %arrayidx10 = getelementptr inbounds float, float* %a, i32 %idxprom9
>   %1 = load float, float* %arrayidx10, align 4
>   %div = fdiv float %1, %add
>   store float %div, float* %arrayidx10, align 4
>   %inc14 = add nuw nsw i64 %i2.031, 1
>   %cmp5 = icmp ult i64 %inc14, %conv
>   br i1 %cmp5, label %for.body8, label %for.cond.cleanup7
> 
> for.cond.cleanup7:                                ; preds = %for.body8, %entry
>   %sum1.0.lcssa36 = phi float [ 0.000000e+00, %entry ], [ %add, %for.body8 ]
>   ret float %sum1.0.lcssa36
> }
> ```
> 
> The importand part being %add, that is defined in the first loop and used in the second. I believe that using normal SSA def-use chains, anything that is def'd in the first loop (and used in the second) will need to use the value from the last iteration of the loop, so fusion will be illegal.
> 
> There may be cases where a def has the same value on every iteration of the loop, but I imagine those would already have been hoisted/sunk, or be quite rare.
Yes, I see now. It's possible to have a def in addition to a PHI.
If I wanted to catch this, I would also need to traverse all the inputs to any defs and make sure they don't come from PHIs within loop 0 as well.
I'm OK with restricting this to defs, with the assumption that if there is any loop invariant defs they would have been hoisted out of the loop at this point. 

================
Comment at: llvm/lib/Transforms/Scalar/LoopFuse.cpp:1083
+    // Update DT/PDT
+    DTU.applyUpdates(TreeUpdates);
+
----------------
dmgreen wrote:
> Whilst checking the example, I noticed the dom tree updater has become stricter about "inbalanced" tree operations. I think that if FC0.ExitingBlocks == FC0.Latch, we add a link to FC1.Header twice, which is maybe what it's complaining about (?)
Yup, good catch. I found the same problem when I tried your test case.
I've fixed the problem and added your test above to cannot_fuse.ll. 

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D55851/new/

https://reviews.llvm.org/D55851