[PATCH] D55851: Implement basic loop fusion pass

Tue Feb 12 09:59:23 PST 2019

kbarton marked 16 inline comments as done.
kbarton added a comment.

I've addressed almost all of the comments.
I will work on an update to the problem that dmgreen provided an example for and post a new version of the patch when that is done. 
In the meantime, if my understanding of the comments is incorrect, someone please correct me :)

================
Comment at: llvm/lib/Transforms/Scalar/LoopFuse.cpp:142-143
+
+  FusionCandidate(Loop *L, const DominatorTree *DT,
+                  const PostDominatorTree *PDT)
+      : Preheader(L->getLoopPreheader()), Header(L->getHeader()),
----------------
jdoerfert wrote:
> kbarton wrote:
> > Meinersbur wrote:
> > > [suggestion] IMHO it is good practice to not have much logic in object constructors. Reasons:
> > > 1. When exceptions are thrown, the object is partially initialized
> > > 2. With class polymorphism, calling virtual functions will call the non-overidden method.
> > > 3. It is not possible to return anything, in this case, an error code.
> > > 
> > > 1. and 2. do not apply to this code. Point 3. is handled using the `Valid` flag.
> > > 
> > > Nonetheless, I think having lots of logic in constructor is a bad pattern. Instead, I suggest to have something return nullptr or `llvm::ErrorOr`.
> > I understand your points here. I did not realize issue #2 (although I agree it doesn't apply here). 
> > 
> > However, if I understand your suggestion, I would need to add an initialize method (or something similar) to do the logic that is currently in the constructor and return a success/fail code. The problem is that before anything can be done with the object, you need to ensure that it has been initialized. This can be accomplished by adding a boolean flag (isInitialized) to the class and ensuring that all accessors check that before doing anything with it. However, I think that is more complicated then leaving the initialization logic in the constructor.
> > 
> > That said, I'm not particularly tied to either approach. I can change the implementation if you feel strongly. Or, if there is an alternative that I'm not seeing, I'm happy to do that instead. 
> It is not uncommon in LLVM to do it this way. I personally do not try to force a solution similar to `bool createXXX(...)` but I also do not necessarily object if there is an actual reason.
OK, so I'm still not sure what the preference is here:

  # Keep it as is.
  # Change to create an empty object and then require an Initialize method call on it.
  # Create a wrapper, createFusionCandidate(), that does the construction and initialization.
  # Some other alternative that I don't see.

If I'm going to change, I prefer the createFusionCandidate route (which I hadn't thought of earlier) which I think is the cleanest, although I don't know how to enforce it's usage. 

================
Comment at: llvm/lib/Transforms/Scalar/LoopFuse.cpp:440-441
+  /// are control-flow equivalent.
+  bool controlFlowEquivalent(const FusionCandidate &FC0,
+                             const FusionCandidate &FC1) const {
+    if (VerboseFusionDebugging)
----------------
Meinersbur wrote:
> [style] Start functions with a verb: `isControlFlowEquivalent`
I addressed this a while ago, I just forgot to check it off as done. 

================
Comment at: llvm/lib/Transforms/Scalar/LoopFuse.cpp:858
+
+      // TODO: Can we actually use the dependence info analysis here?
+      return false;
----------------
dmgreen wrote:
> kbarton wrote:
> > dmgreen wrote:
> > > I would imagine, although I'm not sure, that there would at least be a lot of bugs here. We are dealing with different loops, but we can say that they are very similar.
> > > 
> > > What does it currently give? Anything useful? The SCEV checks you are doing above is obviously quite similar to what DA would be trying to do, but with the added loop rewrite. It would be a shame to duplicate all the effort but may, I guess, be necessary if DA doesn't do well enough.
> > At this point DA doesn't give anything useful - at least not for the test cases that I have tried. I have not had a chance to investigate why, or if there is a better/different way to do things where it can be useful (which is why I marked the TODO here). 
> > 
> > The one thing that DA is able to do, that SCEV currently cannot, is understand the restrict keyword and accurately identify no dependencies between the two loops in this case. Perhaps it would be better to try and teach SCEV about restrict, and then only rely on SCEV in the long run?
> > 
> I remember DA having problems on 64bit systems due to all the sexts that kept happening. Delinearising constant's too, maybe? There are certainly problems in it that would be worth trying to fix (or find a replacement to do it better).
> 
> The delinearisation that DA will attempt may also be helpful. It is nowadays all built on SCEV's too, so should in theory (baring the facts that we know here about dealing with different but similar loops) be a more powerful version of the SCEV code here. That power might be confusing it at the moment though.
> 
> I would expect that better DA is something llvm will need sooner or later as more loop optimisations are implemented. Perhaps this is something that we can improve in the future, relying on the SCEV code at the moment, but getting more help from DA as it is improved. The no-aliasing checks it can do are useful at least, it would seem.
I completely agree about improving DA and the need for it with other loop optimizations. I still haven't had a chance to look at it in detail, but would like to start looking at it once this patch is finalized and lands.

For now are you OK with the current usage of DA here? I'm hoping that as we extend/improve it, we can modify this code to take advantage of it.

================
Comment at: llvm/lib/Transforms/Scalar/LoopFuse.cpp:1003
+                                                          FC1.Header);
+    TreeUpdates.push_back(
+        {DominatorTree::Delete, FC0.ExitingBlocks, FC1.Preheader});
----------------
dmgreen wrote:
> kbarton wrote:
> > dmgreen wrote:
> > > It can sometimes be better to insert edges into the DT before deleting old ones. It keep the tree more intact (especially pdts), with less subtrees becoming unreachable and less recalculations needed. It means it can be simpler and quicker for the updates.
> > OK, that's good to know, thanks.
> > 
> > Is this specifically for inserting/deleting edges between similar blocks, or is this for all inserts/deletes in the entire tree? In other words, should I rearrange this code, and the code below (lines 1048-1051) that updates the latch blocks to do all the insertions before any deletions?
> I think it's probably fine, just something I've run into in the past with PDT's. If it's something we run into, we can fix it then. And if it does become an issue, it may be better to teach the DTU to do this itself.
OK, sounds good. I will mark this comment as done then. 

================
Comment at: llvm/lib/Transforms/Scalar/LoopFuse.cpp:1015
+    // Moves the phi nodes from the second to the first loops header block.
+    while (PHINode *PHI = dyn_cast<PHINode>(&FC1.Header->front())) {
+      if (SE.isSCEVable(PHI->getType()))
----------------
jdoerfert wrote:
> kbarton wrote:
> > dmgreen wrote:
> > > Is there anywhere that we check that these won't have phi operands from inside the first loop? i.e. that the phis can be moved into the first loop / before all of it's instructions.
> > This is a really good catch!] We completely missed this.
> > I've added a check in dependencesAllowFusion that walks through the PHIs in the header of FC1 and check if the incoming block is in FC0 then it is the header block. If it is not, then we do not fuse. 
> I don't get this. Could one of you produce a problematic example for me? (also good as a test case)
dmgreen added an example above that illustrates the issue (after some manual modifications to the IR to get around other limitations in fusion). 

================
Comment at: llvm/lib/Transforms/Scalar/LoopFuse.cpp:529
+
+    if (FC.Preheader->size() != 1) {
+      LLVM_DEBUG(dbgs() << "Loop " << FC.L->getName()
----------------
dmgreen wrote:
> Is there reason in the current version to prevent fusion if "FC0" has instructions in its preheaders?
Strictly speaking, no, I there is no reason to prevent fusion if FC0 has instructions in its preheader. 
This check was meant to prevent fusion if FC1 has instructions in the preheader since we don't have any analysis to move them around yet
I've moved this check later when we are trying to fuse pairs of candidates. If FC1 has a non-empty preheader then fusion is skipped. 

================
Comment at: llvm/lib/Transforms/Scalar/LoopFuse.cpp:848
+             !DepResult->getNextSuccessor() &&
+             "TODO: JD: Implement pred/succ dependence handling!");
+
----------------
jdoerfert wrote:
> This should probably be a conditional and a `return false` ;)
I've changed from the assert to a check and print message.
After that, all the conditions just result in a return false, which cleans this up a lot. 

================
Comment at: llvm/lib/Transforms/Scalar/LoopFuse.cpp:917
+          return false;
+        }
+
----------------
dmgreen wrote:
> I'm not sure if this is quite strong enough. Consider something like this, where the sum would be used in the second block, but not as phi:
>   for(i = 0; i < n; i++)
>     sum += a[i];
>   for(i = 0; i < n; i++)
>     b[i] = a[i]/sum;
> These can be fused, but use the wrong "sum" on each iteration of the loop. Unroll and jam side-steps all this by requiring lcssa nodes (and, you know, not requiring generalised loop fusion :) )
Yes, you're right. If I generate the ll for this example and massage it to make it eligible for fusion, we will fuse these.
However, I'm confused by your statement: 

> These can be fused, but use the wrong "sum" on each iteration of the loop. 

I don't see how fusion is legal here. Are you saying that the current check will (incorrectly) still allow fusion? Or there is a way to fuse these loops?

For this example, lcssa will create a non-empty preheader for the second loop and the earlier checks preventing fusion will bail before we get to this test. If I manually sink the PHI from lcssa into the header of the second loop, I can create an empty preheader for the second loop to allow fusion to continue. 

At any rate, I think this check needs to be strengthened to prevent fusion from happening in this case. Do you agree?

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D55851/new/

https://reviews.llvm.org/D55851