[LLVMdev] proof of concept for a loop fusion pass

Sun Jan 25 19:19:36 PST 2015

----- Original Message -----
> From: "Hal Finkel" <hfinkel at anl.gov>
> To: "Ramshankar Ramanarayanan" <Ramshankar.Ramanarayanan at amd.com>
> Cc: llvmdev at cs.uiuc.edu, "chandlerc" <chandlerc at gmail.com>, "Adam Nemet" <anemet at apple.com>
> Sent: Sunday, January 25, 2015 9:16:26 PM
> Subject: Re: [LLVMdev] proof of concept for a loop fusion pass
> 
> ----- Original Message -----
> > From: "Ramshankar Ramanarayanan" <Ramshankar.Ramanarayanan at amd.com>
> > To: "Hal Finkel" <hfinkel at anl.gov>, "Adam Nemet" <anemet at apple.com>
> > Cc: llvmdev at cs.uiuc.edu, "chandlerc" <chandlerc at gmail.com>
> > Sent: Monday, January 19, 2015 7:57:42 AM
> > Subject: RE: [LLVMdev] proof of concept for a loop fusion pass
> > 
> > Hi Hal,
> > 
> > There are two diffs under revision update history:
> > 
> > The clang patch is:
> > http://reviews.llvm.org/differential/diff/18250/
> > The llvm patch is: http://reviews.llvm.org/differential/diff/18249/
> 
> Okay, please don't do that. Please open one differential per patch.
> It is not at all obvious that the different updates refer to
> completely different patches (plus, it will mess up the incremental
> patch differencing capability).

Also, please post your patches with full context (see http://llvm.org/docs/Phabricator.html#requesting-a-review-via-the-web-interface for instructions), and make sure they're not inverted (your current diff makes it looks like your new files are being deleted, etc.).

 -Hal

> 
> > 
> > The loop fusion pass is in Scalar opts.
> 
> That seems like the right place.
> 
> > One of the challenges in
> > legality checks was aliases with extern variables. However I
> > derived
> > some metadata from GlobalsModRef and used that in the scalar opt to
> > get what I wanted. Do you have thoughts on how we can propagate
> > GlobalsModRef analysis into the scalar opt?
> 
> So it seems, specifically, you're attempting to propagate whether a
> global does not have its address taken. For basic blocks, we stick
> this information into its value subclass data, perhaps we could do
> the same here? We could also have an AddressTaken member variable,
> as we do for MachineBasicBlock in CodeGen. Regardless, the metadata
> does not seem unreasonable.
> 
> > That could be one of the
> > first sub-patches. Going by the various suggestions I will put out
> > a
> > diff just for loop fusion (along with control dependences/closures)
> > with some synthetic test cases.
> 
> Great!
> 
> > 
> > This phase ordering question is of interest: Will inlining (and
> > other
> > IPO analysis/opt incl globals mod/ref, function unswitch)
> > facilitate
> > better loop fusion?
> 
> I'm sure they will, especially inlining, but I see no reason that
> loop fusion should occur outside of the inliner-driven CGSCC pass
> manager (so that is occurs iteratively with inlining). We might want
> to make ICA (the inliner's cost analysis) smarter about loops that
> might be fused, but that's a separable improvement.
> 
> > I experimented with an IPO analysis that spans
> > call graphs and checks "adjacent" code for loops that can be fused.
> > However this made the compiler very slow and I ended up with
> > inlining all functions that has a singly nested loop. I could check
> > that there is a path from entry to the loop header that does not
> > have "bad" basicblocks: blocks that have calls for example.
> 
> I see no reason to avoid fusing loops with calls (so long, I suppose,
> as we can split later based on register pressure considerations).
> Why are you doing that?
> 
> > Alternatively I could use profile or static block frequencies. We
> > need to prune the search space somehow and we got to do it across
> > call boundaries.
> 
> Better use of BPI within ICA is something that the new pass-manager
> infrastructure should allow.
> 
> > 
> > I think that vectorization should come later than fusion and
> > distribution in the optimization flow. Vectorization adds loop
> > versioning. That creates additional loops and probably a more
> > challenging search space for loop fusion.
> 
> Good point. As I mentioned, I think that fusing early and splitting
> late makes sense. Loop fusion potentially exposes many other
> simplification opportunities, and predicting them ahead of time is
> unlikely to be very successful.
> 
> > Interestingly, I tried
> > vectorization by hand for 462.libquantum and it did not help: there
> > is an interplay with if-converting stores (what I saw so far).
> > Perhaps it is an architecture thing.
> 
> Interesting.
> 
>  -Hal
> 
> > 
> > Thanks,
> > Ram
> > 
> > -----Original Message-----
> > From: Hal Finkel [mailto:hfinkel at anl.gov]
> > Sent: Saturday, January 17, 2015 2:30 AM
> > To: Adam Nemet; Ramanarayanan, Ramshankar
> > Cc: llvmdev at cs.uiuc.edu; chandlerc
> > Subject: Re: [LLVMdev] proof of concept for a loop fusion pass
> > 
> > ----- Original Message -----
> > > From: "Adam Nemet" <anemet at apple.com>
> > > To: "Ramshankar Ramanarayanan" <Ramshankar.Ramanarayanan at amd.com>
> > > Cc: llvmdev at cs.uiuc.edu
> > > Sent: Saturday, January 17, 2015 12:20:55 AM
> > > Subject: Re: [LLVMdev] proof of concept for a loop fusion pass
> > > 
> > > 
> > > On Jan 15, 2015, at 4:22 PM, Ramanarayanan, Ramshankar <
> > > Ramshankar.Ramanarayanan at amd.com > wrote:
> > > 
> > > 
> > > Hi,
> > > 
> > > 
> > > We are proposing a loop fusion pass that tries to proactive fuse
> > > loops
> > > across function call boundaries and arbitrary control flow.
> > > 
> > > 
> > > http://reviews.llvm.org/D7008
> > 
> > This link contains the Clang patch. Did you intend to post the LLVM
> > patch as well?
> > 
> > > 
> > > 
> > > With this pass, we get 103 loop fusions in SPECCPU INT 2006
> > > 462.libquantum with rate performance improving close to 2.5X in
> > > x86
> > > (results from AMD A10-6700).
> > > 
> > > 
> > > I took some liberties in patching up some of the code in
> > > ScalarEvolution/DependenceAnalysis/GlobalsModRef/Clang options/
> > > and
> > > also adjusted the IPO/LTO pass managers. I would need to do a
> > > better
> > > job there but I would like to put this out as WIP for high level
> > > reviews. At this time standard loop fusion test cases may not
> > > work
> > > (cases where control dep are same or loops are truly adjacent
> > > etc).
> > > I would be doing that as well.
> > > 
> > > 
> > > Appreciate suggestions and other help.
> > > 
> > > 
> > > The pass initially attempts to inline calls which have loops in
> > > them.
> > > Following inlining, a separate scalar "main" pass tries to fuse
> > > loops.
> > > It is not necessary for loops to be adjacent or have the same
> > > control
> > > dependence. Specifically a control dependence closure is computed
> > > and
> > > loops that share a control dep closure prefix are taken as
> > > candidates
> > > for fusion, in case there are no other loops in a certain path
> > > between
> > > them. A loop graph is built with edges between loops which share
> > > the
> > > control dependence closure prefix. A recursive application of
> > > fusion
> > > follows on the graph from "bottom-up".
> > > 
> > > 
> > > During fusion, program slices are taken based on the control flow
> > > closures of the two loops being fused. Example: Suppose two loops
> > > A
> > > and B are going to be fused. They share the same control
> > > dependence
> > > prefix, but their suffices vary. The suffices are used to
> > > duplicate
> > > Control flow paths that pass through both loops. Specifically
> > > paths
> > > from the first block in the control-dep closure suffix of the
> > > first
> > > loop, through the first loop's exit and following into the suffix
> > > of
> > > the second loop through the second loop's exit on to the common
> > > post-dominator of either loop's exit. The number of paths grows
> > > exponentially. At this time some heuristics are used when
> > > exploring
> > > for paths. Using profile based heuristics is a better idea.
> > 
> > On this last point, would it be possible to encode your static
> > heuristics as improvements to our static branch probability
> > heuristics, and then naturally adapt to profile-based branch
> > probabilities when those are available?
> > 
> > > 
> > > Hi Ramshankar,
> > > 
> > > 
> > > 
> > > 
> > > 
> > > This is interesting, indeed. Thanks for soliciting input.
> > > 
> > > 
> > > My main comment has to do with the fact that in order to get this
> > > gain
> > > you need many pieces to work together. My suggestion is to
> > > proceed
> > > in
> > > steps and develop it incrementally on trunk.
> > 
> > I second this. We definitely should have loop fusion in LLVM, and
> > incremental in-trunk development is the best way to make that
> > happen.
> > 
> > > 
> > > 
> > > For example it would be good to first add a simple loop fusion
> > > pass
> > > that only merges adjacent loops and then build it up from there.
> > > Can
> > > you break down your approach into smaller, incremental steps like
> > > this? What would these steps be?
> > > 
> > > 
> > > My other comment is about the choice of DependenceAnalysis
> > > framework.
> > > No pass currently uses this framework and I am not sure how
> > > complete
> > > the implementation is
> > 
> > DependenceAnalysis has known issues -- as I recall, the way it
> > pulls
> > apart GEP indices without verifying the boundary conditions is not
> > really sound, but this could be replaced by our delinearization
> > functionality now (what Polly is using) -- and if this work
> > provides
> > the impetus for cleaning that up, so much the better. That having
> > been said, care is required. DependenceAnalysis is definitely more
> > powerful than the vectorizer's dependence analysis, and
> > conceptually
> > cleaner. However, using what the vectorizer uses should provide a
> > conservatively-correct base.
> > 
> > > and whether it has adequate test coverage.
> > 
> > DependenceAnalysis's test coverage actually isn't bad, at least for
> > loop bodies in traditional form.
> > 
> > > This was my main reason for reusing the dependence analysis from
> > > the
> > > LoopVectorizer. Have you considered the same?
> > 
> > Finally, it might be a good idea to discuss the overall
> > optimization
> > plan for fuseable loops. One possibility that I had discussed with
> > Chandler a couple of months ago was: aggressively fuse early,
> > expose
> > optimization opportunities, and then split late (after
> > vectorization) based on target-specific modeling (accounting for
> > register pressure, hardware prefetching resources, etc.). There are
> > obviously also other possible approaches, and I'd like to know what
> > you think would be an optimal strategy.
> > 
> > Thanks again,
> > Hal
> > 
> > > 
> > > 
> > > Adam
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > Also function unswitching helps by cleaning up control flow.
> > > 
> > > 
> > > Example:
> > > if (a & b) {
> > > if (a & c) {
> > > if (t & 0xF) {
> > > for (i=0; i < size; i++) {
> > > if (A[i] & d) {
> > > A[i] |= (1 << t);
> > > }
> > > }
> > > }
> > > }
> > > }
> > > 
> > > 
> > > if (b & d) {
> > > if (a & d) {
> > > if (t & 0xA) {
> > > for (i=0; i < size; i++) {
> > > if (A[i] & d) {
> > > A[i] |= (1 << t);
> > > }
> > > }
> > > }
> > > }
> > > }
> > > 
> > > 
> > > After fusion we will have:
> > > 
> > > 
> > > if (a&b && a&c && t&0xF && b&d && a&d t&0xA) { for (i=0; i <
> > > size;
> > > i++) { if (A[i] & d) { A[i] |= (1 << t); } if (A[i] & d) { A[i]
> > > |=
> > > (1
> > > << t); } } } else { // original code if (a & b) { if (a & c) { if
> > > (t &
> > > 0xF) { for (i=0; i < size; i++) { if (A[i] & d) { A[i] |= (1 <<
> > > t);
> > > }
> > > } } } }
> > > 
> > > 
> > > if (b & d) {
> > > if (a & d) {
> > > if (t & 0xA) {
> > > for (i=0; i < size; i++) {
> > > if (A[i] & d) {
> > > A[i] |= (1 << t);
> > > }
> > > }
> > > }
> > > }
> > > }
> > > }
> > > 
> > > 
> > > Thanks,
> > > Ramshankar _______________________________________________
> > > LLVM Developers mailing list
> > > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
> > > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> > > 
> > > _______________________________________________
> > > LLVM Developers mailing list
> > > LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> > > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> > > 
> > 
> > --
> > Hal Finkel
> > Assistant Computational Scientist
> > Leadership Computing Facility
> > Argonne National Laboratory
> > 
> 
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
> 

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory