[PATCH] [InstCombine] Combine adjacent i8 loads.

Fri May 2 10:43:03 PDT 2014

On May 2, 2014, at 1:16 AM, Chandler Carruth <chandlerc at gmail.com> wrote:

> At a very high level (and separate from where or how this happens within the IR domain) I think it is definitely the right design to merge loads and stores as early as possible in the IR.
> 
> 1) It is quite possible (and actually happens in real test cases after inlining) that the loads and stores get twisted across multiple basic blocks even though at the IR level we have enough analysis information to very easily show that all of the loads (or stores) are guaranteed to be executed and thus at least *correct* to merge. The memory model makes it extremely hard to merge stores (and even loads) in the DAG. There is an old thread you can check for the full discussion of this.
> 
> 2) Once merged, there are other optimizations that may apply, so I think its useful to expose throughout the pipeline.
> 
> 3) Ultimately, its fewer IR instructions, and small IR tends to be fast to analyze IR.
> 
> 4) We've had good success with slicing up loads and stores during lowering to ensure we don't do unfriendly things to the targets. This code path was already really well exercised due to SROA, and the old thread I mentioned where we looked at slicing in both places and someone (I think Quentin?) had good success slicing in the DAG to make sure we handle this well in the target.

We could indeed extend the load slicing mechanism of DAG combine (one has to look for SliceUpLoad). Currently it does not understand vector operation like extract and such because we did not have motivating examples to discuss the cost model.
Moreover, IIRC the slicing is limited to two slices per load. Again, this is not a fundamental limitation, just a missing cost model.

The related patch (for DAG combine) was discussed here:
http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20130916/188497.html
http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20131007/190273.html

(If you look back into the initial thread, you’ll find the previous version of the patch (part of instcombine) and the related discussion.)

Thanks,
-Quentin
> 
> 
> On Thu, May 1, 2014 at 10:34 PM, Nadav Rotem <nrotem at apple.com> wrote:
> Michael described the motivation to his patch by showing the LoadU64_x8 function and how it loads multiple scalars from memory just to combine them into a single word. If this function represents the kind of function that he wants to handle then I think that we should consider implementing this in SelectionDAG. We already merge consecutive loads and stores in SelectionDAG. Problems like this should be local to one basic block so the fact that we are working on a single basic block should not be a problem. At SelectionDAG level we should have enough information to make informed decisions on the profitability of merging loads/stores. I can’t imagine propagating all of the relevant information into IR land. The SLP vectorizer is designed to create vectors and teaching the SLP vectorizer to handle non-vector types is not a good idea. Also, generating vectors for code that should obviously stay scalar is not ideal. Michael, was I correct about the kind of problems that you are trying to solve? Have you considered SelectionDAG? I think MergeConsecutiveStores is a good place to look at.
> 
> On May 1, 2014, at 10:08 PM, Andrew Trick <atrick at apple.com> wrote:
> 
> > I agree with Chandler that this should only be done once, late in the pipeline (post GVN). I am also concerned that if this runs before SLP vectorizer it will interfere with it. I'd like to get Arnold's comments on this.
> >
> > Load combining should probably be done in a single pass over the block. First collect all the offsets, then sort, then look for pairs. See the LoadClustering stuff in MachineScheduler. Your RangeLimit skirts around this problem, but I don't think the arbitrary threshold is necessary.
> >
> > Doing this per basic block is ok. Although there's no reason you can't do it almost as easily on an extended basic block (side exits ok, no merges). Chandler said do it on the domtree, but handling CFG merges would be too complicated and expensive.
> >
> > Did you forget to check for Invokes?
> >
> > Conceptually this is doing SLP vectorization, but it doesn't fit with our SLP algortihm which first finds a vectorizable use-def tree. Sticking it in GVN is another option, but again I'm concerned about it running before SLP. Maybe it can run in the SLP pass after the main algorithm.
> >
> > http://reviews.llvm.org/D3580
> >
> >
> >
> > _______________________________________________
> > llvm-commits mailing list
> > llvm-commits at cs.uiuc.edu
> > http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> 
> 
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> 
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20140502/9c8a3a0b/attachment.html>