[PATCH] [InstCombine] Combine adjacent i8 loads.

Thu May 1 23:30:50 PDT 2014

atrick:
> Load combining should probably be done in a single pass over the block. First collect all the offsets, then sort, then look for pairs. See the LoadClustering stuff in MachineScheduler. Your RangeLimit skirts around this problem, but I don't think the arbitrary threshold is necessary.
> 
> Doing this per basic block is ok. Although there's no reason you can't do it almost as easily on an extended basic block (side exits ok, no merges). Chandler said do it on the domtree, but handling CFG merges would be too complicated and expensive.

I agree this should be a basic block pass. The problem with multiple basic blocks is that you cannot introduce a load into a path that it would not otherwise be loaded. This would lead to potential a race conditions. So if all paths have the load, sure. I'm not sure how expensive this is to calculate though. I also believe we would have already hoisted the load into the parent BB.

> Did you forget to check for Invokes?

Invokes are terminating instructions. There won't be any loads in the same basic block after it.

> 
> Conceptually this is doing SLP vectorization, but it doesn't fit with our SLP algortihm which first finds a vectorizable use-def tree. Sticking it in GVN is another option, but again I'm concerned about it running before SLP. Maybe it can run in the SLP pass after the main algorithm.

In this case wouldn't it be just as easy to make it a separate pass that runs after SLP?

Nadav:
> Michael described the motivation to his patch by showing the LoadU64_x8 function and how it loads multiple scalars from memory just to combine them into a single word. If this function represents the kind of function that he wants to handle then I think that we should consider implementing this in SelectionDAG. We already merge consecutive loads and stores in SelectionDAG. Problems like this should be local to one basic block so the fact that we are working on a single basic block should not be a problem. At SelectionDAG level we should have enough information to make informed decisions on the profitability of merging loads/stores. I can’t imagine propagating all of the relevant information into IR land. The SLP vectorizer is designed to create vectors and teaching the SLP vectorizer to handle non-vector types is not a good idea. Also, generating vectors for code that should obviously stay scalar is not ideal. Michael, was I correct about the kind of problems that you are trying to solve? Have you considered SelectionDAG? I think MergeConsecutiveStores is a good place to look at.

My specific reason for looking into this was the stated problem, but I believe there are lots of other cases that can benefit from this. The main reason I didn't want to do it in SDAG is because we don't do bswap recognition in SDAG, and I'd rather not have multiple implementations of it. Another issue is the inliner, bswap is generally a great thing to inline, but if we only do this in SDAG we may choose not to.

http://reviews.llvm.org/D3580