[PATCH][instcombine]: Slice a big load in two loads when the element are next to each other in memory.

Quentin Colombet qcolombet at apple.com
Fri Sep 20 14:56:12 PDT 2013

Hi Chandler,

Thanks for the review!

On Sep 19, 2013, at 11:16 PM, Chandler Carruth <chandlerc at google.com> wrote:

> On Thu, Sep 19, 2013 at 10:49 PM, Quentin Colombet <qcolombet at apple.com> wrote:
> Hi,
> Here is the new patch.
> The optimization is now done as part of DAG combine.
> The patch looks pretty much the same as its InstCombine counter part (SDNode, instead of Instruction) except that it introduces a new profitability model.
> Hence, when the transformation was done as InstCombine the idea  was to perform cannonicalization. Now, this is an optimization.
> For this purpose, the patch introduces a cost model (represented by the helper class LoadedSlice::Cost), which accounts for the instructions that are saved by slicing a load compared to the original big load.
> It also introduces a new target hook, to query about load pairing capabilities (type and required alignment).
> Basically, this cost model counts how many truncates, shifts, etc, are saved with slicing compared to the number of introduced loads.
> It takes advantages of information like isTruncateFree, etc.
> Currently the cost model is used in a very specific case: only two slices that are next to each other in memory.
> This limitation can be lifted as soon as we consider the cost model mature enough and when machine passes that deal with load pairing manage to handle more slices.
> Thanks for your reviews!
> I'm probably not the right person to do a detailed review (I just don't know the DAG that well) but some high level thoughts:
> - I *really* like the design you're building here. I like the use of a concrete cost model, tracking register pairing, etc. That looks like serious goodness, and I'm excited to see it expanded.
> - I wonder if it would help to explicitly shift the data structures to handle N slices used of M adjacent loads? I mention M adjacent loads because SROA (or user-authored code) will pre-slice stuff to only use target-legal integers (or 'int') in some cases. I expect it to be rare, but likely easy to handle the generality.
That may be a good idea indeed.
Do you have a motivating example that may drive this?
> - It wasn't clear how you account for multiple uses. I'm specifically imagining the fact that a wide load may be used N times for N slices; the narrow loads for N-1 slices don't replace the wide load. It'll be important to keep an eye on the use # and not assume (for cost) that you simplify away things that have more uses anyways.
Let me rephrase to be sure I understood your point. Your concern here is that we may slice the wide load, but still have it afterwards, is that it?

If yes, this cannot happen, the slicing abort for each unknown uses. Thus, currently, a wide load is completely sliced or left untouched.
> - The cost comparator doesn't really make sense to me. the early bias followed by summing is just strange, but maybe that's just me.
Right now, the cost model boils down to this.
Each slice spares:
- a truncate
- [a shift]
- [a costly bitcast]
and introduces:
- a load
- [a zext]
[] means optionally and costly bit cast means bit cast which source and destination are in different register banks.

Thus, the cost of a slice is load + [zext].

The wide load needs:
- a load.
* several shift.
* several truncates.
* several costly bit cast.

The * are what the slices eliminates.

- cost of slicing =  sum overall slices of load + [zext].
- cost of wide load = load + several shift + several truncates + several bit cast 

Then, the cost is adjusted according to isTruncateFree, isZExtFree, and hasPairedLoad.

The patch assumes that costly bit casts and loads are fairly more expensive than other operations.
So, we end up with:
1. Try to minimize the number of expensive operation (load + bit cast).
2. If this number is equal, try to minimize the number of other operations.

The complete summing is just 2. and also makes sense for code size :).

I agree this can be improved, especially with target hooks for the cost of bit cast and load (again we assume they are the same). I thought it was good enough for now and can be improved later.
Anyway, I am not a huge fan of making the cost model much smarter for now, because to me this is a Fermi problem. If we try to be more clever, I am afraid there are too much things we will miss to make it accurate anyway.

Do not misinterpret me, if we can come up with something better, I am for!

> And two nits:
> - There is some weird indenting in the patch. I would try running it through clang-format.
> - I saw some strange variable names like 'Nb' for 'Number'
Will do.

> Thanks for shifting this down to the DAG level. The whole thing looks really nice now. One crazy idea: Do you want to see if there is a vector load + extract that would suffice and throw it into the cost mix? The SLP vectorizer *might* get this, but honestly, it seems better to do here.
Honestly, I have thought about it :).
That would be nice but I did not want to tackle this at this time, because I think we will need to have a broader discussion about the cost model with the vectorizer experts!

> -Quentin
> On Sep 17, 2013, at 3:05 PM, Quentin Colombet <qcolombet at apple.com> wrote:
>> After discussing with Chandler off-line, we decided to perform the load slicing at isel time.
>> Thus, this patch has been reverted in r190891.
>> -Quentin
>> On Sep 17, 2013, at 2:10 PM, Chandler Carruth <chandlerc at google.com> wrote:
>>> On Mon, Aug 26, 2013 at 1:08 PM, Quentin Colombet <qcolombet at apple.com> wrote:
>>> After looking into SROA, Nadav and I agreed that it does the right thing.
>>> Therefore, the initial proposed patch is still the one to be reviewed.
>>> Very, very sorry for the late comment here. I misunderstood this comment and the result.
>>> I agree that SROA is correct here, and I also think that this should be the canonical form. With the LLVM memory model, it is very hard to merge two smaller loads back together if that is ever profitable. It is essentially impossible in many cases to merge two smaller stores back together if that is ever profitable. As such, it is very useful to preserve the widest known-safe load and store size as far as possible in the optimizer. At least getting it to the backend where the cost factor for various load and store operations is known is essential.
>>> Can we canonicalize toward the wide loads and stores with appropriate masking to extract narrow values, and then match these back to the small stores in the backend?
>>> Both SROA and optimal bitfield code generation rely on this (at least for x86) so changing it will regress some things.
>> _______________________________________________
>> llvm-commits mailing list
>> llvm-commits at cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20130920/e09beb56/attachment.html>

More information about the llvm-commits mailing list