[PATCH][instcombine]: Slice a big load in two loads when the element are next to each other in memory.

Mon Oct 7 15:47:18 PDT 2013

On Oct 7, 2013, at 1:24 PM, Quentin Colombet <qcolombet at apple.com> wrote:

> Hi Owen, hi Chandler,
> 
> Owen, you may be a good match to review this patch, which is SDSel related and not instcombine related (as the subject says… it changes :)).

I’m not sure I like the cost model this represents.  It seems very x86-centric.  For example, some RISCs (PowerPC and AArch64, offhand) can do single-instruction paired-register bitfield extracts, which means that original input sequence was better for them (1 load + 1 extract, instead of 2 loads).  The issue is even worse on GPUs which generally have undifferentiated register files, so the bit cast and the extraction are both free (it just becomes a sub-register extraction), so you’ve turned one load into two to no benefit.  Even on x86, you’re probably best off trying to turn this into a <2 x i32> vector load + lane copy rather than splitting the load.

—Owen

> Chandler, I have not came up with a motivating example to support slicing over multiple loads. I would like to move forward with this patch and possibly rework the design later if we come up with a motivating example.
> What do you think?
> 
> Cheers,
> -Quentin
> 
> On Sep 30, 2013, at 11:38 AM, Quentin Colombet <qcolombet at apple.com> wrote:
> 
>> Ping?
>> 
>> -Quentin
>> 
>> On Sep 24, 2013, at 10:48 AM, Quentin Colombet <qcolombet at apple.com> wrote:
>> 
>>> Hi,
>>> 
>>> Here is the new clang-formatted patch with the renaming Nb => Number.
>>> 
>>> Cheers,
>>> -Quentin
>>> 
>>> <DAGCombineLoadSlicing.svndiff>
>>> 
>>> On Sep 20, 2013, at 2:56 PM, Quentin Colombet <qcolombet at apple.com> wrote:
>>> 
>>>> Hi Chandler,
>>>> 
>>>> Thanks for the review!
>>>> 
>>>> On Sep 19, 2013, at 11:16 PM, Chandler Carruth <chandlerc at google.com> wrote:
>>>> 
>>>>> On Thu, Sep 19, 2013 at 10:49 PM, Quentin Colombet <qcolombet at apple.com> wrote:
>>>>> Hi,
>>>>> 
>>>>> Here is the new patch.
>>>>> The optimization is now done as part of DAG combine.
>>>>> 
>>>>> The patch looks pretty much the same as its InstCombine counter part (SDNode, instead of Instruction) except that it introduces a new profitability model.
>>>>> Hence, when the transformation was done as InstCombine the idea  was to perform cannonicalization. Now, this is an optimization.
>>>>> 
>>>>> For this purpose, the patch introduces a cost model (represented by the helper class LoadedSlice::Cost), which accounts for the instructions that are saved by slicing a load compared to the original big load.
>>>>> It also introduces a new target hook, to query about load pairing capabilities (type and required alignment).
>>>>> Basically, this cost model counts how many truncates, shifts, etc, are saved with slicing compared to the number of introduced loads.
>>>>> It takes advantages of information like isTruncateFree, etc.
>>>>> 
>>>>> Currently the cost model is used in a very specific case: only two slices that are next to each other in memory.
>>>>> This limitation can be lifted as soon as we consider the cost model mature enough and when machine passes that deal with load pairing manage to handle more slices.
>>>>> 
>>>>> Thanks for your reviews!
>>>>> 
>>>>> I'm probably not the right person to do a detailed review (I just don't know the DAG that well) but some high level thoughts:
>>>>> 
>>>>> - I *really* like the design you're building here. I like the use of a concrete cost model, tracking register pairing, etc. That looks like serious goodness, and I'm excited to see it expanded.
>>>>> 
>>>>> - I wonder if it would help to explicitly shift the data structures to handle N slices used of M adjacent loads? I mention M adjacent loads because SROA (or user-authored code) will pre-slice stuff to only use target-legal integers (or 'int') in some cases. I expect it to be rare, but likely easy to handle the generality.
>>>> That may be a good idea indeed.
>>>> Do you have a motivating example that may drive this?
>>>>> 
>>>>> - It wasn't clear how you account for multiple uses. I'm specifically imagining the fact that a wide load may be used N times for N slices; the narrow loads for N-1 slices don't replace the wide load. It'll be important to keep an eye on the use # and not assume (for cost) that you simplify away things that have more uses anyways.
>>>> Let me rephrase to be sure I understood your point. Your concern here is that we may slice the wide load, but still have it afterwards, is that it?
>>>> 
>>>> If yes, this cannot happen, the slicing abort for each unknown uses. Thus, currently, a wide load is completely sliced or left untouched.
>>>>> 
>>>>> - The cost comparator doesn't really make sense to me. the early bias followed by summing is just strange, but maybe that's just me.
>>>> Right now, the cost model boils down to this.
>>>> Each slice spares:
>>>> - a truncate
>>>> - [a shift]
>>>> - [a costly bitcast]
>>>> and introduces:
>>>> - a load
>>>> - [a zext]
>>>> [] means optionally and costly bit cast means bit cast which source and destination are in different register banks.
>>>> 
>>>> Thus, the cost of a slice is load + [zext].
>>>> 
>>>> The wide load needs:
>>>> - a load.
>>>> * several shift.
>>>> * several truncates.
>>>> * several costly bit cast.
>>>> 
>>>> The * are what the slices eliminates.
>>>> 
>>>> Therefore:
>>>> - cost of slicing =  sum overall slices of load + [zext].
>>>> - cost of wide load = load + several shift + several truncates + several bit cast 
>>>> 
>>>> Then, the cost is adjusted according to isTruncateFree, isZExtFree, and hasPairedLoad.
>>>> 
>>>> The patch assumes that costly bit casts and loads are fairly more expensive than other operations.
>>>> So, we end up with:
>>>> 1. Try to minimize the number of expensive operation (load + bit cast).
>>>> 2. If this number is equal, try to minimize the number of other operations.
>>>> 
>>>> The complete summing is just 2. and also makes sense for code size :).
>>>> 
>>>> I agree this can be improved, especially with target hooks for the cost of bit cast and load (again we assume they are the same). I thought it was good enough for now and can be improved later.
>>>> Anyway, I am not a huge fan of making the cost model much smarter for now, because to me this is a Fermi problem. If we try to be more clever, I am afraid there are too much things we will miss to make it accurate anyway.
>>>> 
>>>> Do not misinterpret me, if we can come up with something better, I am for!
>>>> 
>>>>> 
>>>>> And two nits:
>>>>> - There is some weird indenting in the patch. I would try running it through clang-format.
>>>>> - I saw some strange variable names like 'Nb' for 'Number'
>>>> Will do.
>>>> 
>>>>> 
>>>>> Thanks for shifting this down to the DAG level. The whole thing looks really nice now. One crazy idea: Do you want to see if there is a vector load + extract that would suffice and throw it into the cost mix? The SLP vectorizer *might* get this, but honestly, it seems better to do here.
>>>> Honestly, I have thought about it :).
>>>> That would be nice but I did not want to tackle this at this time, because I think we will need to have a broader discussion about the cost model with the vectorizer experts!
>>>> 
>>>> Cheers,
>>>> -Quentin
>>>>> 
>>>>> 
>>>>> -Quentin
>>>>> 
>>>>> 
>>>>> On Sep 17, 2013, at 3:05 PM, Quentin Colombet <qcolombet at apple.com> wrote:
>>>>> 
>>>>>> After discussing with Chandler off-line, we decided to perform the load slicing at isel time.
>>>>>> Thus, this patch has been reverted in r190891.
>>>>>> 
>>>>>> -Quentin
>>>>>> 
>>>>>> On Sep 17, 2013, at 2:10 PM, Chandler Carruth <chandlerc at google.com> wrote:
>>>>>> 
>>>>>>> 
>>>>>>> On Mon, Aug 26, 2013 at 1:08 PM, Quentin Colombet <qcolombet at apple.com> wrote:
>>>>>>> After looking into SROA, Nadav and I agreed that it does the right thing.
>>>>>>> Therefore, the initial proposed patch is still the one to be reviewed.
>>>>>>> 
>>>>>>> Very, very sorry for the late comment here. I misunderstood this comment and the result.
>>>>>>> 
>>>>>>> I agree that SROA is correct here, and I also think that this should be the canonical form. With the LLVM memory model, it is very hard to merge two smaller loads back together if that is ever profitable. It is essentially impossible in many cases to merge two smaller stores back together if that is ever profitable. As such, it is very useful to preserve the widest known-safe load and store size as far as possible in the optimizer. At least getting it to the backend where the cost factor for various load and store operations is known is essential.
>>>>>>> 
>>>>>>> Can we canonicalize toward the wide loads and stores with appropriate masking to extract narrow values, and then match these back to the small stores in the backend?
>>>>>>> 
>>>>>>> Both SROA and optimal bitfield code generation rely on this (at least for x86) so changing it will regress some things.
>>>>>> 
>>>>>> _______________________________________________
>>>>>> llvm-commits mailing list
>>>>>> llvm-commits at cs.uiuc.edu
>>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> _______________________________________________
>>>> llvm-commits mailing list
>>>> llvm-commits at cs.uiuc.edu
>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>> 
>>> _______________________________________________
>>> llvm-commits mailing list
>>> llvm-commits at cs.uiuc.edu
>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>> 
>> _______________________________________________
>> llvm-commits mailing list
>> llvm-commits at cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20131007/200d8b34/attachment.html>