[PATCH][instcombine]: Slice a big load in two loads when the element are next to each other in memory.

Thu Sep 19 23:16:28 PDT 2013

On Thu, Sep 19, 2013 at 10:49 PM, Quentin Colombet <qcolombet at apple.com>wrote:

> Hi,
>
> Here is the new patch.
> The optimization is now done as part of DAG combine.
>
> The patch looks pretty much the same as its InstCombine counter part
> (SDNode, instead of Instruction) except that it introduces a new
> profitability model.
> Hence, when the transformation was done as InstCombine the idea  was to
> perform cannonicalization. Now, this is an optimization.
>
> For this purpose, the patch introduces a cost model (represented by the
> helper class LoadedSlice::Cost), which accounts for the instructions that
> are saved by slicing a load compared to the original big load.
> It also introduces a new target hook, to query about load pairing
> capabilities (type and required alignment).
> Basically, this cost model counts how many truncates, shifts, etc, are
> saved with slicing compared to the number of introduced loads.
> It takes advantages of information like isTruncateFree, etc.
>
> Currently the cost model is used in a very specific case: only two slices
> that are next to each other in memory.
> This limitation can be lifted as soon as we consider the cost model mature
> enough and when machine passes that deal with load pairing manage to handle
> more slices.
>
> Thanks for your reviews!
>

I'm probably not the right person to do a detailed review (I just don't
know the DAG that well) but some high level thoughts:

- I *really* like the design you're building here. I like the use of a
concrete cost model, tracking register pairing, etc. That looks like
serious goodness, and I'm excited to see it expanded.

- I wonder if it would help to explicitly shift the data structures to
handle N slices used of M adjacent loads? I mention M adjacent loads
because SROA (or user-authored code) will pre-slice stuff to only use
target-legal integers (or 'int') in some cases. I expect it to be rare, but
likely easy to handle the generality.

- It wasn't clear how you account for multiple uses. I'm specifically
imagining the fact that a wide load may be used N times for N slices; the
narrow loads for N-1 slices don't replace the wide load. It'll be important
to keep an eye on the use # and not assume (for cost) that you simplify
away things that have more uses anyways.

- The cost comparator doesn't really make sense to me. the early bias
followed by summing is just strange, but maybe that's just me.

And two nits:
- There is some weird indenting in the patch. I would try running it
through clang-format.
- I saw some strange variable names like 'Nb' for 'Number'

Thanks for shifting this down to the DAG level. The whole thing looks
really nice now. One crazy idea: Do you want to see if there is a vector
load + extract that would suffice and throw it into the cost mix? The SLP
vectorizer *might* get this, but honestly, it seems better to do here.

> -Quentin
>
>
> On Sep 17, 2013, at 3:05 PM, Quentin Colombet <qcolombet at apple.com> wrote:
>
> After discussing with Chandler off-line, we decided to perform the load
> slicing at isel time.
> Thus, this patch has been reverted in r190891.
>
> -Quentin
>
> On Sep 17, 2013, at 2:10 PM, Chandler Carruth <chandlerc at google.com>
> wrote:
>
>
> On Mon, Aug 26, 2013 at 1:08 PM, Quentin Colombet <qcolombet at apple.com>wrote:
>
>> After looking into SROA, Nadav and I agreed that it does the right thing.
>> Therefore, the initial proposed patch is still the one to be reviewed.
>>
>
> Very, very sorry for the late comment here. I misunderstood this comment
> and the result.
>
> I agree that SROA is correct here, and I also think that this should be
> the canonical form. With the LLVM memory model, it is very hard to merge
> two smaller loads back together if that is ever profitable. It is
> essentially impossible in many cases to merge two smaller stores back
> together if that is ever profitable. As such, it is very useful to preserve
> the widest known-safe load and store size as far as possible in the
> optimizer. At least getting it to the backend where the cost factor for
> various load and store operations is known is essential.
>
> Can we canonicalize toward the wide loads and stores with appropriate
> masking to extract narrow values, and then match these back to the small
> stores in the backend?
>
> Both SROA and optimal bitfield code generation rely on this (at least for
> x86) so changing it will regress some things.
>
>
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20130919/a2683580/attachment.html>