[PATCH] [PATCH][SROA]Also slice the STORE when slicing a LOAD in AllocaSliceRewriter

Tue Sep 2 21:57:38 PDT 2014

Hi James,

Sorry, I'm still not convinced here.

2014-09-02 18:46 GMT+08:00 James Molloy <james at jamesmolloy.co.uk>:

> Hi all,
>
> I've just thrashed this out further with Chandler on IRC, and thought I'd
> summarize what came up here. Chandler will rebutt if I make a mistake, I'm
> sure :)
>
> The main point is that there are instcombines and bunch of
> arithmetic-simplifying optimizations that rely on SROA producing wide
> stores. In fact, it isn't the fact that the store is wide that matters, it
> is that there is a merged SSA node.
>
> %a = i32 ...
> %b = i32 ...
> %c = shl i64 (zext i32 %a to i64), i32 32
> %d = or i64 %c, (zext i32 %b to i64)
> store i64* %p, i64 %d
>
> %e = load i64* %p
> ...
>
> In the above example, optimizers can follow the load of %p through, via
> the store and end up with two constituent i32's that they can then do magic
> with. In the alternative scenario:
>
> %a = i32 ...
> %b = i32 ...
> %q = bitcast i64* %p to i32*
> store i32* (getelementptr i32* %q, i32 0), i32 %a
> store i32* (getelementptr i32* %a, i32 1), i32 %b
>

I think this is a typo, it should be %q rather than %a.

>
> %e = load i64* %p
> ...
>
> The optimizer can no longer see easily that %e is the concatenation of %a
> and %b.
>

Why?

With this LLVM IR sequence, I think it is even easier to know %p is the
combination of %a and %b than the SHIFT+OR solution, because those two
separate stores are using two sequential getelementptr result, i.e. (i32*
%q, i32 0) and (i32* %q, i32 1).

If InstCombine can cover SHIFT+OR case, why can't it cover separate stores
as well? I think the latter one is even easier.

Thanks,
-Jiangning

> This is an important property, and is the main reason for not splitting
> wide stores to match their loads. Most importantly, this optimization
> happens in InstCombine and we run InstCombine after SLP and Loop
> Vectorization, which means the IR should be in this form at least up until
> the end of vectorization (or we lose this optimization after vectorization).
>
> This then means that we need to teach the vectorizers and code metrics
> about the cost of splitting concatenated stores, which is likely to be
> awkward but we are left with little choice.
>
> Cheers,
>
> James
>
>
> On 2 September 2014 11:20, Jiangning Liu <liujiangning1 at gmail.com> wrote:
>
>> Hi Chandler,
>>
>>
>>> Once you start slicing up memory accesses, *you break SSA form* and all
>>> of the analyses that depend on it. I cannot express how strongly I feel
>>> this is a very bad idea and the wrong direction in the middle end.
>>>
>>>
>> What do you mean by "you break SSA form" for the case Hao's patch is
>> solving? Do you mean some SSA form info could be lost?
>>
>> The transformation of Hao's patch is to change a single wide store to two
>> separate narrow stores, but the address of those two narrow stores are
>> still sequential. For the memory stored here we don't have any SSA
>> information attached at all, right?
>>
>> So what SSA form information could be lost? And what optimization could
>> be affected? Can you give an example?
>>
>> Thanks,
>> -Jiangning
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20140903/dbd98a0e/attachment.html>