[PATCH] [PATCH][SROA]Also slice the STORE when slicing a LOAD in AllocaSliceRewriter

Fri Sep 12 05:32:20 PDT 2014

Hi Chandler,

I had a discussion with Jiangning about this based upon our chat on IRC,
and he asked a question I couldn't answer, so possibly you could?

On IRC you asserted that we need the wide store so that instcombines can
see that the store was made up of two constitutent values. Why is this the
case? Why can't memdep get you the same information with two narrow stores?

Surely a query such as "what does this i64 load depend on?" would return
"these two i32 stores"? If this is not the case, why not? and if it is the
case, why can't instcombine use this information to dive across load/store
boundaries?

Cheers,

James

On 9 September 2014 02:21, Hao Liu <Hao.Liu at arm.com> wrote:

> Hi all,
>
>
>
> ping…
>
> I agree with Jiangning. I’m still confused with why wide store version is
> better than narrow stores version. If we think wide store should be finally
> optimized into narrow stores version, why don’t we optimize it in the SROA
> at the very beginning when we generating SHL/ZEXT/OR.
>
>
>
> Thanks,
>
> -Hao
>
>
>
> *From:* llvm-commits-bounces at cs.uiuc.edu [mailto:
> llvm-commits-bounces at cs.uiuc.edu] *On Behalf Of *Jiangning Liu
> *Sent:* Wednesday, September 03, 2014 12:58 PM
> *To:* James Molloy
> *Cc:* Commit Messages and Patches for LLVM;
> reviews+D4954+public+39a520765005fb8e at reviews.llvm.org
> *Subject:* Re: [PATCH] [PATCH][SROA]Also slice the STORE when slicing a
> LOAD in AllocaSliceRewriter
>
>
>
> Hi James,
>
>
>
> Sorry, I'm still not convinced here.
>
> 2014-09-02 18:46 GMT+08:00 James Molloy <james at jamesmolloy.co.uk>:
>
> Hi all,
>
>
>
> I've just thrashed this out further with Chandler on IRC, and thought I'd
> summarize what came up here. Chandler will rebutt if I make a mistake, I'm
> sure :)
>
>
>
> The main point is that there are instcombines and bunch of
> arithmetic-simplifying optimizations that rely on SROA producing wide
> stores. In fact, it isn't the fact that the store is wide that matters, it
> is that there is a merged SSA node.
>
>
>
> %a = i32 ...
>
> %b = i32 ...
>
> %c = shl i64 (zext i32 %a to i64), i32 32
>
> %d = or i64 %c, (zext i32 %b to i64)
>
> store i64* %p, i64 %d
>
>
>
> %e = load i64* %p
>
> ...
>
>
>
> In the above example, optimizers can follow the load of %p through, via
> the store and end up with two constituent i32's that they can then do magic
> with. In the alternative scenario:
>
>
>
> %a = i32 ...
>
> %b = i32 ...
>
> %q = bitcast i64* %p to i32*
>
> store i32* (getelementptr i32* %q, i32 0), i32 %a
>
> store i32* (getelementptr i32* %a, i32 1), i32 %b
>
>
>
> I think this is a typo, it should be %q rather than %a.
>
>
>
>
>
> %e = load i64* %p
>
> ...
>
>
>
> The optimizer can no longer see easily that %e is the concatenation of %a
> and %b.
>
>
>
> Why?
>
> With this LLVM IR sequence, I think it is even easier to know %p is the
> combination of %a and %b than the SHIFT+OR solution, because those two
> separate stores are using two sequential getelementptr result, i.e. (i32*
> %q, i32 0) and (i32* %q, i32 1).
>
>
>
> If InstCombine can cover SHIFT+OR case, why can't it cover separate stores
> as well? I think the latter one is even easier.
>
>
>
> Thanks,
>
> -Jiangning
>
>
>
>
>
> This is an important property, and is the main reason for not splitting
> wide stores to match their loads. Most importantly, this optimization
> happens in InstCombine and we run InstCombine after SLP and Loop
> Vectorization, which means the IR should be in this form at least up until
> the end of vectorization (or we lose this optimization after vectorization).
>
>
>
> This then means that we need to teach the vectorizers and code metrics
> about the cost of splitting concatenated stores, which is likely to be
> awkward but we are left with little choice.
>
>
>
> Cheers,
>
>
>
> James
>
>
>
> On 2 September 2014 11:20, Jiangning Liu <liujiangning1 at gmail.com> wrote:
>
> Hi Chandler,
>
>
>
>
>
> Once you start slicing up memory accesses, *you break SSA form* and all of
> the analyses that depend on it. I cannot express how strongly I feel this
> is a very bad idea and the wrong direction in the middle end.
>
>
>
>
>
> What do you mean by "you break SSA form" for the case Hao's patch is
> solving? Do you mean some SSA form info could be lost?
>
>
>
> The transformation of Hao's patch is to change a single wide store to two
> separate narrow stores, but the address of those two narrow stores are
> still sequential. For the memory stored here we don't have any SSA
> information attached at all, right?
>
>
>
> So what SSA form information could be lost? And what optimization could be
> affected? Can you give an example?
>
>
>
> Thanks,
>
> -Jiangning
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20140912/5312c59d/attachment.html>