[PATCH] [PATCH][SROA]Also slice the STORE when slicing a LOAD in AllocaSliceRewriter

Mon Sep 8 18:21:41 PDT 2014

Hi all,

ping…

I agree with Jiangning. I’m still confused with why wide store version is better than narrow stores version. If we think wide store should be finally optimized into narrow stores version, why don’t we optimize it in the SROA at the very beginning when we generating SHL/ZEXT/OR.

Thanks,

-Hao

From: llvm-commits-bounces at cs.uiuc.edu [mailto:llvm-commits-bounces at cs.uiuc.edu] On Behalf Of Jiangning Liu
Sent: Wednesday, September 03, 2014 12:58 PM
To: James Molloy
Cc: Commit Messages and Patches for LLVM; reviews+D4954+public+39a520765005fb8e at reviews.llvm.org
Subject: Re: [PATCH] [PATCH][SROA]Also slice the STORE when slicing a LOAD in AllocaSliceRewriter

Hi James,

Sorry, I'm still not convinced here.

2014-09-02 18:46 GMT+08:00 James Molloy <james at jamesmolloy.co.uk>:

Hi all,

I've just thrashed this out further with Chandler on IRC, and thought I'd summarize what came up here. Chandler will rebutt if I make a mistake, I'm sure :)

The main point is that there are instcombines and bunch of arithmetic-simplifying optimizations that rely on SROA producing wide stores. In fact, it isn't the fact that the store is wide that matters, it is that there is a merged SSA node.

%a = i32 ...

%b = i32 ...

%c = shl i64 (zext i32 %a to i64), i32 32

%d = or i64 %c, (zext i32 %b to i64)

store i64* %p, i64 %d

%e = load i64* %p

...

In the above example, optimizers can follow the load of %p through, via the store and end up with two constituent i32's that they can then do magic with. In the alternative scenario:

%a = i32 ...

%b = i32 ...

%q = bitcast i64* %p to i32*

store i32* (getelementptr i32* %q, i32 0), i32 %a

store i32* (getelementptr i32* %a, i32 1), i32 %b

I think this is a typo, it should be %q rather than %a.

%e = load i64* %p

...

The optimizer can no longer see easily that %e is the concatenation of %a and %b.

Why?

With this LLVM IR sequence, I think it is even easier to know %p is the combination of %a and %b than the SHIFT+OR solution, because those two separate stores are using two sequential getelementptr result, i.e. (i32* %q, i32 0) and (i32* %q, i32 1).

If InstCombine can cover SHIFT+OR case, why can't it cover separate stores as well? I think the latter one is even easier.

Thanks,

-Jiangning

This is an important property, and is the main reason for not splitting wide stores to match their loads. Most importantly, this optimization happens in InstCombine and we run InstCombine after SLP and Loop Vectorization, which means the IR should be in this form at least up until the end of vectorization (or we lose this optimization after vectorization).

This then means that we need to teach the vectorizers and code metrics about the cost of splitting concatenated stores, which is likely to be awkward but we are left with little choice.

Cheers,

James

On 2 September 2014 11:20, Jiangning Liu <liujiangning1 at gmail.com> wrote:

Hi Chandler,

Once you start slicing up memory accesses, *you break SSA form* and all of the analyses that depend on it. I cannot express how strongly I feel this is a very bad idea and the wrong direction in the middle end.

What do you mean by "you break SSA form" for the case Hao's patch is solving? Do you mean some SSA form info could be lost?

The transformation of Hao's patch is to change a single wide store to two separate narrow stores, but the address of those two narrow stores are still sequential. For the memory stored here we don't have any SSA information attached at all, right?

So what SSA form information could be lost? And what optimization could be affected? Can you give an example?

Thanks,

-Jiangning 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20140909/6f192c22/attachment.html>