[PATCH] [PATCH][SROA]Also slice the STORE when slicing a LOAD in AllocaSliceRewriter

Tue Sep 2 02:26:40 PDT 2014

Hi Chandler,

I'm not an expert on the code that SROA produces, but I fail to see why
splitting a wide store into two narrower stores fundamentally loses
information. The information that is "lost" is that the stores are
consecutive and thus can be merged together. But we already reconstruct
that information in the compiler, both in the SLP vectorizer and the DAG
combiner. If we can't do that for narrow stores coming out of SROA then
surely that is a defect in code that already exists.

But that, I think, is a poor argument either way. If we canonicalise on one
form (wide or narrow) then the rest of the compiler needs to deal with that
form and that is "just coding work" (splitting user-given wide stores into
narrow stores, or making the backend deal nicely with wide stores). What I
think sways the argument for me is that wide stores add a constraint to the
IR in order to retain semantic information. I don't think that's the right
way forward.

We have mechanisms for keeping semantic information alongside IR. Metadata
and intrinsics both provide this, and in other cases the semantics can be
rediscovered by analysis later. The wide store case introduces operators
that we expect to be blown away in the backend, and that makes the IR
diverge from the expected generated code which will confuse anything that
relies upon the IR as a heuristic - the inliner, jump threader, vectorizer,
unroller...

Here is a case where narrow stores may make a difference:

uint64_t *p = ...;
uint32_t a = ...;
for (x : ...) {
    p[x] = a | (x << 32UL);
}

That loop is a perfect candidate for two interleaving store instructions,
such as NEON's (V)ST.2 - one performing a splat store to strided 32-bit
addresses and the other performing a store of the induction variable to
alternate strided 32-bit addresses. With some

With a wide store, the loop vectorizer would just see the shift and OR, and
arrive at completely different code. In the case where a wide store would
make a difference, we've already got code at least in the SLP vectorizer
(LV too? can't remember) to identify consecutive loads and stores and take
the stitching of them together into account in the cost model.

So my (biased) summary would be:

*Wide stores*
  + Preserve semantic information about consecutive/wide accesses
  + Users can already write them, so we have to handle them somehow anyway

*Narrow stores*
  + Fewer constraints in the IR, provides more flexibility for optimizers
  + IR closer matches expected generated machine code - IR-based heuristics
more accurate
  - Have to write code to split up wide stores into narrow stores, if
deemed useful (if they come from an OR/SHL?)
  - Have to reconstruct consecutive pointer information, we already do this
but has the potential to fail in some cases.

*An alternative?*
  * If the above hasn't convinced you, how about an intrinsic that
concatenates operands into memory? This could preserve the semantics and
also can be inspected and treated differently in the vectorizers (and
doesn't require an OR/SHL sequence).

      declare void llvm.store.wide.i64(i64* %ptr, ...)

Cheers,

James

On 2 September 2014 08:27, Chandler Carruth <chandlerc at gmail.com> wrote:

>
> On Tue, Sep 2, 2014 at 12:11 AM, Hao Liu <Hao.Liu at arm.com> wrote:
>
>> Hi James & Chandler,
>>
>>
>>
>> I have two small test cases to show James’ first concern. Test results
>> show loop vectorizor generates quite poor code for wide store. To see the
>> result by following command lines:
>>
>> opt –S –loop-vectorize < wide-store.ll
>>
>> opt –S –loop-vectorize < narrow-stores.ll
>>
>>
>>
>> The wide-store.ll and narrow-stores.ll are generated from attached
>> struct.cpp by with or without the my patch. This cpp case is simplified
>> from a hot function in SPEC CPU 2006 473.astar. Currently the poor code
>> affects the performance.
>>
>>
>>
>>
>>
>> Hi Chandler,
>>
>>
>>
>> I also agree with your concern. On the other hand, If the input is
>> zext/shl/or and a wide store, the patch in SROA can not handle such case.
>> For example, if the input is wide-store.ll, only a separate pass or
>> function specific to handle such case can generate simpler code.
>>
>> But there is a conflict, even though we add code in the backend, we still
>> can’t solve the problem about the wide-store affecting the Loop
>> Vectorization issue. For this concern, I think maybe we prefer narrow
>> stores than wide store.
>>
>
> Before I dig into trying to explain various ways it is or isn't possible
> to generate better code with the wide stores, I think it is really
> important to understand why you aren't concerned about the memory model
> implications here which cause us to *lose information* in the IR when
> splitting stores. Once fundamental invariants of the program are lost, they
> simply cannot be recovered. This seems to me to be the overriding concern.
> The fact that we need to improve lots of other parts of LLVM -- well, yes,
> we need to do lots of improvements to LLVM.
>
> And none of these improvements seem bad. A user could just as easily have
> written this kind of wide store in their code, and we will fail to optimize
> it in all the ways you outline. No changes to SROA will fix this. We can
> only emit efficient code when the *user* provides a wide store by actually
> teaching the optimizer to analyze and emit efficient code for it. Once we
> do that, we have also solved the "problem" for SROA.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20140902/9be78c4f/attachment.html>