<div dir="ltr">Hi Chandler,<div><br></div><div>I'm not an expert on the code that SROA produces, but I fail to see why splitting a wide store into two narrower stores fundamentally loses information. The information that is "lost" is that the stores are consecutive and thus can be merged together. But we already reconstruct that information in the compiler, both in the SLP vectorizer and the DAG combiner. If we can't do that for narrow stores coming out of SROA then surely that is a defect in code that already exists.</div>

<div><br></div><div>But that, I think, is a poor argument either way. If we canonicalise on one form (wide or narrow) then the rest of the compiler needs to deal with that form and that is "just coding work" (splitting user-given wide stores into narrow stores, or making the backend deal nicely with wide stores). What I think sways the argument for me is that wide stores add a constraint to the IR in order to retain semantic information. I don't think that's the right way forward.</div>

<div><br></div><div>We have mechanisms for keeping semantic information alongside IR. Metadata and intrinsics both provide this, and in other cases the semantics can be rediscovered by analysis later. The wide store case introduces operators that we expect to be blown away in the backend, and that makes the IR diverge from the expected generated code which will confuse anything that relies upon the IR as a heuristic - the inliner, jump threader, vectorizer, unroller...</div>

<div><br></div><div>Here is a case where narrow stores may make a difference:</div><div><br></div><div>uint64_t *p = ...;</div><div>uint32_t a = ...;</div><div>for (x : ...) {</div><div>    p[x] = a | (x << 32UL);</div>

<div>}</div><div><br></div><div>That loop is a perfect candidate for two interleaving store instructions, such as NEON's (V)ST.2 - one performing a splat store to strided 32-bit addresses and the other performing a store of the induction variable to alternate strided 32-bit addresses. With some </div>

<div><br></div><div>With a wide store, the loop vectorizer would just see the shift and OR, and arrive at completely different code. In the case where a wide store would make a difference, we've already got code at least in the SLP vectorizer (LV too? can't remember) to identify consecutive loads and stores and take the stitching of them together into account in the cost model.</div>

<div><br></div><div>So my (biased) summary would be:</div><div><br></div><div><b>Wide stores</b></div><div>  + Preserve semantic information about consecutive/wide accesses</div><div>  + Users can already write them, so we have to handle them somehow anyway</div>

<div><br></div><div><b>Narrow stores</b></div><div>  + Fewer constraints in the IR, provides more flexibility for optimizers</div><div>  + IR closer matches expected generated machine code - IR-based heuristics more accurate</div>

<div>  - Have to write code to split up wide stores into narrow stores, if deemed useful (if they come from an OR/SHL?)</div><div>  - Have to reconstruct consecutive pointer information, we already do this but has the potential to fail in some cases.</div>

<div><br></div><div><b>An alternative?</b></div><div>  * If the above hasn't convinced you, how about an intrinsic that concatenates operands into memory? This could preserve the semantics and also can be inspected and treated differently in the vectorizers (and doesn't require an OR/SHL sequence).</div>

<div><br></div><div>      declare void llvm.store.wide.i64(i64* %ptr, ...)</div><div><br></div><div>Cheers,</div><div><br></div><div>James</div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On 2 September 2014 08:27, Chandler Carruth <span dir="ltr"><<a href="mailto:chandlerc@gmail.com" target="_blank">chandlerc@gmail.com</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class=""><br><div class="gmail_quote">On Tue, Sep 2, 2014 at 12:11 AM, Hao Liu <span dir="ltr"><<a href="mailto:Hao.Liu@arm.com" target="_blank">Hao.Liu@arm.com</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d">Hi James & Chandler,<u></u><u></u></span></p>

<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d"><u></u> <u></u></span></p><p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d">I have two small test cases to show James’ first concern. Test results show loop vectorizor generates quite poor code for wide store. To see the result by following command lines:<u></u><u></u></span></p>

<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d">opt –S –loop-vectorize < wide-store.ll<u></u><u></u></span></p><p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d">opt –S –loop-vectorize < narrow-stores.ll<u></u><u></u></span></p>

<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d"><u></u> <u></u></span></p><p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d">The wide-store.ll and narrow-stores.ll are generated from attached struct.cpp by with or without the my patch. This cpp case is simplified from a hot function in SPEC CPU 2006 473.astar. Currently the poor code affects the performance.<u></u><u></u></span></p>

<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d"><u></u> <u></u></span></p><p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d"><u></u> <u></u></span></p>

<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d">Hi Chandler,<u></u><u></u></span></p><p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d"><u></u> <u></u></span></p>

<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d">I also agree with your concern. On the other hand, If the input is zext/shl/or and a wide store, the patch in SROA can not handle such case. For example, if the input is wide-store.ll, only a separate pass or function specific to handle such case can generate simpler code.<u></u><u></u></span></p>

<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d">But there is a conflict, even though we add code in the backend, we still can’t solve the problem about the wide-store affecting the Loop Vectorization issue. For this concern, I think maybe we prefer narrow stores than wide store.</span></p>

</blockquote></div><br></div>Before I dig into trying to explain various ways it is or isn't possible to generate better code with the wide stores, I think it is really important to understand why you aren't concerned about the memory model implications here which cause us to *lose information* in the IR when splitting stores. Once fundamental invariants of the program are lost, they simply cannot be recovered. This seems to me to be the overriding concern. The fact that we need to improve lots of other parts of LLVM -- well, yes, we need to do lots of improvements to LLVM.</div>

<div class="gmail_extra"><br></div><div class="gmail_extra">And none of these improvements seem bad. A user could just as easily have written this kind of wide store in their code, and we will fail to optimize it in all the ways you outline. No changes to SROA will fix this. We can only emit efficient code when the *user* provides a wide store by actually teaching the optimizer to analyze and emit efficient code for it. Once we do that, we have also solved the "problem" for SROA.</div>

</div>

</blockquote></div><br></div>