<div dir="ltr"><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Sep 2, 2014 at 2:26 AM, James Molloy <span dir="ltr"><<a href="mailto:james@jamesmolloy.co.uk" target="_blank">james@jamesmolloy.co.uk</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div>So my (biased) summary would be:</div><div><br></div><div><b>Wide stores</b></div><div> + Preserve semantic information about consecutive/wide accesses</div><div> + Users can already write them, so we have to handle them somehow anyway</div></blockquote><div><br></div><div>You missed:</div><div> - We don't currently handle them well in all cases in the vectorizer or in the code generator.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div><br></div><div><b>Narrow stores</b></div><div> + Fewer constraints in the IR, provides more flexibility for optimizers</div></blockquote><div><br></div><div>I don't understand this at all. The optimizer has *less* flexibility here.</div><div><br></div><div>Perhaps what you mean to say is that the optimizer already tends to generate good code for these? That much is true.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div> + IR closer matches expected generated machine code - IR-based heuristics more accurate</div></blockquote><div><br></div><div>I mean, sure. But this seems pretty insignificant to me. I don't understand why jump-threading would care. I don't think the inliner would care enough for it to ever matter.</div><div><br></div><div>If you want to tilt at this windmill, there are just piles of places where we diverge more wildly. For example, the existence of bitcasts. Or any of the illegal operations on vector types that will cause an explosion of machine code during legalization.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div> - Have to write code to split up wide stores into narrow stores, if deemed useful (if they come from an OR/SHL?)</div></blockquote><div><br></div><div>And fix all of the *myriad* of places where we suddenly stop re-combining this arithmetic later on. Passes like instcombine can reason about a single store being fed by this arithmetic *fundamentally better* than reasoning about two stores having consecutive pointers. How would you even teach it about such pointers?</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div> - Have to reconstruct consecutive pointer information, we already do this but has the potential to fail in some cases.</div></blockquote><div><br></div><div>I guess maybe this is where you were hinting at the above problem.</div><div><br></div><div>Once you start slicing up memory accesses, *you break SSA form* and all of the analyses that depend on it. I cannot express how strongly I feel this is a very bad idea and the wrong direction in the middle end.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div><br></div><div><b>An alternative?</b></div><div> * If the above hasn't convinced you, how about an intrinsic that concatenates operands into memory? This could preserve the semantics and also can be inspected and treated differently in the vectorizers (and doesn't require an OR/SHL sequence).</div>
<div><br></div><div> declare void llvm.store.wide.i64(i64* %ptr, ...)</div></blockquote><div><br></div><div>I really don't know why we wouldn't just match the bit-math sequences that form this? Is there something that makes matching these patterns really deeply problematic? I understand that the DAG may just be missing the information due to the basic block boundary, but the vectorizers should definitely be able to reconstruct it, as should stuff like codegenprep and LSR....</div></div></div></div>