<div dir="ltr">On Thu, Sep 19, 2013 at 10:49 PM, Quentin Colombet <span dir="ltr"><<a href="mailto:qcolombet@apple.com" target="_blank" class="cremed">qcolombet@apple.com</a>></span> wrote:<br><div class="gmail_extra">

<div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="word-wrap:break-word">Hi,<div><br></div><div>Here is the new patch.</div><div>The optimization is now done as part of DAG combine.</div>

<div><br></div><div>The patch looks pretty much the same as its InstCombine counter part (SDNode, instead of Instruction) except that it introduces a new profitability model.</div><div>Hence, when the transformation was done as InstCombine the idea  was to perform cannonicalization. Now, this is an optimization.</div>

<div><br></div><div>For this purpose, the patch introduces a cost model (represented by the helper class LoadedSlice::Cost), which accounts for the instructions that are saved by slicing a load compared to the original big load.</div>

<div>It also introduces a new target hook, to query about load pairing capabilities (type and required alignment).</div><div>Basically, this cost model counts how many truncates, shifts, etc, are saved with slicing compared to the number of introduced loads.</div>

<div>It takes advantages of information like isTruncateFree, etc.</div><div><br></div><div>Currently the cost model is used in a very specific case: only two slices that are next to each other in memory.</div><div>This limitation can be lifted as soon as we consider the cost model mature enough and when machine passes that deal with load pairing manage to handle more slices.</div>

<div><br></div><div>Thanks for your reviews!</div></div></blockquote><div><br></div><div>I'm probably not the right person to do a detailed review (I just don't know the DAG that well) but some high level thoughts:</div>

<div><br></div><div>- I *really* like the design you're building here. I like the use of a concrete cost model, tracking register pairing, etc. That looks like serious goodness, and I'm excited to see it expanded.</div>

<div><br></div><div>- I wonder if it would help to explicitly shift the data structures to handle N slices used of M adjacent loads? I mention M adjacent loads because SROA (or user-authored code) will pre-slice stuff to only use target-legal integers (or 'int') in some cases. I expect it to be rare, but likely easy to handle the generality.</div>

<div><br></div><div>- It wasn't clear how you account for multiple uses. I'm specifically imagining the fact that a wide load may be used N times for N slices; the narrow loads for N-1 slices don't replace the wide load. It'll be important to keep an eye on the use # and not assume (for cost) that you simplify away things that have more uses anyways.</div>

<div><br></div><div>- The cost comparator doesn't really make sense to me. the early bias followed by summing is just strange, but maybe that's just me.</div><div><br></div><div>And two nits:</div><div>- There is some weird indenting in the patch. I would try running it through clang-format.</div>

<div>- I saw some strange variable names like 'Nb' for 'Number'</div><div><br></div><div>Thanks for shifting this down to the DAG level. The whole thing looks really nice now. One crazy idea: Do you want to see if there is a vector load + extract that would suffice and throw it into the cost mix? The SLP vectorizer *might* get this, but honestly, it seems better to do here.</div>

<div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="word-wrap:break-word"><span class="HOEnZb"><font color="#888888"><div><br></div><div>-Quentin</div>

<div></div></font></span></div><br><div style="word-wrap:break-word"><div><br><div><div>On Sep 17, 2013, at 3:05 PM, Quentin Colombet <<a href="mailto:qcolombet@apple.com" target="_blank" class="cremed">qcolombet@apple.com</a>> wrote:</div>

<br><blockquote type="cite"><div style="word-wrap:break-word">After discussing with Chandler off-line, we decided to perform the load slicing at isel time.<div>Thus, this patch has been reverted in r<span style="font-family:Menlo;font-size:11px">190891.</span></div>

<div><font face="Menlo"><span style="font-size:11px"><br></span></font><div>

<div style="font-family:Helvetica;font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:normal;text-align:-webkit-auto;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;word-wrap:break-word">

-Quentin</div>


</div>

<br><div><div>On Sep 17, 2013, at 2:10 PM, Chandler Carruth <<a href="mailto:chandlerc@google.com" target="_blank" class="cremed">chandlerc@google.com</a>> wrote:</div><br><blockquote type="cite"><div dir="ltr"><div class="gmail_extra">

<br><div class="gmail_quote">On Mon, Aug 26, 2013 at 1:08 PM, Quentin Colombet <span dir="ltr"><<a href="mailto:qcolombet@apple.com" target="_blank" class="cremed">qcolombet@apple.com</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div>After looking into SROA, Nadav and I agreed that it does the right thing.</div><div>Therefore, the initial proposed patch is still the one to be reviewed.</div>


<div><div></div></div></blockquote></div><br>Very, very sorry for the late comment here. I misunderstood this comment and the result.</div><div class="gmail_extra"><br></div><div class="gmail_extra">I agree that SROA is correct here, and I also think that this should be the canonical form. With the LLVM memory model, it is very hard to merge two smaller loads back together if that is ever profitable. It is essentially impossible in many cases to merge two smaller stores back together if that is ever profitable. As such, it is very useful to preserve the widest known-safe load and store size as far as possible in the optimizer. At least getting it to the backend where the cost factor for various load and store operations is known is essential.</div>


<div class="gmail_extra"><br></div><div class="gmail_extra">Can we canonicalize toward the wide loads and stores with appropriate masking to extract narrow values, and then match these back to the small stores in the backend?</div>


<div class="gmail_extra"><br></div><div class="gmail_extra">Both SROA and optimal bitfield code generation rely on this (at least for x86) so changing it will regress some things.</div></div>

</blockquote></div><br></div></div>

_______________________________________________<br>llvm-commits mailing list<br><a href="mailto:llvm-commits@cs.uiuc.edu" target="_blank" class="cremed">llvm-commits@cs.uiuc.edu</a><br><a href="http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits" target="_blank" class="cremed">http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits</a><br>

</blockquote></div><br></div></div>

<br></blockquote></div><br></div></div>