<html><head><meta http-equiv="Content-Type" content="text/html charset=windows-1252"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;"><div apple-content-edited="true">Hi Owen,
</div><div apple-content-edited="true"><br></div><div apple-content-edited="true">All your points are true and fair.</div><div apple-content-edited="true">I have tried to comments on them inlined in your mail.</div><div apple-content-edited="true"><br></div><div apple-content-edited="true">To sum up, maybe I took the wrong approach here.</div><div apple-content-edited="true"><br></div><div apple-content-edited="true">I thought exposing this feature in a generic way may be beneficial for most targets. However, it seems that we cannot come up with a reasonable cost model (at least I did not convince you with my proposal :)).</div><div apple-content-edited="true">Therefore, I am starting to think that I should keep that as a target specific DAG combine for my own target.</div><div apple-content-edited="true"><br></div><div apple-content-edited="true">I let you read my inlined comments and reply to this general question:</div><div apple-content-edited="true">Should we keep trying to improve so cost model or should I keep it as a target specific DAG combine?</div>
<br><div><div>On Oct 8, 2013, at 10:28 AM, Owen Anderson <<a href="mailto:resistor@mac.com">resistor@mac.com</a>> wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite"><meta http-equiv="Content-Type" content="text/html charset=windows-1252"><div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;"><br><div><div>On Oct 8, 2013, at 9:30 AM, Quentin Colombet <<a href="mailto:qcolombet@apple.com">qcolombet@apple.com</a>> wrote:</div><br><blockquote type="cite"><div style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;">To me, vector registers hold both values within the same register, whereas a paired load defines two different registers.</div></blockquote><div><br></div><div>That’s not really accurate on something like AArch32, where float vector registers are just pairs of float scalar registers. The problem generalizes to other targets with undifferentiated register files. Scalar GPUs typically don’t differentiate between a sequence of scalar registers and a vector of that width.</div></div></div></blockquote><div>I agree and to be fair, it depends on the aliasing pattern of the register banks, which is target dependent. E.g., AArch64 D registers are not a pair of S registers. Thus, a vector load of <2 x float> is not equivalent to a paired load of two S values if we want to use both single values, i.e., you will need more instructions with the vector load.</div><br><blockquote type="cite"><div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;"><div><br><blockquote type="cite"><div style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;">Assuming the vector registers and the regular registers are on the same bank, i.e., extracting both values from the vector register is free, it may be still interesting to use the paired load because it may be less constrained by register allocation.</div></blockquote><div><br></div><div>That seems like a cost analysis to me. Most CPU architectures I know of have plenty of registers in their FP/vector banks, so it seems unlikely that the small reduction in register pressure would be worth it for the increase pressure on the load pipe.</div></div></div></blockquote><div>I do not get your point regarding the increase pressure on the load pipe.</div><div><br></div><div>Let me explain.</div><div><br></div><div>The optimization kicks in only if we reduce the amount of expensive operations, i.e., loads and optionally bitcasts if they do not share the same register bank.</div><div>The bitcasts are discutable, however assuming all of your values map to the same register file, it remains only the loads.</div><div>Thus, unless you target feature paired loads, the slicing will *never* kick in because we increase the number of loads.</div><div>If your target feature paired loads, the number of loads remains the same.</div>That is why I do not get the point with the increase pressure on the load pipe.</div><div><br><blockquote type="cite"><div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;"><div><br><blockquote type="cite"><div style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;">Now, if the registers are not on the same bank, we may definitely not want to generate the vector load, but still be able to catch the paired load.</div></blockquote><div><br></div><div>It depends on where the values are going to end up. </div></div></div></blockquote><div>I agree that is why I think that it was not worth spending a lot of effort of having an accurate cost model, since it would not be accurate anyway!</div><div>With the proposed cost model, I was trying not to at least not degrade the generated instructions (this is also why I did not want to introduce new types, in particular vector types).</div><br><blockquote type="cite"><div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;"><div><div>In the example you gave involving _Complex, the ultimate destination type is float, even though the load is currently written as occurring in the integer bank. On AArch32, it seems like the optimal code would be a <2 x float> load, and the individual lanes can then be sourced directly by their users.</div></div></div></blockquote><div>I agree.</div><div>However, I thought it may be easier to map after ISel two floating point loads into a vector load <2 x vector> than one big integer load with truncate and shifts.</div><br><blockquote type="cite"><div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;"><div><div> On X86, it’s unclear to me why it would be better to generate two scalar loads versus one <2 x float> load and a lane extract to get the high lane out.</div></div></div></blockquote><div>Same here.</div><div>I think this optimization is a step toward more vectorizable code in the sense that the output representation looks more vector-friendly. I did not attempt to solve that problem here.</div><div>I agree that the model has to mature to give the ultimate result for all targets, but I think it can still be as good as what we are generating now.</div><br><blockquote type="cite"><div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;"><div><div><br></div><blockquote type="cite"><div style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;"><blockquote type="cite"><div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;"><div><blockquote type="cite"><div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;"><blockquote type="cite"><div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;">Even on x86, you’re probably best off trying to turn this into a <2 x i32> vector load + lane copy rather than splitting the load.</div></blockquote><div>My plan here was to turn this into two split loads, then having another dag combine for merging those loads into a vector load. I though maybe target supporting paired load already do this kind of target specific combine.</div></div></blockquote><br></div><div>You certainly can go that way, but it sounds challenging to implement. My point was that it seems like you could achieve most of the same benefit with a DAG combine that turned this example into a <2 x float> vector load, with much less concern about the cost model, etc.</div></div></blockquote><div>My concern then is that we may introduce cross register bank copies (e.g., NEON to regular register) and in that case, the new code would be more expensive.</div><div>My approach here was to not insert vector code. I believe the same approach was used in SROA: do not insert vector code, unless the code has already been vectorized.</div></div></blockquote><br></div><div>Ultimately, I don’t think this is an optimization that you can do without a lot of target profitability knowledge. </div></div></blockquote><div>I agree.</div><div><br></div><blockquote type="cite"><div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;"><div> With that information, you can introduce vector on targets that want them, and avoid doing so on targets that don’t.</div></div></blockquote><div>Then, what should we expose?</div><br><blockquote type="cite"><div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;"><div> Even within a single target, the choice of whether to split or to vectorize is dependent on the ultimate type of the users of the values.</div></div></blockquote><div>I agree and that is why I thought SDIsel may not be enough to provide a good cost model, thus I did not want to spend too much effort on it.</div><div>Ultimately, we could solve this kind of problems with the new global isel.</div><div><br></div>-Quentin<br><blockquote type="cite"><div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;"><div> On AArch32, we’re prefer paired loads for integers (unless they’re being fed to integer NEON instructions!), but vector loads for floats .</div><div><br></div><div>—Owen</div></div></blockquote></div><br></body></html>