<html><head><meta http-equiv="Content-Type" content="text/html charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class="">Thanks, James. Please see below.<br class=""><div><blockquote type="cite" class=""><div class="">On Apr 22, 2015, at 2:15 AM, James Molloy <<a href="mailto:james@jamesmolloy.co.uk" class="">james@jamesmolloy.co.uk</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><div dir="ltr" class="">r235431.<br class=""></div><br class=""><div class="gmail_quote">On Wed, 22 Apr 2015 at 08:45 James Molloy <<a href="mailto:james@jamesmolloy.co.uk" class="">james@jamesmolloy.co.uk</a>> wrote:<br class=""><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr" class="">Hi Michael and Gerolf,<br class=""><br class=""><span style="font-size:13.1999998092651px" class="">Gerolf, I agree in general terms. However in gobmk these madds being removed are not being strength reduced, they are redundant. Not only that but they hold two registers live as arguments, so we're reducing both number of instructions and register pressure and not increasing any other metric. However the gain will depend on the number of multiply functional units your microarchitecture has.</span><div class=""><br class="">Michael, thanks for that explanation. It shows that there really is a problem, where GEP matching is superior to arithmetic matching. It may just be that the pointer arithmetic emitted by GEP lowering differs slightly from canonical form and that is what our instruction selector is tuned for. But it's certainly nontrivial so that, combined with the other issues, makes me inclined to turn it off by default.<div class=""><br class=""></div><div class="">However, I robustly defend the position I took. I don't think pointing the finger at a pass due to one testcase over 9 months later and asking to disable that pass, without looking for an underlying cause, is the best way of improving LLVM. If the pass were newly added, I would feel differently of course.</div></div></div></blockquote></div></div></blockquote><div><br class=""></div>Understandably so. We were aware that this could be taken at heart. In reverse circumstance I’m sure initially we would have felt the same way. Our data and internal thumbnail assessment of the optimization were not great,  but sufficient to point to the need of discussion with a preference for disabling/back-out. I can see future learnings on how to present the evidence better. Especially we don’t want to come across as blamers, finger pointers,  ‘extremists' or anything like it. Clearly the long 9month delay exacerbated the discussion to begin with. This is due to the fact that we started tracking Geekbench for older version just recently. As tracking improves in future similar issues should get caught much earlier. Life will be easier then. For now there are 6 or so more regressions in Geekbench to recover from. No, no, nope, nothing in your code alley! :-)</div><div><br class=""></div><div><blockquote type="cite" class=""><div class=""><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr" class=""><div class=""><div class=""><br class=""></div><div class="">Cheers,</div><div class=""><br class=""></div><div class="">James</div><div class=""><br class=""></div><div class=""><br class=""></div></div></div><br class=""><div class="gmail_quote">On Wed, 22 Apr 2015 at 03:00 Michael Zolotukhin <<a href="mailto:mzolotukhin@apple.com" target="_blank" class="">mzolotukhin@apple.com</a>> wrote:<br class=""><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="word-wrap:break-word" class="">Hi James,<div class=""><br class=""></div><div class="">I’ve spent some time getting familiar with the source and looking into the generated assembly, and here is what I found.<br class=""><div class=""><br class=""></div><div class="">First, let me start with some examples where the pass did some not-optimzal transformation. The code snippets are from random functions (not necessarily hot ones), but they still should give the idea.</div><div class=""><br class=""></div><div class=""><b class="">Case 1:</b></div><div class=""><div class=""><u class="">gepopt=true+D9148:</u></div><div class=""><font face="Menlo" class="">mul<span style="white-space:pre-wrap" class="">       </span> x8, x8, x9</font></div><div class=""><font face="Menlo" class="">stp<span style="white-space:pre-wrap" class="">  </span>x11, x8, [sp, #32]</font></div><div class=""><font face="Menlo" class="">add<span style="white-space:pre-wrap" class="">   </span>x8, x25, #8</font></div><div class=""><font face="Menlo" class="">str<span style="white-space:pre-wrap" class="">  </span>x8, [sp, #48]</font></div><div class=""><font face="Menlo" class="">add<span style="white-space:pre-wrap" class="">        </span>x8, x25, #56</font></div><div class=""><font face="Menlo" class="">str<span style="white-space:pre-wrap" class=""> </span>x8, [sp, #16]</font></div><div class=""><font face="Menlo" class="">movz<span style="white-space:pre-wrap" class="">       </span>w8, #72</font></div><div class=""><font face="Menlo" class="">mul<span style="white-space:pre-wrap" class="">      </span> x22, x26, x8</font></div><div class=""><font face="Menlo" class="">add<span style="white-space:pre-wrap" class="">        </span>x8, x25, x22</font></div><div class=""><font face="Menlo" class="">ldrb<span style="white-space:pre-wrap" class="">        </span>w10, [x8, #23]</font></div><div class=""><font face="Menlo" class="">sxtb<span style="white-space:pre-wrap" class="">      </span>w9, w10</font></div><div class=""><font face="Menlo" class="">ldr<span style="white-space:pre-wrap" class="">      </span>x11, [sp, #48]</font></div><div class=""><font face="Menlo" class="">ldr<span style="white-space:pre-wrap" class="">       </span>x11, [x11, x22]</font></div></div><div class=""><br class=""></div><div class=""><u class="">gepopt=false:</u></div><div class=""><div class=""><font face="Menlo" class="">mul<span style="white-space:pre-wrap" class="">        </span> x28, x8, x9</font></div><div class=""><font face="Menlo" class="">movz<span style="white-space:pre-wrap" class="">        </span>w8, #72</font></div><div class=""><font face="Menlo" class="">madd<span style="white-space:pre-wrap" class="">     </span>x8, x26, x8, x25</font></div><div class=""><font face="Menlo" class="">ldrb<span style="white-space:pre-wrap" class="">    </span>w10, [x8, #23]</font></div><div class=""><font face="Menlo" class="">sxtb<span style="white-space:pre-wrap" class="">      </span>w9, w10</font></div><div class=""><font face="Menlo" class="">ldr<span style="white-space:pre-wrap" class="">      </span>x11, [x8, #8]</font></div></div><div class=""><br class=""></div><div class="">In this case we have a load from [x25+8+x26*72] address, and the pass split the expression into two parts: (x25+8) and (x26*72). That increased register pressure, and we had to generate a spill-fill pair for (x25+8) expression. Without the pass we generated madd and a load with a simple addressing mode.</div><div class=""><br class=""></div><div class=""><br class=""></div><div class=""><b class="">Case 2:</b></div><div class=""><u class="">Gepopt=true+D9148:</u></div><div class=""><div class=""><font face="Menlo" class="">ldr<span style="white-space:pre-wrap" class="">      </span>x22, [x0, #136]</font></div><div class=""><font face="Menlo" class="">orr<span style="white-space:pre-wrap" class="">      </span>w9, wzr, #0x18</font></div><div class=""><font face="Menlo" class="">mul<span style="white-space:pre-wrap" class="">       </span> x8, x8, x9</font></div><div class=""><font face="Menlo" class="">add<span style="white-space:pre-wrap" class="">  </span>x23, x22, x8</font></div><div class=""><font face="Menlo" class="">str<span style="white-space:pre-wrap" class=""> </span>wzr, [x23]</font></div><div class=""><font face="Menlo" class="">orr<span style="white-space:pre-wrap" class="">   </span>x26, x8, #0x4</font></div><div class=""><font face="Menlo" class="">str<span style="white-space:pre-wrap" class="">        </span>wzr, [x26, x22]</font></div><div class=""><font face="Menlo" class="">add<span style="white-space:pre-wrap" class="">      </span>x24, x8, #8</font></div><div class=""><font face="Menlo" class="">str<span style="white-space:pre-wrap" class="">  </span>wzr, [x24, x22]</font></div><div class=""><font face="Menlo" class="">add<span style="white-space:pre-wrap" class="">      </span>x25, x8, #16</font></div><div class=""><font face="Menlo" class="">ldr<span style="white-space:pre-wrap" class=""> </span>x27, [x25, x22]</font></div><div class=""><font face="Menlo" class="">cbz<span style="white-space:pre-wrap" class="">      </span>x27, 0xABCDABCD</font></div><div class=""><br class=""></div></div><div class=""><u class="">Gepopt=false:</u></div><div class=""><div class=""><font face="Menlo" class="">ldr<span style="white-space:pre-wrap" class="">        </span>x9, [x0, #136]</font></div><div class=""><font face="Menlo" class="">orr<span style="white-space:pre-wrap" class="">       </span>w10, wzr, #0x18</font></div><div class=""><font face="Menlo" class="">madd<span style="white-space:pre-wrap" class="">     </span>x22, x8, x10, x9</font></div><div class=""><font face="Menlo" class="">str<span style="white-space:pre-wrap" class="">     </span>xzr, [x22]</font></div><div class=""><font face="Menlo" class="">str<span style="white-space:pre-wrap" class="">   </span>wzr, [x22, #8]</font></div><div class=""><font face="Menlo" class="">ldr<span style="white-space:pre-wrap" class="">       </span>x23, [x22, #16]</font></div><div class=""><font face="Menlo" class="">cbz<span style="white-space:pre-wrap" class="">      </span>x23, 0xABCDABCD</font></div><div class=""><br class=""></div></div><div class="">In this case the hoisted common part seems to be just x22, and the we generate stores to [x22+x8*0x18], [x22+x8*0x18+4], and [x22+x8*0x18+8]. New addresses also seem to be more confusing for other optimization passes, so we weren’t able to merge adjacent stores, as we did with gepopt=false. We also didn’t fuse ‘add’ and ‘mul’ to ‘madd’ in this case.</div><div class=""><br class=""></div><div class=""><br class=""></div><div class="">Looking into these examples, I thought that the problem might be that we do this transformation always and unconditionally - we don’t check if it’s beneficial at all. This would work fine if the transformation is free, i.e. if codegen always can roll everything back. But it’s not free, and in cases when we don’t actually hit the target (i.e. hoist a common part of several GEP-expressions), we often lose.</div><div class=""><br class=""></div><div class="">I see how this pass can be beneficial for targets like NVPTX which (according to comments in SeparateConstOffsetFromGEP.cpp) don't support complex addressing modes with several registers. However, it’s not that clear whether it’s beneficial or not on targets like ARM64. I’m inclining to believe that turning this pass in the current state on by default was the ‘large hammer’. I still think it could be beneficial, but it needs some more work for it.</div><div class=""><br class=""></div><div class="">Thanks,</div><div class="">Michael</div></div></div><div style="word-wrap:break-word" class=""><div class=""><div class=""><br class=""></div><div class=""><div class=""><blockquote type="cite" class=""><div class="">On Apr 21, 2015, at 11:38 AM, James Molloy <<a href="mailto:james@jamesmolloy.co.uk" target="_blank" class="">james@jamesmolloy.co.uk</a>> wrote:</div><br class=""><div class="">Hi Michael,<br class=""><br class="">Thanks. I'll take a look tomorrow. At least this is nothing to do with the pass itself- if re association regresses instruction selection, something is fragile!<br class=""><br class="">Will you have time to look at this one today? It's now 19:30 here and I'm away for the evening.<br class=""><br class="">Cheers,<br class=""><br class="">James<br class=""><div class="gmail_quote">On Tue, 21 Apr 2015 at 19:36, Michael Zolotukhin <<a href="mailto:mzolotukhin@apple.com" target="_blank" class="">mzolotukhin@apple.com</a>> wrote:<br class=""><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="word-wrap:break-word" class="">Hi James,<div class=""><br class=""></div><div class="">It looks like the first patch gives nearly the same results as both patches together (at least, in terms of gains and regressions on geekbench). E.g. the code generated for mandelbrot is the same.</div><div class=""><br class=""></div><div class="">Thanks,</div><div class="">Michael</div></div><div style="word-wrap:break-word" class=""><div class=""><br class=""><div class=""><blockquote type="cite" class=""><div class="">On Apr 21, 2015, at 11:29 AM, James Molloy <<a href="mailto:james@jamesmolloy.co.uk" target="_blank" class="">james@jamesmolloy.co.uk</a>> wrote:</div><br class=""><div class="">Hi Michael,<br class=""><br class="">Jingue pointed out that my second patch did pessimize some code, so do you still see the regressions with my first patch (the reassoc one)?<br class=""><br class="">Cheers,<br class=""><br class="">James<br class=""><div class="gmail_quote">On Tue, 21 Apr 2015 at 19:26, Michael Zolotukhin <<a href="mailto:mzolotukhin@apple.com" target="_blank" class="">mzolotukhin@apple.com</a>> wrote:<br class=""><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="word-wrap:break-word" class="">Hi James,<div class=""><br class=""></div><div class="">Thank you for taking your time to work on it! I tried the patches you posted and they did fix the problem on twofish. However, with them there are other regressions - and while you are at it, could you please take a look at the generated code? Probably, you can come up with another one-line fix.</div><div class=""><br class=""></div><div class="">So, one regression is on Geekbench/Mandlerbrot, here is the ‘fast’ and ‘slow’ disassembly files (fast - from reference compiler, slow - reference compiler+D9148+D9149).</div><div class=""><br class=""></div><div class=""></div></div><div style="word-wrap:break-word" class=""><div class=""></div></div><div style="word-wrap:break-word" class=""><div class=""></div><div class=""><br class=""></div><div class="">I can see that we generate less ‘madd’ instructions, but in some cases we end up with less-optimal sequences, including ‘mul', ‘add' and ‘or' instead. Also, in some cases we now generate 2 ‘str' instead of 1 ‘stp'.</div><div class=""><br class=""></div><div class="">Do you have any ideas what might go wrong this time?</div><div class=""><br class=""></div><div class="">Thanks,</div><div class="">Michael</div></div><div style="word-wrap:break-word" class=""><div class=""><br class=""><div class=""><blockquote type="cite" class=""><div class="">On Apr 21, 2015, at 6:55 AM, James Molloy <<a href="mailto:james@jamesmolloy.co.uk" target="_blank" class="">james@jamesmolloy.co.uk</a>> wrote:</div><br class=""></blockquote><blockquote type="cite" class=""><div class=""><div dir="ltr" class="">See:<br class=""><br class=""><a href="http://reviews.llvm.org/D9148" target="_blank" class="">http://reviews.llvm.org/D9148</a><br class=""><div class=""><a href="http://reviews.llvm.org/D9149" target="_blank" class="">http://reviews.llvm.org/D9149</a><br class=""></div><div class=""><br class=""></div><div class="">Cheers,</div><div class=""><br class=""></div><div class="">James</div></div><br class=""><div class="gmail_quote">On Tue, 21 Apr 2015 at 12:06 James Molloy <<a href="mailto:james@jamesmolloy.co.uk" target="_blank" class="">james@jamesmolloy.co.uk</a>> wrote:<br class=""><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr" class="">Hi Michael, Gerolf,<br class=""><br class=""><div class="">The main goal of this optimization was to reduce the number of multiply-adds emitted when complex GEPs are used. With regards to SPEC, this is particularly prevalent in sjeng and gobmk (although we don't just care about SPEC and there were plenty of other places we caught this idiom).</div><div class=""><br class=""></div><div class="">The pass is actively reducing the number of multiply-adds in gobmk:</div><div class=""><br class=""></div><div class="">$ llc engine.ll -O3 -mllvm -aarch64-gep-opt=false -o - |& grep madd | wc -l</div><div class=""> 297</div><div class=""><br class=""></div><div class="">$ llc engine.ll -O3 -mllvm -aarch64-gep-opt=true -o - |& grep madd | wc -l</div><div class=""> 158</div><div class=""><br class=""></div><div class="">So it demonstrably is making at least some code better.</div><div class=""><br class=""></div><div class="">I was able to reproduce your twofish example. Twofish, like most crypto algorithms, has a pattern repeated many times so the problem could just be down to one pattern producing one more instruction, which it turns out is exactly the case.</div><div class=""><br class=""></div><div class="">The problem is that with gep-opt enabled, we're associating shifts and constants the other way around.</div><div class=""><br class=""></div><div class="">With gep-opt disabled (synthetic code example):</div><div class="">   add x0, x0, #72</div><div class="">   ldr x0, [x1, x0, lsl #2]</div><div class=""><br class=""></div><div class="">With gep-opt enabled:</div><div class="">   add x0, x1, x0, lsl #2</div><div class="">   ldr x0, [x0, #72]</div><div class=""><br class=""></div><div class="">This is actually fine, until another expression reuses a different shift with the same offset:</div><div class=""><br class=""></div><div class="">With gep-opt disabled:</div><div class="">   add x0, x0, #72</div><div class="">   ldr x0, [x1, x0, lsl #2]</div><div class="">   ldr x2, [x1, x0, lsl #3]</div><div class=""><br class=""></div><div class="">With gep-opt enabled:</div><div class="">   add x0, x1, x0, lsl #2</div><div class="">   add x2, x1, x0, lsl #3</div><div class="">   ldr x0, [x0, #72]</div><div class="">   ldr x2, [x0, #72]</div><div class=""><br class=""></div><div class="">Amusingly, I saw the opposite behaviour on AArch32 on a different testcase and came to the conclusion that expressions with correlated immediate offsets and identical shift amounts were more likely than expressions with correlated shift amounts and identical immediate offsets, but that's a different story.</div><div class=""><br class=""></div><div class="">Anyway, the reason this is happening is because we're not running reassociate after GEP expansion. If we run reassociate it all goes away, it's a one-line fix.</div><div class=""><br class=""></div><div class="">The two testcases that were disabled - well, they weren't really disabled, but the behaviour was changed for them. I put that down to a bug in the SeparateConstOffsetFromGEP pass, where it tries to separate a GEP with only one variadic offset:</div><div class=""><br class=""></div><div class="">  GEP i8* %base, i32 %offset</div><div class=""><br class=""></div><div class="">Nothing beneficial can come of that, so I've got, again, a single line patch to fix this. I'll push both patches upstream just as soon as I verify them.</div><div class=""><br class=""></div><div class="">So in summary, I don't see any need to disable this pass. It's doing good work, it just has one or two bugs.</div><div class=""><br class=""></div><div class="">Cheers,</div><div class=""><br class=""></div><div class="">James</div></div><br class=""><div class="gmail_quote">On Tue, 21 Apr 2015 at 05:27 Michael Zolotukhin <<a href="mailto:mzolotukhin@apple.com" target="_blank" class="">mzolotukhin@apple.com</a>> wrote:<br class=""><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi Tim and others,<br class="">

<br class="">

So, the entire discussion is moving around the assumption that this pass is beneficial in some cases, but we don’t have such data. Please note that I don’t want to say that the assumption is wrong, but as far as I understand, only Hao observed some gains. Thus, until we hear from him, we have real slow downs versus presumable gains.<br class="">

<br class="">

As for the looking into the code later - I’m not an expert in this area, and I don’t even have a motivational example for this optimization, so I’d barely be a right person for it. I’d be glad to assist with collecting some additional data/running tests if that might help though.<br class="">

<br class="">

Thanks,<br class="">

Michael<br class="">

<br class="">

<br class="">

> On Apr 20, 2015, at 8:40 PM, Tim Northover <<a href="mailto:t.p.northover@gmail.com" target="_blank" class="">t.p.northover@gmail.com</a>> wrote:<br class="">

><br class="">

> On 20 April 2015 at 20:34, Gerolf Hoflehner <<a href="mailto:ghoflehner@apple.com" target="_blank" class="">ghoflehner@apple.com</a>> wrote:<br class="">

>> It shouldn’t come across as ‘extreme’.<br class="">

><br class="">

> The "extreme" part wasn't so much the tentative disabling as "Could<br class="">

> you and/or Hao take this on after the code is disabled/backed out?". I<br class="">

> realise it wasn't an ultimatum, but it could very easily turn into<br class="">

> "we're washing our hands of this", which I'd object to.<br class="">

><br class="">

> I assume the pass was committed with some kind of metric that it<br class="">

> improved, and now we have a counter-example. It's down to everyone to<br class="">

> decide what the best way to proceed is.<br class="">

><br class="">

> Cheers.<br class="">

><br class="">

> Tim.<br class="">

><br class="">

> _______________________________________________<br class="">

> llvm-commits mailing list<br class="">

> <a href="mailto:llvm-commits@cs.uiuc.edu" target="_blank" class="">llvm-commits@cs.uiuc.edu</a><br class="">

> <a href="http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits" target="_blank" class="">http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits</a><br class="">

<br class="">

<br class="">

_______________________________________________<br class="">

llvm-commits mailing list<br class="">

<a href="mailto:llvm-commits@cs.uiuc.edu" target="_blank" class="">llvm-commits@cs.uiuc.edu</a><br class="">

<a href="http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits" target="_blank" class="">http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits</a><br class="">

</blockquote></div></blockquote></div>

</div></blockquote></div><br class=""></div></div></blockquote></div>

</div></blockquote></div><br class=""></div></div></blockquote></div>

</div></blockquote></div><br class=""></div></div></div></blockquote></div></blockquote></div>

_______________________________________________<br class="">llvm-commits mailing list<br class=""><a href="mailto:llvm-commits@cs.uiuc.edu" class="">llvm-commits@cs.uiuc.edu</a><br class="">http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits<br class=""></div></blockquote></div><br class=""></body></html>