<html><head><meta http-equiv="Content-Type" content="text/html charset=windows-1252"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;"><br><div><div>On Apr 8, 2014, at 12:07 PM, Dan Gohman <<a href="mailto:dan433584@gmail.com">dan433584@gmail.com</a>> wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite"><div dir="ltr"><br><div class="gmail_extra"><br><br><div class="gmail_quote">On Mon, Apr 7, 2014 at 5:09 PM, Andrew Trick <span dir="ltr"><<a href="mailto:atrick@apple.com" target="_blank">atrick@apple.com</a>></span> wrote:<br>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="word-wrap:break-word"><div><br><div><div>On Mar 25, 2014, at 10:50 AM, Dan Gohman <<a href="mailto:dan433584@gmail.com" target="_blank">dan433584@gmail.com</a>> wrote:</div>


<br><blockquote type="cite"><div dir="ltr"><br><div class="gmail_extra"><br><br><div class="gmail_quote">On Tue, Mar 25, 2014 at 7:24 AM, Rafael Espíndola <span dir="ltr"><<a href="mailto:rafael.espindola@gmail.com" target="_blank">rafael.espindola@gmail.com</a>></span> wrote:<br>


<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div>On 25 March 2014 09:49, Dan Gohman <<a href="mailto:dan433584@gmail.com" target="_blank">dan433584@gmail.com</a>> wrote:<br>


> Hi Lang,<br>

><br>

> I can reproduce the performance regression on fourinarow, at least. With the<br>

> patch, the code size and static instruction count of the benchmark's one<br>

> embarassingly-hot function is lower, the dynamic instruction count is lower,<br>

> and the stack frame is smaller, but it still runs slower. Instruction<br>

> selection is basically the same, except that there are fewer cmovs. There<br>

> appears to be a minor difference in instruction scheduling in the hot<br>

> function. The regression disappeared when I experimented with non-default<br>

> values for -pre-RA-sched. However, I'm not prepared for the adventure of<br>

> changing the instruction scheduler's heuristics at this time, so I'll just<br>

> let this patch go for now.<br>

<br>

</div>Do you have a small .ll testcase?<br></blockquote><div><br></div><div>Not handy anymore, but it's just MultiSource/Benchmarks/<div>FreeBench/fourinarow/fourinarow with -O3 -flto on x86-64.<br></div></div></div>


</div></div></blockquote></div><br></div><div>fourinarow is jittery, sensitive to register pressure, and doesn’t like codegen changes. Were there several other significant regressions and no significant improvements? Were the results overall bad on non -flto builds too? Or did we just have bad luck with LTO? Are there regressions on any real benchmarks?</div>


</div></blockquote><div><br></div><div>There is a very significant improvement in one of my own benchmarks. Also, it's an intuitively appealing patch because 0 selects is nicer than 1 select, and there are no apparent significant downsides. That said, in the LLVM testsuite, there appear to have been several regressions and no improvements.<br>


</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="word-wrap:break-word"><div>Is there any reason to believe this patch is chronically increasing register pressure?</div>


</div></blockquote><div><br></div><div>No.<br></div></div></div></div></blockquote><blockquote type="cite"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div> </div><blockquote class="gmail_quote" style="margin: 0px 0px 0px 0.8ex; border-left-width: 1px; border-left-color: rgb(204, 204, 204); border-left-style: solid; padding-left: 1ex; position: static; z-index: auto;">

<div style="word-wrap:break-word"><div>The default SD scheduler should be simply preserving IR order. If the patch fundamentally makes sense, and the generated code before register coalescing looks better by simple metrics: dynamic instruction count and critical path, then the only way forward is to file a bug against the register coalescer and MI scheduler (which are often two sides of the same problem).</div>


<div></div></div></blockquote><div><br></div><div>The dynamic instruction count was lower. The main difference is that a cmov is removed. I'll make a note to myself to file a bug against the MI scheduler.<br></div></div></div></div></blockquote><div><br></div><div>Ok. If we know that the number instructions before coalescing is less-or-equal and the spill count is greater in each of the regressions, that enough of a clue to pin it on coalescing/scheduling/regalloc.</div><div><br></div><div>I don’t like to prevent the right thing at IR level just because downstream codegen happens to make bad decisions.</div><div><br></div><div>-Andy</div><br><blockquote type="cite"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">

<div>Dan<br><br></div></div></div></div>

</blockquote></div><br></body></html>