<div dir="ltr">Wow, these are some fantastic results! Android is definitely in favor of fixing the defaults, so this proposal looks great from our perspective.<div><br></div><div>Thanks,</div><div>Steve</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Wed, May 31, 2017 at 5:57 AM, Kristof Beyls <span dir="ltr"><<a href="mailto:Kristof.Beyls@arm.com" target="_blank">Kristof.Beyls@arm.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div style="word-wrap:break-word">

<div><b>Motivation</b></div>

<div><br>

</div>

At the moment, when targeting armv7a, clang defaults to generate code as if -mcpu=cortex-a8 was specified.

<div>When targeting armv8a, it defaults to generate code as if -mcpu=cortex-a53 was specified.</div>

<div><br>

</div>

<div>This leads to surprising code generation, by the compiler optimizing for a specific micro-architecture, whereas the intent from the user was probably to generate code that is "blended" for all the cores implementing the requested architecture.

 One example of a user being surprised like this is at <a href="https://bugs.llvm.org//show_bug.cgi?id=27219" target="_blank">https://bugs.llvm.org//<wbr>show_bug.cgi?id=27219</a>, where vmla's are not produced to optimize for a Cortex-A8-specific micro-architectural behaviour,

 even though the user didn't request to optimize specifically for Cortex-A8.</div>

<div><br>

</div>

<div>It would be much cleaner conceptually if clang would default to -mcpu=generic when no specific cpu is specified.</div>

<div><br>

</div>

<div><b>What is the impact of this change on execution speed?</b></div>

<div><b><br>

</b></div>

<div>I think the main reason to be hesitant to change the default CPU for ARM to -mcpu=generic is the potential impact on performance of generated code.</div>

<div><b><br>

</b></div>

<div>I've measured quite a wide selection of benchmarks with this change, on the following cores: Cortex-A9, Cortex-A53, Cortex-A57, Cortex-A72.</div>

<div><br>

</div>

<div>Impact on execution speed, for each core, when using -march=armv7a, after changing the default cpu from cortex-a8 to generic is as follows.</div>

<div>A positive numbers means speedup, a negative number means slow-down. These are the geomean results over 350 programs coming from benchmark suites such as the test-suite, SPEC2000, SPEC2006 and a range of proprietary suites.</div>

<div><br>

</div>

<div>Cortex-A9: 0.96%</div>

<div>Cortex-A53: -0.64%</div>

<div>Cortex-A57: 1.04%</div>

<div>Cortex-A72: 1.17%</div>

<div><br>

</div>

<div>Impact on execution speed, for each core, when using -march=armv8a, after changing the default cpu from cortex-a53 to generic:</div>

<div><br>

</div>

<div>(Cortex-A9 is an armv7a core, so can't execute armv8a binaries)</div>

<div>

<div>Cortex-A53: -0.09%</div>

<div>Cortex-A57: -0.12%</div>

<div>Cortex-A72: 0.03%</div>

</div>

<div><br>

</div>

<div><b>Should we enable scheduling for an in-order core even for -mcpu=generic?</b></div>

<div><b><br>

</b></div>

<div>In the above measurements it shows that the biggest negative impact seen is with -march=armv7a on Cortex-A53: -0.64%.</div>

<div>It seems that the in-order Cortex-A53 core is losing quite a bit of performance when the instructions aren't scheduled - which is to be expected.</div>

<div>Therefore, I also experimented with letting instructions be scheduled according to the Cortex-A8 pipeline model, even for -mcpu=generic, trying to figure out if it's beneficial to schedule instructions for an in-order core rather than not trying

 to schedule them at all, for -mcpu=generic.</div>

<div><br>

</div>

<div>Measurement results:</div>

<div><br>

</div>

<div>-march=armv7a</div>

<div><br>

</div>

<div>

<div>

<div>Cortex-A9: 1.57% (up from 0.96%)</div>

<div>Cortex-A53: 0.47% (up from -0.64%)</div>

<div>Cortex-A57: 1.74% (up from 1.04%)</div>

<div>Cortex-A72: 1.72% (up from 1.17%)</div>

<div><br>

</div>

<div>-march=armv8a (Note that there isn't a pipeline model for Cortex-A53 in the 32-bit ARM backend):</div>

<div><br>

</div>

<div>(Cortex-A9 is an armv7a core, so can't execute armv8a binaries)</div>

<div>

<div>Cortex-A53: 0.49% (up from -0.09%)</div>

<div>Cortex-A57: 0.09% (up from -0.12%)</div>

<div>Cortex-A72: 0.20% (up from 0.03%)</div>

</div>

</div>

<div><br>

</div>

<div>Conclusion: for all the in-order and out-of-order cores I measured, it's beneficial to get the instructions scheduled using the Cortex-A8 pipeline model in combination with -mcpu=generic.</div>

<div><br>

</div>

<div><br>

</div>

<div>Taking into account the above measurements, my conclusions are:</div>

</div>

<div>1. We should make -mcpu=generic the default cpu, not Cortex-A8 or Cortex-A53 for march=armv7a and march=armv8a.</div>

<div>2. We probably want to let the compiler schedule instructions using the Cortex-A8 pipeline model for -mcpu=generic, since it gives a bit of speedup on all cores tested.</div>

<div><br>

</div>

<div>Do people agree with these conclusions?</div>

<div>Any objections against implementing this?</div>

<div>Any other potential impact this may have that I forgot to consider above?</div>

<div><br>

</div>

<div>Thanks,</div>

<div><br>

</div>

<div>Kristof</div>

</div>

</blockquote></div><br></div>