[llvm-dev] [RFC] Making -mcpu=generic the default for ARM armv7a and arm8a rather than -mcpu=cortex-a8 or -mcpu=cortex-a53

Tue Jun 20 07:05:01 PDT 2017

Hi Evandro,

For now, I'm only looking at AArch32, not AArch64.
Indeed, we could also perform in-order scheduling for -mcpu=generic on AArch64. Cortex-A53 indeed seems to be the best/only choice available.
But before making that change, that'll require another round of lots of benchmarking.

So in summary: I'll put the idea on my backlog, but I probably won't have time to get all the benchmarking done in the very near future.

Thanks,

Kristof

On 1 Jun 2017, at 22:23, Evandro Menezes <e.menezes at samsung.com<mailto:e.menezes at samsung.com>> wrote:

Hi, Kristof.

It sounds like a good plan, but one thing is not clear to me from your
post.  Which pipeline model will be used for AArch64, A53's (i.e., none)?

Thank you,

--
Evandro Menezes

On 06/01/2017 01:37 AM, Kristof Beyls wrote:
Thanks for everyone giving their feedback!
I saw pretty unanimous support for making -mcpu=generic the default
and making -mcpu=generic schedule for an in-order CPU (Cortex-A8 in
this case).
I'll be making those changes shortly.

I think the comments also make clear that it's less obvious whether
we'd want -mcpu=native to become a default. It's probably good for
some use cases, but really not good for other use cases. I won't be
making that change, nor advocate for it.

Thanks!

Kristof

On 31 May 2017, at 17:57, Stephen Hines <srhines at google.com<mailto:srhines at google.com>
<mailto:srhines at google.com>> wrote:

Wow, these are some fantastic results! Android is definitely in favor
of fixing the defaults, so this proposal looks great from our
perspective.

Thanks,
Steve

On Wed, May 31, 2017 at 5:57 AM, Kristof Beyls <Kristof.Beyls at arm.com<mailto:Kristof.Beyls at arm.com>
<mailto:Kristof.Beyls at arm.com>> wrote:

   *Motivation*

   At the moment, when targeting armv7a, clang defaults to generate
   code as if -mcpu=cortex-a8 was specified.
   When targeting armv8a, it defaults to generate code as if
   -mcpu=cortex-a53 was specified.

   This leads to surprising code generation, by the compiler
   optimizing for a specific micro-architecture, whereas the intent
   from the user was probably to generate code that is "blended" for
   all the cores implementing the requested architecture. One
   example of a user being surprised like this is at
   https://bugs.llvm.org//show_bug.cgi?id=27219
   <https://bugs.llvm.org//show_bug.cgi?id=27219>, where vmla's are
   not produced to optimize for a Cortex-A8-specific
   micro-architectural behaviour, even though the user didn't
   request to optimize specifically for Cortex-A8.

   It would be much cleaner conceptually if clang would default to
   -mcpu=generic when no specific cpu is specified.

   *What is the impact of this change on execution speed?*
   *
   *
   I think the main reason to be hesitant to change the default CPU
   for ARM to -mcpu=generic is the potential impact on performance
   of generated code.
   *
   *
   I've measured quite a wide selection of benchmarks with this
   change, on the following cores: Cortex-A9, Cortex-A53,
   Cortex-A57, Cortex-A72.

   Impact on execution speed, for each core, when using
   -march=armv7a, after changing the default cpu from cortex-a8 to
   generic is as follows.
   A positive numbers means speedup, a negative number means
   slow-down. These are the geomean results over 350 programs coming
   from benchmark suites such as the test-suite, SPEC2000, SPEC2006
   and a range of proprietary suites.

   Cortex-A9: 0.96%
   Cortex-A53: -0.64%
   Cortex-A57: 1.04%
   Cortex-A72: 1.17%

   Impact on execution speed, for each core, when using
   -march=armv8a, after changing the default cpu from cortex-a53 to
   generic:

   (Cortex-A9 is an armv7a core, so can't execute armv8a binaries)
   Cortex-A53: -0.09%
   Cortex-A57: -0.12%
   Cortex-A72: 0.03%

   *Should we enable scheduling for an in-order core even for
   -mcpu=generic?*
   *
   *
   In the above measurements it shows that the biggest negative
   impact seen is with -march=armv7a on Cortex-A53: -0.64%.
   It seems that the in-order Cortex-A53 core is losing quite a bit
   of performance when the instructions aren't scheduled - which is
   to be expected.
   Therefore, I also experimented with letting instructions be
   scheduled according to the Cortex-A8 pipeline model, even for
   -mcpu=generic, trying to figure out if it's beneficial to
   schedule instructions for an in-order core rather than not trying
   to schedule them at all, for -mcpu=generic.

   Measurement results:

   -march=armv7a

   Cortex-A9: 1.57% (up from 0.96%)
   Cortex-A53: 0.47% (up from -0.64%)
   Cortex-A57: 1.74% (up from 1.04%)
   Cortex-A72: 1.72% (up from 1.17%)

   -march=armv8a (Note that there isn't a pipeline model for
   Cortex-A53 in the 32-bit ARM backend):

   (Cortex-A9 is an armv7a core, so can't execute armv8a binaries)
   Cortex-A53: 0.49% (up from -0.09%)
   Cortex-A57: 0.09% (up from -0.12%)
   Cortex-A72: 0.20% (up from 0.03%)

   Conclusion: for all the in-order and out-of-order cores I
   measured, it's beneficial to get the instructions scheduled using
   the Cortex-A8 pipeline model in combination with -mcpu=generic.

   Taking into account the above measurements, my conclusions are:
   1. We should make -mcpu=generic the default cpu, not Cortex-A8 or
   Cortex-A53 for march=armv7a and march=armv8a.
   2. We probably want to let the compiler schedule instructions
   using the Cortex-A8 pipeline model for -mcpu=generic, since it
   gives a bit of speedup on all cores tested.

   Do people agree with these conclusions?
   Any objections against implementing this?
   Any other potential impact this may have that I forgot to
   consider above?

   Thanks,

   Kristof

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170620/282d821d/attachment-0001.html>