[llvm-dev] [RFC] Making -mcpu=generic the default for ARM armv7a and arm8a rather than -mcpu=cortex-a8 or -mcpu=cortex-a53

Evandro Menezes via llvm-dev llvm-dev at lists.llvm.org
Wed May 31 08:25:17 PDT 2017


Hi, Kristof.

I think that it makes sense.  Your results also somehow corroborate the 
model adopted in GCC for the generic tuning, especially WRT scheduling 
in order.

Thank you,

-- 
Evandro Menezes

On 05/31/2017 07:57 AM, Kristof Beyls wrote:
> *Motivation*
>
> At the moment, when targeting armv7a, clang defaults to generate code 
> as if -mcpu=cortex-a8 was specified.
> When targeting armv8a, it defaults to generate code as if 
> -mcpu=cortex-a53 was specified.
>
> This leads to surprising code generation, by the compiler optimizing 
> for a specific micro-architecture, whereas the intent from the user 
> was probably to generate code that is "blended" for all the cores 
> implementing the requested architecture. One example of a user being 
> surprised like this is at 
> https://bugs.llvm.org//show_bug.cgi?id=27219, where vmla's are not 
> produced to optimize for a Cortex-A8-specific micro-architectural 
> behaviour, even though the user didn't request to optimize 
> specifically for Cortex-A8.
>
> It would be much cleaner conceptually if clang would default to 
> -mcpu=generic when no specific cpu is specified.
>
> *What is the impact of this change on execution speed?*
> *
> *
> I think the main reason to be hesitant to change the default CPU for 
> ARM to -mcpu=generic is the potential impact on performance of 
> generated code.
> *
> *
> I've measured quite a wide selection of benchmarks with this change, 
> on the following cores: Cortex-A9, Cortex-A53, Cortex-A57, Cortex-A72.
>
> Impact on execution speed, for each core, when using -march=armv7a, 
> after changing the default cpu from cortex-a8 to generic is as follows.
> A positive numbers means speedup, a negative number means slow-down. 
> These are the geomean results over 350 programs coming from benchmark 
> suites such as the test-suite, SPEC2000, SPEC2006 and a range of 
> proprietary suites.
>
> Cortex-A9: 0.96%
> Cortex-A53: -0.64%
> Cortex-A57: 1.04%
> Cortex-A72: 1.17%
>
> Impact on execution speed, for each core, when using -march=armv8a, 
> after changing the default cpu from cortex-a53 to generic:
>
> (Cortex-A9 is an armv7a core, so can't execute armv8a binaries)
> Cortex-A53: -0.09%
> Cortex-A57: -0.12%
> Cortex-A72: 0.03%
>
> *Should we enable scheduling for an in-order core even for -mcpu=generic?*
> *
> *
> In the above measurements it shows that the biggest negative impact 
> seen is with -march=armv7a on Cortex-A53: -0.64%.
> It seems that the in-order Cortex-A53 core is losing quite a bit of 
> performance when the instructions aren't scheduled - which is to be 
> expected.
> Therefore, I also experimented with letting instructions be scheduled 
> according to the Cortex-A8 pipeline model, even for -mcpu=generic, 
> trying to figure out if it's beneficial to schedule instructions for 
> an in-order core rather than not trying to schedule them at all, for 
> -mcpu=generic.
>
> Measurement results:
>
> -march=armv7a
>
> Cortex-A9: 1.57% (up from 0.96%)
> Cortex-A53: 0.47% (up from -0.64%)
> Cortex-A57: 1.74% (up from 1.04%)
> Cortex-A72: 1.72% (up from 1.17%)
>
> -march=armv8a (Note that there isn't a pipeline model for Cortex-A53 
> in the 32-bit ARM backend):
>
> (Cortex-A9 is an armv7a core, so can't execute armv8a binaries)
> Cortex-A53: 0.49% (up from -0.09%)
> Cortex-A57: 0.09% (up from -0.12%)
> Cortex-A72: 0.20% (up from 0.03%)
>
> Conclusion: for all the in-order and out-of-order cores I measured, 
> it's beneficial to get the instructions scheduled using the Cortex-A8 
> pipeline model in combination with -mcpu=generic.
>
>
> Taking into account the above measurements, my conclusions are:
> 1. We should make -mcpu=generic the default cpu, not Cortex-A8 or 
> Cortex-A53 for march=armv7a and march=armv8a.
> 2. We probably want to let the compiler schedule instructions using 
> the Cortex-A8 pipeline model for -mcpu=generic, since it gives a bit 
> of speedup on all cores tested.
>
> Do people agree with these conclusions?
> Any objections against implementing this?
> Any other potential impact this may have that I forgot to consider above?
>
> Thanks,
>
> Kristof



More information about the llvm-dev mailing list