[PATCH] D25020: [ARM] Fix 26% performance regression on Cortex-A9 caused by not using VMLA/VMLS

Fri Oct 14 08:29:51 PDT 2016

rengolin added a comment.

Hi Evgeny,

I think the best thing to do right now is to check the documentation, prepare a plan, and then test on the different cores.

On the manuals [1], I could only find cycle instructions for A8[2] and A9[3,4], but not for the others. So we'll have to assume something and test on the cores.

I imagine that A8's model is followed by A7 (in-order cores), while A9's model is followed by A15 and Krait (OOO cores), but we'll have to make sure our assumptions are correct. Benchmarks may be too big, so we coult try running fabricated snippets of VMUL/VADD/VMLA with and without dependency in tight loops, as they should yield big differences on different cores.

Another alternative would be to control MLxHazard and MLxForwarding via flags and run `Benchmarks/Misc/matmul_f64_4x4`, which seems to be the biggest difference of them all.

We have Cortex-A7 (RPi2), A8 (Beagle), A9 (Panda), A15 (Chromebooks) and Krait (Dragon). We'd only be missing Swift to make sure we have covered all relevant cores.

I think the best course of action now is to combine forces on coding and testing and come up with a concrete solution based on real data to apply those feature flags in the right cores. Right now, the situation is a big mess and I don't want to make it worse.

cheers,
--renato

[1] http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.set.cortexa/index.html
[2] http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0344k/BCGDCECC.html
[3] http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0409i/BCGJIBBD.html
[4] http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0409i/BCGDCIBA.html

https://reviews.llvm.org/D25020