[PATCH] D39415: [ARMISelLowering] Better handling of NEON load/store for sequential memory regions

Fri Nov 3 08:10:47 PDT 2017

eastig added a comment.

In https://reviews.llvm.org/D39415#915116, @rengolin wrote:

> Interesting. I'm guessing all TSVC reductions are due to the same function.
>
> It's also interesting that TSVC was the code that changed the most, and had no visible performance benefits.

Actually it had but not many.
There are some other regressions/improvements. I have not listed them because the hottest code is not changed.
Attached full tables:
F5467382: regr_a57.png <https://reviews.llvm.org/F5467382>
F5467384: imprv_a57.png <https://reviews.llvm.org/F5467384>

MultiSource/Benchmarks/TSVC/LoopRerolling-flt/LoopRerolling-flt has 1.86% execution time regression and 41.88% code size improvement.
MultiSource/Benchmarks/TSVC/LinearDependence-flt/LinearDependence-flt has 1.48% execution time improvement and 39.61% code size improvement.
MultiSource/Benchmarks/TSVC/Equivalencing-flt/Equivalencing-flt has 1.26% execution time improvements and 41.61% code size improvement.

> Maybe hinting that the variation in other benchmarks could have been a side effect, not the use of post-increment, including the matrix multiply. Neither lencod nor salsa show up in code size changes.
> 
> It's quite likely that your A9 won't show much better results (or even different random results), given that it's OOO and pipelined between memory, ALU and vector operations.

What about running on Cortex-A53? Its pipeline is in-order.

https://reviews.llvm.org/D39415