[PATCH] D39415: [ARMISelLowering] Better handling of NEON load/store for sequential memory regions

Thu Nov 2 10:30:12 PDT 2017

rengolin added a comment.

In https://reviews.llvm.org/D39415#914104, @evgeny777 wrote:

> The only user of Addr operand (VST1_UPD) is Store(Value, Addr). What should I do next?

All three stores chain up to the same load, with different offsets. Can't you look for that pattern?

@eli.friedman / @majnemer should really give their input here, as I'm reaching the limit of my combine knowledge.

> Also sort (qsort) introduces additional O(N*logN) complexity in average case and O(N*N) in the worst case.

First, there is no requirement for standard libraries to use qsort. I believe stdlibc++ uses introsort (worst-case nlogn), not sure about libc++.

Still, you only sort a very short list of integers IFF the pattern is matched, paying the (small) cost only for the rare cases where it does.

Let's assume you have the absurd amount of 10 integers. That's 100 passes for N^2 in the worst possible case, ~33 for nlogn, only when the list is non-empty.

While iterating through the use chain and dereferencing pointers through N for loops will grant you cache misses, huge number of loads, potential branch prediction failures. For *all* cases.

I'll let you guess which one is worse.

> also I ran LLVM test-suite

No you didn't. You only compiled the test suite.

> However I could only cross-compile it and check that there are no more asserions. I don't have any linux ARM devices and running test suite on Android looks difficult. 
>  I'll try jailbroken iPad tomorrow which might probably be easier

You can't push a patch that will affect every Arm device on the planet by testing on NDA hardware with NDA operating system and hope all is well. But it gets worse, because you have only "tested" your small matrix multiply snippet and claimed victory.

In reality, you have to demonstrate, from your own effort, that you have done your homework.

In general, your homework is at least:

1. Build and check-all, making sure you built the Arm back-end (so Lit tests run on Arm code-gen). Check.
2. Convince people your algorithm is the best we can possibly do in this case. We're doing that, but I'm not convinced yet.

Meanwhile, we have demonstrated that you have broken the test-suite on Arm (the architecture your patch targets). In this case you *must*:

3. *Run* the test-suite in test mode, on Arm hardware (Linux, Apple, even Android would be fine). Everything must pass. Not check.

Since this is a performance improvement with serious risk of compile time issues, the additional items are required:

4. Run standard compiler benchmarks (SPEC, EEMBC, Coremark) on Arm hardware (Linux, Apple, even Android would be fine). Not check.
  1. Report results for anything greater than 1% change (up and down) on both run time and compile time.
  2. Report the geomeans of all benchmarks, discuss the results.
  3. A nice touch would be running the test-suite in benchmark mode, which does that more or less automagically.
5. Invite other people to run benchmarks on other machines which the patch targets, discuss results. I've done that, but they will only do it when the algorithm is right.

In this particular case, the Arm back-end targets a lot more ARMv7 cores than ARMv8, so at least some benchmarking has to be done on ARMv7 hardware.

It would be fine for you to benchmark on ARMv8 and wait for other people to do the same on ARMv7.

But before anyone else gets involved in their own benchmarks, you *have* to have finished, and presented, steps 1~4 above.

cheers,
--renato

https://reviews.llvm.org/D39415