[PATCH] D38196: [AArch64] Avoid interleaved SIMD store instructions for Exynos

Wed Oct 25 08:05:16 PDT 2017

az marked 2 inline comments as done.
az added a comment.

> Thanks. I assume you confirmed there are no performance regressions at all in the test-suite?

Yes, there is no regression in the llvm test-suite.

>> Note that other targets that we tested did not in general have this problem. However, I noticed through unit testing that ARM A57 had this problem with st4 only (and not with st2) even though A57 latencies in AArch64SchedA57.td do not indicate any problem and hence this pass does not affect A57 (somebody from ARM may need to double check on those latency).
> 
> I had a very brief glance and the latencies specified for ST4 in AArch64SchedA57.td. They didn't seem wrong at a glance. There's a chance you saw some other effects? I'll try to find some time to check those instruction latencies for Cortex-A57 more closely.

Unless there is something wrong with my experimental setup, I am confirming that I see a problem related to the ST4 latencies. I am experimenting now with timing the following one-line loop that has an st4 instruction in it:  
#pragma nounroll
for (i=0; i<10000000; i++)

  vst4q_s32 (ptr, st);

With my patch, no change to the st4 instruction under -mcpu=cortex-a57 because of the advertised latency. If I change my patch (by commenting calls to shouldExitEarly and shouldReplaceInstruction) so that we do not check latencies and do always the replacement, I get significant improvement.

For the ST2 experiment (using vst2q_s32()), my results are not conclusive given that results are close but there is chance that this has a small problem as well.

> More importantly this potentially being an issue on cores for ST4 only and not ST2 highlights the need for AArch64SIMDInstrOpt::shouldExitEarly to not bail out early if it finds the optimization isn't useful for ST2, but instead it should also check the other optimizations (e.g. ST4) this pass can do.

Agree but when the original pass is written, there is a concern from another target without the problem that we should a have a minimal check before really entering the pass. So, I am doing the smallest check that shows the problem. I have no problem doing both checks or doing the more expensive ST4 check only so that it can trigger the pass for ARM and Exynos. In case you are getting back to us soon about the ARM latency, then let's wait until then and we can then decide what to have in early exit code.

> Other than the above I don't think I've got other major concerns left-over.
>  Thanks very much for your prompt replies and actions!

https://reviews.llvm.org/D38196