[PATCH] D38196: [AArch64] Avoid interleaved SIMD store instructions for Exynos

Kristof Beyls via Phabricator via llvm-commits llvm-commits at lists.llvm.org
Thu Jan 11 02:55:01 PST 2018


kristof.beyls added a comment.

In https://reviews.llvm.org/D38196#951927, @az wrote:

> Thank you for providing several very useful feedback to improve this patch. In the future (not very soon), I am planning to make this pass more complete by 1) implementing the existing vector element sub-pass in the same way as optimize interleaved store instructions sub-pass for consistency reason (put all rewrite rules and not just one, caching, instruction replacement table, etc.) 2) Addressing the ST3 instruction inefficiency assuming I can find an efficient combination of instructions to replace ST3. I hope I can put you as the reviewer.
>
> I would like also to get your feedback on a couple of things:
>
> 1. Is it worth replacing how code generation (i.e. building new instructions) is implemented in this pass? It is currently hard-coded but we can replace it by a table driven approach similar to the one used in the analysis phase (see IRT Table). In particular, I need to find a simple way to express through a table or other means how the new instructions are built out of the old one and more specifically how the operands of new instructions related to old ones. This adds complexity to the implementation and understanding of the code generation but makes adding more rewrite rules quite simple. It tried this before but the patch was not accepted but may be I should bring it again now that we added several new rewrite pattern in this patch and expect to add more for ST3.


I only have some hand-wavy high-level thoughts/opinions on this.
I see this pass, at a high level, to be about rewriting instruction sequences into sequences that are faster, when the info about instruction latency suggests that should be the case.
I would expect that this could also be beneficial for other, non-AArch64, targets. If that proves to be the case; and if there prove to be quite a few rewriting rules, I think the ideal end game is being able to represent potential rewriting rules per instruction set in some kind of TableGen syntax.
That would allow to keep the rewriting rules specified in a concise, easily understandable and maintainable way.

Going for that design right now seems premature to me: it isn't proven yet that there will be many useful such rewriting rules and it hasn't been proven yet that this is also useful for other instruction sets.
What the current implementation has proven, I think, is that it's possible to implement this functionality in a way such that it doesn't negatively impact compile time too much unnecessarily.

Would it be worth to try and take some steps when adding support for the ST3 pattern towards that hypothetical Tablegen end goal?
I'm not sure - it depends on how many more rewrite rules we expect will get added over time.
Do you happen to have any ideas of what other rewriting rules might plausibly be useful?
Or a method of somehow mining the latency information to somehow automatically suggest instruction sequences that might be worthwhile to consider for rewriting in a different way?

> 
> 
> 2. I still need your investigation about the latency for st4/st2 on A57. Based on my unit test case (which is posted somewhere in the comments of this patch), st4 is inefficient for cortex-a57 and the replacement instructions in this patch perform better. If you or somebody from ARM can verify on that, then this means that the latency information in A57 .td file is not accurate and we need to update it. My test for st2 is not conclusive related to whether st2 is better than the replacement instructions or not. Note that, we have few benchmarks that are compiled with cortex-a57 and we are not supposed to change that flag. It would be nice if the A57 .td file has accurate information.

As far as I know, the latency information for Cortex-A57 is correct in the .td file. There's a good chance that in your unit test you might be seeing the effect of another micro-architectural limiting factor than latency, e.g. resource usage, that this pass doesn't take into account when making decisions.


https://reviews.llvm.org/D38196





More information about the llvm-commits mailing list