[llvm] [AArch64][GlobalISel] Combine G_EXTRACT_VECTOR_ELT and G_BUILD_VECTOR sequences into G_SHUFFLE_VECTOR (PR #110545)

Wed Oct 2 17:01:17 PDT 2024

ValentijnvdBeek wrote:

First of thanks for the great detailed comment Amara, I don't have a lot of experience with AArch64, so your perspective is very helpful.

> So while I see some nice code quality improvements in the tests, I'm not convinced this is a good transformation to make in the general case. The problem is that vector extracts->inserts are simple operations and a shuffle is in general a large and expensive operation, unless it can be pattern matched to a more precise variant like zip.

The combiner based on the lowering of the shuffle vector. When you lower a shuffle vector it will [turn into a sequence of extract->buildvector](https://github.com/llvm/llvm-project/blob/main/llvm/lib/CodeGen/GlobalISel/LegalizeHelper.cpp#L8065) operations which is exactly what I match here. So any backend either uses this lowering and performs the same, or has a more efficient implementation of shuffle vector and performs better. However, I am not sure if this assumption actually holds in reality, so hopefully you can shine a light on that. 

> In your example in this PR description, your input could have been optimized instead into a G_CONCAT_VECTOR right?

You are right and it does actually. [Another combiner](https://github.com/llvm/llvm-project/blob/6c7a3f80e75de36f2642110a077664e948d9e7e3/llvm/lib/CodeGen/GlobalISel/CombinerHelper.cpp#L475) turns the shufflevector into a `G_CONCAT_VECTORS`, which is a consequence of this combiner being there. This is where the performance improvements in the tests come from. For example, it turns the large sequence of operations in [arm64-neon-copy.ll](https://github.com/llvm/llvm-project/pull/110545/commits/d3edbb416eab0ada38ac0146f1de61507039e608#diff-7f02a23a0657bd646f3015c3e828c7cd50f0cda5aec72f4536766f97b4c67cf2L1925) into a shufflevector which is then turned into a `G_CONCAT_VECTORS`

I didn't mention it since it a consequence of another combiner kicking in, although this is an intentional effect. Probably should have put a comment explaining the speedups for the cases where it conforms to the Selection DAG output, but I didn't want to overload the PR with additional comments. Apologies for the confusion I caused by that oversight.

> More specific operations are usually faster and they're also easier for other optimizations to reason about. Shuffles suffer from the problem that they're more opaque unless we do expensive analysis.

The background of the PR is that I worked during my internship on extending the analysis of shufflevectors and replacing them with more specific opcodes. By pulling these sequences into a shufflevector, the analysis is run on it and then either the sequence is replaced by an equivalent opcode or the shufflevector is lowered, in the generic case, into the exact same sequence as it was before. You mention that is the analysis is expensive, which is true, but we are already running this analysis for [shufflevectors](https://github.com/llvm/llvm-project/blob/main/llvm/lib/CodeGen/GlobalISel/CombinerHelper.cpp#L475) and I am hoping that the speed improvements in the generated code will make up for the additional compile overhead.

With some luck my previous company will let me upstream that code and allow me to implement some of the ideas I have with regards of reducing the cost of this analysis. But no promises. 

https://github.com/llvm/llvm-project/pull/110545