[llvm] [AArch64][GlobalISel] Combine G_EXTRACT_VECTOR_ELT and G_BUILD_VECTOR sequences into G_SHUFFLE_VECTOR (PR #110545)

Thu Oct 3 11:03:19 PDT 2024

aemerson wrote:

> First of thanks for the great detailed comment Amara, I don't have a lot of experience with AArch64, so your perspective is very helpful.
> 
> > So while I see some nice code quality improvements in the tests, I'm not convinced this is a good transformation to make in the general case. The problem is that vector extracts->inserts are simple operations and a shuffle is in general a large and expensive operation, unless it can be pattern matched to a more precise variant like zip.
> 
> The combiner based on the lowering of the shuffle vector. When you lower a shuffle vector it will [turn into a sequence of extract->buildvector](https://github.com/llvm/llvm-project/blob/main/llvm/lib/CodeGen/GlobalISel/LegalizeHelper.cpp#L8065) operations which is exactly what I match here. So any backend either uses this lowering and performs the same, or has a more efficient implementation of shuffle vector and performs better. However, I am not sure if this assumption actually holds in reality, so hopefully you can shine a light on that.

Yes that's true, but let's think about this from first principles. In LLVM IR, we don't have separate instructions for things like vector concat, zip, etc... because the general shuffle can implement those. We rely on optimizations later on in codegen to select the most optimal instructions for some patterns.

In the GISel MIR, we don't have the same philosophy as the IR. Our goal is not to have a minimal MIR but to instead aid in lowering and selecting good target instructions. So we decompose some complex operations that targets don't have native support for into simpler ops, because now one of the factors we have to take into account is the idea of instruction legality.

If we were to perform this transform, we would be going from simpler (but more numerous) operations into more complex single operation. In order to get performant output, we rely on another transform to then re-optimize our output into the idea form. Therefore, our transform is not so much an optimization but instead a form of canonicalization.

If we decided to instead directly pattern match into a vector concat for example, we're skipping the canonicalization step and going straight to the instruction we want to emit. There's no ambiguity here about what's the best final output, in this case it's always a G_CONCAT_VECTOR. So going directly has benefits in 1) not having to spend time going through the intermediate step, and b) having the guaranteed output generated.

> 
> > In your example in this PR description, your input could have been optimized instead into a G_CONCAT_VECTOR right?
> 
> You are right and it does actually. [Another combiner](https://github.com/llvm/llvm-project/blob/6c7a3f80e75de36f2642110a077664e948d9e7e3/llvm/lib/CodeGen/GlobalISel/CombinerHelper.cpp#L475) turns the shufflevector into a `G_CONCAT_VECTORS`, which is a consequence of this combiner being there. This is where the performance improvements in the tests come from. For example, it turns the large sequence of operations in [arm64-neon-copy.ll](https://github.com/llvm/llvm-project/pull/110545/commits/d3edbb416eab0ada38ac0146f1de61507039e608#diff-7f02a23a0657bd646f3015c3e828c7cd50f0cda5aec72f4536766f97b4c67cf2L1925) into a shufflevector which is then turned into a `G_CONCAT_VECTORS`
> 
> I didn't mention it since it a consequence of another combiner kicking in, although this is an intentional effect. Probably should have put a comment explaining the speedups for the cases where it conforms to the Selection DAG output, but I didn't want to overload the PR with additional comments. Apologies for the confusion I caused by that oversight.
> 
> > More specific operations are usually faster and they're also easier for other optimizations to reason about. Shuffles suffer from the problem that they're more opaque unless we do expensive analysis.
> 
> The background of the PR is that I worked during my internship on extending the analysis of shufflevectors and replacing them with more specific opcodes. By pulling these sequences into a shufflevector, the analysis is run on it and then either the sequence is replaced by an equivalent opcode or the shufflevector is lowered, in the generic case, into the exact same sequence as it was before. You mention that is the analysis is expensive, which is true, but we are already running this analysis for [shufflevectors](https://github.com/llvm/llvm-project/blob/main/llvm/lib/CodeGen/GlobalISel/CombinerHelper.cpp#L475) and I am hoping that the speed improvements in the generated code will make up for the additional compile overhead.
> 
> With some luck my previous company will let me upstream that code and allow me to implement some of the ideas I have with regards of reducing the cost of this analysis. But no promises.
Improving pattern matching of shuffle idioms so that we lower into more optimal instructions sounds like a good idea. I don't think that means that we need to canonicalize everything into shufflevectors however.

https://github.com/llvm/llvm-project/pull/110545