[PATCH] D152826: [AArch64] Improve shuffles of i1 vectors (WIP)
Ricardo Jesus via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Wed Jun 14 07:28:26 PDT 2023
rjj added a comment.
In D152826#4420599 <https://reviews.llvm.org/D152826#4420599>, @dmgreen wrote:
> Thanks for uploading this. It does look very similar to the case I was looking at recently, but goes further than the optimization I was trying. (That was about altering `anyext(buildvector(..))`, not doing it earlier which looks like it allows the shuffles to lower more efficiently). I will try to take a look, to get my head around how this works.
Thanks for taking a look! To give a bit of context, this was brought up by the following example (from cmsisdsp iirc):
define <4 x i1> @t1(<2 x i1> %a, <2 x i1> %b) {
%r = shufflevector <2 x i1> %a, <2 x i1> %b, <4 x i32> <i32 0, i32 3, i32 0, i32 3>
ret <4 x i1> %r
}
This was getting lowered to:
t1: // @t1
fmov d2, d0
mov w8, v1.s[1]
mov v0.16b, v2.16b
mov v0.h[1], w8
mov v0.h[2], v2.h[0]
mov v0.h[3], w8
ret
which is essentially an element-by-element insert.
However, by taking advantage of the symmetry of the mask, `<(0, 3), (0, 3)>`, we could do better if we used a bigger element type to replicate the first half of the mask into the second, e.g.:
t1:
mov v0.h[1], v1.h[2]
mov v0.s[1], v0.s[0]
As I was working on this, I saw the compiler was actually able to do something similar if we used i32s (for example) instead of i1s, but was getting stuck with the latter because its initial SelectionDAG consisted of `vector_shuffle (concat (tx, undef), concat (ty, undef))`, which would get type-legalised into `BUILD_VECTORs` and eventually lead to a sequence of inserts.
Since during legalisation the `CopyFromReg` nodes were getting bitcast into smaller types (`v4i16` in this case), I thought I could cut the middleman and cast them earlier in the pipeline (making the necessary adjustments) and thus avoid the `undef` concatenations. The rest of the optimisations fell through from this.
In the end, for the example above we got the following assembly:
t1:
mov v0.h[2], v1.h[2]
uzp1 v0.4h, v0.4h, v0.4h
ret
which, despite being a bit less obvious, is functionally equivalent to the assembly with two moves and should have the same performance on the V2.
You can check this example on Alive2 here https://alive2.llvm.org/ce/z/297iBB where I've essentially written the IR that corresponds to the initial SelectionDAGs before and after the patch.
Repository:
rG LLVM Github Monorepo
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D152826/new/
https://reviews.llvm.org/D152826
More information about the llvm-commits
mailing list