[PATCH] D152826: [AArch64] Improve shuffles of i1 vectors (WIP)

Wed Jun 14 07:28:26 PDT 2023

rjj added a comment.

In D152826#4420599 <https://reviews.llvm.org/D152826#4420599>, @dmgreen wrote:

> Thanks for uploading this. It does look very similar to the case I was looking at recently, but goes further than the optimization I was trying. (That was about altering `anyext(buildvector(..))`, not doing it earlier which looks like it allows the shuffles to lower more efficiently). I will try to take a look, to get my head around how this works.

Thanks for taking a look! To give a bit of context, this was brought up by the following example (from cmsisdsp iirc):

  define <4 x i1> @t1(<2 x i1> %a, <2 x i1> %b)  {
    %r = shufflevector <2 x i1> %a, <2 x i1> %b, <4 x i32> <i32 0, i32 3, i32 0, i32 3>
    ret <4 x i1> %r
  }

This was getting lowered to:

  t1:                                     // @t1
          fmov    d2, d0
          mov     w8, v1.s[1]
          mov     v0.16b, v2.16b
          mov     v0.h[1], w8
          mov     v0.h[2], v2.h[0]
          mov     v0.h[3], w8
          ret

which is essentially an element-by-element insert.

However, by taking advantage of the symmetry of the mask, `<(0, 3), (0, 3)>`, we could do better if we used a bigger element type to replicate the first half of the mask into the second, e.g.:

  t1:
          mov     v0.h[1], v1.h[2]
          mov     v0.s[1], v0.s[0]

As I was working on this, I saw the compiler was actually able to do something similar if we used i32s (for example) instead of i1s, but was getting stuck with the latter because its initial SelectionDAG consisted of `vector_shuffle (concat (tx, undef), concat (ty, undef))`, which would get type-legalised into `BUILD_VECTORs` and eventually lead to a sequence of inserts.

Since during legalisation the `CopyFromReg` nodes were getting bitcast into smaller types (`v4i16` in this case), I thought I could cut the middleman and cast them earlier in the pipeline (making the necessary adjustments) and thus avoid the `undef` concatenations. The rest of the optimisations fell through from this.

In the end, for the example above we got the following assembly:

  t1:
          mov     v0.h[2], v1.h[2]
          uzp1    v0.4h, v0.4h, v0.4h
          ret

which, despite being a bit less obvious, is functionally equivalent to the assembly with two moves and should have the same performance on the V2.

You can check this example on Alive2 here https://alive2.llvm.org/ce/z/297iBB where I've essentially written the IR that corresponds to the initial SelectionDAGs before and after the patch.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D152826/new/

https://reviews.llvm.org/D152826