[PATCH] D145301: Add more efficient vector bitcast for AArch64

Fri Apr 28 09:25:42 PDT 2023

efriedma added inline comments.

================
Comment at: llvm/test/CodeGen/AArch64/vec-combine-compare-to-bitmask.ll:41
+; CHECK-NEXT:  fmov	    w8, s1
+; CHECK-NEXT:  orr	    w0, w9, w8, lsl #8
+; CHECK-NEXT:  ret
----------------
efriedma wrote:
> lawben wrote:
> > efriedma wrote:
> > > lawben wrote:
> > > > efriedma wrote:
> > > > > Instead of addv.8b+addv.8b+fmov+fmov+orr, you could use zip1+addv.8h+fmov, I think?
> > > > I did a [quick implementation with NEON intrinsics](https://godbolt.org/z/nz5P8TYn4). Your idea is correct, but it is combined into a different set of instructions in the end.
> > > > 
> > > > The gist of it being: if we use `vzip_u8` to combine both halves, this returns a `uin8x8x2_t`, which we need to combine into a `uint8x16_t` for the `vadd.8h`. But this is essentially the same as just shuffling the input bytes of the original comparison result in the form `0,8,1,9,2,10,3,11,4,12,5,13,6,14,7,15`. As far as I know, there is no instruction to zip two "smaller" vectors into a "larger" one, so we need the shuffle (as `tbl`) here. 
> > > > 
> > > > On my M1 MacBook Pro, this is actually ~50% faster than my original code with two `addv`. We are replacing an `extract + addv + fmov + or` with `adrp + ldr + tbl`. This seems to be a good trade-off, at least on an M1. I read somewhere, that `addv` is quite expensive, so maybe removing one for a `tbl` is good.
> > > > 
> > > > @efriedma @dmgreen What are your thoughts on this? I'm currently building on this patch in D148316. I would suggest merging that one first and then updating the `v16i8` strategy. 
> > > > 
> > > > 
> > > ext+zip1 vs. tbl isn't a huge difference in most cases.  (Maybe we're combining the NEON intrinsics a little too aggressively, though?  tbl is sort of slow on some chips.)
> > > 
> > > Fixing this as a followup to D148316 seems fine.
> > I did some benchmarks on my M1, Graviton 2, and Graviton 3. The results indicate the following things:
> > 
> > - Doing this for `v2i64` is beneficial on M1 and Graviton 2. On Graviton 3, it's slightly slower. So this puts us in a bit of a pickle. The gains on M1 and Graviton 2 are larger (+50% and +20%) than the loss on Graviton 3 (-10%), so I think it makes sense to keep this as is. What do you think?
> > - The `tbl` variant (the combination that I mentioned based on the `zip` approach proposed by @efriedma) is faster by ~50% on all three CPUs. So I think it makes sense to change the implementation to use a `shufflevector` + `addv` instead of two `addv`, as `addv` seems to be more expensive across the board.
> > - As soon as there are 4 elements or more, it pays off on all measured CPUs.
> > 
> > I'll submit a patch in the next few days to update the `v16i8` implementation, and possibly the `v2i64` one, depending on your comments.00
> > 
> > 
> > 
> > > ext+zip1 vs. tbl isn't a huge difference in most cases
> > I'm still not quite sure if I fully understood your suggested approach though. I don't see how we can get this down to a single `zip1` instruction. As fas as I can tell, we need 2x `ext`, `zip1` and `zip2` + `ins` to combine the two again. In that case, `tbl` is probably a decent choice. But if you have a suggestion to actually get this down to a single `zip`, please let me know so I can test it.
> > 
> >   
> > 
> > 
> ```
> uint16_t ext_zip(uint8x16_t a, uint8x16_t b) {
>     constexpr uint8x16_t mask = {1, 2, 4, 8, 16, 32, 64, 128, 1, 2, 4, 8, 16, 32, 64, 128};
>     auto matches = vceqq_u8(a, b);
>     auto masked_matches = vandq_u8(matches, mask);
>     auto zipped = vzip1q_u8(masked_matches, vextq_u8(masked_matches, masked_matches, 8));
>     return vaddvq_u16(vreinterpretq_u16_u8(zipped));
> }
> ```
> 
> https://godbolt.org/z/P9b8qWT3Y
> 
> Apparently there's a bug somewhere that makes the "vzip1q_u8" not produce the right instruction with clang, but it works fine with gcc.
Filed https://github.com/llvm/llvm-project/issues/62434 for the odd vzip1q_u8 behavior.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D145301/new/

https://reviews.llvm.org/D145301