[PATCH] D145301: Add more efficient vector bitcast for AArch64

Sat Apr 22 04:05:10 PDT 2023

lawben added inline comments.

================
Comment at: llvm/test/CodeGen/AArch64/vec-combine-compare-to-bitmask.ll:41
+; CHECK-NEXT:  fmov	    w8, s1
+; CHECK-NEXT:  orr	    w0, w9, w8, lsl #8
+; CHECK-NEXT:  ret
----------------
efriedma wrote:
> Instead of addv.8b+addv.8b+fmov+fmov+orr, you could use zip1+addv.8h+fmov, I think?
I did a [quick implementation with NEON intrinsics](https://godbolt.org/z/nz5P8TYn4). Your idea is correct, but it is combined into a different set of instructions in the end.

The gist of it being: if we use `vzip_u8` to combine both halves, this returns a `uin8x8x2_t`, which we need to combine into a `uint8x16_t` for the `vadd.8h`. But this is essentially the same as just shuffling the input bytes of the original comparison result in the form `0,8,1,9,2,10,3,11,4,12,5,13,6,14,7,15`. As far as I know, there is no instruction to zip two "smaller" vectors into a "larger" one, so we need the shuffle (as `tbl`) here. 

On my M1 MacBook Pro, this is actually ~50% faster than my original code with two `addv`. We are replacing an `extract + addv + fmov + or` with `adrp + ldr + tbl`. This seems to be a good trade-off, at least on an M1. I read somewhere, that `addv` is quite expensive, so maybe removing one for a `tbl` is good.

@efriedma @dmgreen What are your thoughts on this? I'm currently building on this patch in D148316. I would suggest merging that one first and then updating the `v16i8` strategy. 

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D145301/new/

https://reviews.llvm.org/D145301