[PATCH] D145301: Add more efficient vector bitcast for AArch64

Mon Apr 24 09:35:17 PDT 2023

efriedma added inline comments.

================
Comment at: llvm/test/CodeGen/AArch64/vec-combine-compare-to-bitmask.ll:41
+; CHECK-NEXT:  fmov	    w8, s1
+; CHECK-NEXT:  orr	    w0, w9, w8, lsl #8
+; CHECK-NEXT:  ret
----------------
lawben wrote:
> efriedma wrote:
> > Instead of addv.8b+addv.8b+fmov+fmov+orr, you could use zip1+addv.8h+fmov, I think?
> I did a [quick implementation with NEON intrinsics](https://godbolt.org/z/nz5P8TYn4). Your idea is correct, but it is combined into a different set of instructions in the end.
> 
> The gist of it being: if we use `vzip_u8` to combine both halves, this returns a `uin8x8x2_t`, which we need to combine into a `uint8x16_t` for the `vadd.8h`. But this is essentially the same as just shuffling the input bytes of the original comparison result in the form `0,8,1,9,2,10,3,11,4,12,5,13,6,14,7,15`. As far as I know, there is no instruction to zip two "smaller" vectors into a "larger" one, so we need the shuffle (as `tbl`) here. 
> 
> On my M1 MacBook Pro, this is actually ~50% faster than my original code with two `addv`. We are replacing an `extract + addv + fmov + or` with `adrp + ldr + tbl`. This seems to be a good trade-off, at least on an M1. I read somewhere, that `addv` is quite expensive, so maybe removing one for a `tbl` is good.
> 
> @efriedma @dmgreen What are your thoughts on this? I'm currently building on this patch in D148316. I would suggest merging that one first and then updating the `v16i8` strategy. 
> 
> 
ext+zip1 vs. tbl isn't a huge difference in most cases.  (Maybe we're combining the NEON intrinsics a little too aggressively, though?  tbl is sort of slow on some chips.)

Fixing this as a followup to D148316 seems fine.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D145301/new/

https://reviews.llvm.org/D145301