[PATCH] D145301: Add more efficient vector bitcast for AArch64
Eli Friedman via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Mon Apr 24 09:35:17 PDT 2023
efriedma added inline comments.
================
Comment at: llvm/test/CodeGen/AArch64/vec-combine-compare-to-bitmask.ll:41
+; CHECK-NEXT: fmov w8, s1
+; CHECK-NEXT: orr w0, w9, w8, lsl #8
+; CHECK-NEXT: ret
----------------
lawben wrote:
> efriedma wrote:
> > Instead of addv.8b+addv.8b+fmov+fmov+orr, you could use zip1+addv.8h+fmov, I think?
> I did a [quick implementation with NEON intrinsics](https://godbolt.org/z/nz5P8TYn4). Your idea is correct, but it is combined into a different set of instructions in the end.
>
> The gist of it being: if we use `vzip_u8` to combine both halves, this returns a `uin8x8x2_t`, which we need to combine into a `uint8x16_t` for the `vadd.8h`. But this is essentially the same as just shuffling the input bytes of the original comparison result in the form `0,8,1,9,2,10,3,11,4,12,5,13,6,14,7,15`. As far as I know, there is no instruction to zip two "smaller" vectors into a "larger" one, so we need the shuffle (as `tbl`) here.
>
> On my M1 MacBook Pro, this is actually ~50% faster than my original code with two `addv`. We are replacing an `extract + addv + fmov + or` with `adrp + ldr + tbl`. This seems to be a good trade-off, at least on an M1. I read somewhere, that `addv` is quite expensive, so maybe removing one for a `tbl` is good.
>
> @efriedma @dmgreen What are your thoughts on this? I'm currently building on this patch in D148316. I would suggest merging that one first and then updating the `v16i8` strategy.
>
>
ext+zip1 vs. tbl isn't a huge difference in most cases. (Maybe we're combining the NEON intrinsics a little too aggressively, though? tbl is sort of slow on some chips.)
Fixing this as a followup to D148316 seems fine.
Repository:
rG LLVM Github Monorepo
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D145301/new/
https://reviews.llvm.org/D145301
More information about the llvm-commits
mailing list