[PATCH] D148316: [AArch64] Add support for efficient bitcast in vector truncate store.

Wed Apr 26 02:47:27 PDT 2023

lawben added inline comments.

================
Comment at: llvm/test/CodeGen/AArch64/vec-combine-compare-truncate-store.ll:227
+
+define void @store_2_elements_64_bit_vector(<2 x i32> %vec, ptr %out) {
+; CHECK-LABEL: lCPI8_0:
----------------
dmgreen wrote:
> Some of these with low vector lanes are starting to look worse than the code before. The fmov/strb could be done on the fp side, but I don't think that would be enough to make then profitable. Is it worth limiting it to >= 4 vector lanes?
On my M1, it is still faster though. But I agree that this needs a bit of investigation on other ARM CPUs.

Suggestion: In a follow-up patch, with the changes to `v16i8` suggested by @efriedma, I'll run a set of benchmarks for the `v16i8` and the `v2i64` (and other) cases on my M1, Graviton 2, Graviton 3, and a Pi 4 (and possibly an A64FX, but that has terrible NEON performance across the board). I have this setup for another project at the moment anyway. Then this should give us a bit of a wider range of performance characteristics.  

So I'd suggest leaving this as is in this patch and than doing a performance-based follow-up patch. Thoughts?

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D148316/new/

https://reviews.llvm.org/D148316