[PATCH] D148316: [AArch64] Add support for efficient bitcast in vector truncate store.

Wed Apr 26 08:59:08 PDT 2023

dmgreen accepted this revision.
dmgreen added a comment.
This revision is now accepted and ready to land.

Thanks for checking about the performance. LGTM in that case.

================
Comment at: llvm/test/CodeGen/AArch64/vec-combine-compare-truncate-store.ll:227
+
+define void @store_2_elements_64_bit_vector(<2 x i32> %vec, ptr %out) {
+; CHECK-LABEL: lCPI8_0:
----------------
lawben wrote:
> dmgreen wrote:
> > Some of these with low vector lanes are starting to look worse than the code before. The fmov/strb could be done on the fp side, but I don't think that would be enough to make then profitable. Is it worth limiting it to >= 4 vector lanes?
> On my M1, it is still faster though. But I agree that this needs a bit of investigation on other ARM CPUs.
> 
> Suggestion: In a follow-up patch, with the changes to `v16i8` suggested by @efriedma, I'll run a set of benchmarks for the `v16i8` and the `v2i64` (and other) cases on my M1, Graviton 2, Graviton 3, and a Pi 4 (and possibly an A64FX, but that has terrible NEON performance across the board). I have this setup for another project at the moment anyway. Then this should give us a bit of a wider range of performance characteristics.  
> 
> So I'd suggest leaving this as is in this patch and than doing a performance-based follow-up patch. Thoughts?
Sure - I was just going from the number of instructions and the extra constant pools. The critical path looks longer (but that might not matter much), and there are less FPR->GPR transfers that will help.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D148316/new/

https://reviews.llvm.org/D148316