[PATCH] D48332: [AArch64] Add custom lowering for v4i8 trunc store

Wed Jun 20 11:50:04 PDT 2018

zatrazz added a comment.

In https://reviews.llvm.org/D48332#1136937, @efriedma wrote:

> I wonder if we should prefer to widen `<2 x i8>` and `<4 x i8>` to `<8 x i8>` instead of promoting to `<4 x i16>`.   It would make stores like this a bit cheaper.  Maybe an interesting experiment at some point (mostly just modifying AArch64TargetLowering::getPreferredVectorAction, I think, and seeing what happens to the generated code).

I tried your suggestion, but without further tuning in vector lowering this does not yield much gain on a vector store operation. The operation:

  %0 = trunc <4 x i32> %a to <4 x i8>
  store <4 x i8> %0, <4 x i8>* %p, align 4, !tbaa !2

is scalarized because LowerBUILD_VECTOR can't really see a good pattern to use on it:

Custom lowering: t49: v8i8 = BUILD_VECTOR t37, t40, t43, t46, undef:i32, undef:i32, undef:i32, undef:i32
AArch64TargetLowering::ReconstructShuffle
Reshuffle failed: span too large for a VEXT to cope
LowerBUILD_VECTOR: alternatives failed, creating sequence of INSERT_VECTOR_ELT

Maybe if we handle v4i8 as v4i32 we could get a better code generation, but also it would require some more tuning in generic code. I do see a better code generation for trunc store v2i32 to v2i8, but I am not convinced that this vector type should be tuned.

> Do we need similar handling to this patch for `<2 x i16>` or `<2 x i8>`?

The trunc store for v2i16 to v2i8 and v4i32 to v4i8 indeed can be optimized, but I also think it can be orthogonal to this optimization.

Repository:
  rL LLVM

https://reviews.llvm.org/D48332