[PATCH] D103629: [AArch64] Cost-model i8 vector loads/stores

Fri Jun 4 09:00:27 PDT 2021

SjoerdMeijer added a comment.

In D103629#2797291 <https://reviews.llvm.org/D103629#2797291>, @efriedma wrote:

>> And while we don't have a load instruction that supports this
>
> If `<4 x i8>` loads matter, we should probably convert them to a 32-bit load followed by a zip1, which should would have a cost of 2.  (Or possibly 3 on big-endian, I guess.)  Basically the inverse of LowerTruncateVectorStore.

Question about this. I will keep looking a bit longer because my zip1-fu is not so strong, but I was struggling to see how codegen would look like. For an example like this:

  define <4 x i32> @f(<4 x i8>* %a, <4 x i32> %b) {
    %x = load <4 x i8>, <4 x i8>* %a
    %y = sext <4 x i8> %x to <4 x i32>
    %z = add <4 x i32> %y, %b
    ret <4 x i32> %z
  }

I am failing to see how with something like

  fmov s0, w0
  zip1.8d v0, v0, v0

I would get the bytes sign extended and in the right place with zip1 for the 128-bit add.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D103629/new/

https://reviews.llvm.org/D103629