[PATCH] D103629: [AArch64] Cost-model i8 vector loads/stores

Fri Jun 4 00:07:11 PDT 2021

SjoerdMeijer added inline comments.

================
Comment at: llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp:1254
+    // a vector with:
+    //   ld1 {v0.b}[0], [x0]
+    // followed by some offset calculation like:
----------------
dmgreen wrote:
> Prior to D102938, this wasn't true and seems to still not be very true in general:
> https://godbolt.org/z/7KMrEqcMW
> Although the add's can be removed. The serialized ld1's won't be very cheap though, on many cpus.
> 
> A factor of two might be enough to show they are expensive, but there would probably be some cases where performance was worse. 
> As Eli says, optimizing the 4 x i8 case at least using a 32bit load and a shuffle sounds like a good idea.
> Prior to D102938, this wasn't true and seems to still not be very true in general:
> https://godbolt.org/z/7KMrEqcMW

While there are some variants, the trend is still roughly 2 * #elements.

> A factor of two might be enough to show they are expensive.

Yep, so that's what this patch does. Correct me if I am wrong, but looks like we agree on that.

> As Eli says, optimizing the 4 x i8 case at least using a 32bit load and a shuffle sounds like a good idea.

Yep, that's a nice suggestion, will look into that first.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D103629/new/

https://reviews.llvm.org/D103629