[PATCH] D103629: [AArch64] Cost-model i8 vector loads/stores

Thu Jun 3 22:18:34 PDT 2021

dmgreen added a reviewer: asavonic.
dmgreen added inline comments.

================
Comment at: llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp:1254
+    // a vector with:
+    //   ld1 {v0.b}[0], [x0]
+    // followed by some offset calculation like:
----------------
Prior to D102938, this wasn't true and seems to still not be very true in general:
https://godbolt.org/z/7KMrEqcMW
Although the add's can be removed. The serialized ld1's won't be very cheap though, on many cpus.

A factor of two might be enough to show they are expensive, but there would probably be some cases where performance was worse. 
As Eli says, optimizing the 4 x i8 case at least using a 32bit load and a shuffle sounds like a good idea.

================
Comment at: llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp:1256
-      // We generate 2 instructions per vector element.
-      return NumVectorizableInstsToAmortize * NumVecElts * 2;
-    }
----------------
SjoerdMeijer wrote:
> I was also wondering if this was just a bug, because what we are doing here is `NumVecElts * 2 * NumVecElts * 2`. For an `<4 x i8>` that results in a cost of 64. If this was intention, then I don't think I follow this.
My rough understanding was that you really don't want the vectorizer to produce
  <4 x i8> load
  <4 x i16> zext
You want to make sure it's at least 8x:
  <8 x i8> load
  <8 x i16> zext
That way you don't serialize the load/extend, using d and q reg instructions as expected.

So the costs are deliberately high - high enough to prevent the scalarization and cross register bank moves. It may be higher than the cost of the individual instructions, but that is what you want to steer the vectorizer profitably.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D103629/new/

https://reviews.llvm.org/D103629