[PATCH] D104630: [AArch64][CostModel] Add cost model for experimental.vector.splice

Tue Jun 22 01:43:22 PDT 2021

sdesmalen added inline comments.

================
Comment at: llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp:1900
       { TTI::SK_Reverse, MVT::nxv2i1,   1 },
+      // Handle the cases for vector.splice with scalable vectors
+      { TTI::SK_Splice, MVT::nxv16i8,  1 },
----------------
Separate from the type, I think we'll need to distinguish the costs based on the value of the index as well.

Given two scalable vectors <x0, x1, x2, x3> and <y0, y1, y2, y3>.
* For a positive offsets we can use SVE's EXT instruction. E.g. to splice at offset #1, the result of the splice will be <x1, x2, x3, y0>.
* For a negative offset, we can't use EXT but we can instead use SPLICE which requires (generating) a predicate. For a negative offset of 1 we need a predicate of: <0, 0, 0, 1>. This means the operation can be done using whilelt+not+splice, so for negative offsets it would be more expensive.

================
Comment at: llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp:1901-1913
+      { TTI::SK_Splice, MVT::nxv16i8,  1 },
+      { TTI::SK_Splice, MVT::nxv8i16,  1 },
+      { TTI::SK_Splice, MVT::nxv4i32,  1 },
+      { TTI::SK_Splice, MVT::nxv2i64,  1 },
+      { TTI::SK_Splice, MVT::nxv2f16,  1 },
+      { TTI::SK_Splice, MVT::nxv4f16,  1 },
+      { TTI::SK_Splice, MVT::nxv8f16,  1 },
----------------
At the moment, the costs for these is actually quite high because they're expanded to two stores and one reload. That said, I'd prefer not to reflect that in the cost-model because this is not the desired code-gen and we should favour getting more scalable vectorization to get more testing coverage.

================
Comment at: llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp:1914-1917
+      { TTI::SK_Splice, MVT::nxv16i1,  1 },
+      { TTI::SK_Splice, MVT::nxv8i1,   1 },
+      { TTI::SK_Splice, MVT::nxv4i1,   1 },
+      { TTI::SK_Splice, MVT::nxv2i1,   1 },
----------------
The predicates require two stores, a reload and an additional compare operation. Since predicates don't have a dedicated instruction, it should be fair to model the cost as that of two stores, a reload and a compare.

================
Comment at: llvm/test/Analysis/CostModel/AArch64/splice.ll:34
+
+  %splice.nv16i8 = call < 16 x i8> @llvm.experimental.vector.splice.nv16i8(< 16 x i8> zeroinitializer, < 16 x i8> zeroinitializer, i32 -1)
+  %splice.nv32i8 = call < 32 x i8> @llvm.experimental.vector.splice.nv32i8(< 32 x i8> zeroinitializer, < 32 x i8> zeroinitializer, i32 -1)
----------------
nv?

================
Comment at: llvm/test/Analysis/CostModel/AArch64/splice.ll:59
+  %splice.nv8i1 =  call < 8 x i1> @llvm.experimental.vector.splice.nv8i1(< 8 x i1> zeroinitializer, < 8 x i1> zeroinitializer, i32 -1)
+  %splice.nv4i1 = call < 4 x i1> @llvm.experimental.vector.splice.nv4i1(< 4 x i1> zeroinitializer, < 4 x i1> zeroinitializer, i32 -1)
+  %splice.nv2i1 = call < 2 x i1> @llvm.experimental.vector.splice.nv2i1(< 2 x i1> zeroinitializer, < 2 x i1> zeroinitializer, i32 -1)
----------------
odd spaces.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D104630/new/

https://reviews.llvm.org/D104630