[PATCH] D123638: [SLP][AArch64] Implement lookahead operand reordering score of splat loads for AArch64

Wed Apr 20 08:28:09 PDT 2022

vporpo marked an inline comment as done.
vporpo added a comment.

Thank you Dave for the review.

> Do you have any performance results to suggest it is a good idea, past the obvious that it sounds like it should be better? Some very quick runs didn't look amazing, but perhaps they are in some ways unlucky.

Yes, I tried this on an AArch64 machine. It improves the BM_Dot_RealComplex_EigenDotFixed<double_16> test from the Eigen benchmark by more than 10%. But this only tests for a `2 x double` type. I am not sure how it performs for other types.

================
Comment at: llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp:2703
+    if (Kind == TTI::SK_Broadcast) {
+      bool IsLoad = !Args.empty() && llvm::all_of(Args, [](const Value *V) {
+        return isa<LoadInst>(V);
----------------
dmgreen wrote:
> I would expect it to have one Arg for a splat.
> 
> I guess this would not match the canonical representation for a splat load too, except from where it is being created from scalar code. Not sure what to do about that, but it seems unfortunate that the cost will go up after it has been vectorized.
> ```
>   %l = load i8, i8 *%p
>   %i = insertelement <16 x i8> poison, i8 %l, i32 0
>   %s = shufflevector <16 x i8> %i, <16 x i8> poison, <16 x i32> zeroinitializer
> ```
Yes, I think this won't be matched.  `Args` is meant to be used for scalar code, but that's a great point, we could improve this in a future patch.

================
Comment at: llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h:283

+  bool isLegalBroadcastLoad(Type *ElementTy, unsigned NumElements) const {
+    // ld1r
----------------
dmgreen wrote:
> I'm not a fan of these "isLegal" methods for things that should just have a cost.
> It should either be passed a VectorType or the NumElements should be an ElementCount, to allow for scalable vector types.
I agree, ideally this should be handled transparently by cost functions. The reason why we need this is that we are not currently using the TTI cost model functions for the operand reordering scores in `getShallowScore()`. So we use this function as a way to check whether the target supports this type of combined load + broadcast instructions.

Regarding using `ElementCount` this needs some refactoring on the X86 side too, so I will upload one refactoring patch before this, with some of these changes.

================
Comment at: llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h:285
+    // ld1r
+    return ST->hasNEON();
+  }
----------------
dmgreen wrote:
> The "legal" types for NEON would be something where the legalized type is one of
> {v8i8, v16i8, v4i16, v8i16, v2i32, v4i32, v1i64, v2i64}.
> But also the floating point equivalents of them. I guess I don't really know what "legal" means here. The X86 version of this method seems to account for NumElements, but if a v4i64 can just be legalized to two v2i64's, it can still generate a splatload efficiently and copy that to another value.
> 
> Hmm. For now lets say that the ElementCount size needs to be one of {8, 16, 32, 64}, and the size of the vector would be at least 64bits. That should rule out types we don't have at least, and larger vectors can still be treated as cheap.
"Legal" means that we can efficiently generate an instruction that handles a load + broadcast. In x86 this is done with the `movddup` instruction which seems to do this efficiently for two 64-bit elements, which is what gets accepted by `isLegalBroadcast()`.

The `ld1r` instruction in AArch64 seems to support a wide range of type, like the ones you listed, though I am not sure how efficient all of these are. I guess we can accept the element count sizes you propose {8, 16, 32, 64}, which sounds right, or we can stick to a `2 x double` that we know for sure that it works well.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D123638/new/

https://reviews.llvm.org/D123638