[PATCH] D123638: [SLP][AArch64] Implement lookahead operand reordering score of splat loads for AArch64

Wed Apr 20 11:26:53 PDT 2022

dmgreen added inline comments.

================
Comment at: llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp:2616
+      if (IsLoad) {
+        assert(isLegalBroadcastLoad(Tp->getElementType(),
+                                    LT.second.getVectorElementCount()) &&
----------------
This assert looks like it will fire a lot of the time. Should we use `IsLoad && isLegalBroadcastLoad(...)`?

That might make this testable from the cost model too, Even if it's slightly unorthodox to use a vector load for such cases.

================
Comment at: llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp:2703
+    if (Kind == TTI::SK_Broadcast) {
+      bool IsLoad = !Args.empty() && llvm::all_of(Args, [](const Value *V) {
+        return isa<LoadInst>(V);
----------------
vporpo wrote:
> dmgreen wrote:
> > I would expect it to have one Arg for a splat.
> > 
> > I guess this would not match the canonical representation for a splat load too, except from where it is being created from scalar code. Not sure what to do about that, but it seems unfortunate that the cost will go up after it has been vectorized.
> > ```
> >   %l = load i8, i8 *%p
> >   %i = insertelement <16 x i8> poison, i8 %l, i32 0
> >   %s = shufflevector <16 x i8> %i, <16 x i8> poison, <16 x i32> zeroinitializer
> > ```
> Yes, I think this won't be matched.  `Args` is meant to be used for scalar code, but that's a great point, we could improve this in a future patch.
Three instructions patterns are tough to cost-model nicely is llvm at the moment. It needs too much code to check down through uses and up through operands.

The first part of my comment was meaning that Args should only have 1 element for a splat.
```
 IsLoad = Args.size() == 1 && isa<LoadInstr>(Args[0]);
```
Do we expect it get called when there are multiple loads too?

================
Comment at: llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h:285
+    // Return true if we can generate a `ld1r` splat load instruction.
+    if (!ST->hasNEON() || NumElements.isScalable())
+      return false;
----------------
fhahn wrote:
> It looks like there is at least some test coverage missing, e.g. I think we need a test with scalable vectors and with different element types other than double.
I don't think we _can_ test scalable vectors, this is only callable from the slp vectorizer, unfortunately.

================
Comment at: llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h:285
+    // ld1r
+    return ST->hasNEON();
+  }
----------------
vporpo wrote:
> dmgreen wrote:
> > The "legal" types for NEON would be something where the legalized type is one of
> > {v8i8, v16i8, v4i16, v8i16, v2i32, v4i32, v1i64, v2i64}.
> > But also the floating point equivalents of them. I guess I don't really know what "legal" means here. The X86 version of this method seems to account for NumElements, but if a v4i64 can just be legalized to two v2i64's, it can still generate a splatload efficiently and copy that to another value.
> > 
> > Hmm. For now lets say that the ElementCount size needs to be one of {8, 16, 32, 64}, and the size of the vector would be at least 64bits. That should rule out types we don't have at least, and larger vectors can still be treated as cheap.
> "Legal" means that we can efficiently generate an instruction that handles a load + broadcast. In x86 this is done with the `movddup` instruction which seems to do this efficiently for two 64-bit elements, which is what gets accepted by `isLegalBroadcast()`.
> 
> The `ld1r` instruction in AArch64 seems to support a wide range of type, like the ones you listed, though I am not sure how efficient all of these are. I guess we can accept the element count sizes you propose {8, 16, 32, 64}, which sounds right, or we can stick to a `2 x double` that we know for sure that it works well.
It is the "efficiently generate" that gets me, it being a pretty imprecise term. Would an architecture that uses two operations count? Many architectures will split the operation into two micro-ops in any case. Why not just give it a cost like other functions. The reciprocal throughput is 1. It's more clear what that means compared to other instructions.  The other isLegalMaskedLoad/isLegalMaskedGather functions I can see a place for - they effectively tell you what the canonical representation for a gather should be - should it be treated as a vector operation that has vector optimizations on it or as a group of scalar operations that go through scalar optimizations.

Ignore me though. It's a fairly minor complaint, and it working is much better than it not working :)

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D123638/new/

https://reviews.llvm.org/D123638