[PATCH] D138203: [AArch64][InstCombine] Simplify repeated complex patterns in dupqlane

Wed Jan 11 12:00:36 PST 2023

MattDevereau added inline comments.

================
Comment at: llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp:1443-1448
+  for (auto Lhs = Vec.begin(), Rhs = Vec.begin() + HalfVecSize;
+       Rhs != Vec.end(); Lhs++, Rhs++) {
+    if (*Lhs != nullptr && *Rhs != nullptr && *Lhs == *Rhs)
+      continue;
+    return false;
+  }
----------------
sdesmalen wrote:
> nit: please capitalise Lhs -> LHS and Rhs -> RHS (that's more common in LLVM for these variable names).
> 
> And feel free to ignore this suggestion, but I personally find this a little easier to read:
> 
>   ArrayRef<Value*> Ref(Vec);                                       
>   auto LHS = Ref.take_front(HalfVecSize);
>   auto RHS = Ref.drop_front(HalfVecSize);
>   for (unsigned I=0; I<HalfVecSize; ++I) {
>     if (LHS[I] != nullptr && RHS[I] != nullptr && LHS[I] == RHS[I])
>       continue;
> 
> (i.e. using ArrayRef and indexing, rather than two changing pointers)
I think there's pros and cons to this approach, in your example the setup looks nicer but I think the ability to resize the vector without passing indices through function parameters looks cleaner and makes the result easier to process. I will stick with the current implementation for now I think just to keep the review more compact.

================
Comment at: llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp:1451-1452
+  Vec.resize(HalfVecSize);
+  if (Vec.size() == 1)
+    return true;
+  SimplifyValuePattern(Vec);
----------------
sdesmalen wrote:
> Can this condition be moved to the start of this function?
Yes it can be, done

================
Comment at: llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp:1478
+
+  if (!isa<PoisonValue>(CurrentInsertElt) || !isa<PoisonValue>(Default) ||
+      !SimplifyValuePattern(InsertEltVec))
----------------
sdesmalen wrote:
> I don't think you'll need to check this, since you've already got the check for `*Lhs != nullptr && *Rhs != nullptr ` in SimplifyValuePattern, which should always return `false` if any of the lanes is not explicitly defined. And because the size of `InsertEltVec` is based on the minimum number of elements of the scalable vector, it will also return false for the following case:
> 
>   %0 = insertelement <2 x half> poison, half %a, i64 0
>   %1 = insertelement <2 x half> %0, half %b, i64 0
>   %2 = call <vscale x 8 x half> @llvm.vector.insert.nxv8f16.v2f16(<vscale x 8 x half> poison, <2 x half> %1, i64 0)
> 
> Because InsertEltVec will be defined as:
> 
>   [%a, %b, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr]
Fair enough, removed.

================
Comment at: llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp:1489
+  for (size_t I = 1; I < InsertEltVec.size(); I++) {
+    Constant *Idx = ConstantInt::get(Builder.getInt64Ty(), APInt(64, I));
+    InsertEltChain =
----------------
sdesmalen wrote:
> I think you can do just `Builder.getInt64(I)` (and then you don't need the extra variable for it, and can propagate it directly into the expression below, and also drop the curly braces :)
This loop no longer looks quite as hideous, thanks :)

================
Comment at: llvm/test/Transforms/InstCombine/AArch64/sve-intrinsic-dupqlane.ll:30
+; CHECK-NEXT:    [[TMP3:%.*]] = tail call <vscale x 8 x half> @llvm.vector.insert.nxv8f16.v8f16(<vscale x 8 x half> poison, <8 x half> [[TMP2]], i64 0)
+; CHECK-NEXT:    [[TMP4:%.*]] = tail call <vscale x 8 x half> @llvm.aarch64.sve.dupq.lane.nxv8f16(<vscale x 8 x half> [[TMP3]], i64 0)
+; CHECK-NEXT:    ret <vscale x 8 x half> [[TMP4]]
----------------
MattDevereau wrote:
> sdesmalen wrote:
> > It's a pity this case isn't optimised. This is probably because instcombine has transformed it into a 'shufflevector' splat. You could do something like this:
> > 
> >   if (Value *V = getSplatValue(CurrentInsertElt)) {              
> >     InsertEltVec[0] = V;                                         
> >     InsertEltVec.resize(1);                                      
> >   } else {                                                       
> >     while (InsertElementInst *InsertElt =                        
> >                dyn_cast<InsertElementInst>(CurrentInsertElt)) {  
> >     ...
> >   }
> > 
> > to handle it.
> I think the assembly output ends up as just a single MOV instruction and there isn't much to gain here (I'll double check this), the thinking behind this test was more just to prove the recursion can indeed get the smallest possible pattern `a`.
This case currently gets emitted as a neon mov:

```
dup	v0.8h, v0.h[0]
mov	z0.q, q0
ret
```
Implementing your suggestion does change this to `mov z0.h, h0` but also drastically changes all tests in `sve-intrinsic-opts-cmpne.ll`. I think this optimization is best left for another patch.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D138203/new/

https://reviews.llvm.org/D138203