[PATCH] D138203: [AArch64][InstCombine] Simplify repeated complex patterns in dupqlane

Fri Jan 13 06:50:44 PST 2023

sdesmalen accepted this revision.
sdesmalen added a comment.

LGTM (please address my nit on the test before you land it)

================
Comment at: llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp:1443-1448
+  for (auto Lhs = Vec.begin(), Rhs = Vec.begin() + HalfVecSize;
+       Rhs != Vec.end(); Lhs++, Rhs++) {
+    if (*Lhs != nullptr && *Rhs != nullptr && *Lhs == *Rhs)
+      continue;
+    return false;
+  }
----------------
MattDevereau wrote:
> sdesmalen wrote:
> > nit: please capitalise Lhs -> LHS and Rhs -> RHS (that's more common in LLVM for these variable names).
> > 
> > And feel free to ignore this suggestion, but I personally find this a little easier to read:
> > 
> >   ArrayRef<Value*> Ref(Vec);                                       
> >   auto LHS = Ref.take_front(HalfVecSize);
> >   auto RHS = Ref.drop_front(HalfVecSize);
> >   for (unsigned I=0; I<HalfVecSize; ++I) {
> >     if (LHS[I] != nullptr && RHS[I] != nullptr && LHS[I] == RHS[I])
> >       continue;
> > 
> > (i.e. using ArrayRef and indexing, rather than two changing pointers)
> I think there's pros and cons to this approach, in your example the setup looks nicer but I think the ability to resize the vector without passing indices through function parameters looks cleaner and makes the result easier to process. I will stick with the current implementation for now I think just to keep the review more compact.
My suggestion was unrelated to resizing the vector, it was only meant for this loop. But I'm happy with the current version too.

================
Comment at: llvm/test/Transforms/InstCombine/AArch64/sve-intrinsic-dupqlane.ll:30
+; CHECK-NEXT:    [[TMP3:%.*]] = tail call <vscale x 8 x half> @llvm.vector.insert.nxv8f16.v8f16(<vscale x 8 x half> poison, <8 x half> [[TMP2]], i64 0)
+; CHECK-NEXT:    [[TMP4:%.*]] = tail call <vscale x 8 x half> @llvm.aarch64.sve.dupq.lane.nxv8f16(<vscale x 8 x half> [[TMP3]], i64 0)
+; CHECK-NEXT:    ret <vscale x 8 x half> [[TMP4]]
----------------
MattDevereau wrote:
> MattDevereau wrote:
> > sdesmalen wrote:
> > > It's a pity this case isn't optimised. This is probably because instcombine has transformed it into a 'shufflevector' splat. You could do something like this:
> > > 
> > >   if (Value *V = getSplatValue(CurrentInsertElt)) {              
> > >     InsertEltVec[0] = V;                                         
> > >     InsertEltVec.resize(1);                                      
> > >   } else {                                                       
> > >     while (InsertElementInst *InsertElt =                        
> > >                dyn_cast<InsertElementInst>(CurrentInsertElt)) {  
> > >     ...
> > >   }
> > > 
> > > to handle it.
> > I think the assembly output ends up as just a single MOV instruction and there isn't much to gain here (I'll double check this), the thinking behind this test was more just to prove the recursion can indeed get the smallest possible pattern `a`.
> This case currently gets emitted as a neon mov:
> 
> ```
> dup	v0.8h, v0.h[0]
> mov	z0.q, q0
> ret
> ```
> Implementing your suggestion does change this to `mov z0.h, h0` but also drastically changes all tests in `sve-intrinsic-opts-cmpne.ll`. I think this optimization is best left for another patch.
Fair enough, I didn't expect that would happen.

================
Comment at: llvm/test/Transforms/InstCombine/AArch64/sve-intrinsic-dupqlane.ll:136-137
+  %7 = insertelement <8 x half> %6, half %c, i64 6
+  %8 = insertelement <8 x half> %7, half %c, i64 7
+  %9 = insertelement <8 x half> %8, half %d, i64 7
+  %10 = tail call <vscale x 8 x half> @llvm.vector.insert.nxv8f16.v8f16(<vscale x 8 x half> poison, <8 x half> %9, i64 0)
----------------
sdesmalen wrote:
> If you swap `%c` and `%d` this at least becomes a negative test to ensure LLVM doesn't perform the wrong optimisation. That's better than it being a positive test where it just happens to do the right thing because another InstCombine rule has fired first, even though your code would have optimised this incorrectly.
Can you rename this test so that it's clear this is a negative test? (and also maybe add a comment?)

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D138203/new/

https://reviews.llvm.org/D138203