[PATCH] D147451: [CodeGen] Enable AArch64 SVE FCMLA/FCADD instruction generation in ComplexDeinterleaving

Tue Apr 11 09:17:20 PDT 2023

igor.kirillov marked an inline comment as not done.
igor.kirillov added inline comments.

================
Comment at: llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp:346
+  for (auto &B : F) {
+    if (HasScalableComplexSupport)
+      Changed |= evaluateBasicBlock(&B, true);
----------------
dmgreen wrote:
> Does this need to run twice with and without IsScalable? It doubles the scanning of instructions, and seems unnecessary if it only modifies the shuffles/intrinsic matches, which are either both valid or mutually exclusive.
I thought about it, but theoretically target could have only one (scalable or fixed vector) support.
Alternatively I can pass both flags `ComplexDeinterleavingGraph` and check them each time root node is found, but I am not sure if it is a better approach.

================
Comment at: llvm/lib/Target/AArch64/AArch64ISelLowering.cpp:24460
+bool AArch64TargetLowering::isComplexDeinterleavingSupported(
+    bool UseScalable) const {
+  if (UseScalable)
----------------
mgabka wrote:
> I think it could be worth to add extra run lines to the existing aarch64 tests which operate on fixed width vectors, to make sure that adding +sve does not stop generation of fcmla there. What do you think?
Added to `complex-deinterleaving-f16-add.ll` where we have both shufflevectors and deinterleave2 intrinsics applied to fixed-width vectors

================
Comment at: llvm/lib/Target/AArch64/AArch64ISelLowering.cpp:24473-24476
-  if ((VTyWidth < 128 && VTyWidth != 64) || !llvm::isPowerOf2_32(VTyWidth))
+  if ((VTyWidth < 128 && (VTy->isScalableTy() || VTyWidth != 64)) ||
+      !llvm::isPowerOf2_32(VTyWidth))
     return false;

   return (ScalarTy->isHalfTy() && Subtarget->hasFullFP16()) ||
----------------
mgabka wrote:
> NickGuy wrote:
> > When working with scalable vectors, they don't have the same restriction of bit width. Treating them with a max width of 128 bits seems wasteful and inefficient, is there any way to get the vector width at compile time (is there a `target->getMaxVectorWidth()` or something)?
> For the scalable vectors I don't think we want to use a min or max vector width, we should rather operate on the ElementCount and size of the ElementTypes I think.
> IIUC for the scalable vectors the condition we want to check is if we are operating on the packed vector types (in that case all are supported) or on the set of unpacked vectors we are supporting, am I correct?
> 
> in that case maybe it is worth to have dedicated section for fixed width and scalable width vectors in this function?
@NickGuy Actually, this functions returns false if VTyWidth is less than 128 bit, so any 128+ bit sized vectors are supported.

@mgabka We support any unpacked type with size 2**X if 2**X >= 128, there is code in `AArch64TargetLowering::createComplexDeinterleavingIR` that splits those vectors until they have minimal size of 128 and then merges them back. We don't support min-64bit sized vectors (unlike Neon)  and that's the condition I added to the `if` statement.

================
Comment at: llvm/test/CodeGen/AArch64/complex-deinterleaving-f16-add.ll:102-107
+; Expected to not transform
+define <4 x half> @complex_add_v4f16_with_intrinsic(<4 x half> %a, <4 x half> %b) {
+; CHECK-LABEL: complex_add_v4f16_with_intrinsic:
+; CHECK:       // %bb.0: // %entry
+; CHECK-NEXT:    fcadd v0.4h, v1.4h, v0.4h, #90
+; CHECK-NEXT:    ret
----------------
NickGuy wrote:
> Is the top comment correct? Is this case actually expected to not transform?
> Given the `fcadd` in the output, I'd assume that this is expected to transform.
Yes, indeed. Also fixed in the other places.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D147451/new/

https://reviews.llvm.org/D147451