[PATCH] D120912: [AArch64][SVE] Convert gather/scatter with a stride of 2 to contiguous loads/stores

Mon Mar 14 06:25:23 PDT 2022

paulwalker-arm added a comment.

Perhaps I've misunderstood something but I'd like to take a step back to make sure we're not overcomplicating something here and at the same time making it harder to handle more cases in the future.

================
Comment at: llvm/lib/Target/AArch64/AArch64ISelLowering.cpp:16801-16806
+  if (MemVT == MVT::nxv2i64 || MemVT == MVT::nxv2f64)
+    ShiftAmt = 3;
+  else if (MemVT == MVT::nxv4i32 || MemVT == MVT::nxv4f32)
+    ShiftAmt = 2;
+  else
+    return SDValue();
----------------
Is this a temporary restriction?  I ask because when I asked about doing this in IR instead of CodeGen, one of the answers was to support illegal types.

To be clear I'm not against solving this in CodeGen by my assumption is that one of the biggest winners to doing this transform is when trying to gather/scatter bytes or halfwords, which are not supported and thus require splitting.  However they can be done more efficiently with masked loads and stores.

================
Comment at: llvm/lib/Target/AArch64/AArch64ISelLowering.cpp:16816-16818
+  auto PFalse = SDValue(DAG.getMachineNode(AArch64::PFALSE, DL, MaskVT), 0);
+  auto PredL = DAG.getNode(AArch64ISD::ZIP1, DL, MaskVT, Mask, PFalse);
+  auto PredH = DAG.getNode(AArch64ISD::ZIP2, DL, MaskVT, Mask, PFalse);
----------------
Can these use `ISD::EXTRACT_SUBVECTOR` instead? I ask because those will emit better code (removes the need for `pfalse`) and also allows DAGCombine to perhaps optimise further (e.g. when Mask is a constant).

With that said, ...

================
Comment at: llvm/lib/Target/AArch64/AArch64ISelLowering.cpp:16827-16845
+    // from the first in the second.
+    auto LoadL = DAG.getMaskedLoad(VT, DL, Chain, BasePtr, DAG.getUNDEF(BaseVT),
+                                   PredL, MGT->getPassThru(), MemVT, MMO,
+                                   ISD::UNINDEXED, MGT->getExtensionType());
+
+    MMO = DAG.getMachineFunction().getMachineMemOperand(
+             MPI, MGT->getMemOperand()->getFlags(), MemoryLocation::UnknownSize,
----------------
...sorry if this is a silly question as I've not scrutinised the code but is this manual splitting and recombine necessary.  It looks a little like manual type legalisation but this combine is something that is generally best done before type legalisation and thus you should be able to create a single big load and let common code split it if necessary.  This might also resolve my "bytes" and "halfwords" comment above.

================
Comment at: llvm/lib/Target/AArch64/AArch64ISelLowering.cpp:16857-16858
+  // the inactive lanes won't be stored.
+  auto DataL = DAG.getNode(AArch64ISD::ZIP1, DL, VT, Data, Data);
+  auto DataH = DAG.getNode(AArch64ISD::ZIP2, DL, VT, Data, Data);
+
----------------
As above, using `ISD::EXTRACT_SUBVECTOR` here seems preferable and also means the scatter case doesn't require any target specific nodes.

================
Comment at: llvm/lib/Target/AArch64/AArch64ISelLowering.cpp:16914-16915
+  // a pair of contiguous loads or stores
+  if (Index->getOpcode() == ISD::STEP_VECTOR &&
+      IndexType == ISD::SIGNED_SCALED)
+    return tryToCombineToMaskedLoadStore(MGS, DAG, Subtarget);
----------------
I think these tests really belong inside `tryToCombineToMaskedLoadStore` because it is that function that has the restrictions and thus the same function that might later remove them.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D120912/new/

https://reviews.llvm.org/D120912