[PATCH] D133421: [AArch64] break non-temporal loads over 256 into 256-loads and a smaller load

Fri Sep 9 08:55:41 PDT 2022

t.p.northover added inline comments.

================
Comment at: llvm/lib/Target/AArch64/AArch64ISelLowering.cpp:17860
+      DAG.getMemBasePlusOffset(BasePtr, TypeSize::Fixed(PtrOffset), DL, Flags);
+  Align NewAlign = commonAlignment(LD->getAlign(), MemVT.getSizeInBits());
+  SDValue RemainingLoad =
----------------
Shouldn't the second operand of this call be `PtrOffset` again?

================
Comment at: llvm/lib/Target/AArch64/AArch64ISelLowering.cpp:17867-17869
+  SDValue ExtendedReminingLoad =
+      DAG.getNode(ISD::INSERT_SUBVECTOR, DL, NewVT,
+                  {UndefVector, RemainingLoad, InsertIdx});
----------------
Is this (and the implementation generally) big-endian correct? I don't know the answer here, I can never remember what's supposed to happen. But someone should definitely try it on an `aarch64_be` target and at least eyeball the assembly to check the offsets and so on.

================
Comment at: llvm/lib/Target/AArch64/AArch64ISelLowering.cpp:17881
+                  {ConcatVectors, DAG.getVectorIdxConstant(0, DL)});
+  return DAG.getMergeValues({ExtractSubVector, Chain}, DL);
+}
----------------
The `Chain` here is the input chain, I think you need to `TokenFactor` all the loads' output chains together to make sure nothing gets reordered with them.

================
Comment at: llvm/lib/Target/AArch64/AArch64ISelLowering.cpp:19934
+        performTBISimplification(N->getOperand(1), DCI, DAG))
       return SDValue(N, 0);
+    return performLOADCombine(N, DCI, DAG, Subtarget);
----------------
This looks like it takes precedence over `performLOADCombine` and disables `ldnp` formation if the TBI feature is enabled.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D133421/new/

https://reviews.llvm.org/D133421